By the end of 2021, students and staff interested in digital methods, data wrangling, text and data mining from Aarhus University and University of Copenhagen were once again invited to join the annually recurring datasprint organised by The University Libraries at The Royal Danish Library (Det Kgl. Bibliotek).
With the purpose of developing competencies within the field of digital humanities, the datasprint focused on the importance of open political data and the potential of text and data mining in this context.
Large historical data sets were made available to the participants as raw material to explore using the cloud based Interactive High Performance Computing service, UCloud, developed for Danish Universities. A hybrid group of staff from Center for Humanities Computing Aarhus (CHCAA) and students from Information Science, Aarhus University participated in the datasprint in Aarhus (November 18th and 19th) and gained experience with applying UCloud in their work with large datasets.
Benefits of UCloud
High Performance Computing systems (HPC), colloquially referred to as ‘super computers’, are characterised by their immense amount of computing power that far surpasses the abilities of regular desktop computers.
With the cloud based service UCloud, though, complex HPC systems are made accessible for researchers and students even when working with large datasets on laptops.
According to the participants from CHCAA, Aarhus University one main advantage of working with UCloud at the datasprint was the efficiency gained from the use of UCloud as it inflicts more computer power and works faster than similar systems. The ability to process large amounts of data in a relatively short amount of time is also described as a significant feature of UCloud next to its intuitive interface and easy error recovery.
The value of UCloud in the datasprint
UCloud formed an important tool at the datasprint in Aarhus as the topic of the datasprint involved a considerable amount of data, that is the complete collection of Folketinget’s proceedings from 1953 to 2021.
A notable challenge working with the large dataset from the Danish parliament was that only contemporary data from the 2000’s onwards had already been categorised into subjects, a challenge that the participants from our hybrid group sought to solve in order to favour the conditions for analysing the dataset.
By creating a new classifier for the old datasets lacking categories of subjects, the dataset will thus become more accessible and available for further analyses: We’re working with only 20 subjects, so it is very generic …like economy, labour, foreign affairs.
– Jan Kostkan, Center for Humanities Computing, Aarhus University
A broader comprehension of the dataset from Folketinget can thus be gained, and the group found a way to categorise the proceedings making them available for further analyses by experts with subject-matter knowledge, for example historians.
Evaluating the datasprint
UCloud thus served a valuable tool at the datasprint in Aarhus this November. All four participants unanimously agree that UCloud contains significant advantages when it comes to working with large datasets as in the datasprint, mainly because UCloud has more computer power and works faster than other systems.
One specific quality of UCloud that is emphasised by the participants is its ability to support the collaborative working process as the system makes it easy to work with others, even on a distance. Apart from minor issues in the user interface, UCloud is generally commended for its usability, even for beginners, and both students and staff from the group stress the potential of including UCloud in teaching.
Read more about the Data(Tinget) datasprint