News

TDM and the reading revolution

Posted: 24-04-2017

You will not catch Steven Claeyssens carrying a smartphone and he will always prefer a paper book to an e-reader. Yet he is the curator of digital collections at the National Library of the Netherlands. I interviewed him about his job, text and data mining (TDM) in the humanities and the role of libraries in the research landscape.

What does it mean to be a curator of digital collections at the National Library?

It’s important to know that I don’t focus on digital publications, but on collections. So I don’t look at individual publications, but the datasets that come from them. That’s the level I look at. I do many things. I am the link between the researchers and the library, and that means I have an important advising role. For instance, the digital library decided not to generate article segmentation for a set of digitised magazines. But since I work with text and data miners, I know that this segmentation is important for them. So I passed this information onto the library when I found out.

Do you work on your own, or do you have a team?

We have an informal digital humanities team. It consists of several people working in different departments and on different tasks. There are two research programmers, and two digital scholarship advisors. One of them, Lotte, leads the KB Research Lab. And the other, Martijn, looks for opportunities to join in writing research proposals. I look from the point of view of a data supplier.

What do you like about your job?

I come from research, and what I really like, is that I am still working closely with researchers. I can really join them at the fascinating forefront.
What also interests me, are the parallels with book history. I have just written a small article about the fact that machines are currently reading more than people. Think about Delpher, our open collection of digitised books, newspapers and magazines. A large bulk of these texts will never turn up in a search and will therefore never be read. But machines do ‘read’ them. So right now, more digitised text is ‘read’ by machines than by humans. That’s fascinating.

What do you think the role of the National Library should be in the research landscape?

A lot of my colleagues work together with researchers. Curators for instance, really participate in research. I think the library and research will be connected even closer in the future, they should collaborate even more. And this goes for humanists as well as computer scientists, you need both.

A national library, as opposed to a university library, is not connected to a university and therefore a bit further away from the universities. But because of the digitisation, we have the power of aggregation. By curating national text collections, we can offer researchers something very valuable. Of course we also have fellowship programs. So in short, I think we are partners in research.

The National Library works on making content available and more accessible, can you name a few examples?

Well, we have Delpher of course, which is a graphic user-interface and works very well for the general public. But for some researchers, this is not enough. They want raw data as well. So we gave them access to the infrastructure through APIs. In the future it will be more differentiated. For many humanities researchers who are not yet familiar with TDM, access through APIs is already too complex. We recently also released a newspaper set as zip file, in order to make it accessible for less technical researchers. Now we’re also making a harvester/corpus builder. But then we have to deal with copyright, so this is not always possible.

What do you think about text and data mining, in general?

I got to know text and data mining through digital humanities. Digital humanities may feel like a bit of a hype, but I am convinced that the tools and methodologies that are being developed are here to stay. In that sense, digital humanities is the most exciting thing going on in the humanities at the moment.
What interests me most, is that we are learning to deal with such huge amounts of text. That’s the most exciting part for me. The e-reader for example is not a revolution, but the fact that we learn to read a million books is.

I see a parallel with the reading revolution of the 18th century. Before that, people read the same texts over and over again. For instance the Bible, or a devotional book. But in the 18th century it started to shift, and people started to read more different texts. I think there is a new reading revolution going on, but then at a much larger scale and using software.

What are the most important barriers to text and data mining?

Copyright would be number one. After that, also infrastructural questions. Researchers want to have a copy of the data, because all of them have different requests and they study the data in a different way. And this leads to changes to the data, and then they expect us to incorporate those changes again. But before you know it, you end up with a landscape with many streams, and that’s something you’d like to avoid.
So maybe we shouldn’t bring the data to the software, but the software to the data. Then we know for sure that everyone is working on the same data. You have three parties involved in this: IT-staff, humanities researchers, and curators. In the Netherlands, CLARIAH is already working on a solution for this problem. I think people should bring their algorithms to us, and then we can run them on the collection.

What progress has already been made with TDM in the humanities?

In the humanities, it is still in an early stage. But one example is stylometry software, that can see who the actual author of a certain text is. Because of this software, they recently found out that the Dutch national anthem was probably written by someone else than we always thought.

But in general it is still in the discovery phase. There are no clear results yet, which makes it hard to deal with TDM-skeptics. And if there are others who are saying “this is THE new thing for humanities”…that’s dangerous I think. But soon the first PhD students will graduate on TDM in the humanities, after that I expect more results.

Another challenge is that scientists in the humanities have to think about what the tools and techniques actually do. Especially with software that becomes self-teaching, as in machine learning, that’s hard to understand. Maybe we should be in touch with scientists from other disciplines more. Astronomy perhaps, or at CERN, the things that happen there… I am sure humanities researchers can learn from that.

This article was originally posted on the websites of FutureTDM and OpenMinTeD, two projects in which LIBER is a partner.