In the last few days Elsevier has announced their policy on Text And Data Mining (TDM). I use the term “content mining” as I wish to mine every part of published content (images, audio, video) and not just text.
This post contains a lot of material (from Elsevier and my comments) so I’ll try to summarise. Note that Elsevier’s material seems inconsistent in places (common with this publisher). I have had to go behind Elsevier’s paywall to find one statement of agreement and rights and it is probable that I have not found everything. In essence:
- Elsevier asserts complete control over “its” content and requires both institutions and individuals to sign licences.
- Elsevier is the sole author and controller of the policy – there has been no Open discussion or agreement with scholarly bodies
- Libraries have to – individually – sign agreements with Elsevier. There are no details of these policies or whether they entail additional institutional payment. It is also possible that Institutions may be asked to give up content-mining rights in return for lower overall prices. (Libraries have universally and unilaterally given away all these rights over the last decade and support publishers to forbid machine access to content).
- Researchers have to register as a developer (I think) and ask permission of Elsevier for every project they wish to do. It is not clear whether permission is automatic or whether Elsevier exercise control over choice and scope of project (they certainly did when I “negotiated” with them).
- Researchers can only access content through an Elsevier-controlled portal. They have to register as a Developer and get an APIKey (conflicts with “sign a click-through licence”).
- Researchers can only mine text. Images are specifically prohibited. This is useless for me – as I and colleagues are mining chemical structure diagrams.
- There is no indication of how current the material will be. I shall be mining the literature an hour after it appears. Will the API provide that?
- The amount that can be republished is often useless (“200 characters”). I want to build corpora (impossible); vocabularies (essential to record precise words – impossible); chemical names (often > 200 characters so impossible). Figure captions (impossible).
- The researchers must commit to a CC-NC licence. This effectively kills downstream use (I shall use CC0). It also trains them into thinking CC-NC is a “good thing”. It isn’t.
- If a researcher has a LEGITIMATE collection of papers that they wish to mine (say on their hard disk) they are forbidden. They have to go to each publisher (if this awful protocol is promoted elsewhere) and find the API and mine the individual papers. Absurd.
This is licence-controlled TDM. The publishers tried very hard to get Europe (Neelie Kroes) to agreeto licences for TDM (“Licences for Europe”). They failed.
They tried to stop the UK Hargreaves process exempting data analytics from copyright reform. They failed.
The leading library organizations and funders such as the British Library, JISC, LIBER, Wellcome Trust, RCUK are united in their opposition to licences. This is simply Licences under another head.
The danger is that University libraries – who have signed these restrictive clauses for years will continue to sign them.
Don’t take my word for this. Ask the BL, or JISC or LIBER.
BUT DON’T SIGN ELSEVIERS TDM.
YOU DO NOT NEED ANY API.
APIs make it HARDER to mine. We are releasing technology that will work directly on PDFs. It’s Open Source and works. And others are doing the same. If every publisher came up with a similar process it would make the burden of mining huge. This is probably what some publishers hope.
The full text of Peter’s blog piece, plus supporting analysis, is available here: http://blogs.ch.cam.ac.uk/pmr/2014/01/