Researchers Have Teamed Up in India To Build a Gigantic Store of Texts and Images Extracted From 73M Journal Articles (nature.com) 32
A giant data store quietly being built in India could free vast swathes of science for computer analysis -- but whether it is a legal pursuit remains unclear. From a report: Carl Malamud is on a crusade to liberate information locked up behind paywalls -- and his campaigns have scored many victories. He has spent decades publishing copyrighted legal documents, from building codes to court records, and then arguing that such texts represent public-domain law that ought to be available to any citizen online. Sometimes, he has won those arguments in court. Now, the 60-year-old American technologist is turning his sights on a new objective: freeing paywalled scientific literature. And he thinks he has a legal way to do it. Over the past year, Malamud has -- without asking publishers -- teamed up with Indian researchers to build a gigantic store of text and images extracted from 73 million journal articles dating from 1847 up to the present day.
The cache, which is still being created, will be kept on a 576-terabyte storage facility at Jawaharlal Nehru University (JNU) in New Delhi. "This is not every journal article ever written, but it's a lot," Malamud says. It's comparable to the size of the core collection in the Web of Science database, for instance. Malamud and his JNU collaborator, bioinformatician Andrew Lynn, call their facility the JNU data depot. No one will be allowed to read or download work from the repository, because that would breach publishers' copyright. Instead, Malamud envisages, researchers could crawl over its text and data with computer software, scanning through the world's scientific literature to pull out insights without actually reading the text. The unprecedented project is generating much excitement because it could, for the first time, open up vast swathes of the paywalled literature for easy computerized analysis.
Dozens of research groups already mine papers to build databases of genes and chemicals, map associations between proteins and diseases, and generate useful scientific hypotheses. But publishers control -- and often limit -- the speed and scope of such projects, which typically confine themselves to abstracts, not full text. Researchers in India, the United States and the United Kingdom are already making plans to use the JNU store instead. Malamud and Lynn have held workshops at Indian government laboratories and universities to explain the idea. "We bring in professors and explain what we are doing. They get all excited and they say, 'Oh gosh, this is wonderful'," says Malamud. But the depot's legal status isn't yet clear. Malamud, who contacted several intellectual-property (IP) lawyers before starting work on the depot, hopes to avoid a lawsuit.
The cache, which is still being created, will be kept on a 576-terabyte storage facility at Jawaharlal Nehru University (JNU) in New Delhi. "This is not every journal article ever written, but it's a lot," Malamud says. It's comparable to the size of the core collection in the Web of Science database, for instance. Malamud and his JNU collaborator, bioinformatician Andrew Lynn, call their facility the JNU data depot. No one will be allowed to read or download work from the repository, because that would breach publishers' copyright. Instead, Malamud envisages, researchers could crawl over its text and data with computer software, scanning through the world's scientific literature to pull out insights without actually reading the text. The unprecedented project is generating much excitement because it could, for the first time, open up vast swathes of the paywalled literature for easy computerized analysis.
Dozens of research groups already mine papers to build databases of genes and chemicals, map associations between proteins and diseases, and generate useful scientific hypotheses. But publishers control -- and often limit -- the speed and scope of such projects, which typically confine themselves to abstracts, not full text. Researchers in India, the United States and the United Kingdom are already making plans to use the JNU store instead. Malamud and Lynn have held workshops at Indian government laboratories and universities to explain the idea. "We bring in professors and explain what we are doing. They get all excited and they say, 'Oh gosh, this is wonderful'," says Malamud. But the depot's legal status isn't yet clear. Malamud, who contacted several intellectual-property (IP) lawyers before starting work on the depot, hopes to avoid a lawsuit.
Make sure to notify JSTOR (Score:4, Funny)
May the Swartz be with you.
Re: (Score:2)
Hell yeah!!!
Aaron + Machine Learning = something big
Re: (Score:1)
So, the Indian libgen? (Score:2)
Good luck, but I doubt it will go very far before the publishing cartel tries to crush him.
Re: (Score:2)
If a tech billionaire wanted to do good... (Score:2)
He would stop focusing on building himself a private space station gateway and instead purchase up all of the science journals and release them into the public domain.
Re: If a tech billionaire wanted to do good... (Score:2)
That is true about literally everything.
99% of songs written are simple and boring
99% of books written are at best forgettable
99% of hillsides don't have t-rex fossils.
There are still a lot of t-rex fossils out there to be found. The trick is finding them and you can't find them unless you have access to lots and lots of hillsides.
Re: (Score:3)
The majority of those articles are as boring as watching paint dry
It doesn't matter, because humans don't need to read them.
Machine Learning systems learn from scientific papers [sciencedaily.com]
Really Sad (Score:4, Insightful)
This is really sad. Not that they are are pulling all this information together in one place. What is really sad is copyright restrictions on this will prevent the information from being really useful. Sure you can crawl it and hope to find useful snipits and hope to not run afoul of the copy right gestapo. But imagine if all the articles could be key worded, properly indexed, and with not restricted access to researchers.
Definitions (Score:2)
scanning through the world's scientific literature to pull out insights without actually reading the text.
Sorry, but scanning is just another form of reading even if you have your electronic manservent read most of it for you.
I fully support the effort but the idea you can get around copyright like this strikes me as flawed.
We'll see what the courts say though!
Re: (Score:2)
Not the first (Score:1)
Regex and Word AI (Score:2)
A million monkeys with a million typewriters (Score:2)
woo vs non-woo (Score:2)
Maybe the system wouldn't need 576 terabytes if they skipped all of the pseudoscience. I'm not sure we need a repository of chakra realignment techniques.
What's that you say? This is just for science? How can it tell?