Researchers Have Teamed Up in India To Build a Gigantic Store of Texts and Images Extracted From 73M Journal Articles

Researchers Have Teamed Up in India To Build a Gigantic Store of Texts and Images Extracted From 73M Journal Articles (nature.com) 32

Posted by msmash on Friday July 19, 2019 @03:25PM from the good-fights dept.

A giant data store quietly being built in India could free vast swathes of science for computer analysis -- but whether it is a legal pursuit remains unclear. From a report: Carl Malamud is on a crusade to liberate information locked up behind paywalls -- and his campaigns have scored many victories. He has spent decades publishing copyrighted legal documents, from building codes to court records, and then arguing that such texts represent public-domain law that ought to be available to any citizen online. Sometimes, he has won those arguments in court. Now, the 60-year-old American technologist is turning his sights on a new objective: freeing paywalled scientific literature. And he thinks he has a legal way to do it. Over the past year, Malamud has -- without asking publishers -- teamed up with Indian researchers to build a gigantic store of text and images extracted from 73 million journal articles dating from 1847 up to the present day.

The cache, which is still being created, will be kept on a 576-terabyte storage facility at Jawaharlal Nehru University (JNU) in New Delhi. "This is not every journal article ever written, but it's a lot," Malamud says. It's comparable to the size of the core collection in the Web of Science database, for instance. Malamud and his JNU collaborator, bioinformatician Andrew Lynn, call their facility the JNU data depot. No one will be allowed to read or download work from the repository, because that would breach publishers' copyright. Instead, Malamud envisages, researchers could crawl over its text and data with computer software, scanning through the world's scientific literature to pull out insights without actually reading the text. The unprecedented project is generating much excitement because it could, for the first time, open up vast swathes of the paywalled literature for easy computerized analysis.

Dozens of research groups already mine papers to build databases of genes and chemicals, map associations between proteins and diseases, and generate useful scientific hypotheses. But publishers control -- and often limit -- the speed and scope of such projects, which typically confine themselves to abstracts, not full text. Researchers in India, the United States and the United Kingdom are already making plans to use the JNU store instead. Malamud and Lynn have held workshops at Indian government laboratories and universities to explain the idea. "We bring in professors and explain what we are doing. They get all excited and they say, 'Oh gosh, this is wonderful'," says Malamud. But the depot's legal status isn't yet clear. Malamud, who contacted several intellectual-property (IP) lawyers before starting work on the depot, hopes to avoid a lawsuit.

Researchers Have Teamed Up in India To Build a Gigantic Store of Texts and Images Extracted From 73M Journal Articles

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 32 Comments Log In/Create an Account

Comments Filter:

Make sure to notify JSTOR (Score:4, Funny)

by weilawei ( 897823 ) writes: on Friday July 19, 2019 @03:29PM (#58953052)

May the Swartz be with you.

- Re: (Score:2)
  
  by kyoko21 ( 198413 ) writes:
  
  Hell yeah!!!
  Aaron + Machine Learning = something big
- Re: (Score:1)
  
  by Jamal Kerton ( 6106772 ) writes:
  
  you can earn free robux [robloxclaim.com] here
So, the Indian libgen? (Score:2)

by Mr. Dollar Ton ( 5495648 ) writes:

Good luck, but I doubt it will go very far before the publishing cartel tries to crush him.
- Re: (Score:2)
  
  by BarbaraHudson ( 3785311 ) writes:
  
  In Indian courts? Where it can take 10 to 15 years just to get a preliminary hearing? By the t a final judgment is rendered, it would be moot because copyright will have expired and everyone involved in the original case will have died of old age. Think Bleak House Bollywood style. Lots of noise, no real plot, nobody cares.
If a tech billionaire wanted to do good... (Score:2)

by Dallas May ( 4891515 ) writes:

He would stop focusing on building himself a private space station gateway and instead purchase up all of the science journals and release them into the public domain.
- - Re: If a tech billionaire wanted to do good... (Score:2)
    
    by Dallas May ( 4891515 ) writes:
    
    That is true about literally everything.
    99% of songs written are simple and boring
    99% of books written are at best forgettable
    99% of hillsides don't have t-rex fossils.
    There are still a lot of t-rex fossils out there to be found. The trick is finding them and you can't find them unless you have access to lots and lots of hillsides.
  - Re: (Score:3)
    
    by ShanghaiBill ( 739463 ) writes:
    
    The majority of those articles are as boring as watching paint dry
    It doesn't matter, because humans don't need to read them.
    Machine Learning systems learn from scientific papers [sciencedaily.com]
Really Sad (Score:4, Insightful)

by jwhyche ( 6192 ) writes: on Friday July 19, 2019 @04:10PM (#58953266) Homepage

This is really sad. Not that they are are pulling all this information together in one place. What is really sad is copyright restrictions on this will prevent the information from being really useful. Sure you can crawl it and hope to find useful snipits and hope to not run afoul of the copy right gestapo. But imagine if all the articles could be key worded, properly indexed, and with not restricted access to researchers.

Definitions (Score:2)

by SuperKendall ( 25149 ) writes:

scanning through the world's scientific literature to pull out insights without actually reading the text.
Sorry, but scanning is just another form of reading even if you have your electronic manservent read most of it for you.
I fully support the effort but the idea you can get around copyright like this strikes me as flawed.
We'll see what the courts say though!
- Re: (Score:2)
  
  by BarbaraHudson ( 3785311 ) writes:
  
  Nothing to prevent India from creating an exception to copyright law for this sort of project, same as the US has an exception for audio books for the blind.
Not the first (Score:1)

by BanHammer ( 5567450 ) writes:

Hasn't Google accomplished something similar with Google Books?
Regex and Word AI (Score:2)

by wolfheart111 ( 2496796 ) writes:

Use Regex to pull relevant content, then use Word AI https://wordai.com/ [wordai.com] to spin the article, all without reading the original article. I would imagine you would need a specific thesaurus for science though.
A million monkeys with a million typewriters (Score:2)

by jfdavis668 ( 1414919 ) writes:

They have enough to write everything. Sooner or later they will turn out all scientific papers, as well as some that no person has even written yet. (BTW, I am referring to monkeys, not the human population of India. A lot of monkeys also live there)
woo vs non-woo (Score:2)

by sheramil ( 921315 ) writes:

Maybe the system wouldn't need 576 terabytes if they skipped all of the pseudoscience. I'm not sure we need a repository of chakra realignment techniques.
What's that you say? This is just for science? How can it tell?

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Researchers Have Teamed Up in India To Build a Gigantic Store of Texts and Images Extracted From 73M Journal Articles (nature.com) 32

Researchers Have Teamed Up in India To Build a Gigantic Store of Texts and Images Extracted From 73M Journal Articles More Login

Researchers Have Teamed Up in India To Build a Gigantic Store of Texts and Images Extracted From 73M Journal Articles

Make sure to notify JSTOR (Score:4, Funny)

Re: (Score:2)

Re: (Score:1)

So, the Indian libgen? (Score:2)

Re: (Score:2)

If a tech billionaire wanted to do good... (Score:2)

Re: If a tech billionaire wanted to do good... (Score:2)

Re: (Score:3)

Really Sad (Score:4, Insightful)

Definitions (Score:2)

Re: (Score:2)

Not the first (Score:1)

Regex and Word AI (Score:2)

A million monkeys with a million typewriters (Score:2)

woo vs non-woo (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot