JPL Creates World's Largest PDF Archive to Aid Malware Research 21
NASA's Jet Propulsion Laboratory (JPL) has created the largest open-source archive of PDFs as part of DARPA's Safe Documents program, with the aim of improving internet security. The corpus consists of approximately 8 million PDFs collected from the internet. From a press release: "PDFs are used everywhere and are important for contracts, legal documents, 3D engineering designs, and many other purposes. Unfortunately, they are complex and can be compromised to hide malicious code or render different information for different users in a malicious way," said Tim Allison, a data scientist at JPL in Southern California. "To confront these and other challenges from PDFs, a large sample of real-world PDFs needs to be collected from the internet to create a shared, freely available resource for software experts." Building the corpus was no easy task. As a starting point, Allison's team used Common Crawl, an open-source public repository of web-crawl data, to identify a wide variety of PDFs to be included in the corpus -- files that are publicly available and not behind firewalls or in private networks. Conducted between July and August 2021, the crawl identified roughly 8 million PDFs.
Common Crawl limits downloaded data to 1 megabyte per file, meaning larger files were incomplete. But researchers need the entire PDF, not a truncated version, in order to conduct meaningful research on them. The file-size limit reduced the number of complete, untruncated files extracted directly from Common Crawl to 6 million. To get the other 2 million PDFs and ensure the corpus was complete, the JPL team re-fetched the truncated files using specialized software that downloaded the whole files from the incomplete PDFs' web addresses. Various metadata, such as the software used to create each PDF, was extracted and is included with the corpus. The JPL team also relied on free, publicly available geolocation software to identify the server location of the source website for each PDF. The complete data set totals about 8 terabytes, making it the largest publicly available corpus of its kind.
The corpus will do more than help researchers identify threats. Privacy researchers, for example, could study these files to determine how file-creation and editing software can be improved to better protect personal information. Software developers could use the files to find bugs in their code and to check if old versions of software are still compatible with newer versions of PDFs. The Digital Corpora project hosts the huge data archive as part of Amazon Web Services' Open Data Sponsorship Program, and the files have been packaged in easily downloadable zip files.
Common Crawl limits downloaded data to 1 megabyte per file, meaning larger files were incomplete. But researchers need the entire PDF, not a truncated version, in order to conduct meaningful research on them. The file-size limit reduced the number of complete, untruncated files extracted directly from Common Crawl to 6 million. To get the other 2 million PDFs and ensure the corpus was complete, the JPL team re-fetched the truncated files using specialized software that downloaded the whole files from the incomplete PDFs' web addresses. Various metadata, such as the software used to create each PDF, was extracted and is included with the corpus. The JPL team also relied on free, publicly available geolocation software to identify the server location of the source website for each PDF. The complete data set totals about 8 terabytes, making it the largest publicly available corpus of its kind.
The corpus will do more than help researchers identify threats. Privacy researchers, for example, could study these files to determine how file-creation and editing software can be improved to better protect personal information. Software developers could use the files to find bugs in their code and to check if old versions of software are still compatible with newer versions of PDFs. The Digital Corpora project hosts the huge data archive as part of Amazon Web Services' Open Data Sponsorship Program, and the files have been packaged in easily downloadable zip files.