How many AIs does it take to read a PDF?

• Last November, the House Oversight Committee had just released 20,000 pages of documents from the estate of Jeffrey Epstein, and Luke Igel and some friends were clicking around, trying to follow the threads of conversation through garbled email threads and a PDF viewer that was, frankly, “gross.” In the coming months, the Department of Justice would release its own batches of files, more than three million of them - again, all PDFs. • How many AIs does it take to read a PDF? • One of the humblest and most ubiquitous file formats is stumping the world’s most advanced models. • While the Department of Justice had run optical character recognition over the text, it was not very good, Igel said, rendering the files more or less unsearchable. • “There was no interface the government put out that allowed you to actually see any sort of summary of things like flights, things like calendar events, things like text messages. • There was no real index.

Article Summaries:

The U.S. Department of Justice has released over three million PDF files from the Jeffrey Epstein estate, but the documents’ poor OCR and complex formatting make them largely unsearchable. Despite advances in AI, extracting structured data from PDFs remains a “grand challenge.” Reducto, a startup led by former MIT classmate Adit Abraham, successfully parsed a wide range of Epstein-related PDFs-including redacted call logs, handwritten flight manifests, and scanned emails-producing usable data. Using this output, Luke Igel and collaborators built a suite of prototype tools (Jmail, Jflights, Jamazon, Jikipedia) to search and visualize the information, illustrating the potential impact of improved PDF parsing.

Sources:

https://www.theverge.com/ai-artificial-intelligence/882891/ai-pdf-parsing-failure