From the Trenches: Turning LLMs into Data Excavators
The Challenges of Entity Extraction with Generative AI
Recently, I've been working extensively on LLM pipelines for extracting complex data entities from PDF documents. I developed pipelines capable of processing millions of pages of text and extracting millions of data entities. It’s not an easy process, but along the way, I realized that if designed correctly, the quality of data extraction can be very high. I picked up many practical tips and lessons, which I plan to share over the next few posts, time permitting.
Let me start with a warning: it's easy to get started but hard to develop a production-quality data extraction system with Gen AI.
In essence, the process is simple to get started with. Pick up a framework like LangChain, LlamaIndex, or Haystack—they've implemented most of the plumbing needed to interact with the LLMs. Get familiar with the key concepts and the overall architecture, RAG, etc., go over a few examples, borrow some code for your project, and off you go. You can extract a trivial data entity from the text pretty quickly (when was the US Constitution written? 1787, easy!).
Tutorials and documentation often cover only simple examples, but real-life data extraction needs are a lot more complex than that. What if you want to extract an entire object in a consistent way—let's say data on a company, a product, or a patient record—objects that have a large number of attributes scattered through a large document? Or what if you want to extract a list of objects, ensuring the data is organized accurately across the objects? What if you want to extract several types of objects that are related to each other and you need to merge various entities together? How do you get non-text attributes such as dates and numbers reliably? Units of measure? How deep do you go?
I have to tell you, an LLM that can handle all of these use cases with ease doesn’t exist yet. All LLMs have limits, and each excels in different areas. As a developer, you will grow to learn these limits, and this can only be done through experimentation. But know that at some point, even if you work on an average-complexity project, you will reach the limits of the LLM's capabilities or the LLM abstraction framework, and you will need to find ways to navigate around them.
Experimenting with prompt engineering, taking advantage of advanced LLM features such as function calling or JSON mode output—which may or may not work for some LLMs and may or may not be supported by your LLM abstraction layer (pointing at LangChain here), trying out various chunking methods and embedding models, experimenting with different document retrieval techniques, rerankers, dealing with the LLM context length limits, fixing JSON outputs—all can make a huge impact on the quality of your entity extraction.
LangChain and LlamaIndex libraries change on a monthly basis; many functions are deprecated or removed completely, and new ones emerge. Unfortunately, documentation is often minimal or doesn’t exist at all. This constant evolution requires staying up-to-date and continuously adapting your codebase to ensure stability and functionality.
From my experience, you can write starter code to interact with the LLM pretty quickly, but developing reliable, production-level code for a complex project may take months of work, experimentation, and optimization, and thousands of lines of code.
In the next few posts, I'll dive deeper into specific challenges and solutions I've encountered, including prompt engineering techniques, handling edge cases in data extraction, testing various LLMs, and pushing them to the limits. Stay tuned for more insights from the trenches of AI-driven data extraction.
I’d love to hear about your experiences in this field! If you have any interesting ideas or projects in data extraction, please share them or reach out.


