One approach we can use for this challenge is called RAG which stands for Retrieval-Augmented Generation. It's a technique that combines the power of large language models with external knowledge retrieval. Essentially, RAG enhances the capabilities of a language model by allowing it to retrieve information from a large database or collection of documents.
This way, the model can access a vast amount of information beyond what it was trained on, making its responses more informed and potentially more accurate to the provided data. This approach is particularly useful for answering questions that require up-to-date or very specific information that might not be included in the training data of the model.
Here is how you can use RAG
1. PDF Preprocessing:
Convert the pdf reports into text format. This can be done using a PDF parser. The parser will read through the pdfs and extract all the text content. This is a crucial step because RAG systems work with text data.
2. Building a Database for Retrieval:
Once you have the text, the next step is to organize it into a searchable database. This means structuring the text in a way that it can be easily queried. For example, you might categorize the text by sections or topics found in the report data.
3. Setting Up the RAG System:
The next step is to integrate a large language model with the database (the parsed report data). The LLM model serves as the "brain" of the system, understanding queries and knowing what information to retrieve from the database.
4. Query Processing with RAG:
Next, when you have a specific piece of information you want to extract from the PDF, you ask the language model. For instance, "What were the air emissions of Carbon Dioxide for Absa in the year 2022?"
The large language model will internalize and comprehend the query and decide what information is relevant. It then uses the retrieval system to fetch this information from the database which contains the text extracted from the reports pdfs.
The whole process can then be iterated over the other AMKEYS and retrieve values from the reports data
N/B
Note that this is just one of the many approaches you can use for this kind of challenge.
Here are some resources to get you started:
Thank you, this is helpful.
Thank you for sharing the roadmap!
your insights are truly valuable, thank you.