🌱 Must-Read: 10th Placed Solution

Unifi Value Frameworks PDF Lifting Competition

Helping South Africa

$5 000 USD

Challenge completed over 1 year ago

Skills you will learn

Generative AI

450 joined

73 active

Info Data Chat Leaderboard

Start

Dec 21, 21

Mar 17, 24

Reveal

Mar 17, 24

Koleshjr

Multimedia university of kenya

10th Placed Solution

Platform · 20 Mar 2024, 18:02 · 4

Hello Zindians,

This has been a very interesting competition and even though I never won despite putting in the work, I honestly learnt a lot and I am going to share with you guys. But first I want to congratulate all the winners and I hope they too will share their approaches to this amazing competition.

My solution was honestly so straightforward. Just the normal RAG pipeline of ingest, embed, retrieve and answer. I never tried the fancy things like reranking and all. I condensed this four stages into two stages namely ingest and embed(handled by the index py file) & retrieve and answer (handled by the main py file)

Key Takeaways , Tricks and challenges

Proprietary models definitely won. In my case I had success with Gemini-pro and Gpt-4. Gpt-4 though is very expensive but still costs less than $11 which is below the competition limits. Gemini-pro on the other hand is extremey cheap, doesn't cost more than $1 per experiment and offers almost the same accuracy as gpt-4 although slightly lower

Large language Models love to be shown what to do. In my case I significantly boosted my scores by passing the previous year answers in the prompt for those queries with prev year values. Doing this caused the models not to confuse between which values are for 2022, which are for the previous years. It also guided the model on which set of numbers to pay attention to as it can see where the previous years were retrieved from. Another trick was to add the company_name in the metadata during the ingest and embed stage on top of the source, since some documents had ambiguous names. These were the only tricks I used and every thing else was pretty straightforward RAG things

One of the challenges that the model was facing was with the units/Magnitude. Some of the values iven in the pdf were in a different magnitude as the ones expected to be submitted. For example , in the pdf you might have a value like 8.8Rbn but what is expected to be correct is 8800000000. The model kept retrieving 8.8 despite writing good prompts to handle those case although for some it got them. So the model's really struggled with this.

The other challenge I faced was that the metrics for some companies were not straightforward. For example, a company like Impala the accurate values had to be extracted by adding Impala refineries, Impala rustenburg and Marula. This is in itself is so hard for even humans to get and I am pretty sure 75% of the competitors here never even noticed this and you are expecting a model to get this? It was extremely hard for the models and even despite adding this rule to the prompt template. Also for a company like ssw, we were supposed to focus in values for SA operations PGM and Gold and ignore the ones from Europe and this separation was also hard to inject this to the model.

Potential concerns. Well this is for the @Zindi team to look at. It is very easy to get good scores in this competition. What one has to do is just read the pdfs, extract the correct values, store them in a csv, embed this CSV instead of the pdfs and Voila you have amazing accuracies. Well this approach is only beneficial if the goal was to build a question answering system. Also it is not scalable as it is time consuming to go throught the data manually. For an extraction process, if someone does this then whats the need of building a model? And you can as well pay people to label this data for you?

Anyways enough ranting. Congratulations to the @ZIndi team for hosting these Genai competitions and I am looking forward for more.

Here is the github link to the solution. KIndly star it if you find it useful. This will motivate me to even publish more code solutions for the recently concluded competions and more to come. Thank you

https://github.com/koleshjr/Unifi_Instruct_Rag

Enjoy!

Discussion 4 answers

swagatron

That's great solution @Koleshjr! Glad you were able to share the same!

Some insights on how I built my solution:

- My solution was quite simple(structurally), i.e. had a good data extractor to precisely extract tabular data inside the pdfs, used GPT-4 on fine-tuned prompts specific to type of data(structured/unstructured) and letting it answering accordingly.

- Some insights are:

- Retriever:

- No chunking on tables (as while chunking I might lose header level information), entire table is considered a chunk.

- Before using a vectordb to return best K chunks, I look for exact/partial matches of ActivityMetrics inside Tables and answer accordingly.

- Pass the entire table text in the prompt (as GPT4 is able to handle 128K tokens, and the biggest table was ~3K tokens, so it can handle the entire tables very easily)

- Unit/Magnitude:

- This was indeed a problem and had a neat trick attached to the same(very simple function) i.e. if you have the previous years value, you just to calculate the ratio of the [extracted value / previous year] and check its magnitude and multiply/divide the extracted answer accordingly (majorly they were of multipliers of THOUSAND, MILLION, BILLION)

- SSW: This was part of Unstructured Data where I had prompted it to calculate the values explicitly just for SA Operations (that is more scalable than specifically writing prompts for each group.)

- Irregularites in Impala was left as it is, as that was quite custom & specific.

The comment:

" What one has to do is just read the pdfs, extract the correct values, store them in a csv, embed this CSV instead of the pdfs and Voila you have amazing accuracies."

I agree to get such high scores you can obviously write the answers manually, but I concretely believe that Zindi team will look at the reproducability of the code very carefully.

Hope we see more solutions in upcoming days!

20 Mar 2024, 19:23

Upvotes 1

Koleshjr

Multimedia university of kenya

Thank you @swagatron for sharing your approach too. It would be nice to also see your code solution.

replied to swagatron20 Mar 2024, 19:31

Upvotes 1

AdeptSchneider22

Kenyatta University

@swagatron When you mention no chunking on tables? Does this mean you were able to extract the tables as they are from the PDFs? In my solution I used unstructured to extract text, table and image elements which I then used a large language model to summarize the text and table elements, and a multimodal model (Llava) to extract the metrics and summarize the images. I'm interested in learning how you did it given most tables were in form of an image in the PDFs.

replied to swagatron21 Mar 2024, 04:15

Upvotes 0

AdeptSchneider22

Kenyatta University

My initial plan was to fine-tune an Open-source LLM on previous years' data to guide the model on how to extract text but the train.csv data wasn't that accurate. Let me look into your code to see how exactly you were able to pass previous years' values to ensure the LLM aligns its answer based on the examples you showed. This seems like you were doing few-shot prompting.

21 Mar 2024, 04:08

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status