Primary competition visual

Malawi Public Health Systems LLM Challenge

Helping Malawi
$2 000 USD
Challenge completed over 1 year ago
Questioning and Answering
Generative AI
407 joined
74 active
Starti
Jan 24, 24
Closei
Mar 03, 24
Reveali
Mar 03, 24
User avatar
Koleshjr
Multimedia university of kenya
CPU Inference sucks!
Platform · 3 Mar 2024, 17:37 · 7

Hello @avt_nyanja @Zindi

What is the essence of CPU inference? First you are going to get extremely slow responses, second you have to quantize your models leading to accuracy loss.

Case study based on my experiments:

  • 8bit quantized model: Approximaltely 1hr per input on google collab cpu using llama.cpp
  • Original model(unquantized) : on google collab free tier gpu: 500 questions , aproximately 2hrs
  • Do the maths: cpu inference on two questions is approximately 500 questions on free tier collab
  • So that means for 8 bit quantized model to finish inference it is going to take 500 hrs with that estimate of 1hr per one question. 500/24 thats nearly 20 days to finish inference. What!!! How are these models going to be usable or was this a research project where these models won't be used in production?

Okay, the other option quantize to 4bit. In as much as it reduces the inference time, we get to 1-3 days inference for 500 questions.

Imagine sacrificing 2hrs of gpu inference to atleast 1day inference on a 4 bit quantized model and also hurting accuracy. Top that off with these inference platforms not being mature enough. E.g Ollama which is one of the most popular platforms for cpu inference getting stuck after some runs:

Ollama stuck after few runs · Issue #1863 · ollama/ollama (github.com) which is still an open issue

So I ask the question again. What is the essence of cpu inference????

N/B: Note that these numbers are based on my experiments. So someone else can chip in the discussion with their numbers since we are all not using the same approaches

@Reacher @DanielBruintjies @AdeptSchneider22 @Saifdaoud Are you facing similar challenges on cpu inference?

Discussion 7 answers

I agree. While making my RAG, the requirement that the solution must run locally forced me to use 4 bit with nested quantization. I did hybrid retrievers, pre and post processing, fine tuned both the retriever and generator etc but still i couldnt go over a cetain score due to that limitation. I wonder how other people were able to get the best scores with such contraints

4 Mar 2024, 08:53
Upvotes 1

For us no problem, we can answer a question (embed, retrieve and answer) on cpu in ~2 mins (7 seconds on GPU) with no optimizations.

4 Mar 2024, 12:22
Upvotes 1
User avatar
Koleshjr
Multimedia university of kenya

With ollama or Llama.cpp?

native pytorch.

User avatar
Koleshjr
Multimedia university of kenya

Cool

User avatar
Amy_Bray
Zindi

Hi Kolesh, Reacher, Daniel, Adept, Saif,

Thank you for your input into this!

This is a research project by the AI Lab. The CPU restriction is based on implementation but you have raised good points that the team will take into consideration.

As the competition has ended we can't amend the rules. Please submit your solutions (if in the top 10) based on CPU inference. You can include a second submission that has reasonable (determined by you) inference hardware. The code review team will start reviewing the CPU inference scripts, if we see that there are challenges across the board we will make a judgement call, allowing you sufficient time to update your scripts. Note, code review does take up to 3 weeks so please be patient and we will contact you if needed.

This is one of our first LLM challenges and we look forward to growing with you. Thank you for raising these points and this is something we will consider in depth for future LLM challenges.

5 Mar 2024, 06:29
Upvotes 4
User avatar
Koleshjr
Multimedia university of kenya

Thank you so much @amyflorida626 for your feedback. We appreciate