🏥 Must-Read: CPU Inference sucks!

Malawi Public Health Systems LLM Challenge

Helping Malawi

$2 000 USD

Challenge completed almost 2 years ago

Skills you will learn

Questioning and Answering

Generative AI

409 joined

74 active

Info Data Chat Leaderboard

Start

Jan 24, 24

Mar 03, 24

Reveal

Mar 03, 24

Koleshjr

Multimedia university of kenya

CPU Inference sucks!

Platform · 3 Mar 2024, 17:37 · 7

Hello @avt_nyanja @Zindi

What is the essence of CPU inference? First you are going to get extremely slow responses, second you have to quantize your models leading to accuracy loss.

Case study based on my experiments:

8bit quantized model: Approximaltely 1hr per input on google collab cpu using llama.cpp
Original model(unquantized) : on google collab free tier gpu: 500 questions , aproximately 2hrs
Do the maths: cpu inference on two questions is approximately 500 questions on free tier collab
So that means for 8 bit quantized model to finish inference it is going to take 500 hrs with that estimate of 1hr per one question. 500/24 thats nearly 20 days to finish inference. What!!! How are these models going to be usable or was this a research project where these models won't be used in production?

Okay, the other option quantize to 4bit. In as much as it reduces the inference time, we get to 1-3 days inference for 500 questions.

Imagine sacrificing 2hrs of gpu inference to atleast 1day inference on a 4 bit quantized model and also hurting accuracy. Top that off with these inference platforms not being mature enough. E.g Ollama which is one of the most popular platforms for cpu inference getting stuck after some runs:

Ollama stuck after few runs · Issue #1863 · ollama/ollama (github.com) which is still an open issue

So I ask the question again. What is the essence of cpu inference????

N/B: Note that these numbers are based on my experiments. So someone else can chip in the discussion with their numbers since we are all not using the same approaches

@Reacher @DanielBruintjies @AdeptSchneider22 @Saifdaoud Are you facing similar challenges on cpu inference?

Discussion 7 answers

Mutisyaboy

I agree. While making my RAG, the requirement that the solution must run locally forced me to use 4 bit with nested quantization. I did hybrid retrievers, pre and post processing, fine tuned both the retriever and generator etc but still i couldnt go over a cetain score due to that limitation. I wonder how other people were able to get the best scores with such contraints

4 Mar 2024, 08:53

Upvotes 1

Reacher

For us no problem, we can answer a question (embed, retrieve and answer) on cpu in ~2 mins (7 seconds on GPU) with no optimizations.

4 Mar 2024, 12:22

Upvotes 1

Koleshjr

Multimedia university of kenya

With ollama or Llama.cpp?

replied to Reacher4 Mar 2024, 12:39

Upvotes 0

Reacher

native pytorch.

replied to Koleshjr4 Mar 2024, 12:41

Upvotes 0

Koleshjr

Multimedia university of kenya

Cool

replied to Reacher4 Mar 2024, 12:51

Upvotes 0

Amy_Bray

Zindi

Hi Kolesh, Reacher, Daniel, Adept, Saif,

Thank you for your input into this!

This is a research project by the AI Lab. The CPU restriction is based on implementation but you have raised good points that the team will take into consideration.

As the competition has ended we can't amend the rules. Please submit your solutions (if in the top 10) based on CPU inference. You can include a second submission that has reasonable (determined by you) inference hardware. The code review team will start reviewing the CPU inference scripts, if we see that there are challenges across the board we will make a judgement call, allowing you sufficient time to update your scripts. Note, code review does take up to 3 weeks so please be patient and we will contact you if needed.

This is one of our first LLM challenges and we look forward to growing with you. Thank you for raising these points and this is something we will consider in depth for future LLM challenges.

5 Mar 2024, 06:29

Upvotes 4

Koleshjr

Multimedia university of kenya

Thank you so much @amyflorida626 for your feedback. We appreciate

replied to Amy_Bray5 Mar 2024, 06:44

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status