
Hello @avt_nyanja @Zindi
What is the essence of CPU inference? First you are going to get extremely slow responses, second you have to quantize your models leading to accuracy loss.
Case study based on my experiments:
Okay, the other option quantize to 4bit. In as much as it reduces the inference time, we get to 1-3 days inference for 500 questions.
Imagine sacrificing 2hrs of gpu inference to atleast 1day inference on a 4 bit quantized model and also hurting accuracy. Top that off with these inference platforms not being mature enough. E.g Ollama which is one of the most popular platforms for cpu inference getting stuck after some runs:
Ollama stuck after few runs · Issue #1863 · ollama/ollama (github.com) which is still an open issue
So I ask the question again. What is the essence of cpu inference????
N/B: Note that these numbers are based on my experiments. So someone else can chip in the discussion with their numbers since we are all not using the same approaches
@Reacher @DanielBruintjies @AdeptSchneider22 @Saifdaoud Are you facing similar challenges on cpu inference?
I agree. While making my RAG, the requirement that the solution must run locally forced me to use 4 bit with nested quantization. I did hybrid retrievers, pre and post processing, fine tuned both the retriever and generator etc but still i couldnt go over a cetain score due to that limitation. I wonder how other people were able to get the best scores with such contraints
For us no problem, we can answer a question (embed, retrieve and answer) on cpu in ~2 mins (7 seconds on GPU) with no optimizations.
With ollama or Llama.cpp?
native pytorch.
Cool
Hi Kolesh, Reacher, Daniel, Adept, Saif,
Thank you for your input into this!
This is a research project by the AI Lab. The CPU restriction is based on implementation but you have raised good points that the team will take into consideration.
As the competition has ended we can't amend the rules. Please submit your solutions (if in the top 10) based on CPU inference. You can include a second submission that has reasonable (determined by you) inference hardware. The code review team will start reviewing the CPU inference scripts, if we see that there are challenges across the board we will make a judgement call, allowing you sufficient time to update your scripts. Note, code review does take up to 3 weeks so please be patient and we will contact you if needed.
This is one of our first LLM challenges and we look forward to growing with you. Thank you for raising these points and this is something we will consider in depth for future LLM challenges.
Thank you so much @amyflorida626 for your feedback. We appreciate