Hello @avt_nyanja @Zindi
What is the essence of CPU inference? First you are going to get extremely slow responses, second you have to quantize your models leading to accuracy loss.
Case study based on my experiments:
Okay, the other option quantize to 4bit. In as much as it reduces the inference time, we get to 1-3 days inference for 500 questions.
Imagine sacrificing 2hrs of gpu inference to atleast 1day inference on a 4 bit quantized model and also hurting accuracy. Top that off with these inference platforms not being mature enough. E.g Ollama which is one of the most popular platforms for cpu inference getting stuck after some runs:
Ollama stuck after few runs · Issue #1863 · ollama/ollama (github.com) which is still an open issue
So I ask the question again. What is the essence of cpu inference????
N/B: Note that these numbers are based on my experiments. So someone else can chip in the discussion with their numbers since we are all not using the same approaches
@Reacher @DanielBruintjies @AdeptSchneider22 @Saifdaoud Are you facing similar challenges on cpu inference?
I agree. While making my RAG, the requirement that the solution must run locally forced me to use 4 bit with nested quantization. I did hybrid retrievers, pre and post processing, fine tuned both the retriever and generator etc but still i couldnt go over a cetain score due to that limitation. I wonder how other people were able to get the best scores with such contraints
For us no problem, we can answer a question (embed, retrieve and answer) on cpu in ~2 mins (7 seconds on GPU) with no optimizations.
With ollama or Llama.cpp?
native pytorch.
Cool
Thank you so much @amyflorida626 for your feedback. We appreciate