Primary competition visual

Kenya Clinical Reasoning Challenge

Helping Kenya
$10 000 USD
Completed (8 months ago)
Prediction
Natural Language Processing
SLM
1664 joined
440 active
Starti
Apr 03, 25
Closei
Jun 29, 25
Reveali
Jun 30, 25
did someone found a model that respects inference time of 100ms per prompt ?
Help · 13 Jun 2025, 19:31 · 8

to generate 100-300 tokens takes few seconds for every model i tried

Discussion 8 answers
User avatar
hark99
Self-employed

That's not possible so far for a single prompt.

13 Jun 2025, 19:43
Upvotes 0

so this rule is impossible to reach, right ? what is the solution ?

if they change the rules, i hope that they will give us more time

User avatar
hark99
Self-employed
  • tricky situation. According to them,"""How implementable is your code in a real application? Have you taken into account that the solution will be deployed on an edge device? - 25%""" and """Training should take no longer than 24 hours on a GPU similar to an NVIDIA T4 while inference should be on an NVIDIA Jetson Nano or equivalent.""" I think on the deployment side, the inference should be 100ms per prompt. That may be achieved when you convert into a particular format, such as AWQ, where efficiency is slightly decreased but inference is fast, a trade-off. @Amy_Bray, could you please assist?

use TinyLLaMA 1.1B and finetune your model than use sentenceformer quantize it using bitsandbytes and peft

15 Jun 2025, 21:07
Upvotes 0

very heavy to train though right?, unless you spend on cloud gpus

User avatar
hark99
Self-employed

TinyLLaMA (quantized and fine tuned) is taking 35.1 seconds on a 100 batch, which means above 100ms per vignette.

try to use a t5 model (t5-base, t5-small or t5-large)

and ensure that you are using the GPU

I used it on colab with Nvidia T4 GPU and the mean inference in ma case after finetuning was around 76 secs

26 Jun 2025, 22:40
Upvotes 0