User avatar
Eswardivi
Amrita Vishwa Vidyapeetham
Clarification about Inference Time
Platform · 17 May 2025, 13:02 · 10

What does 'Inference must be less than 100ms per vignette' mean? Does this refer to the total time taken to generate the entire output, or just the time to generate the first token?

Discussion 10 answers
User avatar
hark99
Self-employed

The whole time for a single inference.

17 May 2025, 13:06
Upvotes 0
User avatar
Eswardivi
Amrita Vishwa Vidyapeetham

is it possible for SLM to achieve those time constraints

17 May 2025, 13:08
Upvotes 0

I've been trying some things, quantization, ollama, ... and I don't see how it's possible, at least with qwen 0.6

User avatar
Joseph_gitau
African center for data science and analytics

The fastest I have achieved with T5 base is 280ms. A trial I am sure

T5 seems like a good choice. I think because the inference requirements, I'm not going to continue this contest.

User avatar
Joseph_gitau
African center for data science and analytics

Yes, I actually achieved a 135 ms today.

@Joseph_gitau did you achieve that on a local PC? Just curious, I've been testing performance in colab.

User avatar
Joseph_gitau
African center for data science and analytics

Yes, It's on my local pc

User avatar
Joseph_gitau
African center for data science and analytics

I achieved an inference speed of 75.7 ms today, I believe it's possible to stay Below the contraints. I might share the notebook for reference. Has a score of 0.35

24 May 2025, 09:08
Upvotes 0
User avatar
Joseph_gitau
African center for data science and analytics

Have shared the notebook under notebooks