What does 'Inference must be less than 100ms per vignette' mean? Does this refer to the total time taken to generate the entire output, or just the time to generate the first token?
I achieved an inference speed of 75.7 ms today, I believe it's possible to stay Below the contraints. I might share the notebook for reference. Has a score of 0.35
The whole time for a single inference.
is it possible for SLM to achieve those time constraints
I've been trying some things, quantization, ollama, ... and I don't see how it's possible, at least with qwen 0.6
The fastest I have achieved with T5 base is 280ms. A trial I am sure
T5 seems like a good choice. I think because the inference requirements, I'm not going to continue this contest.
Yes, I actually achieved a 135 ms today.
@Joseph_gitau did you achieve that on a local PC? Just curious, I've been testing performance in colab.
Yes, It's on my local pc
I achieved an inference speed of 75.7 ms today, I believe it's possible to stay Below the contraints. I might share the notebook for reference. Has a score of 0.35
Have shared the notebook under notebooks