tricky situation. According to them,"""How implementable is your code in a real application? Have you taken into account that the solution will be deployed on an edge device? - 25%""" and """Training should take no longer than 24 hours on a GPU similar to an NVIDIA T4 while inference should be on an NVIDIA Jetson Nano or equivalent.""" I think on the deployment side, the inference should be 100ms per prompt. That may be achieved when you convert into a particular format, such as AWQ, where efficiency is slightly decreased but inference is fast, a trade-off. @Amy_Bray, could you please assist?
That's not possible so far for a single prompt.
so this rule is impossible to reach, right ? what is the solution ?
if they change the rules, i hope that they will give us more time
use TinyLLaMA 1.1B and finetune your model than use sentenceformer quantize it using bitsandbytes and peft
very heavy to train though right?, unless you spend on cloud gpus
TinyLLaMA (quantized and fine tuned) is taking 35.1 seconds on a 100 batch, which means above 100ms per vignette.
try to use a t5 model (t5-base, t5-small or t5-large)
and ensure that you are using the GPU
I used it on colab with Nvidia T4 GPU and the mean inference in ma case after finetuning was around 76 secs