Participants are free to use any appropriate open-source dataset or to curate their own as long as this does not violate any data privacy regulations that apply.
Below are some databases and libraries that could be used to assemble custom datasets for fine tuning text embedding models and for running preparatory validation tests:
Any dataset used for fine-tuning embedding models or LLMs needs to be submitted along with the proposed solution and made open-source (for the reasons of transparency).
The test set will be made available on 16 May 2024 at 23:59 PM GMT. You will have 24 hours to do your inference and submit your submission. Note, only your most recent submission will be considered for evaluation.
You need to submit ONE .ZIP file that contains the following:
- Solution that creates your solution
- Documentation
- Submission CSV, use the SampleSubmission provided.
Please ensure your .ZIP file is less than 30mb.