Given that we're supposed to use publicly available documents for building our retrieval system as we wait for the documents to be released to us 2 days before the deadline, will the documents be structured or unstructured? Should we build an ingestion pipeline that will deal with either of the two i.e. structured or unstructured?
Most of the documents will come unstructured or semi-structured. Ideally, the ingestion pipeline should be optimized to deal with a variety of public sector documents, including normative documents (e.g. recommendations and guidelines), legislative documents (e.g. laws, treaties, and policies), knowledge products (e.g. reports, handbooks), strategic documents (e.g. plans, strategies), etc.