Primary competition visual

Unifi Value Frameworks PDF Lifting Competition

Helping South Africa
$5 000 USD
Challenge completed over 1 year ago
Generative AI
450 joined
73 active
Starti
Dec 21, 21
Closei
Mar 17, 24
Reveali
Mar 17, 24
About

The data provided is from annual reports from companies in South Africa. These reports come in different formats such as paragraphs of text, values in tabular form reported in PDF format or as presentations with statistics embedded in charts.

The objective of this challenge is to create a solution that parses these annual reports in pdf format and extracts all statistical information relating to pre-defined Activity Metrics in order for Unifi to gain high-level information.

Your solution needs to allow for the date period to be updated as a parameter in the code so that the solution can be implemented each year a new set of reports is released without rehashing information already received from previous implementations in previous years.

It is important to note that some companies do not refer to an Activity Metric by its standard name but rather by a synonym or name that is more applicable to their context. For example, one activity metric is the “GHG Scope 1 emissions” but some companies refer to this as “Scope 1” and another company might refer to it as something different. A sheet showing how each company refers to each Activity Metric is available for download.

An added benefit to the client would be if you can create a method that identifies different naming conventions so the client can use it for future iterations of the text. A simple rules-based approach might not be robust as companies might change how they refer to each activity metric each year.

A list of all activity metrics (AMKEYs) are provided along with annual reports from 10 companies.

The train set is values from the years 2019-2021 and the test year is 2022. Please note that some reports will not have values for the train year as they might only report the current year.

Files
Description
Files
This is a mapping between the PDF names from Data Sources.zip and the Group name mentioned in the Train and SampleSubmission files.
These are the 511 unique AMKEYs that at least one company has reported on. The column "ActivityMetric" is the standard name convention for each AMKEY.
The train set is values from the years 2019-2021 and the test year is 2022. Please note that some reports will not have values for the train year as they might only report the current year.
This folder contains the PDFs you will use to train and test your solution on.
These are the "ClientMetrics", some companies choose to refer to an AMKEY slightly differently to the convention as it fits their use case better.
Is an example of what your submission file should look like. The order of the rows does not matter, but the names of the "ID" must be correct.