The data provided is from annual reports from companies in South Africa. These reports come in different formats such as paragraphs of text, values in tabular form reported in PDF format or as presentations with statistics embedded in charts.
The objective of this challenge is to create a solution that parses these annual reports in pdf format and extracts all statistical information relating to pre-defined Activity Metrics in order for Unifi to gain high-level information.
Your solution needs to allow for the date period to be updated as a parameter in the code so that the solution can be implemented each year a new set of reports is released without rehashing information already received from previous implementations in previous years.
It is important to note that some companies do not refer to an Activity Metric by its standard name but rather by a synonym or name that is more applicable to their context. For example, one activity metric is the “GHG Scope 1 emissions” but some companies refer to this as “Scope 1” and another company might refer to it as something different. A sheet showing how each company refers to each Activity Metric is available for download.
An added benefit to the client would be if you can create a method that identifies different naming conventions so the client can use it for future iterations of the text. A simple rules-based approach might not be robust as companies might change how they refer to each activity metric each year.
A list of all activity metrics (AMKEYs) are provided along with annual reports from 10 companies.
The train set is values from the years 2019-2021 and the test year is 2022. Please note that some reports will not have values for the train year as they might only report the current year.
Join the largest network for
data scientists and AI builders