AI4D Africa’s Anglophone Research Lab Tanzania Tourism Classification Challenge
Can you use tourism survey data and ML to classify the range of expenditures a tourist spends in Tanzania?
$1 000 USD
Ended 3 days ago
180 active · 461 enrolled
Good for beginners

The dataset describes 24,675 rows of up-to-date information on tourist expenditure collected by the National Bureau of Statistics (NBS) in Tanzania.The dataset was collected to gain a better understanding of the status of the tourism sector and provide an instrument that will enable sector growth.

Your goal is to accurately classify the range of expenditures a tourist spends in Tanzania.

The majority of the visitors under the age group of 25-44 came for business (18.5%), or leisure and holidays (53.2%), which is consistent with the fact that they are economically more productive. Those at the age group of 45-64 were more prominent in holiday making and visiting friends and relatives. The results further reveal that most visitors belonging to the age group of 18-24 came for leisure and holidays (55.3%) as well as volunteering (13.7%). The majority of senior citizens (65 and above) came for leisure and holidays (80.9%) and visiting friends and relatives (9.5%).

The survey covers seven departure points, namely: Julius Nyerere International Airport, Kilimanjaro International Airport, Abeid Amani Karume International Airport, and the Namanga, Tunduma, Mtukula and Manyovu border points.

provides definitions of the variables found in Test.csv and Train.csv
is an example of what your submission file should look like. Note that this is a table of probabilities across the six cost categories (High Cost, Higher Cost, Highest Cost, Low Cost, Lower Cost and Normal Cost).
is the dataset to which you will apply your model to test how well it performs. The test set contains 6,169 rows of tourists information. This dataset includes the same fields as train.csv except for the last column. Use your model and this dataset to predict in which of the six classifications the tourist is likely in (High Cost, Higher Cost, Highest Cost, Low Cost, Lower Cost and Normal Cost)
contains the target. This is the dataset that you will use to train your model.