June Study Jam Series: Bank Transaction Volume Forecasting Challenge 💰

June Study Jam Series: Bank Transaction Volume Forecasting Challenge

Helping South Africa

500 Points

Under code review

Skills you will learn

Feature Engineering

Time-series

Forecast

265 joined

74 active

Info Data Chat Leaderboard

Start

Jun 10, 26

Jun 30, 26

Reveal

Jun 30, 26

Can you turn behavioural signals into accurate transaction forecasts?

Guided Walkthrough : https://youtu.be/3nRAkrh9p0Q

Every day, millions of Africans interact with their bank - swiping cards, making transfers, paying bills, receiving salaries. Behind each of these touch-points lies a rich behavioural signal. Understanding and anticipating customer transaction volumes is a foundational capability: it drives capacity planning, fraud detection, product development, and personalised service delivery. The question is deceptively simple - how many transactions will a given customer make in the next three months? - but the answer demands genuine data science skill.

In this challenge, you are provided with anonymised behavioural data for nearly 12,000 customers spanning up to 34 months of transaction history, monthly financial snapshots, and cleaned demographic profiles. Your task is to predict next_3m_txn_count - the total number of bank transactions each customer will make over a future three-month window (November 2015 through January 2016). This is a regression problem scored using Root Mean Squared Logarithmic Error (RMSLE), a metric that penalises large relative errors and handles the right-skewed distribution of transaction counts gracefully.

What makes this challenge compelling is its real-world texture. The data is not clean-room synthetic - it has the quirks of production banking data: high-cardinality free-text descriptions, partial nulls in income fields, seasonality effects from the November to January holiday period, and customers whose behaviour varies wildly from month to month. Success will reward thoughtful feature engineering, careful handling of temporal patterns, and models that generalise rather than memorise.

This challenge is more than a competition.

Challenges Resources

Slides: https://docs.google.com/presentation/d/1FYUFSc8LX42eZwDyjAbxZ-3R5lu5fVDiRq_05SQtLfw/edit?usp=sharing

Notebook model: to improve they may look at the parameters of the models, adjust and work around it: https://colab.research.google.com/drive/1jVRENUlfu0Oov5BE_SIxdPWXc3poh_NH?usp=sharing

Prizes

This challenge is a learning opportunity: Award 500 Zindi Points

Evaluation

The error metric for this challenge is Root Mean Squared Logarithmic Error (RMSLE), implemented as RMSE on log-transformed values. See the submission instructions below for how to format your predictions.

Your submission file must contain exactly 2 columns: UniqueID and next_3m_txn_count.

Important - log-transformed submissions required: The platform scores using RMSE on log-transformed values, which is equivalent to RMSLE. You must submit the natural log of your predictions plus one. In Python: np.log1p(y_pred). Do not submit raw predicted counts - your score will be incorrect if you do.

The order of rows does not matter, but you must include predictions for all 3,584 customers in Test.csv. Your submission should look like this:

UniqueID                                 next_3m_txn_count

6b62ce75-9823-4de6-ba7b-8b2b199df239     3.456

e193e600-a706-4bc6-8597-d5d6fb171ab5     4.321

8fd44803-12ed-46ab-a146-8496d95d1b13     2.789

Rules

Languages and tools: You may only use open-source languages and tools in building models for this challenge.
Who can compete: Open to all participants. To be eligible for prizes and the in-person finale, you must be a South African citizen or permanent resident, or hold a valid work permit for South Africa.
Submission Limits: 10 submissions per day, 300 submissions overall.
Team size: 0 (only individuals can compete)
Public-Private Split: Zindi maintains a public leaderboard and a private leaderboard for each challenge. The Public Leaderboard includes approximately 30% of the test dataset. The private leaderboard will be revealed at the close of the challenge and contains the remaining 70% of the test set.
Data Sharing: CC-BY SA 4.0 license
Code sharing: Multiple accounts, or sharing of code and information across accounts not in teams, is not allowed and will lead to disqualification.

Join the largest network for
data scientists and AI builders

About FAQs

Status