Wow, what an honour, especially given the performance of my model.
One of the Polgar sisters famously complained that she never beat (at chess) a completely healthy male, as each time she wins, he develops flu symptoms or a headache.
My model's headache was that I did not model the impact of the petrol price correctly. During the test period, petrol prices were stable, and so I did not pick it up. But in the first month of the competition, petrol prices rose dramatically and I sat with serious model failure. Since this was a once-off, my model quickly caught up again but, after such a blow, it did not even enter the top 10 in the end.
But I got the price for most innovative model - thank you RMB. Thank you Zindi. What a privilege. fwiw my wife thinks its "cute" as she considers me innovative. Not sure what that means, especially since she would not allow me to e.g. choose paint or furniture at all ...
Oh well - so what did I do. Here are some key things from my model.
The correct approach, in my mind, for this, is to use a proper multivariate econometric model. So you model the components of PPI and of CPI and you then solve them interactively a la econometric modelling 101. Such a multivariate approach will be quite accurate and will also benefit a lot from averaging. Note in this approach that you do not model the more aggregate components, rather, you model at a disaggregate level and calculate the aggregate components using the CPI weights.
This was broadly the foundation on which I built my model.
Of course, I had to immediately trim this idea back a lot. At what level do you stop aggregating? Then, for each disaggregated item, which external variables to you include? Here I tried to make use of some of the more exotic stuff such as air quality but these things are not sampled regularly. Mostly for data reasons, I decided to stick with STATSSA and FRED and used variables from these. I also discarded PPI, probably a big strategic error but hey, this competition is mostly for fun, we are not trekking to the south pole after all.
When I chose external variables I would e.g. use the ZAR - this is easy, it probably makes sense for almost every item. I also used the oil price in USD and converted it using the ZAR and also smoothed it a little. I was hoping that this would capture any petrol price movements. Of course, this has one glaring omission - taxes! It also does a very crude job of the timing of oil price increases leading to petrol price increases leading to inflation item increases, but hey, for small changes, it should get it right. For big changes - not so much.
After this I also threw the whole interactive model out the window. This is probably a mistake, since in an interactive model you capture the interaction between the components, but, from a causality point of view, it is much easier to have a model without any temporal interaction. So on my right hand side I basically had just lagged variables. Solves easy and quick - nice and stable.
The right hand side I did choose for each item at the level to which I disaggregated. For this I used STATSSA data such as retail sales or manufacturing production. Now, if this was a proper econometric model, I'd pour over each equation individually, but here I just clumped a bunch of variables together, threw out the lower correlation ones, and used something like GBM on the remaining ones. Easy peasy ...
Finally, what helped me a lot, was that, towards the end, I coded it up so that I had just a single python program that would then do all the work for me. No tweaks required from month to month.
The rest, I guess, I will leave as an exercise to the reader. But below do find a few snippets to illustrate the implementation.
Data
I wrote two functions to read data downloaded from STATSSA and FRED respectively. Here you see them in action
rts = statssa_data ( pd.read_excel ( inp_dir + "rts.xlsx" ) )
mts = statssa_data ( pd.read_excel ( inp_dir + "mts.xlsx" ) )
zar = fred_data ( pd.read_csv ( fred_dir + "EXSFUS.csv" ) )
oil = fred_data ( pd.read_csv ( fred_dir + "MCOILBRENTEU.csv" ) )
# Convert to zar
oilzar = ( zar [ "EXSFUS" ] * oil [ "MCOILBRENTEU" ] ).rename ( "OILZAR" )
Here
rts = retail trade sales from statssa
mts = motor trade sales from statssa
zar = ZAR/USD from FRED
oil = Brent from FRED
Weights
Here you see me loading CPI weights
# CPI weights
# Entered from https://www.statssa.gov.za/publications/P0141/P0141March2023.pdf
print ( "Weights" )
cpi_weights = {
"CPS00000": 100.00,
"CPS00021": 8.57,
"CPS00022": 6.73,
"CPS01000": 17.14,
"CPS01100": 15.30,
...
Below I specify one level of aggregation, clothing and footwear made up of those two components. This will later on get weighed and added to form the aggregate.
cpi_cloth_foot = [
"CPS03100",
"CPS03200"
]
Model
In the very abbreviated snippet below you'll see how I selected the variables to enter on the right hand side
# Select best fitting lag and moving average from x
for i in rhs + rhs_lag :
bestr2 = 0
bestx = None
# lag
for j in range ( 6 ) :
xl = x_data [ i ].shift ( j )
# ma
for k in range ( 6 ) :
xlma = xl.rolling ( k + 1 ).mean ()
r2 = 0
for l in lhs :
r = pd.DataFrame ( { "y" : y_data [ l ], "x" : xlma } ).corr ().iloc [ 0, 1 ]
This loops all the right hand side and lagged right hand side variables and chooses the best one from 6 lags and 6 moving averages. So it will find the lag and moving average that best correlates with the left hand side and then proceed with that variable.
dep_cloth_foot = cpi_cloth_foot
drv_cloth_foot = []
lag_cloth_foot = \
[ "ELEKTS10", "ELEKTR11", "PPC34110", "PPC34120", "PPE11000", "PPE11200", "EXSFUS", "OILZAR" ] + \
[ "MPI31100", "MPI31200", "MPI31300", "MPI31400", "MPI31600", "MPI31700", "MPI31999" ] + \
[ "MSS31100", "MSS31200", "MSS31300", "MSS31400", "MSS31600", "MSS31700", "MSS31999" ] + \
[ "MSV31100", "MSV31200", "MSV31300", "MSV31400", "MSV31600", "MSV31700", "MSV31999" ] + \
[ "PPC32000", "PPC32100", "PPC32200", "PPC32300" ] + \
[ "sales6232" ] + \
[ "sales6131" ]
Here you see the priming of the clothing and footwear section of the model. Lagged variables of electricity produced and lots of other STATSSA publications will be handed to the model to select from and ultimately model the series with. Those codes are mostly of STATSSA series from the different publications, such as manufacturing production or retail trade sales.
Finally, the model itself. I used a weighted average of a nice selection of both univariate (mostly ARIMA) and multivariate (mostly GBM) models and throughout the competition basically played with this selection and with the weights a bit. While this was my main tweak go-to area, and while it did have a significant impact, the difference between alternative collections of univariate and multivariate models and model weights certainly was not the key to unlock this puzzle with.
Here you see, from an earlier version of the model, how I selected the multivariate and univariate models to use
def make_models_mv ( rs ) :
return [ LinearRegressionModel ( lags = 12, lags_past_covariates = 4, random_state = rs + 1 ), \
RandomForest ( lags = 12, lags_past_covariates = 4, random_state = rs + 2 ) ]
def make_models_uv ( rs ) :
return [ ExponentialSmoothing ( seasonal_periods = 12, trend = ModelMode.ADDITIVE, seasonal = SeasonalityMode.ADDITIVE, random_state = rs + 1 ),
ARIMA ( p = 3, d = 1, q = 1, seasonal_order = ( 1, 1, 0, 12 ), random_state = rs + 2 ), \
FourTheta ( theta = 2, seasonality_period = 12, trend_mode = TrendMode.LINEAR ) ]
The formatting is not kind to me, but if you squint you may recognize that I used linear regression and random forest (here) for multivariate and exponential smoothing, ARIMA and the Theta model on the univariate side.
Congratulations Skaak on your achievement, very impressive. Thank you for sharing the approach and code.
Thanks Jaw22 - you know, when I started out, given the short horizon and short term nature of this, I thought univariate models would do well. I mention this as I think you also used lots of ARIMA. In the very near term, the fundamentals don't matter, and things like trend and cycle dominate. However, now that we are on the other end of this competition, I think if you do the multivariate stuff properly, not as rushed as I did, then you can discard the univariate models.
Oh well - thoughts?