🤖 This Week on Zindi: Rank 1 solution writeup

Fault Impact Analysis: Towards Service-Oriented Network Operation & Maintenance by ITU

8 000 CHF

Completed (almost 3 years ago)

Skills you will learn

Classification

277 joined

88 active

Info Data Chat Leaderboard

Start

Jul 26, 23

Aug 18, 23

Reveal

Aug 18, 23

Krishna_Priya

Rank 1 solution writeup

Notebooks · 22 Aug 2023, 13:18 · 12

The first thing I noticed was that the training and testing data were in different csv. So wrote a script to concat training folder csv and test folder csv

EDA:

I filtered the data for a single NE ID and observed that it was broken into multiple IDs Where based on n past data where no fault, once we observed fault, we had to predict data rate change. After the particular fault hour, again after some hours there was no fault and again n faults. This made me believe that past n data of a NE ID is not only present in that ID but we should get all data for a NE ID.

Thus I sorted the whole data by NE ID and endtime in ascending order to utilize the complete data and from now on, ID column would be just used for submission and no analysis.

Second thing I observed was that it was not necessary that we have datapoint for justt the hour before the fault. We also had to predict for fault time when the last data point was more than 9 hours ago. This gave me an intuition that along with last values, how far in time was the last value was also important.

Third thing that I observed was that we did not have

'access_success_rate','resource_utilition_rate', 'TA', 'bler', 'cqi', 'mcs'

these values for data points where fault occurred in the test data, where as we had these values in the training data so we had to keep this in mind while building the model to avoid overfitting on training.

DATA PROCESSING POST EDA:

From above observations and from the train and test meta data I created df_train_meta_features / df_test_meta_features, basically this would be the data from which I would be creating features and not the complete meta data (leakage issue) Train and test meta features contained data only where fault_duration is 0. Then df_train was created , the binary target variable was created and all non available columns in the test data when fault was detected were deleted from df_train. Similarly df_test was created with target column.

Finally a column called “data” was added in df_train_meta_features, df_test_meta_features, df_train, df_test with values no_fault, no_fault, train_fault, test_fault respectively.

All the above 4 data were combined to create df_combined, with all values removed which we not present in test data for data homogeneity in train and test. Note: this was the first approach, if this plan wouldnt have worked the idea was to forecast the independent variables for both train and test using previous data. But it was not required as this would also lead to error multiplication.

Finally from this df_combined, two separate dataframes were created

df_total and df_total_resampled for each NE ID (hourly). These two data would be used for all feature engineering.

Modelling Notebooks

Feature Engineering: What worked: Lagged value of 1,2,3 for all KPIs+data_rate and difference of current sample’s endtime to endtime of lagged value. On increasing more lags the model was overfitting. Lagged value of 24,48,72 to get an idea of that particular hour’s values of KPI+data_rate. Extracting month,day of month, hour, day of week from current samples endtime. Sin and cos transformation of hour column to give model an idea of hour 0 and hour 23 are the closest one and not the farthest one. This would help the model split the leaf/node on this feature easier and get better gini impurity.

Introducing hour,month,day of month and day of week as categorical features.

groupby NE ID hour, descriptive statistics of data_hour + 6 KPIs

groupby NE ID hour-1, descriptive statistics of data_hour + 6 KPIs What did not work: Time since last fault for each current sample. Finding the nearest neighbour of each NE ID based on last 10 data rates, the using the nearest neighbour’s lagged values. Could not find neighbour with good accuracy resulting in noisy features. You should create this feature if you have latitude/longitude data in real life. My error analysis post modelling suggested this feature would improved the F1 score by atleast 5%. Difference of consecutive lagged values for 1,2,3 and 24,48,72. This feature too was overfitting. It seems model was already able to capture this from the lagged values

Training data selection: What worked: 1st endtime for each ID was taken when the fault occurred. What did not work: Experimented with first “n” and “all”, based on the intuition that their delta endtime would be different and this would help model learn for the instances where the last data point was n hours away, but it seems this was already captured in the 1st endtime across different IDs. Thus it did not work

Feature Selection: What worked: Removed all columns with <=1 unique values in column. No column had null percent more than 75% so did not remove nay based on nulls. What did not work: Since there were only 6 lag variables all of them had some feature importance, no 0 feature importance features, so removing any feature from the above set was resulting in decreased local CV.

Model selection: What worked: Since we had lag features, there were many null values in the feature set so I wanted to try models which handle null values by splitting them in a different leaf/node and calculating their impurity and based on impurity adjusting they should be a part of which leaf/node of the tree.These models are: lightgbm, catboost, xgboost, histgradientboosting

What did not work:

Since we only had ~7200 training samples, I thought bagging models (random forest, extra tree) would work better, but they had low variance and high bias, thus they were underfitting. Also for bagging models since the sklearn api for RF doesnot support missing value, imputation was also leading to huge errors. I was imputing with a very high/low value.

One thing that could be tried was imputing them with ffill and bfill but this should be done in the feature creation stage and should build a different pipeline for bagging models. However since I already got good results with boosting models, I chose to only try this if required at the end. Ended up not needing to do it.

Modelling: What worked: A 10 fold cross validation for lightgbm,catboost,histgradientboost and xgboost. Using early stopping for overfitting detection in all models. Choosing the appropriate feature fraction for training based on CV score as I didnot want the model to overfit on any feature. Introducing a small L1 regularization term The most important parameter to tune was scale_pos_weight for the best CV score. This parameter is important for imbalanced problems or problems where F1 score is the metric. Basically it adds more weight for a class to model’s binary logistic function. The final submission was mean of all 16 models (with combinations of above feature engineering techniques) probability outputs. No manual thresholding was applied to convert probilities to binary values as this generally overfits the leaderboard and it should be taken care by scale_pos_weight in training itself.

What did not work: One thing was that since we had lesser training data changing model seeds were also resulting in high variance in F1 scores, thus i tried to get avg predictions for a random set of 5 seeds but this didnot result in better local or leaderboard score. So dropped the idea and instead focussed on getting better models on a fixed seed of 42.

surprisingly the 4 LGBM models all alone scored score 0.75+ on private with 2 of them having 0.765+

They also have the best CV score: ~0.745, But were scoring 0.73 on public LB

all other 3 types of models in ensemble brought down the score on private.

Discussion 12 answers

crossentropy

Federal university of Technology, Akure

Very good writeup, any link to the code?

Please?

22 Aug 2023, 13:22

Upvotes 0

Krishna_Priya

sorry won't be able to share the code before the leaderboard is finalized (9th Sep). once it is, will do that too :)

But this write-up contains everything that was coded.

replied to crossentropy22 Aug 2023, 13:25

Upvotes 0

crossentropy

Federal university of Technology, Akure

Congrats again, thank

replied to Krishna_Priya22 Aug 2023, 13:37

Upvotes 0

CoderBui

Can you please share the link to the code now?

replied to Krishna_Priya18 Nov 2023, 06:20

Upvotes 0

Krishna_Priya

@CoderBui https://github.com/ITU-AI-ML-in-5G-Challenge/Fault-Impact-Analysis-Solution-Krishna-Priya

replied to CoderBui18 Nov 2023, 07:32

Upvotes 0

Houssem_Kehal

Ecole nationale supérieure de statistiques et d'economie appliquée

Congratulations! Thank you for sharing you approach

22 Aug 2023, 13:26

Upvotes 1

Koleshjr

Multimedia university of kenya

Thank you @Krishna_Priya well deserved First place. A lot to unpack here and I will definitely go and try out all of these suggestions. Based on your feature importance, which features do you think were the most influential?

22 Aug 2023, 13:45

Upvotes 0

Krishna_Priya

Lag 1 and 24 of the 6 kpi values + datarate were the most influential. Along with Ne id hour grouped datarate mean and max values.

replied to Koleshjr22 Aug 2023, 17:52

Upvotes 0

skaak

Ferra Solutions

Wow - amazing. Thanks for sharing all the details. Interesting that you included consecutive rows where the time gap > 1h and had that time as a feature. I specifically discarded these rows as I wasn't sure how to handle that uneven time.

Anyhow great solution. So in the end, did you submit ensemble of GBMs or just a single one? Sometimes one GBM does really well, then I just randomly vary the LR a bit to have some ensemble rather than ensemble different GBMs.

22 Aug 2023, 15:05

Upvotes 0