The first thing I noticed was that the training and testing data were in different csv. So wrote a script to concat training folder csv and test folder csv
EDA:
I filtered the data for a single NE ID and observed that it was broken into multiple IDs Where based on n past data where no fault, once we observed fault, we had to predict data rate change. After the particular fault hour, again after some hours there was no fault and again n faults. This made me believe that past n data of a NE ID is not only present in that ID but we should get all data for a NE ID.
Thus I sorted the whole data by NE ID and endtime in ascending order to utilize the complete data and from now on, ID column would be just used for submission and no analysis.
Second thing I observed was that it was not necessary that we have datapoint for justt the hour before the fault. We also had to predict for fault time when the last data point was more than 9 hours ago. This gave me an intuition that along with last values, how far in time was the last value was also important.
Third thing that I observed was that we did not have
'access_success_rate','resource_utilition_rate', 'TA', 'bler', 'cqi', 'mcs'
these values for data points where fault occurred in the test data, where as we had these values in the training data so we had to keep this in mind while building the model to avoid overfitting on training.
DATA PROCESSING POST EDA:
From above observations and from the train and test meta data I created df_train_meta_features / df_test_meta_features, basically this would be the data from which I would be creating features and not the complete meta data (leakage issue) Train and test meta features contained data only where fault_duration is 0. Then df_train was created , the binary target variable was created and all non available columns in the test data when fault was detected were deleted from df_train. Similarly df_test was created with target column.
Finally a column called “data” was added in df_train_meta_features, df_test_meta_features, df_train, df_test with values no_fault, no_fault, train_fault, test_fault respectively.
All the above 4 data were combined to create df_combined, with all values removed which we not present in test data for data homogeneity in train and test. Note: this was the first approach, if this plan wouldnt have worked the idea was to forecast the independent variables for both train and test using previous data. But it was not required as this would also lead to error multiplication.
Finally from this df_combined, two separate dataframes were created
df_total and df_total_resampled for each NE ID (hourly). These two data would be used for all feature engineering.
Modelling Notebooks
Feature Engineering: What worked: Lagged value of 1,2,3 for all KPIs+data_rate and difference of current sample’s endtime to endtime of lagged value. On increasing more lags the model was overfitting. Lagged value of 24,48,72 to get an idea of that particular hour’s values of KPI+data_rate. Extracting month,day of month, hour, day of week from current samples endtime. Sin and cos transformation of hour column to give model an idea of hour 0 and hour 23 are the closest one and not the farthest one. This would help the model split the leaf/node on this feature easier and get better gini impurity.
Introducing hour,month,day of month and day of week as categorical features.
groupby NE ID hour, descriptive statistics of data_hour + 6 KPIs
groupby NE ID hour-1, descriptive statistics of data_hour + 6 KPIs What did not work: Time since last fault for each current sample. Finding the nearest neighbour of each NE ID based on last 10 data rates, the using the nearest neighbour’s lagged values. Could not find neighbour with good accuracy resulting in noisy features. You should create this feature if you have latitude/longitude data in real life. My error analysis post modelling suggested this feature would improved the F1 score by atleast 5%. Difference of consecutive lagged values for 1,2,3 and 24,48,72. This feature too was overfitting. It seems model was already able to capture this from the lagged values
Training data selection: What worked: 1st endtime for each ID was taken when the fault occurred. What did not work: Experimented with first “n” and “all”, based on the intuition that their delta endtime would be different and this would help model learn for the instances where the last data point was n hours away, but it seems this was already captured in the 1st endtime across different IDs. Thus it did not work
Feature Selection: What worked: Removed all columns with <=1 unique values in column. No column had null percent more than 75% so did not remove nay based on nulls. What did not work: Since there were only 6 lag variables all of them had some feature importance, no 0 feature importance features, so removing any feature from the above set was resulting in decreased local CV.
Model selection: What worked: Since we had lag features, there were many null values in the feature set so I wanted to try models which handle null values by splitting them in a different leaf/node and calculating their impurity and based on impurity adjusting they should be a part of which leaf/node of the tree.These models are: lightgbm, catboost, xgboost, histgradientboosting
What did not work:
Since we only had ~7200 training samples, I thought bagging models (random forest, extra tree) would work better, but they had low variance and high bias, thus they were underfitting. Also for bagging models since the sklearn api for RF doesnot support missing value, imputation was also leading to huge errors. I was imputing with a very high/low value.
One thing that could be tried was imputing them with ffill and bfill but this should be done in the feature creation stage and should build a different pipeline for bagging models. However since I already got good results with boosting models, I chose to only try this if required at the end. Ended up not needing to do it.
Modelling: What worked: A 10 fold cross validation for lightgbm,catboost,histgradientboost and xgboost. Using early stopping for overfitting detection in all models. Choosing the appropriate feature fraction for training based on CV score as I didnot want the model to overfit on any feature. Introducing a small L1 regularization term The most important parameter to tune was scale_pos_weight for the best CV score. This parameter is important for imbalanced problems or problems where F1 score is the metric. Basically it adds more weight for a class to model’s binary logistic function. The final submission was mean of all 16 models (with combinations of above feature engineering techniques) probability outputs. No manual thresholding was applied to convert probilities to binary values as this generally overfits the leaderboard and it should be taken care by scale_pos_weight in training itself.
What did not work: One thing was that since we had lesser training data changing model seeds were also resulting in high variance in F1 scores, thus i tried to get avg predictions for a random set of 5 seeds but this didnot result in better local or leaderboard score. So dropped the idea and instead focussed on getting better models on a fixed seed of 42.
surprisingly the 4 LGBM models all alone scored score 0.75+ on private with 2 of them having 0.765+
They also have the best CV score: ~0.745, But were scoring 0.73 on public LB
all other 3 types of models in ensemble brought down the score on private.
Very good writeup, any link to the code?
Please?
sorry won't be able to share the code before the leaderboard is finalized (9th Sep). once it is, will do that too :)
But this write-up contains everything that was coded.
Congrats again, thank
Can you please share the link to the code now?
@CoderBui https://github.com/ITU-AI-ML-in-5G-Challenge/Fault-Impact-Analysis-Solution-Krishna-Priya
Congratulations! Thank you for sharing you approach
Thank you @Krishna_Priya well deserved First place. A lot to unpack here and I will definitely go and try out all of these suggestions. Based on your feature importance, which features do you think were the most influential?
Lag 1 and 24 of the 6 kpi values + datarate were the most influential. Along with Ne id hour grouped datarate mean and max values.
Wow - amazing. Thanks for sharing all the details. Interesting that you included consecutive rows where the time gap > 1h and had that time as a feature. I specifically discarded these rows as I wasn't sure how to handle that uneven time.
Anyhow great solution. So in the end, did you submit ensemble of GBMs or just a single one? Sometimes one GBM does really well, then I just randomly vary the LR a bit to have some ensemble rather than ensemble different GBMs.
I submitted mean of 4 gbms which is the my private score 75% where as only the lightgbm scored 76% which i didnot select for evaluation.
Nicely done! @Krishna_Priya, This is an amazing thought process. Thanks for sharing your approach.
Can you share dataset of this use cass