Categorical (nominal) variables encoding done wrong???
published 30 Sep 2019, 09:43
edited 1 minute later

I've seen some top solutions for a certain number of competitions here using LabelEncoder for nominal variables. Just to remind what a nominal variable is, it's a variable containing categories that have no order between its levels. Example : Cat, Dog , Mouse.

Are the teachings of top datascientists advising against encoding nominal variables with Label/Ordinal encoders wrong? Or do tree-based models somehow manage to handle it internally, or is it just pure luck that the categories match almost their respective order toward the target? Thanks.

if the categorical feature is multi class LabelEncoder will return different values for different classes .

Pandas get_dummies method is a very straight forward one step procedure to get the dummy variables for categorical features. The advantage is you can directly apply it on the dataframe and the algorithm inside will recognize the categorical features and perform get dummies operation on it

First of all, LabelEncoder is meant for encoding the target variable not the features. You can check the documentation of scikit learn about that. Second, label encoding turns features into an array of 1 < 2 < 3 < 4 which is in other words artificially creating an order between the levels of a variable when there is none like in the example i gave of animals.

Can you please be more specific? Are you talking about encoding the target variable or the input features with the LabelEncoder?

Features. I know that you will say that LabelEncoder is meant for encoding the target variable, read my next comment so you understand my point better.

Ah! In this case: it is highly not advisable! As far as my knowledge goes: you shouldn't be encoding categories as numbers and you should stick with one-hot-encoding. The only "exception" I saw to this rule is when conducting text classification and using the hashing trick.

I highly doubt that tree-based approaches handle this internally. As far as my knowledge goes, they are mainly designed to handle categorical features. Numerical features are actually converted to categorical ones when establishing the rules (their interval is binned and transformed into a rule of the type "x < value").

Kaggle-style challenges are, in my honest opinion, the worst way to "learn" Data Science. Many of the strategies you can see in winning solutions are actually bad practices and have no scientific soundness to them.

Was doubting that. What i'm seeing as winning practices are bad practices when i read about them in theory, so i had to ask the question to clear it up, did i misunterstand what a nominal variable was, or what labelencoder was doing, or is it just luck to get those results with a bad approach. Thanks for the answer

Lets say: Cat -> 0, Dog -> 1, Mouse ->2.

What will most likely happen is the following: if the "Dog" category is useful for solving your problem, your tree will have a rule like "x > 0.5" then a subsequent rule "x < 1.5" (whereas it would have had one single and simpler rule "x == Dog" if you didn't encode this way).

Your understanding of nominal variables and the LabelEncoder are correct. And again: yes most winning practices are not theoretically sound. Most of the time it's the contrary. I saw winning solutions in which feature engineering was mainly calculating aggregates (means and standard deviations, etc.) on the whole training dataset before conducting the CV/model selection. The evaluation is conducted on a test dataset from the same period as the training dataset. This means that: (i) there is clear overfitting to the data and (ii) in a real-world scenario, the aggregates that lead to the winning solution aren't even "calculable" at prediction time.

@Blenz good point, I just recently discovered this out (the order labelencoder introduces) as I was working on a certain project but had never really paid attention to it..