I left 10% of the train dataset for validation. Then, I applied train_test_split in the remaining dataset. In other words, I have two datasets to validate my model, validation and test set. Both datasets are getting great results, something like 92% of f1-score (macro) and accuracy. The problem is: my submission results are about 70%. Any ideas on what might be happening? I couldn't find any sort of data leakage, overfitting, it seems like there's nothing wrong. Also, I use sklearn pipelines to help me not do anything stupid. Any ideas on what is the problem here?
My code is something like this:
# load training data
train_set = pd.read_csv('gender-based-violence-tweet-classification-challenge/Train.csv')
# leave 10% for validation
train = train_set.loc[:35685, ["Tweet_ID", "tweet", "type"]]
validation = train_set.loc[35685:, ["Tweet_ID", "tweet"]]
# load the test set
submission_set = pd.read_csv('gender-based-violence-tweet-classification-challenge/Test.csv')
# load submission file
submission_file = pd.read_csv('gender-based-violence-tweet-classification-challenge/SampleSubmission.csv')
def preprocess_text(text):
STOPWORDS = stopwords.words("english")
# Check characters to see if they are in punctuation
nopunc = [char for char in text if char not in string.punctuation]
# Join the characters again to form the string.
nopunc = "".join(nopunc)
# Now just remove any stopwords
return " ".join([word for word in nopunc.split() if word.lower() not in STOPWORDS])
X = train["tweet"]
y = train["type"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.15, random_state=42
)
pipe = Pipeline([ ])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
===========================================================
UPDATE: PROBLEM SOLVED!
u need to balance the data. the dataset is so imbalanced (for example there is a class with 30 000 samples and another with only 200 ).
there are multiple techniques to balance the data
I tested already: SMOTE(), class_weight and even undersampling the dataset. Nothing changed.
how did you solve the problem? I have the same thing.
I have same problem, what did you do?
Try pseudo-labelling