When we Create the id column using the three columns mentionned in the named of simple submission id and after merging this with this:
test = pd.merge(test, submission, how='outer', on=['CID X LOC_NUM X VENDOR']
we will get the same length of row for test and submission file but on the test set we will get 40300 missing values for others columns.
How to handle them ?
Absolutely the same issue here. 40300 rows are missing from the provided test data. Hope someone is gonna answer soon.
What I do:
break the test dataset in two parts: with nulls and without nulls
testOK = test[~test['customer_id'].isnull()]
testNull = test[test['customer_id'].isnull()]
print (test.shape)
print (testOK.shape)
print (testNull.shape)
After that I got the customer_id, loc_num and vendor for the CID X LOC_NUM X VENDOR column: (only for testNull)
# fill null columns in test after Merge - for customer_id, location_number_obj and id_obj
for i, row in testNull.iterrows():
if pd.isna(row['customer_id']):
text = row['CID X LOC_NUM X VENDOR']
x1 = text.find(' X ')
x2 = text.find(' X ', x1+2)
text1 = text[0:x1]
text2 = text[x1+2: x2]
text3 = text[x2+2:]
test.at[i,'customer_id'] = text1
test.at[i,'location_number_obj'] = text2
test.at[i,'id_obj'] = text3
Them recreate the test dataset:
frames = [testOK, testNull]
test = pd.concat(frames)
After that I apply my null column strategy for that customer, Loc_num or vendor