Hi Everyone, I am getting a lot of messages regarding problem formulation, so instead of replying to everyone, I am mentioning it here.
For preparing test data, the hint is in sample submission. Drop duplicates for test customers. Merge test customer and location and then, for each row of test add all rows of vendors. Then create the id column using the three columns mentioned in the name of sample submission id. Try to make the number of rows in the test similar to sample submission.
Follow the same steps for the train. This should get you started.
Thank you Krishna. This will be helpful !
And what about "orders"?
Use it to make target. If ID of train is present in orders then target is 1..if not then 0
I have a problem with the problem formulation you brilliantly described above, specifically on the 'for each row of test add all rows of vendors' line. Let's assume i want to achieve that using a pandas merge or cross-join of the two dataframes, what column will i use as the key because they share no similar column. I'm really confused about that as a beginner, and i'd appreciate any help.
This will help :)
Thank you very much for the prompt clarification.
I have tried :(
MemoryError: Unable to allocate 60.0 GiB for an array with shape (8050934409,) and data type int64
hi, have you ever tried to downcast_dtypes?
Hey, read the instructions carefully. I never said to merge any dataframe with orders.
Sorry it's my mystake.
if I understood correctly it is as follows,
1st - merge left train_customer and train_location
2nd - merge result with Train Orders?
after I did not understand how we get the target :(
thanks for your help guys
2nd - for each row of result add all rows of vendor.. read the instructions carefully.
Hi, I don't see why duplicate customers have to be ruled out ... the fact that it is duplicate is because the same customer is in another restaurant. If we eliminate the duplicates then we are going to predict the probability only in a restaurant. Another thing that is not clear to me in your methodology is the assessment ("vendor_rating") ... finally, collaborative filters, for example, make use of these metrics to obtain similarity matrices. PS: In fact, I'm not quite sure if it is necessary to get a binary variable (0 and 1) as target ... Regards.
I am stuck on your 2nd line itself..look at the data before writing. test customer CSV has nothing to do with restaurants. it is all information about customer. And when you finally spend some time with data you will realise the same customer has a verified and and unverified detail which should be removed. As you won't get same number of rows as sample submission. But again for this you have to spend time with the data before commenting.
Some of the customerID's (CID) in the Sample Submission file are not present in the test_customer file.
Hi Suraj, Yes. But full test customer detail is obtained when you merge test customer and test location, after removing duplicates as I have mentioned. After merging if you find set intersection of merged ID and submission file, you will get all IDs of the sample in the test.
I seem to get much difference between my validation error and test error. But I have done the same formatting for both the train and test data as you have told. Do you know what might be the reason ?