I've written a few posts on getting going with this contest. Part 1 (https://datasciencecastnet.home.blog/2019/10/19/zindi-uberct-part-1-getting-started/) basically re-caps the starter notebook I shared earlier and is useful for getting a quick entry on the board. The second part (https://datasciencecastnet.home.blog/2019/10/21/zindi-uberct-part-2-stepping-up/) shares some next steps (adding features, using fast.ai) to boost the score (to >0.08 wthout much tweaking). Both have accompanying notebooks on Google Colab for easy duplication.
I'll be working on part 3, so please share any tips for things to include. Looking forward to questions and feedback :)
Thanks for Making the notebook for a startedstarted
Sweet, thank you for the starter code notebook and the blog posts. They're useful so far.
By the way, I get an error when I try to run this part of the notebook:
locations = data.groupby('road_segment_id').mean()[['longitude', 'latitude']] locations.head(2)
The error is as follows:
KeyError: "['longitude'] not in index"
It seems that this could be a Pandas bug, resulting from the groupby function going funky. I have updated my Pandas but the error persists in the latest version. I verified that the longitude column is in the data object after loading, and that it disappears right after the groupby method is called, by running nothing but the groupby function and checking for the longitude column again. It disappears.
Turns out there wasn't a bug in Pandas after all, but the train.csv file has a few dirty data entries. I suppose data cleaning is inevitable, but just a heads up to anyone else who is pulling their hair out.
Here's a tip: After loading the csv file say in a dataframe called data, call data.info(). If your longitude and latitude columns are not float64 types, you are not going to have a good time.
Awesome. Thanks for sharing
Hi DevilEars. Have you been able to find your way around the dirty data entries?
No other way around than going ahead with data wrangling operations to reformat your data..
Yes, I just clean it up with data wrangling operations. I remove all the longitude entries with the value Closed, and then I change the dtype of the longitude column to float.
Hi, could you please share a pyautogui code for automating download of uber data, which you mentioned in part 3?