I thought I'd share some info to help get started. We also just did a live stream, so if you want a video reference for connecting to the VM you can check it out here: https://www.twitch.tv/videos/648634668. But assuming that like me you're a text person, here's the juice:
- The training locations all fall within a single image tile, which I copied to the VM using `aws s3 cp s3://eohackathon-covid19/Hackthon_Data/Gauteng/2528C.tif 2528C.tif` (you need to set secret keys and things - see the README.yaml file for instructions). The test locations all fall within 2930D.tif. SO: you don't need to copy all the imagery from s3 to get going - just these two tiles (4GB each).
- The GP settlement layer is available as a shapefile, which means you can generate more training data if you want. BUT: careful with the class balance. Both train and (spoiler alert) test have ~20-30% positives (informal settlements) - randomly sampling locations will likely get you closer to 2% positives, and thus might give a model that does worse on the test set.
- The test set comes from an entirely different province. Check out some of the imagery and you'll notice it's from a fairly urban area, but with a good mix of land-use classes. If you're doing a random split on the training data, you'll see high accuracy (95%) locally but will get a nasty surprise when your model is scored on the leaderboard (eg 75% accuracy, and log_loss of ~0.7 or whatever, higher than the 0.2 seen in training). THINK ABOUT HOW TO MAKE A BETTER LOCAL VALIDATION SET. Maybe split by latitude, or generate a new validation set from a different location in GP...
- Make sure to save your notebook to your local machine after making a submission - the VMs disappear at the end of the weekend, and you don't want to be stuck with no code to submit :)
- Finally, although this is a competition, we're all trying to learn things. If you find a nice way to speed up something like image access, or have a nifty trick for generating more training locations, or you've done an image segmentation model using the shapefile as a mask, or you figured out how to install unrar... SHARE :) It's so nice as a beginner to get help from others and see how they've overcome challenges. Add tips in this thread or start your own.
Good luck, and have fun :)
JW
Thanks Johnowhitaker! :-)
Thanks so much for the excellent session!
Thank you, John. I am using windows and I successfully connected to ssh via putty. I type Jupiter-notebook in the terminal and it generated a token for the notebook. But I can not be able to access localhost in the browser. Please could you help me with this one?
You need to set up port forwarding in Putty. If I remember correctly it's under ssh -> tunnels.
Something like this: https://www.ccsl.carleton.ca/~falaca/comp4108_w17/ssh_putty/index.html (but the destination is localhost:8888 and the source is 8000 or whatever you choose).
An alternative: AWS has a guide on setting up Jupyter: https://docs.aws.amazon.com/dlami/latest/devguide/setup-jupyter.html
Thank you Johnowhitaker I am finally able to forward a port. But now I encounter another problem: when I copy the token it says invalid Invalid credentials. How to deal with this one
You can set a password instead. On mobile at the moment so you'll have to Google around :)
Thank you I set the password and it is working now. Thank you very much