Congrats to the winning teams and everyone that participated in this competition. It was a worthy fight.
The greatest challenge faced during this competition is the issue of big data, which we are certain that the majority of people dodged because of hardware resources.
However, it would be more educational if we could all share how we handled the 50GB-ish data and the hardware resources utilized.
That's very true.
We used Google Cloud AI Platform Jupyter Lab. But it is expensive and we spent more than 50,000 naira in this competition (@crimacode and @MICADEE). There you can be able to choose the CPU, GPU and RAM that fits your needs and according to your pocket capacity.
@crimacode: why don't use try colab pro ?
@Moto Not available in Nigeria
I tried this method initially utilizing the free $100 dollars, but I could not do anything succinct before the money got exhausted.
Seeing that the fields and labels were the only data not available on Google Earth Engine (GEE), we mosaiced and uploaded those files. We then created image composites and object (field) stats on GEE and then extracted the data in batches. While one of the benefits of this is the zero-cost, it makes it difficult to perform image convolution if you do not use the pay-for AI platform. We also had numerous challenges with Na values and were missing 10 fields at the end (still unsolved).
interesting we also had the some missing fields.. still unsolved..
there were data with 2 different crs's. when we used rasterio to create a Mosaic, fields were projected to the wrong area because they would inherit a single crs. fixing this partially assisted with missing fields. Another issue was fields with NA's- only partially fixed.
Interesting!!
The satellite imagery were available in GEE?
wow.. @Geethen ..thanks for the tip...
but did it occur to you extract the data from both crs.. there might be some explanation to why the satellite had both.. or it may have just been a glitch
I only have colab pro and a normal PC. At the begining, I downloaded the data and created a small samples to do exploration. The unzip process for all the files in my PC took more than 12 hours :-)
Skaak has a strong server but he also needed to wait for hours (even days) to finish certain steps.
We used fairly modest hardware.
I have an old iMac but upgraded its memory myself to 32Gb and used that for the XL competition. For the s2-only I used Macbook Air (i5, 8Gb of mem) and it took about a day to fit a model (about double the time it took on the iMac). I think GPU would have helped. We used time rather than hardware to solve this. As @moto mentions, just donwloading and unzipping took forever.
Still early days, will post more detail once zindi accepts our solution.
PS: Initially I read all the images and wrote out details into csv files. This took forever, later on I used numpy files (npy) for this which helped a lot.
@skaak I am keen to see the model/ model architecture you guys used. That is a long time to train :o. Well done:).
i look forward to your team's solution. Did you utilize all the data and all the bands?
I had a RAM of 16GB but could not utilize my resources. It took me so long to preprocess and the cost internet cost of downloading discouraged me.
Truly, it was not only a factor of hardware resources, but more of how well you utilized the codes processing the data.
I used a normal system and jupyter notebook for downloading and extracting data. Then I saved the extracted data into .csv files and uploaded that in google drive. Preprocessing and other steps were done with google colab. I should mention that I worked with observations of 500 fields and all the times. I think using more fields could improve my results.
I did not use colab, all local processing. Yes it took much longer I suppose but it was also nice to have complete control over the process.
PS: Why did you upload and not use your "normal system" for the model as well?
I get this message using jupyter notebook for processing in my system : "The kernel appears to have died"
RAM not enough
FWIW at some stage I switched over to pure python - was easiest. I think it would work as is in a notebook as well anyhow, but made its editing and running easier.
What is the resource specification of your normal system?
i7, 4Gb of RAM
its interesting to see how a number of folks tried to handle the data and the problem... For my team we thought paid resources werent allowed because it was stated in the Zindi Rules Data standards so we had to constrain ourselves to pure python and truly hacky data manipulation techniques to work.. downloading the data was the biggest challenge of all but after we managed to do that.. everything else got a bit easier..
"paid resources werent allowed" Where did you see it?
Rules tab
DATA STANDARDS
the last bulleting
in the end some rules might end up being laxed I guess...
Ah, I see. It mentioned "tools" not hardware. We are using python and free software.
yhup...
theres a lot to be learned when the winners soluion code get published .. cant wait
I use my old laptop to process images to tables. It takes about 24 hours. Next, I sent the data to the mac and make some eda and simlpe models. After I upload this data to kaggle kernel and fit model there.
Nice - initially we did something very similar. Converted all those images into CSV and it also took about 24h. After fitting some models to these, we could simplify and skip some of the steps. The initial conversion was very detailed but later on much simpler and smaller and quicker.
How big was the upload to kaggle? And, did you upload csv or .npy?
PS: This was long ago, but I think initially we converted each and every pixel into a line in a CSV somewhere and it was ~200+Gb. Then we'd read the CSV and calculate averages and so on from it. Later on we did it directly on the images and stored the results as npy and that made things much smaller and faster.
200 Gb is too big for my devices :)
I calculated the vegetation indexes for each field in each picture and saved it in csv files. The training data consisted of 265 csv files that weighed 4.2 Gb. Then I expanded it into time series for each field.
I uploaded files to kaggle in the feather format, it really saved both space and time quite well. Perhaps numpy would help to save even more