🌾 This Week on Zindi: Handling Big Data and Hardware...

4 Oct 2021, 05:58 (edited 2 minutes later)

Handling Big Data and Hardware Resources

Data · 4 Oct 2021, 04:46 · 29

Congrats to the winning teams and everyone that participated in this competition. It was a worthy fight.

The greatest challenge faced during this competition is the issue of big data, which we are certain that the majority of people dodged because of hardware resources.

However, it would be more educational if we could all share how we handled the 50GB-ish data and the hardware resources utilized.

Discussion 29 answers

Orian_Keith

That's very true.

4 Oct 2021, 05:12

Upvotes 0

crimacode

Josplay enterprise

We used Google Cloud AI Platform Jupyter Lab. But it is expensive and we spent more than 50,000 naira in this competition (@crimacode and @MICADEE). There you can be able to choose the CPU, GPU and RAM that fits your needs and according to your pocket capacity.

Upvotes 0

replied to crimacode4 Oct 2021, 07:22

@crimacode: why don't use try colab pro ?

Upvotes 0

crimacode

Josplay enterprise

@Moto Not available in Nigeria

replied to Moto4 Oct 2021, 07:38 (edited 1 minute later)

Upvotes 0

replied to crimacode4 Oct 2021, 13:55

I tried this method initially utilizing the free $100 dollars, but I could not do anything succinct before the money got exhausted.

Upvotes 0

Geethen

Seeing that the fields and labels were the only data not available on Google Earth Engine (GEE), we mosaiced and uploaded those files. We then created image composites and object (field) stats on GEE and then extracted the data in batches. While one of the benefits of this is the zero-cost, it makes it difficult to perform image convolution if you do not use the pay-for AI platform. We also had numerous challenges with Na values and were missing 10 fields at the end (still unsolved).

4 Oct 2021, 06:58

Upvotes 0

replied to Geethen4 Oct 2021, 07:34

University of ghana

interesting we also had the some missing fields.. still unsolved..

Upvotes 0

Geethen

there were data with 2 different crs's. when we used rasterio to create a Mosaic, fields were projected to the wrong area because they would inherit a single crs. fixing this partially assisted with missing fields. Another issue was fields with NA's- only partially fixed.

replied to Lone_Wolf4 Oct 2021, 12:15

Upvotes 0

replied to Geethen4 Oct 2021, 14:00

Interesting!!

The satellite imagery were available in GEE?

Upvotes 0

replied to Geethen4 Oct 2021, 16:29

University of ghana

wow.. @Geethen ..thanks for the tip...

but did it occur to you extract the data from both crs.. there might be some explanation to why the satellite had both.. or it may have just been a glitch

Upvotes 0

I only have colab pro and a normal PC. At the begining, I downloaded the data and created a small samples to do exploration. The unzip process for all the files in my PC took more than 12 hours :-)

Skaak has a strong server but he also needed to wait for hours (even days) to finish certain steps.

4 Oct 2021, 07:20

Upvotes 0

4 Oct 2021, 07:45 (edited 7 minutes later)

Ferra Solutions

We used fairly modest hardware.

I have an old iMac but upgraded its memory myself to 32Gb and used that for the XL competition. For the s2-only I used Macbook Air (i5, 8Gb of mem) and it took about a day to fit a model (about double the time it took on the iMac). I think GPU would have helped. We used time rather than hardware to solve this. As @moto mentions, just donwloading and unzipping took forever.

Still early days, will post more detail once zindi accepts our solution.

PS: Initially I read all the images and wrote out details into csv files. This took forever, later on I used numpy files (npy) for this which helped a lot.

Upvotes 0

Geethen

@skaak I am keen to see the model/ model architecture you guys used. That is a long time to train :o. Well done:).

replied to skaak4 Oct 2021, 10:37

Upvotes 0

replied to skaak4 Oct 2021, 14:08

i look forward to your team's solution. Did you utilize all the data and all the bands?

I had a RAM of 16GB but could not utilize my resources. It took me so long to preprocess and the cost internet cost of downloading discouraged me.

Truly, it was not only a factor of hardware resources, but more of how well you utilized the codes processing the data.

Upvotes 0

mas

I used a normal system and jupyter notebook for downloading and extracting data. Then I saved the extracted data into .csv files and uploaded that in google drive. Preprocessing and other steps were done with google colab. I should mention that I worked with observations of 500 fields and all the times. I think using more fields could improve my results.

4 Oct 2021, 08:15

Upvotes 0

replied to mas4 Oct 2021, 09:29 (edited 3 minutes later)

Ferra Solutions

I did not use colab, all local processing. Yes it took much longer I suppose but it was also nice to have complete control over the process.

PS: Why did you upload and not use your "normal system" for the model as well?

Upvotes 0

mas

I get this message using jupyter notebook for processing in my system : "The kernel appears to have died"

replied to skaak4 Oct 2021, 11:26

Upvotes 0

crimacode

Josplay enterprise

RAM not enough

replied to mas4 Oct 2021, 11:45

Upvotes 0

replied to crimacode4 Oct 2021, 11:55

Ferra Solutions

FWIW at some stage I switched over to pure python - was easiest. I think it would work as is in a notebook as well anyhow, but made its editing and running easier.

Upvotes 0

replied to mas4 Oct 2021, 14:02

What is the resource specification of your normal system?

Upvotes 0

mas

i7, 4Gb of RAM

replied to warrie5 Oct 2021, 09:40

Upvotes 0

University of ghana

its interesting to see how a number of folks tried to handle the data and the problem... For my team we thought paid resources werent allowed because it was stated in the Zindi Rules Data standards so we had to constrain ourselves to pure python and truly hacky data manipulation techniques to work.. downloading the data was the biggest challenge of all but after we managed to do that.. everything else got a bit easier..

4 Oct 2021, 14:49

Upvotes 0

replied to Lone_Wolf4 Oct 2021, 15:06

"paid resources werent allowed" Where did you see it?

Upvotes 0

replied to Moto4 Oct 2021, 15:39

University of ghana

Rules tab

DATA STANDARDS

the last bulleting

You may only use tools available to everyone i.e. no paid services or free trials that require a credit card.

in the end some rules might end up being laxed I guess...

Upvotes 0

replied to Lone_Wolf4 Oct 2021, 16:40

Ah, I see. It mentioned "tools" not hardware. We are using python and free software.

Upvotes 0

replied to Moto4 Oct 2021, 16:52

University of ghana

yhup...

theres a lot to be learned when the winners soluion code get published .. cant wait

Upvotes 0

sdv

I use my old laptop to process images to tables. It takes about 24 hours. Next, I sent the data to the mac and make some eda and simlpe models. After I upload this data to kaggle kernel and fit model there.

5 Oct 2021, 06:32

Upvotes 0