The Yield column in Train has about 55% missing data, in Test it is about 62%.
I believe if we are to design a method to find the correct field locations, it is important to know what the actual yield is, as this will guide in the exhaustive search to find band values that give the closest observed yields in the fields of interest (using the models for yield calculation) and then correct the locations.
While it is an inherent part of a project anyway to have missing values and find a way to resolve them, I am wondering if it is worth the trade-off in this particular case. This would necessarily mean that we design a model first to find yields where it isnt available, or just impute, which might really not be the best option if accurate field locations is of utmost concern. And we already had an earlier competition to find yields anyway.
Are these yields really not available?
If you can, can you help clarify the rationale behind this?
This data is different from the other Yield estimate competition and that is why we really don't have all the yield information available for this dataset. We hope that even with this much missing data, it will help improve the model.