You can try creating a new column(pseudo-location) with the first N characters of the filename, then split training and validation sets in such a way that they don't have an overlap of items from this new column.
Thanks. I tried your method with say N=25 and found out that a lot of files share similar filenames up to N characters but they are not the same file actually. Is there a correlation between these images if they have similar names? @Lemin
I started with stratification using extent as the key. The dataset has a lot of similar-looking images. My CV score is low because similar images are being spread through train and validation sets. Not exactly sure how I prevent this. As @Lemin mentioned above, splitting and grouping images with similar names might be one strategy.
You may be making the same mistake as me - MAKE SURE you're using the train csv which is downloaded manually from this competition page, NOT the other train data that you get when you download all the images.
An easy check is to do train_df.groupby("damage_type").extent.describe(). Everything other than DR has to be 0.
If you're using the wrong data, it isn't necessary to load in the other data. You can just set non-DR extent in your current dataframe to 0 manually and rerun, and your LB score will go down by a very satisfying margin
You can try creating a new column(pseudo-location) with the first N characters of the filename, then split training and validation sets in such a way that they don't have an overlap of items from this new column.
Thanks. I tried your method with say N=25 and found out that a lot of files share similar filenames up to N characters but they are not the same file actually. Is there a correlation between these images if they have similar names? @Lemin
I'm not seeing quite as high of a gap, but still a gap:
CV: 13.3 (quite high :c )
LB: 15.1
What is your CV method? Splitting randomly probably isn't the way to go - note that the competition mentions this:
"The hold-out is grouped by season and by growth_stage, damage type and extent."
A potential issue here is dataset drift / correlated images, i.e. where images from the same season / damage type will have similar looking damage.
I started with stratification using extent as the key. The dataset has a lot of similar-looking images. My CV score is low because similar images are being spread through train and validation sets. Not exactly sure how I prevent this. As @Lemin mentioned above, splitting and grouping images with similar names might be one strategy.
you could try using stratified cv splits based on season, growth stage, etc.
You may be making the same mistake as me - MAKE SURE you're using the train csv which is downloaded manually from this competition page, NOT the other train data that you get when you download all the images.
An easy check is to do train_df.groupby("damage_type").extent.describe(). Everything other than DR has to be 0.
If you're using the wrong data, it isn't necessary to load in the other data. You can just set non-DR extent in your current dataframe to 0 manually and rerun, and your LB score will go down by a very satisfying margin