The ID has multiple values for impressions because it is daily impressions. The train.csv data contains daily impressions for each unique ID from 2020 - 2024. The challenge is time series forecasting.
Just to clarify, the training data is made up of daily entries related to clients' ads and for some dates, clients would have more than one ad on display at a time.
This information could be useful in building your model, however, the main objective of the challenge is forecasting the total number of clicks a client would get in the future.
`clients would have more than one ad on display at a time.` If that's the case, then there should be an identifier for the ads. I mean why do we even have the ID field then!?
I assumed that it's a snapshot of the number of clicks at certain point in the day. And based on the EDA that I have done on the dataset, I feel my assumption is right.
Please correct me if I am wrong. Attaching an example to support the point would be really helpful.
The IDs are for each unique client on the platform and a client can run multiple ads concurrently, or at different times of day, on the same date.
You'll notice that the keyword and description lengths can be different for entries made on the same date. This can therefore be used to distinguish the unique ads and inform your model.
Features such as these are the primary reason the different ads were separated and included in the training set and you should be able to use this information in building your model.
However, it is also possible to aggregate the data for dates with multiple entries and use the totals instead since the challenge is focused on the total number of clicks.
The ID has multiple values for impressions because it is daily impressions. The train.csv data contains daily impressions for each unique ID from 2020 - 2024. The challenge is time series forecasting.
I get that it's a time series forecasting challenge and that it has daily impressions.
But for 1 ID and 1 date, only 1 entry should be present, right? Or am I missing something?
For your reference, consider ID - 'ID_5da86e71bf5dee4cf5047046', it has 6 entries for date '2020-01-01'
For that instance, you can add an hours column. If you look at it you'll realize the impressions were recorded on different hours.
Hello,
Just to clarify, the training data is made up of daily entries related to clients' ads and for some dates, clients would have more than one ad on display at a time.
This information could be useful in building your model, however, the main objective of the challenge is forecasting the total number of clicks a client would get in the future.
`clients would have more than one ad on display at a time.` If that's the case, then there should be an identifier for the ads. I mean why do we even have the ID field then!?
I assumed that it's a snapshot of the number of clicks at certain point in the day. And based on the EDA that I have done on the dataset, I feel my assumption is right.
Please correct me if I am wrong. Attaching an example to support the point would be really helpful.
The IDs are for each unique client on the platform and a client can run multiple ads concurrently, or at different times of day, on the same date.
You'll notice that the keyword and description lengths can be different for entries made on the same date. This can therefore be used to distinguish the unique ads and inform your model.
Features such as these are the primary reason the different ads were separated and included in the training set and you should be able to use this information in building your model.
However, it is also possible to aggregate the data for dates with multiple entries and use the totals instead since the challenge is focused on the total number of clicks.
This makes sense. Thanks a lot for the in-depth explanation and for being patient!