Hello,
There is a major hassle with utilizing proper feature engineering generation in this challenge, with the large dataset there's a time complexity. Can we share how we are going about optimizing the process on this discussion, I will update this discussion with any ideas as I explore ways to optimize the process as well. Let's learn from each other about parallelism and efficiency.
Thank you.
Update-
Here are some helpful resources for accelerated workflows and data processing:
Memory Usage Reduction- https://www.kaggle.com/code/gemartin/load-data-reduce-memory-usage/notebook
RAPIDS AI CUDF - Enables accelerated workflows for tabular data on CUDA https://docs.rapids.ai/api/cudf/stable/
RAPIDS AI CUML- Accelerated model training, GPU integration with traditional machine learning algorithms: https://www.analyticsvidhya.com/blog/2022/01/cuml-blazing-fast-machine-learning-model-training-with-nvidias-rapids/, https://medium.com/rapids-ai/10-minutes-to-rapids-cudf-and-dask-cudf-3d16fcb84139
PANDAS + DASK - https://pandas.pydata.org/docs/user_guide/scale.html, https://www.vantage-ai.com/en/blog/4-strategies-how-to-deal-with-large-datasets-in-pandas, https://docs.dask.org/en/stable/10-minutes-to-dask.html
P.S- I'm yet to implement them and see the benefits.
your welcome to contunue the conversation on our discourd https://discord.gg/TwsnzK8k