🏥 Must-Read: Kaggle Dead Kernel Error

Malawi Public Health Systems LLM Challenge

Helping Malawi

$2 000 USD

Challenge completed almost 2 years ago

Skills you will learn

Questioning and Answering

Generative AI

409 joined

74 active

Info Data Chat Leaderboard

Start

Jan 24, 24

Mar 03, 24

Reveal

Mar 03, 24

Sach

University of Witwatersrand

Kaggle Dead Kernel Error

Help · 26 Feb 2024, 17:40 · 7

Hi there fellow Zindians!! I hope this message finds you well. I am currently facing a challenge in running @Proffesor's gpu/cpu RAG starter notebook on kaggle and would appreciate if I could get some guidance from the community.

When I run the starter notebook on Kaggle I get a dead kernel error at record 249 of the inference. This is quite peculiar as when I run that question in isolation or a batch of 10 - 100 questions with that question included I do not get any error. Hence I assumed that maybe it was a resource issue due to some memory leakage, however when monitoring the resources nothing strange occurs. I even tried freeing memory every 10 iterations but as soon as it reaches iteration 249 it simply just stops. Although if I run the same code on Colab it runs without any issue. I would appreciate if you could share your insight and assist as I would prefer running my notebooks on kaggle to save compute units. Thank you so much and goodluck!!

Here is some of the logs:

1903.5s 21 File "/opt/conda/lib/python3.10/site-packages/nbclient/client.py", line 730, in _async_poll_for_reply

1903.5s 22 msg = await ensure_async(self.kc.shell_channel.get_msg(timeout=new_timeout))

1903.5s 23 File "/opt/conda/lib/python3.10/site-packages/nbclient/util.py", line 96, in ensure_async

1903.5s 24 result = await obj

1903.5s 25 File "/opt/conda/lib/python3.10/site-packages/jupyter_client/channels.py", line 310, in get_msg

1903.5s 26 ready = await self.socket.poll(timeout)

1903.5s 27 asyncio.exceptions.CancelledError

1903.5s 28

1903.5s 29 During handling of the above exception, another exception occurred:

1903.5s 30

1903.5s 31 Traceback (most recent call last):

1903.5s 32 File "/opt/conda/lib/python3.10/site-packages/nbclient/client.py", line 949, in async_execute_cell

1903.5s 33 exec_reply = await self.task_poll_for_reply

1903.5s 34 asyncio.exceptions.CancelledError

1903.5s 35

1903.5s 36 During handling of the above exception, another exception occurred:

1903.5s 37

1903.5s 38 Traceback (most recent call last):

1903.5s 39 File "<string>", line 1, in <module>

1903.5s 40 File "/opt/conda/lib/python3.10/site-packages/papermill/execute.py", line 119, in execute_notebook

1903.5s 41 nb = papermill_engines.execute_notebook_with_engine(

1903.5s 42 File "/opt/conda/lib/python3.10/site-packages/papermill/engines.py", line 48, in execute_notebook_with_engine

1903.5s 43 return self.get_engine(engine_name).execute_notebook(nb, kernel_name, **kwargs)

1903.5s 44 File "/opt/conda/lib/python3.10/site-packages/papermill/engines.py", line 365, in execute_notebook

1903.5s 45 cls.execute_managed_notebook(nb_man, kernel_name, log_output=log_output, **kwargs)

1903.5s 46 File "/opt/conda/lib/python3.10/site-packages/papermill/engines.py", line 434, in execute_managed_notebook

1903.5s 47 return PapermillNotebookClient(nb_man, **final_kwargs).execute()

1903.5s 48 File "/opt/conda/lib/python3.10/site-packages/papermill/clientwrap.py", line 45, in execute

1903.5s 49 self.papermill_execute_cells()

1903.5s 50 File "/opt/conda/lib/python3.10/site-packages/papermill/clientwrap.py", line 72, in papermill_execute_cells

1903.5s 51 self.execute_cell(cell, index)

1903.5s 52 File "/opt/conda/lib/python3.10/site-packages/nbclient/util.py", line 84, in wrapped

1903.5s 53 return just_run(coro(*args, **kwargs))

1903.5s 54 File "/opt/conda/lib/python3.10/site-packages/nbclient/util.py", line 62, in just_run

1903.5s 55 return loop.run_until_complete(coro)

1903.5s 56 File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete

1903.5s 57 return future.result()

1903.5s 58 File "/opt/conda/lib/python3.10/site-packages/nbclient/client.py", line 953, in async_execute_cell

1903.5s 59 raise DeadKernelError("Kernel died")

1903.5s 60 nbclient.exceptions.DeadKernelError: Kernel died

1905.3s 61 /opt/conda/lib/python3.10/site-packages/traitlets/traitlets.py:2930: FutureWarning: --Exporter.preprocessors=["remove_papermill_header.RemovePapermillHeader"] for containers is deprecated in traitlets 5.0. You can pass `--Exporter.preprocessors item` ... multiple times to add items to a list.

1905.3s 62 warn(

Discussion 7 answers

GIrum

Adama Science and Technology University

same here when i try the starter's notebook given by @Professor

26 Feb 2024, 23:13

Upvotes 0

Professor

Hi @Sach, and @GIrum, unfortunately, I don't have access to my PC for now, so I may not be able to do much. However, I'll try to make an edit with my mobile phone to test larger models. But one important point to note is that the textbook data ingested and converted to a vector database is very very dirty, so retrieving data uses lots of ram especially because of the unicode characters. (I left it that way since it's a basic notebook). What you'll want to do is to clean the reference textbook properly.

Some ideas:

Remove unicode characters (important)
Drop the paragraph column, use only the text column.
Remove duplicates if there are
Remove special characters that aren't needed in the text.
Maybe you can also reduce the chunk size of each split in the vector database.

If I can, I'll try from my end and let you know. But this should fix your kennel from exploding, and give you a much much better score

27 Feb 2024, 06:44

Upvotes 0

Sach

University of Witwatersrand

Hi @Professor thanks for the response. However I doubt retrieval from the vectordb is actually the problem. I have created an alternate solution where I iterated through every question, retrieved 3 chunks from the vector db, formulated the langchain prompt and stored the prompt in a pandas dataframe. I then imported the df in a separate notebook and ran the prompts through the llm but it still gives the same error.

replied to Professor27 Feb 2024, 06:54

Upvotes 0

Professor

Interesting. So I take it the problem is on inference of the LLM right. Any idea what the retrieved text (3 chunks) look like? Are there unicode characters in it?

replied to Sach27 Feb 2024, 06:59

Upvotes 0

Professor

Hi @Sach, and @GIrum, I just successfully ran Llama 13b too without errors. Check my edits here. Let me know if the same data cleaning edits work for your choice model. Raw score on the public LB is 0.38+

ps: it was run on kaggle kennels

replied to Professor27 Feb 2024, 09:20

Upvotes 0

Sach

University of Witwatersrand

Hi @Professor I just tested it with TinyLlama and it is working perfectly. Thank you so much for the assistance and for the starter notebook. As a beginner over the past 3 days I have been searching countless forums to try and find a solution and not once did it ever cross my mind to remove the unicode characters 😂😂 Out of curiosity, what effect does the unicode characters have during inference?

replied to Professor27 Feb 2024, 09:45

Upvotes 1

Professor

Hi @Sach, the challenge with your kennel dying was the RAM getting maxed out, so the best thing is to conserve memory. You could also use an extremely small model or smaller prompt (tokens) chunks and it would work.

For this scenario, the textbook text contains lots of jargons since an unstructed loader was used. The best thing would be to clean the data and remove the jargons, especially unicode characters which eat up lots of RAM. It's basically all about efficient memory management.

replied to Sach27 Feb 2024, 17:07

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status