whatsapp data
published 13 Nov 2019, 08:21

of late whatsapp has become a major platform where groups are formed with an aim in mind. in those groups contributions are made but there is no proper way to track those contributions and also no proper record keeping as there is a lot of duplication of data. whatsapp provides a means of exporting the data but in text format only. the challenge comes in cleaning of the data as there is alot of textual infomation which is unrequired especially if you are only interested in the numeric data. How do go about that?

Hello Henry. I think you're better off providing sample data and specify exactly what you would like to extract from it. So that we can help.

Assuming the data is a txt file then each contribution(John Doe - 3,000) is inline. You could read the each line into a dataframe and use a combination of regex and string functions to generate the numeric figure and contributors per line.

Another option worth exploring is using a whatsapp api and add the api number to the groups, with that u can manipulate and store the data in realtime. One most important advantage of using this approach is that of being able to relate other type of media withe the traiing comments and caption, you will have more insight of media content.

For old chats, the exported text is follows a certain pattern (Mobile number, Time, Message). If you are familiar with web scrapping, retrieving such data is fairly easy. You then choose to store such data in a form of a table with fields being metadata from texts.

please provide a sample code for the same, prefarably in r

contributions towards alex and Jane's wedding

1.alex -20000


3.john -20000

guys are remindend to continue contributing and redeeming your pledges towards alex's wedding.



3.john- 20000


that's just a sample data

mwai, i think unaelewa hii stuff vizuri, a kenyan scenario where contributions zinafanywa then hapo katikati watu wanachat, then wanaendelea na contributions and the copy pasting thing in the list created.

Python though; If you create this into a dataframe maybe you cld additionally do sm more clean up. You cld also explor the api as suggested also.

file = open('chat.txt', 'r')

data = file.readlines()

for line in data:

if (not line.strip()):


print(line.replace(' ','').strip().split('-'))


['1.alex', '20000']
['2.mary', '50000']
['3.john', '20000']
['1.alex', '20000']
['2.mary', '50000']
['3.john', '20000']
['4.wayne', '1000']