Chapter 3 Data transformation
After we gathered the data, we started our data cleaning to make it more convenient for us to work with the data in R. The first thing we did was to use Python to convert the time variables. The script in Python can be found in our github folder at https://github.com/meganzhou62/tiktok_livesale/blob/main/processed%20data/assignmentand%20data/data%20processing/transfer%20duration%20to%20seconds.ipynb. This was necessary because we had time data in digit plus time unit in Chinese, and they were in different units including hour, minute and second. Thus, we matched the Chinese character with different unit of time and converted all time variable into the number of seconds. We did the conversion to the duration of the live stream, the average duration per viewer in the Data_Livestream_general.xlsx and the video duration in Data_Short_Videos.xlsx.
We also changed the Host_category variable from F/T to K/C, as in K standing for KOL and C standing for celebrities. This change could make our later visualization clearer. For the category of items, we had a look at the over twenties different categories and eventually combined them into seven bigger categories for better analysis and visualization. To further organized our data, we removed the Chinese title of each session of live stream and assign an unique id to each session.
Finally, we also did some data transformation in R. This includes changing char variable to numeric, changing Name variable into factor and reordering the Name variable by the total revenue each host generated.
Besides the transformation mentioned above, for our interactive component, we also processed our data in Python (code at https://github.com/meganzhou62/tiktok_livesale/blob/main/processed%20data/Make%20cumulative%20data.ipynb). We grouped the revenue by each host and date, and got the cumulative sum by each host. We also printed it out in a specific way so that it fits into the js data format.