Chapter 2 Data sources
Our dataset was collected from a website called “Chanmama” (https://www.chanmama.com/), which is the largest database that collects and distributes data about “Douyin”. It provides data that includes some general information of the live hosts, the detailed revenue of each live stream, and information on the products sold during each live stream. However, since this database currently only covers data of “Douyin”, it is only available in Chinese, which means some additional translations are required. Also, users must register and subscribe first to gain access to the data.
The raw data is in Excel format, and it is located under each live host’s profile. Each live host has three own Excel files storing the data focusing on three different areas, and we can only download the files one by one instead of in just one step. We took advantage of the sales revenue ranking system provided by the database to see who were the Top 20 live hosts in terms of sales revenue from June to August 2021. Then, we downloaded the data files from these 20 live host profiles and combined them into three single Excel files.
The first Excel file is named as Data_Livestream_General, and it includes general information about each past live stream of the hosts from Jun to Aug 2021. To help our analysis, we decided to add 3 additional columns to the dataset, including the host’s full name (Name), whether is an online KOL or celebrity (Host_category), and the host’s occupation (Occupation). Name and occupation were available on the website not was not contained in the downloaded excel s we mannually added those. We also decided host category based on our team’s common knowledge. After the adjustment, there are a total of 21 variables and 20000+ observations in this dataset and we use them selectively. These variables mainly are numeric data, important ones including the start time of live steam, the number of products presented, the number of goods sold, total sales amount, duration, views, and the average length of stay.
The second Excel file is named as Data_Sales, and it includes detailed information about the specific goods sold by the 20 hosts within the three months. Again, to help our analysis, we add the host’s full name as an additional identifier column. After the adjustment, there are a total of 9 variables and more than 10000 observations. These variables include two categorical columns, the goods category, and the host’s name. The rest variables are all numeric data, including a specific good’s retailing price in the Chinese Yuan, commission rate, sales volume. sales amount, conversion rate, the number of appearances of a good on short videos, and the number of appearances on live streaming.
The third Excel file is named as Data_Short_Videos, and it gives us information about posted short videos of the top 20 hosts from Jun to Aug 2021. It includes 9 variables and 18051 observations in total. Just like in the previous data file, we added in an additional identifier column Name to the original raw data. This dataset tells us information on video post time, duration, estimated sales volume units from the video, estimated sales amount, number of likes of the video, number of comments, and number of reposts.