This Project was put out as data challenge by the Wikimedia Foundation for a data analyst position. I thought it would be a great opportunity to brush up on some data analysis tools in Pandas and decided to complete the challenge.
The goal of the project was to analyze data from event logging (EL) to track a variety of performance and usage metrics to help the company make decisions. Specifically, they were interested in:
clickthrough rate: the proportion of search sessions where the user clicked on one of the results displayed
zero results rate: the proportion of searches that yielded 0 results
The dataset comes from a tracking schema that the Wikimedia Foundation uses for assessing user satisfaction. Desktop users are randomly sampled to be anonymously tracked by this schema which uses a "I'm alive" pinging system that we can use to estimate how long our users stay on the pages they visit. The dataset contains just a little more than a week of EL data.
|uuid||string||Universally unique identifier (UUID) for backend event handling.|
|timestamp||integer||The date and time (UTC) of the event, formatted as YYYYMMDDhhmmss.|
|session_id||string||A unique ID identifying individual sessions.|
|group||string||A label ("a" or "b").|
|action||string||Identifies in which the event was created. See below.|
|checkin||integer||How many seconds the page has been open for.|
|page_id||string||A unique identifier for correlating page visits and check-ins.|
|n_results||integer||Number of hits returned to the user. Only shown for searchResultPage events.|
|result_position||integer||The position of the visited page's link on the search engine results page (SERP).|
The following are possible values for an event's action field:
The objectives of this project are to find answers to the following questions:
What is their daily overall clickthrough rate? How does it vary between the groups?
Which results do people tend to try first? How does it change day-to-day?
What is their daily overall zero results rate? How does it vary between the groups?
Let session length be approximately the time between the first event and the last event in a session. Choose a variable from the dataset and describe its relationship to session length. Visualize the relationship.
We'll dive in to the analysis below, but first well read in some basic libraries as well as some functions which we will use. The functions were defined outside this notebook can found on the GitHub repo.
import pandas as pd import numpy as np import os.path import pandas as pd from itertools import chain import matplotlib.pyplot as plt % matplotlib inline from plotting_functions import * from helper_functions import * import warnings warnings.filterwarnings('ignore') def read_in_data(): """ This function does the following, 1.) Reads in the csv file: 'events_logs.csv' and set the index to be the timestamp and print its to a new file called: 'new_events_log.csv' 2.) Then it just reads in'new_events_log.csv' and returns the corresponding dataframe. If the file 'new_events_log.csv' already exists then only 2.) is done. :return: Dataframe of the data. :rvalue: Pandas DataFrame. """ # 1. if os.path.isfile('new_events_log.csv') is False: df = pd.read_csv("events_log.csv") df['timestamp'] = df['timestamp'].astype(int) df.to_csv('new_events_log.csv') #2.) df = pd.read_csv("new_events_log.csv", parse_dates=['timestamp'], index_col='timestamp') # drop some extra column df = df.drop('Unnamed: 0',axis=1) return df
The data was provided in csv format in a file called, "events_log.csv." When reading in the data I wanted to set the index of the dataframe to be a TimeStamp. This can be achieved using the pandas command:
df = pd.read_csv("events_log.csv", parse_dates=['timestamp'], index_col='timestamp')
However, this would not work because the TimeStamp was being read in as an float in scientific notation and therefore wouldnt convert to DateTime properly. I was able to circumvent this by first reading in the dataframe as usual, then changing the timestamp variable to an integer with the command:
df = pd.read_csv('events_log.csv') df['timestamp'] = df['timestamp'].astype(int)
I then wrote the dataframe to csv as "new_events_log.csv" and read it in again using the previous technique. I wanted my code to work even if the person had download the original csv and not the new one. So, I wrote a function that did the above if "new_events_log.csv" doesn't exist, and if it did exist just read it in. This is the funciton called below,
df = read_in_data()
I included this function specifically in this notebook since it run much faster when it was defined within the notebook then as an outside function.
Now, we can get a basic look at our data by using the
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 400165 entries, 2016-03-01 10:38:42 to 2016-03-04 14:43:03 Data columns (total 8 columns): uuid 400165 non-null object session_id 400165 non-null object group 400165 non-null object action 400165 non-null object checkin 223824 non-null float64 page_id 400165 non-null object n_results 136234 non-null float64 result_position 230482 non-null float64 dtypes: float64(3), object(5) memory usage: 27.5+ MB
Let's get an idea of the number of sessions this takes into account for,
session_ids = df.session_id.unique() print "Number of sessions: " + str(len(session_ids))
Number of sessions: 68028
If we sort the dataframe by
session_id, we can see how long each person stayed on a webpage,
I dug deeper into looking at the data on my own, but won't show it because it's superfluous. Instead, let's get into analyzing the data. Instead lets get to answering the questions posed in the challenge.
We first set out to understand what the average daily clickthrough rate of the users is, we can do this by first finding out how many people visit the website per day. In order to find this number on a per day basis, we have to first perform a groupby day,
and then count the number of rows where a search was performed, i.e. when each rows has,
'action' = 'searchResultPage'
We then need the number of people that visited a search result page on a daily basis. We repeat the same groupby call, but now count the number of records where the corresponding rows have,
'action' == 'visitPage'
We then take the two values for each day, divide them and take the average to get the average daily clickthrough rate. This is what is completed in the function call,
rate = average_click_through_rate(df)
We also have to answer the same question on clickthrough rate, but this time taking into account the differences between 'group a' and 'group b'. This can be achieved using the same procedure as before, but we group by day as well as the session 'group', i.e.
We then count the number of rows that have the different actions discussed previously. This procedure is exactly what is done in the function,
The results of the grouped and ungrouped results are shown below:
diff = average_click_through_rate_by_group(df) plot_grouped_and_ungrouped(diff, rate, "Daily Clickthrough Rate")
We can see that the clickthrough rate of 'group a' is almost twice that of 'group b'.
In order to figure out which links clients click first we have get the number of search result pages that were clicked on and group by day,
result_positon. We do this using the groupby Pandas call as well as filtering the rows by which rows includes a webpage that has been visited. We can tell which page has been visited by getting the
result_position of the page that was visisted. The groupby and filter call are below,
clicked = df.groupby([pd.TimeGrouper('D'),'session_id','result_position'])\ .apply(lambda row: len(row[row['result_position'] > 0]))
timestamp session_id result_position 2016-03-01 001e2d0e159172d2 2.0 2 0024c4506bf92e1c 1.0 9 0029420a5f8c7d90 1.0 1 002b97995ca9ce77 3.0 2 18.0 2 dtype: int64
We note that the values in the series is the number of times that the
result_positon appeared in the rows of the dataframe. We see that in session,
002b97995ca9ce77, the user went to both the 3rd listed result and the 18th listed result. Let's take a look at the data for that session in chronological order:
df[df.session_id == '002b97995ca9ce77'].sort_index()
We can see that the person first checked out the 3rd result and then went back to the search results page and clicked on the 18th page. We have to be careful that we only take the first result webpage the user clicked on in the analysis. I did this by first filtering the dataframe to only included rows where the person was checkedin and then sorted by the
visited_pages = df[df.action == 'visitPage'].sort_index()
Next we groupby the day and
session_id and take the first result in the column corresponding to the
result_position column. Since the dataframe is sorted in chronological order we can assume that the result we get back is the first search result that the user clicked on:
first_clicked = visited_pages.groupby([pd.TimeGrouper('D'),'session_id',], sort=True)['result_position']\ .first()
We print out the first few reults to see what they like:
timestamp session_id 2016-03-01 001e2d0e159172d2 2.0 0024c4506bf92e1c 1.0 0029420a5f8c7d90 1.0 002b97995ca9ce77 3.0 003037c8d3c3fe94 2.0 Name: result_position, dtype: float64
As a sanity check we can see that the value for
002b97995ca9ce77, is 3 which is what we expected.
Now we loop over the number of days and count the
result_positions of the pages that were first visited and order them by
result_position. We plot the daily results as bar graphs, taking the top 20 most clicked
The first thing we see is that the most frequently clicked is the first result listed and the number of click on the rest of the results drop exponentially with the rank of the listing. This makes sense as hopefully the first returned result is the most relevant.
Another interesting thing to note is the number of clicks changes throughout the week. The number of clicks on the first result seem to be around 2400-2500 for the first three days. On the fourth day they drop to 2300, then drop down to 1700 on the fifth day before recovering to 2000 on the 6th. The clickthrough rate of the first results returns to the 2400-2500 range on the seventh day and finally drops again to around 2200 on the eigth day. We can see that most days the first search results get about 2200-2500 clicks, but a noticable drop in click rate for the first search result occurs on the 5th day. We can also see the same pattern occurs for the click rate on the second search result.
Now we are interested in the daily zero rates, that is the the number of searches that return zero results. We group the data by day as we did in Question 1. and then count the number of rows that have,
'n_results' == 0
and divide by the number of searches. This gives us the the zero rate for each day and is the done in the function:
zero_rate = average_zero_rate(df)
We now repeat the same process as above, but also group by the 'group', that is whether the session is in 'group a' or in 'group b'. This is done in the function,
and we plot the results below,
diff2 = average_zero_rate_by_group(df) plot_grouped_and_ungrouped(diff2, zero_rate, 'Average Daily Zero Rate By Group')
We can see that the average zero rate is much higher in group b than group a, its almost twice the rate in group a. This makes sense, since group b has a smaller clickthrough rate and having zero search results would definitely result in less clickthroughs.
Now we want to look into the session length and what factors may affect it.
We first get the start time of each session by grouping by
session_id and taking the first row. We have to use the command,
nth(0) instead of
first(), because of issues with
first() dropping cases that have
NaN as values. We drop the
uuid column since it seems like useless information and pushes the dataframe off the screen when I print it. The last thing I will mention is that we use
as_index = False in the
groupby(...) call so that we get keep the original index and a Pandas DataFrame is returned instead of Pandas Series:
session_begin = df.sort_index()\ .groupby(['session_id'], sort=True, as_index=False)\ .nth(0)\ .drop('uuid',axis=1) session_begin.sort_values('session_id') .head()
Now, I'm going to do a sanity check by looing at the session
0001382e027b2ea4 and making sure that the first TimeStamp is the same as the one above.
test_session = df[df.session_id == '0001382e027b2ea4'].sort_index()\ .drop('uuid',axis=1) test_session.head()
Our simple test was a success!
In order to get the correct answer for the last TimeStamp of each session, I had to drop the
page_id column, I'm not 100% confident as to why this was necessary, but I think it has to do with the issues of
df2 = df.drop(['uuid','page_id'], axis=1)\ .sort_index()
Now we do the same thing as previous for
session_begin, but add take the
nth(-1) instead of using
last() because of issues with
last() dropping cases with
session_end = df2.groupby(['session_id'], sort=True,as_index=False)\ .nth(-1) session_end.sort_values('session_id')\ .head()
Now we perfrom another sanity check to make sure that the timestamp above for
session_id=0001382e027b2ea4 is the same as the last timestamp in the
session_begin['time'] = session_begin.index.to_pydatetime()
Now we drop the TimeStamp as the index and drop the
checkin column since that will be irrellevant and depends on the action.
session_begin = session_begin.reset_index()\ .drop(['timestamp','checkin'],axis=1)
Let's repeat the same process, but for
session_end['time'] = session_end.index.to_pydatetime() session_end = session_end.reset_index()\ .drop(['timestamp','checkin'], axis=1)
df3 = pd.merge(session_begin, session_end, on='session_id', suffixes=('_start','_end')) df3.head()
|0||cbba9a98ba7b20cc||a||checkin||f79f531304a265f8||NaN||4.0||2016-03-01 00:00:01||a||searchResultPage||20.0||NaN||2016-03-01 00:00:17|
|1||c1f0f8499c6655cd||b||checkin||88bebe600b777ea9||NaN||NaN||2016-03-01 00:00:02||b||checkin||NaN||NaN||2016-03-01 00:00:12|
|2||614eaa17120fe46b||b||checkin||756d326815bc60c9||NaN||1.0||2016-03-01 00:00:02||b||checkin||NaN||1.0||2016-03-01 00:01:22|
|3||32edcb654a1dee57||b||checkin||d6122cfa99f1809f||NaN||1.0||2016-03-01 00:00:05||b||checkin||NaN||1.0||2016-03-01 00:04:38|
|4||34ba1f4ed8fa8368||b||checkin||b7502d32b09f32bb||NaN||NaN||2016-03-01 00:00:06||b||checkin||NaN||NaN||2016-03-01 00:05:38|
We then get the time difference between the start and end times and convert this number into seconds.
temp_session_length = df3['time_end'] - df3['time_start'] temp_session_length = temp_session_length.apply(lambda row: row.seconds)
Now we try to bin the number of counts of the session length into different bins up until 20 minutes,
bins = [0,10,30,60,90,120,150,180,210,240] time = 240 for i in range(16): time += 60 bins.append(time) top_session_lengths = temp_session_length.value_counts(bins=bins)
Let's plot the distribution of the frequency of sessions that had session length within the above time intervals,
plt.figure(figsize=(6,4)) top_session_lengths.sort_index()\ .plot(kind='bar', title='Frequency Of Session Lengths') plt.xlabel('Seconds') plt.ylabel('Frequency')
<matplotlib.text.Text at 0x10e25a790>
We can see that most people spend either 10 seconds in a session. This means they probably are not finding what they want. Then the next largest groups of people spend 30 seconds, 7 minutes and 0 seconds respectively. Let's see if there is a difference between the session length and session 'group'.
df3['session_length'] = temp_session_length session_length_by_group = df3.groupby(['group_start'])['session_length']\ .value_counts(bins=bins)
<matplotlib.figure.Figure at 0x10c782450>
We can see that the magnitudes of numbers are very different between the two groups but the shape of the distributions are roughly the same. Let's take a look at how the percentages change between the two groups, we can get this information from the series through,
session_length_a = session_length_by_group['a'] session_length_b = session_length_by_group['b'] tot_a = session_length_a.values.sum() tot_b = session_length_b.values.sum() session_length_a = session_length_a / tot_a session_length_b = session_length_b / tot_b
And plot the results:
It seems like group b had a larger percentage of people with both almost 10 or less seconds then group a. Group b also seems like it had larger percentage of sessions that were 7 minutes than group a. The 10 seconds or less is probably due to the higher zero rates in group b, but the 7 minutes I'm not able to explain right now.
Let's take a look at the number or results in each search. We'll look at the top 25, but we have to do this carefully since we cannot just count the number of rows that have
n_results equal a number, since some people may actually change the search criteria within a session. We therefore groupby
session_id and take the unique number of listed search results. Since this returns a Pandas Series where the index is the
session_id and the values are a list of
n_results for each session I had to chain the values together before counting them:
temp_results = df.groupby('session_id')['n_results'].unique() temp_results = pd.Series(list(chain(*temp_results.values))) number_of_results = temp_results.dropna()\ .value_counts()\ .head(25) number_of_results.sort_index()\ .plot(kind='bar', title='Frequency Of The Number Of Results') plt.xlabel('Number Of Results') plt.ylabel('Count') plt.show()
It seems that most searches end up with 20 results and the two next highest are 0 and 1 results respectively.
Let's take a look at the session lengths for number of
n_results being 0, 1, 2, 3, 5, 10 and 20 and how this may be affected by value of
n_results. We will take the number of
n_results from the end of the session. I wrote a function to find the disribution of percentages session length given the number of listed results
get_sessions_with_n_results. The function takes in
df3 and the number of listed results and then returns a Pandas Series that is the distribution of percentages sessions with
session_length and the value of listed of number of results
n_results = [0,1,2,5,10,20] results = [get_sessions_with_n_results(df3, n) for n in n_results]
We plot the results below
plt.figure(figsize=(8,12)) colors = ['g','r','b','m','c','k'] for i in range(1,len(n_results)+1): n = i-1 plt.subplot(4,2,i) results[n].sort_index()\ .plot(kind='bar', color=colors[n], label='n_results = '+ str(n_results[n])) plt.legend() plt.xlabel('Seconds') plt.ylim([0, 0.15]) plt.show()
It doent look like there is much a difference between the number of results and how long people spend on the first clicked page. Let's take the people who spent between 6 and 8 minutes on a link and take a look at which result page people clicked on:
title='result_position of page where people spent 6-8 minutes.' df_420 = df3[df3.session_length < 480] df_420 = df_420[df_420.session_length >= 360] df_420.result_position_end.value_counts() .plot(kind='bar', title=title) plt.xlabel('result_position') plt.ylabel('Frequency')
<matplotlib.text.Text at 0x10ede1f10>
It seems that the number people who spent 6-8 minutes on a resulting webpage were spending that time on the first listed result. So one way to improve the clickthrough rate would be to focus on optimizing the listing engine such that first result is the very relevant. Focusing on getting many relevant searrch reults probably is not the best track since the number of results dont seem to effect how the session_length.
In this project we looked into factors that influence clickthrough rates on eight days of data from the Wikimedia foundation. They had two different groups that users sessions were in, these are 'group a' and 'group b'. We found that,
There are definitely more factors we could look into that could influence click through rate and session length, but that will have to wait for another day!