3.What The Average Daily Clickthrough Rates?
4. Which Results Do Clients Click First?
5. What Are The Daily Average Zero Results Rate?
6. What Factors Affect User Session Length?
The source code can be found here.
This Project was put out as data challenge by the Wikimedia Foundation for a data analyst position. I thought it would be a great opportunity to brush up on some data analysis tools in Pandas and decided to complete the challenge.
The goal of the project was to analyze data from event logging (EL) to track a variety of performance and usage metrics to help the company make decisions. Specifically, they were interested in:
clickthrough rate: the proportion of search sessions where the user clicked on one of the results displayed
zero results rate: the proportion of searches that yielded 0 results
EL uses JavaScript to asynchronously send messages (events) to their servers when the user has performed specific actions.
The dataset comes from a tracking schema that the Wikimedia Foundation uses for assessing user satisfaction. Desktop users are randomly sampled to be anonymously tracked by this schema which uses a "I'm alive" pinging system that we can use to estimate how long our users stay on the pages they visit. The dataset contains just a little more than a week of EL data.
Column | Value | Description |
---|---|---|
uuid | string | Universally unique identifier (UUID) for backend event handling. |
timestamp | integer | The date and time (UTC) of the event, formatted as YYYYMMDDhhmmss. |
session_id | string | A unique ID identifying individual sessions. |
group | string | A label ("a" or "b"). |
action | string | Identifies in which the event was created. See below. |
checkin | integer | How many seconds the page has been open for. |
page_id | string | A unique identifier for correlating page visits and check-ins. |
n_results | integer | Number of hits returned to the user. Only shown for searchResultPage events. |
result_position | integer | The position of the visited page's link on the search engine results page (SERP). |
The following are possible values for an event's action field:
The objectives of this project are to find answers to the following questions:
What is their daily overall clickthrough rate? How does it vary between the groups?
Which results do people tend to try first? How does it change day-to-day?
What is their daily overall zero results rate? How does it vary between the groups?
Let session length be approximately the time between the first event and the last event in a session. Choose a variable from the dataset and describe its relationship to session length. Visualize the relationship.
We'll dive in to the analysis below, but first well read in some basic libraries as well as some functions which we will use. The functions were defined outside this notebook can found on the GitHub repo.
import pandas as pd
import numpy as np
import os.path
import pandas as pd
from itertools import chain
import matplotlib.pyplot as plt
from plotting_functions import (plot_grouped_and_ungrouped,
plot_daily_clicked,
plot_session_length_by_group,
plot_session_length_by_group_percentages)
from helper_functions import (searched_per_day_by_group,
searched_per_day,
average_click_through_rate,
average_click_through_rate_by_group,
average_zero_rate,
average_zero_rate_by_group,
get_sessions_with_n_results)
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
def read_in_data():
"""
This function does the following,
1.) Reads in the csv file:
'events_logs.csv'
and set the index to be the timestamp and print its to a
new file called:
'new_events_log.csv'
2.) Then it just reads in'new_events_log.csv' and returns the
corresponding dataframe.
If the file 'new_events_log.csv' already exists then only 2.)
is done.
:return: Dataframe of the data.
:rvalue: Pandas DataFrame.
"""
# 1.
if os.path.isfile('new_events_log.csv') is False:
df = pd.read_csv("events_log.csv")
df['timestamp'] = df['timestamp'].astype(int)
df.to_csv('new_events_log.csv')
#2.)
df = pd.read_csv("new_events_log.csv",
parse_dates=['timestamp'],
index_col='timestamp')
# drop some extra column
df = df.drop('Unnamed: 0',axis=1)
return df
The data was provided in csv format in a file called, "events_log.csv." When reading in the data I wanted to set the index of the dataframe to be a TimeStamp. This can be achieved using the pandas command:
df = pd.read_csv("events_log.csv",
parse_dates=['timestamp'],
index_col='timestamp')
However, this would not work because the TimeStamp was being read in as an float in scientific notation and therefore wouldnt convert to DateTime properly. I was able to circumvent this by first reading in the dataframe as usual, then changing the timestamp variable to an integer with the command:
df = pd.read_csv('events_log.csv')
df['timestamp'] = df['timestamp'].astype(int)
I then wrote the dataframe to csv as "new_events_log.csv" and read it in again using the previous technique. I wanted my code to work even if the person had download the original csv and not the new one. So, I wrote a function that did the above if "new_events_log.csv" doesn't exist, and if it did exist just read it in. This is the funciton called below,
df = read_in_data()
I included this function specifically in this notebook since it run much faster when it was defined within the notebook then as an outside function.
Now, we can get a basic look at our data by using the info()
function:
df.info()
Let's get an idea of the number of sessions this takes into account for,
session_ids = df.session_id.unique()
print("Number of sessions: {}".format(len(session_ids)))
If we sort the dataframe by session_id
, we can see how long each person stayed on a webpage,
df.sort_values('session_id').head().sort_index()
I dug deeper into looking at the data on my own, but won't show it because it's superfluous. Instead, let's get into analyzing the data. Instead lets get to answering the questions posed in the challenge.
We first set out to understand what the average daily clickthrough rate of the users is, we can do this by first finding out how many people visit the website per day. In order to find this number on a per day basis, we have to first perform a groupby day,
df.groupby([pd.TimeGrouper('D')])
and then count the number of rows where a search was performed, i.e. when each rows has,
'action' = 'searchResultPage'
We then need the number of people that visited a search result page on a daily basis. We repeat the same groupby call, but now count the number of records where the corresponding rows have,
'action' == 'visitPage'
We then take the two values for each day, divide them and take the average to get the average daily clickthrough rate. This is what is completed in the function call,
rate = average_click_through_rate(df)
We also have to answer the same question on clickthrough rate, but this time taking into account the differences between 'group a' and 'group b'. This can be achieved using the same procedure as before, but we group by day as well as the session 'group', i.e.
df.groupby([pd.TimeGrouper('D'),'group'])
We then count the number of rows that have the different actions discussed previously. This procedure is exactly what is done in the function,
average_click_through_rate_by_group(...)
The results of the grouped and ungrouped results are shown below:
diff = average_click_through_rate_by_group(df)
plot_grouped_and_ungrouped(diff,
rate,
"Daily Clickthrough Rate")
We can see that the clickthrough rate of 'group a' is almost twice that of 'group b'.
In order to figure out which links clients click first we have get the number of search result pages that were clicked on and group by day, session_id
and result_positon
. We do this using the groupby Pandas call as well as filtering the rows by which rows includes a webpage that has been visited. We can tell which page has been visited by getting the result_position
of the page that was visisted. The groupby and filter call are below,
clicked = df.groupby([pd.TimeGrouper('D'),'session_id','result_position'])\
.apply(lambda row: len(row[row['result_position'] > 0]))
clicked.head()
We note that the values in the series is the number of times that the result_positon
appeared in the rows of the dataframe. We see that in session, 002b97995ca9ce77
, the user went to both the 3rd listed result and the 18th listed result. Let's take a look at the data for that session in chronological order:
df[df.session_id == '002b97995ca9ce77'].sort_index()
We can see that the person first checked out the 3rd result and then went back to the search results page and clicked on the 18th page. We have to be careful that we only take the first result webpage the user clicked on in the analysis. I did this by first filtering the dataframe to only included rows where the person was checkedin and then sorted by the TimeStamp
(index):
visited_pages = df[df.action == 'visitPage'].sort_index()
Next we groupby the day and session_id
and take the first result in the column corresponding to the result_position
column. Since the dataframe is sorted in chronological order we can assume that the result we get back is the first search result that the user clicked on:
first_clicked = visited_pages.groupby([pd.TimeGrouper('D'),'session_id',], sort=True)['result_position']\
.first()
We print out the first few reults to see what they like:
first_clicked.head()
As a sanity check we can see that the value for session_id
, 002b97995ca9ce77
, is 3 which is what we expected.
Now we loop over the number of days and count the result_position
s of the pages that were first visited and order them by result_position
. We plot the daily results as bar graphs, taking the top 20 most clicked result_position
s below:
plot_daily_clicked(first_clicked, 20)
The first thing we see is that the most frequently clicked is the first result listed and the number of click on the rest of the results drop exponentially with the rank of the listing. This makes sense as hopefully the first returned result is the most relevant.
Another interesting thing to note is the number of clicks changes throughout the week. The number of clicks on the first result seem to be around 2400-2500 for the first three days. On the fourth day they drop to 2300, then drop down to 1700 on the fifth day before recovering to 2000 on the 6th. The clickthrough rate of the first results returns to the 2400-2500 range on the seventh day and finally drops again to around 2200 on the eigth day. We can see that most days the first search results get about 2200-2500 clicks, but a noticable drop in click rate for the first search result occurs on the 5th day. We can also see the same pattern occurs for the click rate on the second search result.
Now we are interested in the daily zero rates, that is the the number of searches that return zero results. We group the data by day as we did in Question 1. and then count the number of rows that have,
'n_results' == 0
and divide by the number of searches. This gives us the the zero rate for each day and is the done in the function:
zero_rate = average_zero_rate(df)
We now repeat the same process as above, but also group by the 'group', that is whether the session is in 'group a' or in 'group b'. This is done in the function,
average_zero_rate_by_group(...)
and we plot the results below,
diff2 = average_zero_rate_by_group(df)
plot_grouped_and_ungrouped(diff2,
zero_rate,
'Average Daily Zero Rate By Group')
We can see that the average zero rate is much higher in group b than group a, its almost twice the rate in group a. This makes sense, since group b has a smaller clickthrough rate and having zero search results would definitely result in less clickthroughs.
Now we want to look into the session length and what factors may affect it.
We first get the start time of each session by grouping by session_id
and taking the first row. We have to use the command, nth(0)
instead of first()
, because of issues with first()
dropping cases that have NaN
as values. We drop the uuid
column since it seems like useless information and pushes the dataframe off the screen when I print it. The last thing I will mention is that we use as_index = False
in the groupby(...)
call so that we get keep the original index and a Pandas DataFrame is returned instead of Pandas Series:
session_begin = df.sort_index()\
.groupby(['session_id'], sort=True, as_index=False)\
.nth(0)\
.drop('uuid',axis=1)
session_begin.sort_values('session_id')
.head()
Now, I'm going to do a sanity check by looing at the session 0001382e027b2ea4
and making sure that the first TimeStamp is the same as the one above.
test_session = df[df.session_id == '0001382e027b2ea4'].sort_index()\
.drop('uuid',axis=1)
test_session.head()
Our simple test was a success!
In order to get the correct answer for the last TimeStamp of each session, I had to drop the page_id
column, I'm not 100% confident as to why this was necessary, but I think it has to do with the issues of NaNs
.
df2 = df.drop(['uuid','page_id'], axis=1)\
.sort_index()
Now we do the same thing as previous for session_begin
, but add take the nth(-1)
instead of using last()
because of issues with last()
dropping cases with NaNs
.
session_end = df2.groupby(['session_id'], sort=True,as_index=False)\
.nth(-1)
session_end.sort_values('session_id')\
.head()
Now we perfrom another sanity check to make sure that the timestamp above for session_id=0001382e027b2ea4
is the same as the last timestamp in the test_session
:
test_session.tail()
It worked!
So, now lets get the the start time of each session by creating a new column in the session_begin
that strips the time from the index of the old dataframe, but we want to convert the Pandas TimeStamp objects into Python DateTime objects. We use the to_pydatetime() to do this:
session_begin['time'] = session_begin.index.to_pydatetime()
Now we drop the TimeStamp as the index and drop the checkin
column since that will be irrellevant and depends on the action.
session_begin = session_begin.reset_index()\
.drop(['timestamp','checkin'],axis=1)
Let's repeat the same process, but for session_end
:
session_end['time'] = session_end.index.to_pydatetime()
session_end = session_end.reset_index()\
.drop(['timestamp','checkin'], axis=1)
Now we can merge the two pandas dataframes into one:
df3 = pd.merge(session_begin,
session_end,
on='session_id',
suffixes=('_start','_end'))
df3.head()
We then get the time difference between the start and end times and convert this number into seconds.
temp_session_length = df3['time_end'] - df3['time_start']
temp_session_length = temp_session_length.apply(lambda row: row.seconds)
Now we try to bin the number of counts of the session length into different bins up until 20 minutes,
bins = [0,10,30,60,90,120,150,180,210,240]
time = 240
for i in range(16):
time += 60
bins.append(time)
top_session_lengths = temp_session_length.value_counts(bins=bins)
Let's plot the distribution of the frequency of sessions that had session length within the above time intervals,
plt.figure(figsize=(6,4))
top_session_lengths.sort_index()\
.plot(kind='bar',
title='Frequency Of Session Lengths')
plt.xlabel('Seconds')
plt.ylabel('Frequency')
We can see that most people spend either 10 seconds in a session. This means they probably are not finding what they want. Then the next largest groups of people spend 30 seconds, 7 minutes and 0 seconds respectively. Let's see if there is a difference between the session length and session 'group'.
df3['session_length'] = temp_session_length
session_length_by_group = df3.groupby(['group_start'])['session_length']\
.value_counts(bins=bins)
plot_session_length_by_group(session_length_by_group)
We can see that the magnitudes of numbers are very different between the two groups but the shape of the distributions are roughly the same. Let's take a look at how the percentages change between the two groups, we can get this information from the series through,
session_length_a = session_length_by_group['a']
session_length_b = session_length_by_group['b']
tot_a = session_length_a.values.sum()
tot_b = session_length_b.values.sum()
session_length_a = session_length_a / tot_a
session_length_b = session_length_b / tot_b
And plot the results:
plot_session_length_by_group_percentages(session_length_a,
session_length_b)
It seems like group b had a larger percentage of people with both almost 10 or less seconds then group a. Group b also seems like it had larger percentage of sessions that were 7 minutes than group a. The 10 seconds or less is probably due to the higher zero rates in group b, but the 7 minutes I'm not able to explain right now.
Let's take a look at the number or results in each search. We'll look at the top 25, but we have to do this carefully since we cannot just count the number of rows that have n_results
equal a number, since some people may actually change the search criteria within a session. We therefore groupby session_id
and take the unique number of listed search results. Since this returns a Pandas Series where the index is the session_id
and the values are a list of n_results
for each session I had to chain the values together before counting them:
temp_results = df.groupby('session_id')['n_results'].unique()
temp_results = pd.Series(list(chain(*temp_results.values)))
number_of_results = temp_results.dropna()\
.value_counts()\
.head(25)
number_of_results.sort_index()\
.plot(kind='bar',
title='Frequency Of The Number Of Results')
plt.xlabel('Number Of Results')
plt.ylabel('Count')
plt.show()
It seems that most searches end up with 20 results and the two next highest are 0 and 1 results respectively.
Let's take a look at the session lengths for number of n_results
being 0, 1, 2, 3, 5, 10 and 20 and how this may be affected by value of n_results
. We will take the number of n_results
from the end of the session. I wrote a function to find the disribution of percentages session length given the number of listed results n_results
called get_sessions_with_n_results
. The function takes in df3
and the number of listed results and then returns a Pandas Series that is the distribution of percentages sessions with session_length
and the value of listed of number of results n_results
.
n_results = [0,1,2,5,10,20]
results = [get_sessions_with_n_results(df3, n) for n in n_results]
We plot the results below
plt.figure(figsize=(8,12))
colors = ['g','r','b','m','c','k']
for i in range(1,len(n_results)+1):
n = i-1
plt.subplot(4,2,i)
results[n].sort_index()\
.plot(kind='bar',
color=colors[n],
label='n_results = '+ str(n_results[n]))
plt.legend()
plt.xlabel('Seconds')
plt.ylim([0, 0.15])
plt.show()
It doent look like there is much a difference between the number of results and how long people spend on the first clicked page. Let's take the people who spent between 6 and 8 minutes on a link and take a look at which result page people clicked on:
title='result_position of page where people spent 6-8 minutes.'
df_420 = df3[df3.session_length < 480]
df_420 = df_420[df_420.session_length >= 360]
df_420.result_position_end.value_counts()
.plot(kind='bar',
title=title)
plt.xlabel('result_position')
plt.ylabel('Frequency')
It seems that the number people who spent 6-8 minutes on a resulting webpage were spending that time on the first listed result. So one way to improve the clickthrough rate would be to focus on optimizing the listing engine such that first result is the very relevant. Focusing on getting many relevant searrch reults probably is not the best track since the number of results dont seem to effect how the session_length.
In this project we looked into factors that influence clickthrough rates on eight days of data from the Wikimedia foundation. They had two different groups that users sessions were in, these are 'group a' and 'group b'. We found that,
There are definitely more factors we could look into that could influence click through rate and session length, but that will have to wait for another day!