## Introduction ¶

This Project was put out as data challenge by the Wikimedia Foundation for a data analyst position. I thought it would be a great opportunity to brush up on some data analysis tools in Pandas and decided to complete the challenge.

The goal of the project was to analyze data from event logging (EL) to track a variety of performance and usage metrics to help the company make decisions. Specifically, they were interested in:

• clickthrough rate: the proportion of search sessions where the user clicked on one of the results displayed

• zero results rate: the proportion of searches that yielded 0 results

EL uses JavaScript to asynchronously send messages (events) to their servers when the user has performed specific actions.

### Data ¶

The dataset comes from a tracking schema that the Wikimedia Foundation uses for assessing user satisfaction. Desktop users are randomly sampled to be anonymously tracked by this schema which uses a "I'm alive" pinging system that we can use to estimate how long our users stay on the pages they visit. The dataset contains just a little more than a week of EL data.

Column Value Description
uuid string Universally unique identifier (UUID) for backend event handling.
timestamp integer The date and time (UTC) of the event, formatted as YYYYMMDDhhmmss.
session_id string A unique ID identifying individual sessions.
group string A label ("a" or "b").
action string Identifies in which the event was created. See below.
checkin integer How many seconds the page has been open for.
page_id string A unique identifier for correlating page visits and check-ins.
n_results integer Number of hits returned to the user. Only shown for searchResultPage events.
result_position integer The position of the visited page's link on the search engine results page (SERP).

The following are possible values for an event's action field:

• searchResultPage: when a new search is performed and the user is shown a SERP.
• visitPage: when the user clicks a link in the results.
• checkin: when the user has remained on the page for a pre-specified amount of time.

### Objectives¶

The objectives of this project are to find answers to the following questions:

1. What is their daily overall clickthrough rate? How does it vary between the groups?

2. Which results do people tend to try first? How does it change day-to-day?

3. What is their daily overall zero results rate? How does it vary between the groups?

4. Let session length be approximately the time between the first event and the last event in a session. Choose a variable from the dataset and describe its relationship to session length. Visualize the relationship.

We'll dive in to the analysis below, but first well read in some basic libraries as well as some functions which we will use. The functions were defined outside this notebook can found on the GitHub repo.

In [1]:
import pandas as pd
import numpy as np
import os.path
import pandas as pd
from itertools import chain
import matplotlib.pyplot as plt
% matplotlib inline
from plotting_functions import *
from helper_functions import *
import warnings
warnings.filterwarnings('ignore')

"""
This function does the following,

1.) Reads in the csv file:

'events_logs.csv'

and set the index to be the timestamp and print its to a
new file called:

'new_events_log.csv'

2.) Then it just reads in'new_events_log.csv' and returns the
corresponding dataframe.

If the file 'new_events_log.csv' already exists then only 2.)
is done.

:return: Dataframe of the data.

:rvalue: Pandas DataFrame.

"""
# 1.
if os.path.isfile('new_events_log.csv') is False:
df['timestamp'] = df['timestamp'].astype(int)
df.to_csv('new_events_log.csv')

#2.)
parse_dates=['timestamp'],
index_col='timestamp')

# drop some extra column
df = df.drop('Unnamed: 0',axis=1)

return df


## Data Analysis ¶

The data was provided in csv format in a file called, "events_log.csv." When reading in the data I wanted to set the index of the dataframe to be a TimeStamp. This can be achieved using the pandas command:

df = pd.read_csv("events_log.csv",
parse_dates=['timestamp'],
index_col='timestamp')



However, this would not work because the TimeStamp was being read in as an float in scientific notation and therefore wouldnt convert to DateTime properly. I was able to circumvent this by first reading in the dataframe as usual, then changing the timestamp variable to an integer with the command:

df = pd.read_csv('events_log.csv')
df['timestamp'] = df['timestamp'].astype(int)



I then wrote the dataframe to csv as "new_events_log.csv" and read it in again using the previous technique. I wanted my code to work even if the person had download the original csv and not the new one. So, I wrote a function that did the above if "new_events_log.csv" doesn't exist, and if it did exist just read it in. This is the funciton called below,

In [2]:
df = read_in_data()


I included this function specifically in this notebook since it run much faster when it was defined within the notebook then as an outside function.

Now, we can get a basic look at our data by using the info() function:

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 400165 entries, 2016-03-01 10:38:42 to 2016-03-04 14:43:03
Data columns (total 8 columns):
uuid               400165 non-null object
session_id         400165 non-null object
group              400165 non-null object
action             400165 non-null object
checkin            223824 non-null float64
page_id            400165 non-null object
n_results          136234 non-null float64
result_position    230482 non-null float64
dtypes: float64(3), object(5)
memory usage: 27.5+ MB


Let's get an idea of the number of sessions this takes into account for,

In [4]:
session_ids = df.session_id.unique()
print "Number of sessions: " + str(len(session_ids))

Number of sessions: 68028


If we sort the dataframe by session_id, we can see how long each person stayed on a webpage,

In [5]:
df.sort_values('session_id').head().sort_index()

Out[5]:
uuid session_id group action checkin page_id n_results result_position
timestamp
2016-03-03 15:20:45 e6f9280c8df5536490eb925a91979a3b 0000cbcb67c19c45 b searchResultPage NaN fdeeb9aaf0130d41 20.0 NaN
2016-03-07 08:49:55 5b394c862bf657d98bcd25589a5247ac 0001382e027b2ea4 b searchResultPage NaN 7aa28c36c03380a0 18.0 NaN
2016-03-07 08:50:18 f01620582b615c03812cb1af2e804102 0001382e027b2ea4 b checkin 20.0 f88793c95e778898 NaN 1.0
2016-03-07 08:51:28 fa6a1974487154aa95ddb93d8d5d3734 0001382e027b2ea4 b checkin 90.0 f88793c95e778898 NaN 1.0
2016-03-07 08:53:28 70d367a1c60156c1b1723d1e4196a987 0001382e027b2ea4 b checkin 210.0 f88793c95e778898 NaN 1.0

I dug deeper into looking at the data on my own, but won't show it because it's superfluous. Instead, let's get into analyzing the data. Instead lets get to answering the questions posed in the challenge.

### What The Average Daily Clickthrough Rates? ¶

We first set out to understand what the average daily clickthrough rate of the users is, we can do this by first finding out how many people visit the website per day. In order to find this number on a per day basis, we have to first perform a groupby day,

df.groupby([pd.TimeGrouper('D')])



and then count the number of rows where a search was performed, i.e. when each rows has,

'action' = 'searchResultPage'



We then need the number of people that visited a search result page on a daily basis. We repeat the same groupby call, but now count the number of records where the corresponding rows have,

'action' == 'visitPage'



We then take the two values for each day, divide them and take the average to get the average daily clickthrough rate. This is what is completed in the function call,

In [6]:
rate = average_click_through_rate(df)


We also have to answer the same question on clickthrough rate, but this time taking into account the differences between 'group a' and 'group b'. This can be achieved using the same procedure as before, but we group by day as well as the session 'group', i.e.

df.groupby([pd.TimeGrouper('D'),'group'])



We then count the number of rows that have the different actions discussed previously. This procedure is exactly what is done in the function,

average_click_through_rate_by_group(...)



The results of the grouped and ungrouped results are shown below:

In [7]:
diff = average_click_through_rate_by_group(df)

plot_grouped_and_ungrouped(diff,
rate,
"Daily Clickthrough Rate")


We can see that the clickthrough rate of 'group a' is almost twice that of 'group b'.

### Which Results Do Clients Click First? ¶

In order to figure out which links clients click first we have get the number of search result pages that were clicked on and group by day, session_id and result_positon. We do this using the groupby Pandas call as well as filtering the rows by which rows includes a webpage that has been visited. We can tell which page has been visited by getting the result_position of the page that was visisted. The groupby and filter call are below,

In [33]:
clicked = df.groupby([pd.TimeGrouper('D'),'session_id','result_position'])\
.apply(lambda row: len(row[row['result_position'] > 0]))

In [9]:
clicked.head()

Out[9]:
timestamp   session_id        result_position
2016-03-01  001e2d0e159172d2  2.0                2
0024c4506bf92e1c  1.0                9
0029420a5f8c7d90  1.0                1
002b97995ca9ce77  3.0                2
18.0               2
dtype: int64

We note that the values in the series is the number of times that the result_positon appeared in the rows of the dataframe. We see that in session, 002b97995ca9ce77, the user went to both the 3rd listed result and the 18th listed result. Let's take a look at the data for that session in chronological order:

In [10]:
df[df.session_id == '002b97995ca9ce77'].sort_index()

Out[10]:
uuid session_id group action checkin page_id n_results result_position
timestamp
2016-03-01 06:56:58 8e154e03862750dca50374c05170fd0b 002b97995ca9ce77 a searchResultPage NaN b9aef1d46f18a05c 20.0 NaN
2016-03-01 06:57:10 df81855edb66504e9ab1a3b3c8579c33 002b97995ca9ce77 a visitPage NaN ed9547024a71f9f2 NaN 3.0
2016-03-01 06:57:20 afe45772a9c55a54a4dcd36f8e70faf3 002b97995ca9ce77 a checkin 10.0 ed9547024a71f9f2 NaN 3.0
2016-03-01 06:57:25 62b963cbaca55a7fbd4c712832f19d36 002b97995ca9ce77 a searchResultPage NaN 374343e8ba419730 1.0 NaN
2016-03-01 06:57:30 6e9ae9a14484502596983c5ec9c9914d 002b97995ca9ce77 a searchResultPage NaN 60fda08472a6f1c5 20.0 NaN
2016-03-01 06:58:19 4c0f8453754257079692d2a55b4fe232 002b97995ca9ce77 a visitPage NaN e59ec99472571d5c NaN 18.0
2016-03-01 06:58:29 51b31a5a14855e1cbef3847a88fc1fb7 002b97995ca9ce77 a checkin 10.0 e59ec99472571d5c NaN 18.0

We can see that the person first checked out the 3rd result and then went back to the search results page and clicked on the 18th page. We have to be careful that we only take the first result webpage the user clicked on in the analysis. I did this by first filtering the dataframe to only included rows where the person was checkedin and then sorted by the TimeStamp (index):

In [11]:
visited_pages = df[df.action == 'visitPage'].sort_index()


Next we groupby the day and session_id and take the first result in the column corresponding to the result_position column. Since the dataframe is sorted in chronological order we can assume that the result we get back is the first search result that the user clicked on:

In [12]:
first_clicked = visited_pages.groupby([pd.TimeGrouper('D'),'session_id',], sort=True)['result_position']\
.first()


We print out the first few reults to see what they like:

In [13]:
first_clicked.head()

Out[13]:
timestamp   session_id
2016-03-01  001e2d0e159172d2    2.0
0024c4506bf92e1c    1.0
0029420a5f8c7d90    1.0
002b97995ca9ce77    3.0
003037c8d3c3fe94    2.0
Name: result_position, dtype: float64

As a sanity check we can see that the value for session_id, 002b97995ca9ce77, is 3 which is what we expected.

Now we loop over the number of days and count the result_positions of the pages that were first visited and order them by result_position. We plot the daily results as bar graphs, taking the top 20 most clicked result_positions below:

In [14]:
plot_daily_clicked(first_clicked, 20)


The first thing we see is that the most frequently clicked is the first result listed and the number of click on the rest of the results drop exponentially with the rank of the listing. This makes sense as hopefully the first returned result is the most relevant.

Another interesting thing to note is the number of clicks changes throughout the week. The number of clicks on the first result seem to be around 2400-2500 for the first three days. On the fourth day they drop to 2300, then drop down to 1700 on the fifth day before recovering to 2000 on the 6th. The clickthrough rate of the first results returns to the 2400-2500 range on the seventh day and finally drops again to around 2200 on the eigth day. We can see that most days the first search results get about 2200-2500 clicks, but a noticable drop in click rate for the first search result occurs on the 5th day. We can also see the same pattern occurs for the click rate on the second search result.

### What Are The Daily Average Zero Results Rate? ¶

Now we are interested in the daily zero rates, that is the the number of searches that return zero results. We group the data by day as we did in Question 1. and then count the number of rows that have,

'n_results' == 0



and divide by the number of searches. This gives us the the zero rate for each day and is the done in the function:

In [15]:
zero_rate = average_zero_rate(df)


We now repeat the same process as above, but also group by the 'group', that is whether the session is in 'group a' or in 'group b'. This is done in the function,

average_zero_rate_by_group(...)



and we plot the results below,

In [16]:
diff2 = average_zero_rate_by_group(df)
plot_grouped_and_ungrouped(diff2,
zero_rate,
'Average Daily Zero Rate By Group')


We can see that the average zero rate is much higher in group b than group a, its almost twice the rate in group a. This makes sense, since group b has a smaller clickthrough rate and having zero search results would definitely result in less clickthroughs.

### What Factors Affect User Session Length? ¶

Now we want to look into the session length and what factors may affect it.

We first get the start time of each session by grouping by session_id and taking the first row. We have to use the command, nth(0) instead of first(), because of issues with first() dropping cases that have NaN as values. We drop the uuid column since it seems like useless information and pushes the dataframe off the screen when I print it. The last thing I will mention is that we use as_index = False in the groupby(...) call so that we get keep the original index and a Pandas DataFrame is returned instead of Pandas Series:

In [17]:
session_begin = df.sort_index()\
.groupby(['session_id'], sort=True, as_index=False)\
.nth(0)\
.drop('uuid',axis=1)

session_begin.sort_values('session_id')

Out[17]:
session_id group action checkin page_id n_results result_position
timestamp
2016-03-03 15:20:45 0000cbcb67c19c45 b searchResultPage NaN fdeeb9aaf0130d41 20.0 NaN
2016-03-07 08:49:55 0001382e027b2ea4 b searchResultPage NaN 7aa28c36c03380a0 18.0 NaN
2016-03-05 09:24:15 0001e8bb90445cb2 b searchResultPage NaN 6b7f8832f5b6cc7d 20.0 NaN
2016-03-08 16:19:46 000216cf18ae1ab1 a searchResultPage NaN 08cebdafafbe069a 20.0 NaN
2016-03-08 05:33:00 000527f711d50dfc b searchResultPage NaN 41a14a053b06af6c 1.0 NaN

Now, I'm going to do a sanity check by looing at the session 0001382e027b2ea4 and making sure that the first TimeStamp is the same as the one above.

In [18]:
test_session = df[df.session_id == '0001382e027b2ea4'].sort_index()\
.drop('uuid',axis=1)

Out[18]:
session_id group action checkin page_id n_results result_position
timestamp
2016-03-07 08:49:55 0001382e027b2ea4 b searchResultPage NaN 7aa28c36c03380a0 18.0 NaN
2016-03-07 08:49:58 0001382e027b2ea4 b visitPage NaN f88793c95e778898 NaN 1.0
2016-03-07 08:50:08 0001382e027b2ea4 b checkin 10.0 f88793c95e778898 NaN 1.0
2016-03-07 08:50:18 0001382e027b2ea4 b checkin 20.0 f88793c95e778898 NaN 1.0
2016-03-07 08:50:28 0001382e027b2ea4 b checkin 30.0 f88793c95e778898 NaN 1.0

Our simple test was a success!

In order to get the correct answer for the last TimeStamp of each session, I had to drop the page_id column, I'm not 100% confident as to why this was necessary, but I think it has to do with the issues of NaNs.

In [19]:
df2 = df.drop(['uuid','page_id'], axis=1)\
.sort_index()


Now we do the same thing as previous for session_begin, but add take the nth(-1) instead of using last() because of issues with last() dropping cases with NaNs.

In [20]:
session_end = df2.groupby(['session_id'], sort=True,as_index=False)\
.nth(-1)

session_end.sort_values('session_id')\

Out[20]:
session_id group action checkin n_results result_position
timestamp
2016-03-03 15:20:45 0000cbcb67c19c45 b searchResultPage NaN 20.0 NaN
2016-03-07 08:54:58 0001382e027b2ea4 b checkin 300.0 NaN 1.0
2016-03-05 09:31:30 0001e8bb90445cb2 b checkin 420.0 NaN 1.0
2016-03-08 16:20:44 000216cf18ae1ab1 a searchResultPage NaN 20.0 NaN
2016-03-08 05:33:00 000527f711d50dfc b searchResultPage NaN 1.0 NaN

Now we perfrom another sanity check to make sure that the timestamp above for session_id=0001382e027b2ea4 is the same as the last timestamp in the test_session:

In [21]:
test_session.tail()

Out[21]:
session_id group action checkin page_id n_results result_position
timestamp
2016-03-07 08:52:28 0001382e027b2ea4 b checkin 150.0 f88793c95e778898 NaN 1.0
2016-03-07 08:52:58 0001382e027b2ea4 b checkin 180.0 f88793c95e778898 NaN 1.0
2016-03-07 08:53:28 0001382e027b2ea4 b checkin 210.0 f88793c95e778898 NaN 1.0
2016-03-07 08:53:58 0001382e027b2ea4 b checkin 240.0 f88793c95e778898 NaN 1.0
2016-03-07 08:54:58 0001382e027b2ea4 b checkin 300.0 f88793c95e778898 NaN 1.0

It worked!

So, now lets get the the start time of each session by creating a new column in the session_begin that strips the time from the index of the old dataframe, but we want to convert the Pandas TimeStamp objects into Python DateTime objects. We use the to_pydatetime() to do this:

In [22]:
session_begin['time'] = session_begin.index.to_pydatetime()


Now we drop the TimeStamp as the index and drop the checkin column since that will be irrellevant and depends on the action.

In [23]:
session_begin = session_begin.reset_index()\
.drop(['timestamp','checkin'],axis=1)


Let's repeat the same process, but for session_end:

In [24]:
session_end['time'] = session_end.index.to_pydatetime()

session_end = session_end.reset_index()\
.drop(['timestamp','checkin'], axis=1)


Now we can merge the two pandas dataframes into one:

In [25]:
df3 = pd.merge(session_begin,
session_end,
on='session_id',
suffixes=('_start','_end'))

Out[25]:
session_id group_start action_start page_id n_results_start result_position_start time_start group_end action_end n_results_end result_position_end time_end
0 cbba9a98ba7b20cc a checkin f79f531304a265f8 NaN 4.0 2016-03-01 00:00:01 a searchResultPage 20.0 NaN 2016-03-01 00:00:17
1 c1f0f8499c6655cd b checkin 88bebe600b777ea9 NaN NaN 2016-03-01 00:00:02 b checkin NaN NaN 2016-03-01 00:00:12
2 614eaa17120fe46b b checkin 756d326815bc60c9 NaN 1.0 2016-03-01 00:00:02 b checkin NaN 1.0 2016-03-01 00:01:22
3 32edcb654a1dee57 b checkin d6122cfa99f1809f NaN 1.0 2016-03-01 00:00:05 b checkin NaN 1.0 2016-03-01 00:04:38
4 34ba1f4ed8fa8368 b checkin b7502d32b09f32bb NaN NaN 2016-03-01 00:00:06 b checkin NaN NaN 2016-03-01 00:05:38

We then get the time difference between the start and end times and convert this number into seconds.

In [26]:
temp_session_length = df3['time_end'] - df3['time_start']
temp_session_length = temp_session_length.apply(lambda row: row.seconds)


Now we try to bin the number of counts of the session length into different bins up until 20 minutes,

In [27]:
bins = [0,10,30,60,90,120,150,180,210,240]
time = 240
for i in range(16):
time += 60
bins.append(time)

top_session_lengths = temp_session_length.value_counts(bins=bins)


Let's plot the distribution of the frequency of sessions that had session length within the above time intervals,

In [28]:
plt.figure(figsize=(6,4))
top_session_lengths.sort_index()\
.plot(kind='bar',
title='Frequency Of Session Lengths')

plt.xlabel('Seconds')
plt.ylabel('Frequency')

Out[28]:
<matplotlib.text.Text at 0x10e25a790>

We can see that most people spend either 10 seconds in a session. This means they probably are not finding what they want. Then the next largest groups of people spend 30 seconds, 7 minutes and 0 seconds respectively. Let's see if there is a difference between the session length and session 'group'.

In [34]:
df3['session_length'] = temp_session_length
session_length_by_group = df3.groupby(['group_start'])['session_length']\
.value_counts(bins=bins)

In [35]:
plot_session_length_by_group(session_length_by_group)

<matplotlib.figure.Figure at 0x10c782450>

We can see that the magnitudes of numbers are very different between the two groups but the shape of the distributions are roughly the same. Let's take a look at how the percentages change between the two groups, we can get this information from the series through,

In [31]:
session_length_a = session_length_by_group['a']
session_length_b = session_length_by_group['b']

tot_a = session_length_a.values.sum()
tot_b = session_length_b.values.sum()

session_length_a = session_length_a / tot_a
session_length_b = session_length_b / tot_b


And plot the results:

In [36]:
plot_session_length_by_group_percentages(session_length_a,
session_length_b)


It seems like group b had a larger percentage of people with both almost 10 or less seconds then group a. Group b also seems like it had larger percentage of sessions that were 7 minutes than group a. The 10 seconds or less is probably due to the higher zero rates in group b, but the 7 minutes I'm not able to explain right now.

Let's take a look at the number or results in each search. We'll look at the top 25, but we have to do this carefully since we cannot just count the number of rows that have n_results equal a number, since some people may actually change the search criteria within a session. We therefore groupby session_id and take the unique number of listed search results. Since this returns a Pandas Series where the index is the session_id and the values are a list of n_results for each session I had to chain the values together before counting them:

In [38]:
temp_results = df.groupby('session_id')['n_results'].unique()
temp_results = pd.Series(list(chain(*temp_results.values)))

number_of_results = temp_results.dropna()\
.value_counts()\

number_of_results.sort_index()\
.plot(kind='bar',
title='Frequency Of The Number Of Results')

plt.xlabel('Number Of Results')
plt.ylabel('Count')
plt.show()


It seems that most searches end up with 20 results and the two next highest are 0 and 1 results respectively.

Let's take a look at the session lengths for number of n_results being 0, 1, 2, 3, 5, 10 and 20 and how this may be affected by value of n_results. We will take the number of n_results from the end of the session. I wrote a function to find the disribution of percentages session length given the number of listed results n_results called get_sessions_with_n_results. The function takes in df3 and the number of listed results and then returns a Pandas Series that is the distribution of percentages sessions with session_length and the value of listed of number of results n_results.

In [41]:
n_results = [0,1,2,5,10,20]
results = [get_sessions_with_n_results(df3, n) for n in n_results]


We plot the results below

In [48]:
plt.figure(figsize=(8,12))
colors = ['g','r','b','m','c','k']
for i in range(1,len(n_results)+1):
n = i-1
plt.subplot(4,2,i)
results[n].sort_index()\
.plot(kind='bar',
color=colors[n],
label='n_results = '+ str(n_results[n]))

plt.legend()
plt.xlabel('Seconds')
plt.ylim([0, 0.15])

plt.show()


It doent look like there is much a difference between the number of results and how long people spend on the first clicked page. Let's take the people who spent between 6 and 8 minutes on a link and take a look at which result page people clicked on:

In [49]:
title='result_position of page where people spent 6-8 minutes.'
df_420 = df3[df3.session_length < 480]
df_420 = df_420[df_420.session_length >= 360]
df_420.result_position_end.value_counts()
.plot(kind='bar',
title=title)
plt.xlabel('result_position')
plt.ylabel('Frequency')

Out[49]:
<matplotlib.text.Text at 0x10ede1f10>

It seems that the number people who spent 6-8 minutes on a resulting webpage were spending that time on the first listed result. So one way to improve the clickthrough rate would be to focus on optimizing the listing engine such that first result is the very relevant. Focusing on getting many relevant searrch reults probably is not the best track since the number of results dont seem to effect how the session_length.

## Conclusion ¶

In this project we looked into factors that influence clickthrough rates on eight days of data from the Wikimedia foundation. They had two different groups that users sessions were in, these are 'group a' and 'group b'. We found that,

• 'group b' had a higher return of zero results and subsequently a lower clickthrough rate compared to 'group a'.
• Most of the search results either gave 20 or no search results.
• There wasnt much of a difference between how many listings there were and the session length.
• most of the sessions had time lengths that were either under 1 minute or between 6 and 8 minutes.
• Most of the people clicked on the first listed result and the majority of people who had sessions lengths between 6 and 8 clicked ended their session on the webpage of the first listed result.

There are definitely more factors we could look into that could influence click through rate and session length, but that will have to wait for another day!