Purpose
The objective of this analysis is to identify the differences between Cyclistic members and casual riders. The data used covers the previous 12 Months from Apr-2020 to Mar-2021 as asked for in the case study.
Ask
- How do annual members and casual riders use Cyclistic bikes differently?
- Why would casual riders buy Cyclistic annual memberships?
- How can Cyclistic use digital media to influence casual riders to become members?
Prepare and process the data
Loading the packages needed
library(tidyverse)
library(readxl)
library(lubridate)
library(hms)
library(scales)
library(geosphere)
library(RiverLoad)
library(prettydoc)
library(fontawesome)
Merging xlsx files
The datasets we’re going to use is separated in 12 files representing the last twelve previous months. In order to clean it efficiently we first need to merge them into one single dataframe.
I have started by looping through the list stored in the “table” value and binding each dataset to an empty one I’ll name df_stp0_raw.
setwd("C:/Users/Nidal/Documents/case_study_1/r_project/xlsx_files")
= list.files()
lst_tables = data.frame()
df_stp0_raw
for(table in lst_tables){
<- read_excel(table)
dummy_data <- rbind(df_stp0_raw, dummy_data)
df_stp0_raw
}
remove(dummy_data, lst_tables, table) # Cleaning some values created for the task
Showing the error encountered
Before resuming, I would like to point to the fact that the total of observations were compared and both results were the same. The total of rows/observations were 3489748 each.
NOTE: The chunk of code above has prompted several lines of the same error:
## Warning: Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : NA inserted for impossible 1900-02-29 datetime
Taking a peek
The time difference of the ride_length calculated in Excel (as asked for in the case study) has and could create possible errors in the data.
Getting a small glimpse of the generated data to check that.
glimpse(df_stp0_raw)
## Rows: 3,489,748
## Columns: 15
## $ ride_id <chr> "A847FADBBC638E45", "5405B80E996FF60D", "5DD24A79A4~
## $ rideable_type <chr> "docked_bike", "docked_bike", "docked_bike", "docke~
## $ started_at <dttm> 2020-04-26 17:45:14, 2020-04-17 17:08:54, 2020-04-~
## $ ended_at <dttm> 2020-04-26 18:12:03, 2020-04-17 17:17:03, 2020-04-~
## $ start_station_name <chr> "Eckhart Park", "Drake Ave & Fullerton Ave", "McClu~
## $ start_station_id <chr> "86", "503", "142", "216", "125", "173", "35", "434~
## $ end_station_name <chr> "Lincoln Ave & Diversey Pkwy", "Kosciuszko Park", "~
## $ end_station_id <chr> "152", "499", "255", "657", "323", "35", "635", "38~
## $ start_lat <dbl> 41.8964, 41.9244, 41.8945, 41.9030, 41.8902, 41.896~
## $ start_lng <dbl> -87.6610, -87.7154, -87.6179, -87.6975, -87.6262, -~
## $ end_lat <dbl> 41.9322, 41.9306, 41.8679, 41.8992, 41.9695, 41.892~
## $ end_lng <dbl> -87.6586, -87.7238, -87.6230, -87.6722, -87.6547, -~
## $ member_casual <chr> "member", "member", "member", "member", "casual", "~
## $ ride_length <dttm> 1899-12-31 00:26:49, 1899-12-31 00:08:09, 1899-12-~
## $ day_of_week <dbl> 1, 6, 4, 3, 7, 5, 5, 3, 4, 7, 7, 7, 6, 7, 2, 7, 1, ~
Calculating the ride_length
Some noticed errors are those in the list that follows:
- the ride_length is recognized as “dttm” which stands for “Date and time” instead of “time” only ;
- after importing the data created in Excel, the ride_length calculation behave like an absolute value, for example:
- “started_in: 2020-10-02 13:12:09” and “ended_at: 2020-10-02 13:08:46” results with a ride_length of “23:56:37” ;
- fields filled with “NA”, even if the started_in and ended_in fields were correct.
To avoid any future error related to this, I’m going to operate the same calculation with R and check the result, this will entirely recalculate the ride_length.
$ride_length <- as_hms(df_stp0_raw$ended_at - df_stp0_raw$started_at)
df_stp0_raw
glimpse(df_stp0_raw$ride_length) # The result should only show the time difference now
## 'hms' num [1:3489748] 00:26:49 00:08:09 00:14:23 00:12:12 ...
## - attr(*, "units")= chr "secs"
NOTE: The chunck of code above was dependently checked in a new column before directly editing the ride_length.
Filtering the data
Another incomprehensible inconsistency noticed is the bikes returned before being used.
In order to work on a cleaner version, I decided to split the ride_length into two dataframes:
- df_stp1_grteqal: Which is the data frame I will continue to work on.
- df_stp1_lesser: Which will be stored for further investigations.
<- subset(df_stp0_raw, ride_length >= 0)
df_stp1_grteqal <- subset(df_stp0_raw, ride_length < 0) df_stp1_lesser
Getting rid of missing values
To prevent any errors in visualizations, observations that contains NA (which is not available data) will be removed from df_stp1_grteqal and the result will be put in a new dataframe I’ll name df_stp2_prefinal.
<- na.omit(df_stp1_grteqal) df_stp2_prefinal
Calculating the distance travelled
In this step, the distance traveled will be calculated using latitude and longitude start and end points. According to the documentation of geosphere, the default result is in meters.
The for loop needed very long computations. Thereby, to be able to keep working on the project I ran the following chunk of code on another computer and exported it as a csv file.
Keep in mind: The column “distance” is in meters.
#
# DONE ON ANOTHER COMPUTER #
#
# df_stp2_prefinal['distance'] <- NA # Creating an empty column
# for (i in 1:nrow(df_stp2_prefinal)) {
# a <- df_stp2_prefinal$start_lng[i]
# b <- df_stp2_prefinal$start_lat[i]
# c <- df_stp2_prefinal$end_lng[i]
# d <- df_stp2_prefinal$end_lat[i]
# df_stp2_prefinal$distance[i] <- distm(c(a, b),c(c, d), fun = distHaversine)
# }
# remove(a,b,c,d,i)
# write.csv(df_stp2_prefinal,'C:/Users/Exia/Documents/case_study_1/r_project/df_stp2_prefinal2.csv')
Importing the result
Importing the csv file and storing it in a new dataframe.
<- read.csv("C:/Users/Nidal/Documents/case_study_1/r_project/external_data/df_stp2_prefinal2.csv")
df_stp2_prefinal2_imprt
$X <- NULL # This removes the X column with rows numbers df_stp2_prefinal2_imprt
Comparing observations and variables
Checking the difference of observations and variables.
<- matrix(c(nrow(df_stp2_prefinal),
tab_diff ncol(df_stp2_prefinal),
nrow(df_stp2_prefinal2_imprt),
ncol(df_stp2_prefinal2_imprt)),
ncol=2, byrow=TRUE)
colnames(tab_diff) <- c('Observations','Variables')
rownames(tab_diff) <- c('df_stp2_prefinal','df_stp2_prefinal2_imprt')
tab_diff
## Observations Variables
## df_stp2_prefinal 3284237 15
## df_stp2_prefinal2_imprt 3284237 16
remove(tab_diff) # removing the tab created to show the difference
Converting values type
The file imported had some issues with data type, I was forced to convert them to the correct ones.
The data manipulation done is as follows:
- The values started_at and ended_at were converted from char to POSIXct.
- I had to redo the ride_length subtraction done before.
# Converting two values from char to POSIXct
$started_at <- as.POSIXct(df_stp2_prefinal2_imprt$started_at, tz = "UTC")
df_stp2_prefinal2_imprt$ended_at <- as.POSIXct(df_stp2_prefinal2_imprt$ended_at, tz = "UTC")
df_stp2_prefinal2_imprt
# Already done above, but in order to make things like before. I preferred to rerun it.
$ride_length <- as_hms(df_stp2_prefinal2_imprt$ended_at - df_stp2_prefinal2_imprt$started_at)
df_stp2_prefinal2_imprt
# Getting a glimpse on the structure of the dataframe
str(df_stp2_prefinal2_imprt)
## 'data.frame': 3284237 obs. of 16 variables:
## $ ride_id : chr "A847FADBBC638E45" "5405B80E996FF60D" "5DD24A79A4E006F4" "2A59BBDF5CDBA725" ...
## $ rideable_type : chr "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
## $ started_at : POSIXct, format: "2020-04-26 17:45:14" "2020-04-17 17:08:54" ...
## $ ended_at : POSIXct, format: "2020-04-26 18:12:03" "2020-04-17 17:17:03" ...
## $ start_station_name: chr "Eckhart Park" "Drake Ave & Fullerton Ave" "McClurg Ct & Erie St" "California Ave & Division St" ...
## $ start_station_id : chr "86" "503" "142" "216" ...
## $ end_station_name : chr "Lincoln Ave & Diversey Pkwy" "Kosciuszko Park" "Indiana Ave & Roosevelt Rd" "Wood St & Augusta Blvd" ...
## $ end_station_id : chr "152" "499" "255" "657" ...
## $ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ end_lat : num 41.9 41.9 41.9 41.9 42 ...
## $ end_lng : num -87.7 -87.7 -87.6 -87.7 -87.7 ...
## $ member_casual : chr "member" "member" "member" "member" ...
## $ ride_length : 'hms' num 00:26:49 00:08:09 00:14:23 00:12:12 ...
## ..- attr(*, "units")= chr "secs"
## $ day_of_week : int 1 6 4 3 7 5 5 3 4 7 ...
## $ distance : num 3990 980 2991 2138 9138 ...
Note: Values with “dbl” class were automatically converted to “num”. No data manipulation is needed since double() is the same as numeric() in R.
Last but not least: Missing values
Finally, we clean any not available (NA) fields to start creating visualizations based on the df_final dataframe.
<- na.omit(df_stp2_prefinal2_imprt) df_final
Prepare lists for figure 4 (RiverLoad package)
Splitting the df_final into two dataframes:
- df_final_members
- df_final_casuals
<- subset(df_final, member_casual == "member")
df_final_members <- subset(df_final, member_casual == "casual") df_final_casuals
To be able to use RiverLoad, the data needed should be extracted and put in separate dataframe.
Since the distance calculated before is in meters, it will need to be converted to kilometers. Simply subtracting the result by 1000 is the quickest method.
<- data.frame(df_final_members$started_at,
fig4_members $distance / 1000))
(df_final_members
<- data.frame(df_final_casuals$started_at,
fig4_casuals $distance / 1000))
(df_final_casuals
# Renaming the column as asked in the documentation
names(fig4_members) <- c('datetime', 'flow')
names(fig4_casuals) <- c('datetime', 'flow')
# Using RiverLoad to calculate the mean of every month in the flow of records
<- monthly.year.mean(fig4_members)
mn_members <- monthly.year.mean(fig4_casuals)
mn_casuals
# Creating member_casual
$member_casual <- "member"
mn_members$member_casual <- "casual"
mn_casuals
# Binding the dataframes created, and renaming columns
<- rbind(mn_members, mn_casuals)
fig4_dist_month names(fig4_dist_month) <- c('month', 'distance_mean', 'member_casual')
# Cleaning
remove(fig4_casuals, fig4_members, mn_casuals, mn_members)
Analyze the data
The final data frame represents 94% of the raw data. Which results of around 6% of dirty data cleaned and/or isolated.
Table 1: Mean/Max/Min
As asked for in the case study, the mean, maximum and minimum values of the members and casuals using Cystistic bikes.
Mean | Maximum | Minimum | |
---|---|---|---|
Members | 00:15:55 | 978:40:02 | 00:00:00 |
Casuals | 00:45:27 | 928:03:53 | 00:00:00 |
Table 2: Sum of use per day of week
The sum of use of each type of users by day during the last twelve months.
# For aesthetic purposes and HTML compatibility output, I created two functions to use in the pipe tables below.
<- function(day) {
sumd_m <- sum(df_final_members$day_of_week == day)
res return(res)
}
<- function(day) {
sumd_c <- sum(df_final_casuals$day_of_week == day)
res return(res)
}
Members | Casuals | |
---|---|---|
Sunday | 250465 | 249761 |
Monday | 251905 | 142042 |
Tuesday | 268223 | 136218 |
Wednesday | 288390 | 148352 |
Thursday | 283731 | 156177 |
Friday | 288897 | 196451 |
Saturday | 304608 | 319017 |
Act
- Survey premium users to know the reason (if any) they’ve chosen to upgrade to the annual plan.
- Survey casuals to know what make them hesitate and what would make them change their mind.
- Depending on the survey’s result, maybe separate the premium plan into “Premium” and “Premium+”. Making the annual plan more affordable will maybe convince more casuals.
- Use seasonal promotions to give a glimpse of the annual plan.
- Develop and promote an app that help users build consistency.
See also
Additional information about this case study.
Sources
Get in touch
If you have any questions, please feel free to contact me.
Comments on the Data