Cyclistic Bike - Case Study

Nidal Iguer

19/04/2021

Purpose

The objective of this analysis is to identify the differences between Cyclistic members and casual riders. The data used covers the previous 12 Months from Apr-2020 to Mar-2021 as asked for in the case study.

Ask

  1. How do annual members and casual riders use Cyclistic bikes differently?
  2. Why would casual riders buy Cyclistic annual memberships?
  3. How can Cyclistic use digital media to influence casual riders to become members?


Prepare and process the data

Loading the packages needed

library(tidyverse)
library(readxl)
library(lubridate)
library(hms)
library(scales)
library(geosphere)
library(RiverLoad)
library(prettydoc)
library(fontawesome)

Merging xlsx files

The datasets we’re going to use is separated in 12 files representing the last twelve previous months. In order to clean it efficiently we first need to merge them into one single dataframe.

I have started by looping through the list stored in the “table” value and binding each dataset to an empty one I’ll name df_stp0_raw.

setwd("C:/Users/Nidal/Documents/case_study_1/r_project/xlsx_files")
lst_tables = list.files()
df_stp0_raw = data.frame()

for(table in lst_tables){
  dummy_data <- read_excel(table)
  df_stp0_raw <- rbind(df_stp0_raw, dummy_data)
}

remove(dummy_data, lst_tables, table) # Cleaning some values created for the task

Showing the error encountered

Before resuming, I would like to point to the fact that the total of observations were compared and both results were the same. The total of rows/observations were 3489748 each.

NOTE: The chunk of code above has prompted several lines of the same error:

## Warning: Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : NA inserted for impossible 1900-02-29 datetime

Taking a peek

The time difference of the ride_length calculated in Excel (as asked for in the case study) has and could create possible errors in the data.

Getting a small glimpse of the generated data to check that.

glimpse(df_stp0_raw)
## Rows: 3,489,748
## Columns: 15
## $ ride_id            <chr> "A847FADBBC638E45", "5405B80E996FF60D", "5DD24A79A4~
## $ rideable_type      <chr> "docked_bike", "docked_bike", "docked_bike", "docke~
## $ started_at         <dttm> 2020-04-26 17:45:14, 2020-04-17 17:08:54, 2020-04-~
## $ ended_at           <dttm> 2020-04-26 18:12:03, 2020-04-17 17:17:03, 2020-04-~
## $ start_station_name <chr> "Eckhart Park", "Drake Ave & Fullerton Ave", "McClu~
## $ start_station_id   <chr> "86", "503", "142", "216", "125", "173", "35", "434~
## $ end_station_name   <chr> "Lincoln Ave & Diversey Pkwy", "Kosciuszko Park", "~
## $ end_station_id     <chr> "152", "499", "255", "657", "323", "35", "635", "38~
## $ start_lat          <dbl> 41.8964, 41.9244, 41.8945, 41.9030, 41.8902, 41.896~
## $ start_lng          <dbl> -87.6610, -87.7154, -87.6179, -87.6975, -87.6262, -~
## $ end_lat            <dbl> 41.9322, 41.9306, 41.8679, 41.8992, 41.9695, 41.892~
## $ end_lng            <dbl> -87.6586, -87.7238, -87.6230, -87.6722, -87.6547, -~
## $ member_casual      <chr> "member", "member", "member", "member", "casual", "~
## $ ride_length        <dttm> 1899-12-31 00:26:49, 1899-12-31 00:08:09, 1899-12-~
## $ day_of_week        <dbl> 1, 6, 4, 3, 7, 5, 5, 3, 4, 7, 7, 7, 6, 7, 2, 7, 1, ~

Calculating the ride_length

Some noticed errors are those in the list that follows:

  • the ride_length is recognized as “dttm” which stands for “Date and time” instead of “time” only ;
  • after importing the data created in Excel, the ride_length calculation behave like an absolute value, for example:
    • “started_in: 2020-10-02 13:12:09” and “ended_at: 2020-10-02 13:08:46” results with a ride_length of “23:56:37” ;
  • fields filled with “NA”, even if the started_in and ended_in fields were correct.

To avoid any future error related to this, I’m going to operate the same calculation with R and check the result, this will entirely recalculate the ride_length.

df_stp0_raw$ride_length <- as_hms(df_stp0_raw$ended_at - df_stp0_raw$started_at)

glimpse(df_stp0_raw$ride_length) # The result should only show the time difference now
##  'hms' num [1:3489748] 00:26:49 00:08:09 00:14:23 00:12:12 ...
##  - attr(*, "units")= chr "secs"

NOTE: The chunck of code above was dependently checked in a new column before directly editing the ride_length.

Filtering the data

Another incomprehensible inconsistency noticed is the bikes returned before being used.

In order to work on a cleaner version, I decided to split the ride_length into two dataframes:

  1. df_stp1_grteqal: Which is the data frame I will continue to work on.
  2. df_stp1_lesser: Which will be stored for further investigations.
df_stp1_grteqal <- subset(df_stp0_raw, ride_length >= 0)
df_stp1_lesser <- subset(df_stp0_raw, ride_length < 0)

Getting rid of missing values

To prevent any errors in visualizations, observations that contains NA (which is not available data) will be removed from df_stp1_grteqal and the result will be put in a new dataframe I’ll name df_stp2_prefinal.

df_stp2_prefinal <- na.omit(df_stp1_grteqal)

Calculating the distance travelled

In this step, the distance traveled will be calculated using latitude and longitude start and end points. According to the documentation of geosphere, the default result is in meters.

The for loop needed very long computations. Thereby, to be able to keep working on the project I ran the following chunk of code on another computer and exported it as a csv file.

Keep in mind: The column “distance” is in meters.

#
# DONE ON ANOTHER COMPUTER #
#

# df_stp2_prefinal['distance'] <- NA # Creating an empty column

# for (i in 1:nrow(df_stp2_prefinal)) {
#   a <- df_stp2_prefinal$start_lng[i]
#   b <- df_stp2_prefinal$start_lat[i]
#   c <- df_stp2_prefinal$end_lng[i]
#   d <- df_stp2_prefinal$end_lat[i]

#   df_stp2_prefinal$distance[i] <- distm(c(a, b),c(c, d), fun = distHaversine)
# }

# remove(a,b,c,d,i)

# write.csv(df_stp2_prefinal,'C:/Users/Exia/Documents/case_study_1/r_project/df_stp2_prefinal2.csv')

Importing the result

Importing the csv file and storing it in a new dataframe.

df_stp2_prefinal2_imprt <- read.csv("C:/Users/Nidal/Documents/case_study_1/r_project/external_data/df_stp2_prefinal2.csv")

df_stp2_prefinal2_imprt$X <- NULL # This removes the X column with rows numbers

Comparing observations and variables

Checking the difference of observations and variables.

tab_diff <- matrix(c(nrow(df_stp2_prefinal), 
                     ncol(df_stp2_prefinal), 
                     nrow(df_stp2_prefinal2_imprt), 
                     ncol(df_stp2_prefinal2_imprt)), 
                   ncol=2, byrow=TRUE)

colnames(tab_diff) <- c('Observations','Variables')
rownames(tab_diff) <- c('df_stp2_prefinal','df_stp2_prefinal2_imprt')

tab_diff
##                         Observations Variables
## df_stp2_prefinal             3284237        15
## df_stp2_prefinal2_imprt      3284237        16
remove(tab_diff) # removing the tab created to show the difference

Converting values type

The file imported had some issues with data type, I was forced to convert them to the correct ones.

The data manipulation done is as follows:

  1. The values started_at and ended_at were converted from char to POSIXct.
  2. I had to redo the ride_length subtraction done before.
# Converting two values from char to POSIXct
df_stp2_prefinal2_imprt$started_at <- as.POSIXct(df_stp2_prefinal2_imprt$started_at, tz = "UTC")
df_stp2_prefinal2_imprt$ended_at <- as.POSIXct(df_stp2_prefinal2_imprt$ended_at, tz = "UTC")

# Already done above, but in order to make things like before. I preferred to rerun it.
df_stp2_prefinal2_imprt$ride_length <- as_hms(df_stp2_prefinal2_imprt$ended_at - df_stp2_prefinal2_imprt$started_at)

# Getting a glimpse on the structure of the dataframe
str(df_stp2_prefinal2_imprt)
## 'data.frame':    3284237 obs. of  16 variables:
##  $ ride_id           : chr  "A847FADBBC638E45" "5405B80E996FF60D" "5DD24A79A4E006F4" "2A59BBDF5CDBA725" ...
##  $ rideable_type     : chr  "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
##  $ started_at        : POSIXct, format: "2020-04-26 17:45:14" "2020-04-17 17:08:54" ...
##  $ ended_at          : POSIXct, format: "2020-04-26 18:12:03" "2020-04-17 17:17:03" ...
##  $ start_station_name: chr  "Eckhart Park" "Drake Ave & Fullerton Ave" "McClurg Ct & Erie St" "California Ave & Division St" ...
##  $ start_station_id  : chr  "86" "503" "142" "216" ...
##  $ end_station_name  : chr  "Lincoln Ave & Diversey Pkwy" "Kosciuszko Park" "Indiana Ave & Roosevelt Rd" "Wood St & Augusta Blvd" ...
##  $ end_station_id    : chr  "152" "499" "255" "657" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.7 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.9 42 ...
##  $ end_lng           : num  -87.7 -87.7 -87.6 -87.7 -87.7 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...
##  $ ride_length       : 'hms' num  00:26:49 00:08:09 00:14:23 00:12:12 ...
##   ..- attr(*, "units")= chr "secs"
##  $ day_of_week       : int  1 6 4 3 7 5 5 3 4 7 ...
##  $ distance          : num  3990 980 2991 2138 9138 ...

Note: Values with “dbl” class were automatically converted to “num”. No data manipulation is needed since double() is the same as numeric() in R.

Last but not least: Missing values

Finally, we clean any not available (NA) fields to start creating visualizations based on the df_final dataframe.

df_final <- na.omit(df_stp2_prefinal2_imprt)

Prepare lists for figure 4 (RiverLoad package)

Splitting the df_final into two dataframes:

  • df_final_members
  • df_final_casuals
df_final_members <- subset(df_final, member_casual == "member")
df_final_casuals <- subset(df_final, member_casual == "casual")

To be able to use RiverLoad, the data needed should be extracted and put in separate dataframe.

Since the distance calculated before is in meters, it will need to be converted to kilometers. Simply subtracting the result by 1000 is the quickest method.

fig4_members <- data.frame(df_final_members$started_at,
                       (df_final_members$distance / 1000))

fig4_casuals <- data.frame(df_final_casuals$started_at,
                       (df_final_casuals$distance / 1000))

# Renaming the column as asked in the documentation
names(fig4_members) <- c('datetime', 'flow')
names(fig4_casuals) <- c('datetime', 'flow')

# Using RiverLoad to calculate the mean of every month in the flow of records
mn_members <- monthly.year.mean(fig4_members)
mn_casuals <- monthly.year.mean(fig4_casuals)

# Creating member_casual
mn_members$member_casual <- "member"
mn_casuals$member_casual <- "casual"

# Binding the dataframes created, and renaming columns
fig4_dist_month <- rbind(mn_members, mn_casuals)
names(fig4_dist_month) <- c('month', 'distance_mean', 'member_casual')

# Cleaning
remove(fig4_casuals, fig4_members, mn_casuals, mn_members)

Analyze the data

The final data frame represents 94% of the raw data. Which results of around 6% of dirty data cleaned and/or isolated.

Table 1: Mean/Max/Min

As asked for in the case study, the mean, maximum and minimum values of the members and casuals using Cystistic bikes.

Mean Maximum Minimum
Members 00:15:55 978:40:02 00:00:00
Casuals 00:45:27 928:03:53 00:00:00

Table 2: Sum of use per day of week

The sum of use of each type of users by day during the last twelve months.

# For aesthetic purposes and HTML compatibility output, I created two functions to use in the pipe tables below.

sumd_m <- function(day) {
  res <- sum(df_final_members$day_of_week == day)
  return(res)
}

sumd_c <- function(day) {
  res <- sum(df_final_casuals$day_of_week == day)
  return(res)
}
Members Casuals
Sunday 250465 249761
Monday 251905 142042
Tuesday 268223 136218
Wednesday 288390 148352
Thursday 283731 156177
Friday 288897 196451
Saturday 304608 319017

Share

Figure 1: Amount of use of Cyclistic bikes

ggplot(df_final, aes(x = day_of_week, fill = member_casual)) +
  geom_bar() +
  facet_wrap(~member_casual) +
  ggtitle("Difference of amount of use for each type of user") +
  xlab("Days of the week") +
  ylab("Amount of use") +
  scale_x_discrete(limits=c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")) +
  theme(legend.title = element_blank()) +
  ggtitle("Amount of use of rideables by type of users")

Figure 2: Length difference (less-details)

ggplot(data = df_final, aes(x = member_casual,
                            y = ride_length,
                            color = member_casual)) +
  geom_point() +
  xlab("Membership") +
  ylab("Ride length") +
  ggtitle("Ride length difference between casuals and members (less-details)") +
  theme(legend.title = element_blank())

Figure 3: Length difference (more-details)

ggplot(df_final, aes(started_at, ride_length, color = member_casual, factor(p_year))) +
  geom_point() +
  geom_smooth() +
  facet_wrap(~member_casual, dir = "v", labeller = labeller(member_casual = c("member" = "Members", "casual" = "Casuals"))) +
  ggtitle("Ride length difference by type of users with more details") +
  xlab("Months") +
  ylab("Ride length") +
  ylim(0, 12000) +
  theme(legend.title = element_blank())

Figure 4: Distance travelled

ggplot(fig4_dist_month, aes(x = as_date(month),
                            y = distance_mean,
                            color = member_casual)) +
  geom_line(size=4) +
  scale_x_date(breaks = "1 month",
               date_labels = "%m-%Y") +
  theme(axis.text.x = element_text(angle = 90),
        legend.title = element_blank()) +
  labs(subtitle="From Apr-2020 to Mar-2021",
       y="Distance travelled [km]",
       x="Months",
       title="Difference of distance travelled by each type of user")

Act

  • Survey premium users to know the reason (if any) they’ve chosen to upgrade to the annual plan.
  • Survey casuals to know what make them hesitate and what would make them change their mind.
  • Depending on the survey’s result, maybe separate the premium plan into “Premium” and “Premium+”. Making the annual plan more affordable will maybe convince more casuals.
  • Use seasonal promotions to give a glimpse of the annual plan.
  • Develop and promote an app that help users build consistency.

Comments on the Data

  • Negative values when subtracting ended_at with started_end.
  • Distance between lat/long start and end point resulting with a 0, while the ride length is positive.
  • Station IDs are integers or strings, this can lead to errors or additional work if the data is needed someday.
  • The figure 3 shows an empty gap for both casuals and members. I got the error message “Removed 18627 rows containing non-finite values (stat_smooth).” I let this to whoever created the data.

See also

Additional information about this case study.

More clarity

As it’s written in the title of the chart:

Flow chart describing data manipulation process

Sources

Get in touch

If you have any questions, please feel free to contact me.