Understanding Your Users With Cohort Analysis

Effectively using data to inform business decisions is critical to build and grow great products. But it's not always obvious how best to utilize all the data you collect. One line of inquiry championed by Evernote CEO Phil Libin, among others, is cohort analysis.

There are various different definitions for cohort analysis but for our purposes, this consists of breaking a collection of users into groups based on some trait and computing a statistic for each of the groups to determine how the trait affects the value.

This could mean grouping users by the marketing channel they came through and computing lifetime value to determine which channel yields the most valuable customers. Alternatively, we could group users by age and compute daily visits to our website to determine how age affects user engagement.

Given a set of cohorts grouped by a trait, we can compute many different statistics to determine how the trait affects each of them (e.g. use age cohorts to determine how age affects both lifetime value and user engagement).

We can also compute these statistics at various times in order to determine how values evolve in each group. For example, we could create cohorts based on user age and track retention by looking at what percentage of the cohort uses the site daily over time. Such an analysis might look something like this:

In the chart above, we see that daily engagement falls off over time for each group but that the long term retention is higher for the younger cohorts. You could interpret this as saying that 1) some subset of users in each group is testing out the product and will balk over time and 2) that a larger percentage of younger users will stick with the product for the long run.

Let's Do An Example

This is all a bit abstract. Let's go through an example to see how this type of analysis can be implemented in a practical way.

The most common way to bin cohorts - which we will implement here - is by the calendar month in which a user registered for the service. This allows us to easily track various elements of performance over time.

At Wealthsimple, we are interested in tracking our assets under management. So for this example, we will use cohorts to track the growth of assets under management over time.

We'll do this by writing the following code snippets:

  • A User class

  • A Cohort class

  • A CohortStatistic class

  • An AssetsUnderManagement class which will inherit from CohortStatistic

Let's start off by assuming we have a user class that looks roughly like this (note: the included code snippets are written in ruby but should be clear enough to be broadly applicable)

class User

  def created_at
    # return date_that_user_was_created
  end

  def assets_under_management(date)
    # returns the dollar amount the user has invested on a given date
  end

  def self.all
    #return all instances of user
  end
end  

Next, let's build a Cohort class to partition the users into cohorts.

class Cohort  
  attr_reader :month, :year

  # date range over which to create cohorts (from the beginning on 2014 to the beginning of 2015)
  START_DATE = Date.new(2014, 1, 1)
  END_DATE = Date.new(2015, 1, 1)

  #get each unique month in the calendar range
  MONTH_RANGE = (START_DATE..END_DATE).map { |d| Date.new(d.year, d.month, 1) }.uniq

  def initialize(month:, year:)
    @month = month
    @year = year
  end

  def users
    @users ||= User.all.select { |u| u.created_at > start_date && u.created_at < end_date }
  end

  def start_date
    Date.new(year, month, 1)
  end

  def end_date
    start_date + 1.month
  end

  def self.all
    MONTH_RANGE.map { |date| Cohort.new(month: date.month, year: date.year) }
  end
end  

This creates a cohort class which bins all users by the month-year pair they were created and allows us to easily access the members of each cohort.

We will also want to have a cohort statistic class. This class will return the values of a statistic of interest for a given cohort.

class CohortStatistic  
  attr_reader :cohort

  def initialize(cohort)
    @cohort = cohort
  end

  def values
    date_range = @cohort.start_date..Date.today #all dates since the inception of the cohort
    date_range.map { |date| calculate_value_for_date(date) }
  end

  def calculate_value_for_date(date)
    raise "Override in subclass"
  end
end

The advantage of this general cohort statistic class is that it makes creating individual statistics very simple. All we have to do is create a method which lets us compute the value of the statistic on a given date. For this example, our goal is track the assets under management of each cohort.

class AssetsUnderManagement < CohortStatistic

  def calculate_value_for_date(date)
    cohort.users.map { |user| user.assets_under_management(date) }.reduce(:+)
  end
end

Now we can easily view the assets under management over time for each cohort. We simply create an instance of AssetsUnderManagement for the cohort and access its values.

#create of a cohort of all users created in January 2015
cohort_of_interest = Cohort.new(month: 1, year: 2015)

#create an instance of AssetsUnderManagement for the cohort
cohort_of_interest_aum = AssetsUnderManagement.new(cohort_of_interest)

#print an array of the cohort's total assets under management for each day since inception
puts cohort_of_interest_aum.values

And that's basically all there is to it. We now have the foundation to create new cohort statistics very simply. We just create a class which inherits from CohortStatistic and defines a method to compute the value of the statistic on a given date.

Visualization

In addition to any other analysis, it is frequently useful to visualize the time series of values for each of the cohorts for a given statistic in a single plot. This provides an intuitive way to track your performance with most recent users against previous batches on a given metric.

The x-axis shows the time (in months) that has elapsed since the cohort formed; the shorter the time-series, the newer the cohort. How is the most recent cohort performing relative to previous cohorts? Simply compare values at a given point in time.

cohorts = Cohort.all

date_range = cohorts.first.start_date..Date.today

datasets = cohorts.map do |cohort|  
  {
    "label" => cohort.start_date.to_s,
    "values" => AssetsUnderManagement.new(cohort).values,
  }
end

#obviously, this line must be substituted with the use of a real visualization tool.
some_chart_api.plot(x_axis: date_range, datasets: datasets)

Another useful visualization is to "stack" the time series for each cohort. We put time on the x-axis and as each new cohort begins, we add its statistic value on top of the previous cohorts'. In this way, we can view our total assets under management (or any other statistic) broken down by how much each cohort is contributing.

There is a long list of statistics to be tracked using cohorts that are useful for most products. This list includes user retention, cost of acquisition, generated revenue, and many others. In addition, most products will have application-specific metrics which are also useful to track. While some discretion is required to identify which statistics are most useful to track, the procedure for implementing them is highly consistent.