The VOD Clickstream dataset is 42 months of user activity on Netflix websites. Users opted-in to allow their actions to be anonymously tracked in return for the use of services, tools and browser extensions.
The dataset comprises over 610 million clicks from VOD viewers worldwide, revealing what they watched, when they watched it and from this, we can see how tastes differed around the world.
What is ‘the clickstream’?
‘The clickstream’ is a log of clicks made by opted-in, anonymised users. This stream of clicks reveals their journeys across the web, including what they watched on VOD platforms and how they arrived at the content.
For each click, we know:
- The URL they visited
- The time and date
- Their country of origin
- Details of their browser
- A randomly generated ID for each user, which resets after a pre-defined period of time
It may not sound like much, but it opens up a whole world of data for us to crunch. The scale and breadth of the dataset gives us a unique window into the viewing habits of VOD viewers around the world.
The VOD Clickstream dataset provides us with the first public granular study of VOD activity at scale. It opens up a world previously hidden to all but the VOD platforms themselves.
- Insightful. At last, we can get a sense of what VOD audiences want to watch and therefore what content VOD platforms may be looking to produce or acquire.
- Independent. We do not need to solely rely on occasional press releases from VOD platforms to understand the habits of VOD audiences.
- Global. The data covers all countries where the platform operates. Historically, the film industry has traded on a country-by-country basis but in this new world of streaming, we have to take a global viewpoint.
- Accurate. We have tested the dataset against all public claims the VOD platform has made about their viewers’ habits. Our dataset is broadly in alignment with these figures. There are some limitations on the data (see below) and so we like to think of it as looking at the truth through an imperfect lens. It’s accurate, not precise.
- Legal. Our users opted-in to share their anonymised habits and the dataset contains no personal information (i.e. no names, IP addresses, etc).
Limitations and caveats
The VOD Clickstream dataset has a number of limitations and caveats which mean that we cannot regard it a perfect reflection of what VOD platforms see behind the scenes. It is an imperfect lens which paints a broad picture. Among the limitations are:
- Historical data. We have data for a 42-month period for the US and Canada, and a 30-month period for the rest of the world, both ending in June 2019. Unfortunately, it’s not possible to get more recent (or live) data due to changes in the clickstream data business. Since this point, there have been two major changes in the VOD sector: the increase in proprietary VOD services exclusively showing their own content (such as Disney+ and Peacock) and the global COVID-19 pandemic. The former means that some of the content we are tracking is no longer available on Netflix and the latter has led to a sharp increase in VOD viewership. Our analysis is focusing on larger trends, rather than individual pieces of content and so we feel that the vast majority of the insights gained from the Clickstream data are still valid in the wider discussion around ‘what works on VOD?’
- Defining a view. Our core dataset tracks clicks made by users, and from that, we have generated our measures of content success. We can identify which clicks led to a video being played, and we can monitor when the user next clicked elsewhere. For example, if a user clicks to watch a 22-minute episode and after 21 minutes clicks to watch the next episode, we can reasonably regard this as a complete view. We have tracked both types of interaction – the initial click loading the content and the signal that it was completed – and used them in our measures of content success. These are subtly different measures to those Netflix has previously stated they use when reporting viewing data. In December 2018, Netflix told The Verge that they regard a ‘view’ as being counted after 70% of the content is complete and that if an account watches a video more than once it still only counts as one view. More recently, in January 2020, Wired reported that Netflix now counts views after just two minutes have elapsed.
- Relative interest. We cannot give a raw figure for the total number of people who watched a piece of content as we are only sampling a subset of viewers. VOD platforms do not release enough data to be able to draw wider conclusions about raw viewing figures.
- Desktop and laptop. This data is only for desktop and laptop users. In March 2018, Netflix stated that around 25% of global users watch via a browser, although it fluctuates between countries. This is another reason why we’re only tracking relative interest in titles, rather than raw viewing figures.
- Anonymous. In order to comply with data protection laws, our data supplier stripped all personal data from the dataset prior to giving us access. This means we cannot know the demographic breakdown of our audiences. We can track their viewing over the course of months, due to the anonymised IDs, but this only reveals habits, not personal attributes.
- 99.3% coverage. A very small amount of content remains unknown to us (i.e. we can’t link the viewing activity with a known movie, TV episode or comedy special). We managed to find the titles behind 99.3% of tracked audience activity.
- Subjective focus. The raw data is not in a form which can be directly used, as over the course of the study period the platform subscriber numbers have changed, the size of our audience panel fluctuated and content is constantly being added and removed from the platform. This is why we have to process the raw data into our scores. In doing so, we automatically provide subjective judgments. In all cases we have sought to eliminate subjectivity and, where this wasn’t possible, to lean on data signals to make these calls. These include requiring content to be on the platform for at least 100 days during our study period to be included in the reporting, and when accounting for availability windows to cap it at eight months (i.e. content available for more than eight months has the same weighting applied as content that was available for eight months). These figures were chosen as they best reflect our understanding of the lifecycle of content.
- Use it wisely. The data and insights which spring from the clickstream should be used to gain a general understanding of the VOD audience and sector. No one should hang entire business, legal or personal decisions on this data alone. Always seek professional advice before making any decisions.
The raw clickstream data comes from a highly reputable company within the data industry. They have asked that we do not disclose their name, however, we were able to get full access to them, their staff and the larger dataset. We performed due diligence on them and their data, to the point that we are 100% confident the data is as described.
Furthermore, after spending many months crunching the data, it would be a more impressive feat to have falsified the data than to have gathered it in the manner shown to us. We are providing the full dataset to third parties who will also be able to analyse it for anomalies.