IMDb Top 250 Movies List Analysis, 4th Edition

Can we predict the future of the IMDb Top 250 Movies List with statistics? Spoiler alert: no, not really.

The distribution graph for the IMDb Top 250 Movies list as of 10/3/2011

It’s October, which means it’s time to subject the IMDB Top 250 Movies list to a level of quantitative scrutiny it probably doesn’t deserve. For those of you who are new to this series, here’s a quick recap: I’m four years into an effort to analyze changing movie tastes through lists of top movies.

In 2008, I compared the AFI Top 100 movies lists of 1998 and 2007 to the IMDb Top 250 movies list as of September 2008 and found the industry (AFI) list skewed towards older movies compared to the fan list (IMDb) and that the contents of the AFI list had only advanced by 5 years in the 9 years between the two editions.

In 2009 and 2010 I returned to the IMDb lists by taking snapshots of the list at around the same time of year as the initial analysis (late September/early October) and attempted to extrapolate some meaning from the changes in the lists’ composition over the years.

And now, in 2011, I’m adding a fourth year of IMDb data to the analysis! Will any trends emerge? Can we predict the future of the IMDb Top 250 list? Read on to find out.

Before we begin, I should state the obvious limitations to using the IMDb list as a tool for analyzing changes in movie tastes over time. The IMDb list is far from authoritative, and the potentially skewed demographics of its voters casts even more doubt on the validity of the list. Furthermore, the very exercise of assigning a single numerical rating to a movie is more than reductive; it’s borderline absurd. But put all of that aside for a moment. The IMDb list, for all its flaws, is well-known, and its ratings are accepted as being decent rough indicators of movie quality.

With that being said, let’s pick up where we left off last year, when I asserted that the IMDb list’s perceived bias towards newer movies was real, and getting worse over time. From last year’s article:

When I fired up Excel to do this analysis, I was pretty sure that I would find that the overall shift of the dataset in terms of median year would be greater than the concurrent shift in time. And I was right:

  • Median Year of IDMb Top 250 List as of 9/30/2008: 1975
  • Median Year of IMDb Top 250 List as of 10/18/2009: 1977 (jumps 2 years after 1 year)
  • Median Year of IMDb Top 250 List as of 9/26/2010: 1981.5 (jumps 4.5 years after 1 year)

Yup, you read that right. Over the course of the last year, the median year of the IMDb Top 250 movies list increased by 4.5 years, which suggests that not only is the lists’ bias towards newer movies still present, it’s intensifying over time.

So how does the most recent sampling fit in with this trend? The list as of 10/3/2011 had a median year of 1983.5, or a jump of 2 years after 1 year. So the tilt towards newer movies wasn’t as severe as it was from 2009-2010, but it was more than what you’d expect if the movies were evenly distributed from the minimum to the maximum for each year.

Let’s see how that plays out in chart and graph form:

Year Min  Max Actual Median Theoretical Median
2008 1920 2008 1975 1964
2009 1921 2009 1977 1965
2010 1921 2010 1981.5 1965.5
2011 1921 2011 1983.5 1966

(Note: the “theoretical median” = what the median year of the list would be if the movies were evenly distributed between the oldest movie on the list and the newest.)

I thought it would be fun to make a rudimentary projection as to what would happen to this list based on the trend of the last 4 years. So I looked at the changes in differences between median year and list year: from 2008-2009, the difference reduced from 33 to 32; from 2009 to 2010, the difference reduced from 32 to 28.5; and from 2010 to 2011, the difference reduced from 28.5 to 27.5. Each change can also be stated as a reduction factor (e.g., 32/33 = .9697). Averaging these three factors produces a single average reduction factor of .9417, which, if you apply to future years, we can use to come up with the shocking prediction that the list’s median year will essentially equal the year of the list in about 70 years:

OK, calm down folks. You don’t have to be a statistician to see the problems with this approach. First, intuitively, it makes no sense. Can you imagine a top movies list in the year 2070 that has as many movies on it from all years prior to 2069 as it does for the years 2069 and 2070? Second, and more importantly, a sample size of 4 is way too small to make this sort of projection. (Also, my little trick of averaging reduction factors probably isn’t sound math, but it worked in that it produced the nice, albeit erroneous, graph you see above.)

About that sample size. Unfortunately, I’ve only been doing this for four years. Now, if only there were some way to go back in time and capture the status of the list from previous years. If only…

But wait! Such a thing exists. Some fortuitous Google searches led me to the “IMDB Top 250 History” website, which has snapshots of lists going all the way back April 1996.

Jackpot! I took additional snapshots from the same time frame, added them to the analysis, and…

…was disappointed to find that the trend totally disappeared with the expanded dataset:

Year Min Max Actual Median Theoretical Median
1996 1925 1996 1986 1960.5
1997 1927 1997 1986.5 1962
1998 1925 1998 1983 1961.5
1999 1922 1999 1975.5 1960.5
2000 1922 2000 1975.5 1961
2001 1922 2001 1978.5 1961.5
2002 1922 2002 1976 1962
2003 1922 2003 1975 1962.5
2004 1922 2004 1976 1963
2005 1920 2005 1978.5 1962.5
2006 1920 2006 1974.5 1963
2007 1920 2007 1976.5 1963.5
2008 1920 2008 1975 1964
2009 1921 2009 1977 1965
2010 1921 2010 1981.5 1965.5
2011 1921 2011 1983.5 1966

Turns out that in its earlier days, the IMDb Top 250 list was even more skewed towards newer movies than it is today, both in relative and absolute terms. And even with the larger sample size, the swings in median year make any sort of projection unfeasible, whether it’s with my fuzzy math method or a more formal linear regression analysis. And we haven’t even factored in IMDb’s changes and tweaks to its ranking algorithm over the years.

Sorry, folks, we can’t predict the future of the IMDb Top 250 Movies List through statistics. Or even make an educated guess. But we can still have a lot of fun analyzing the changes that have occurred to the list. Read on for more: