UPDATE: Visualizing 2020: Was the Baseball Season Different? (Hint: Not Really)

In addition to my card paintings, I enjoy working with baseball data to visualize trends within a season, or over time. It relates to what I do professionally, where I lead a team of researchers and designers at the National Library of Medicine to analyze and visualize how effectively our digital products are helping people find the medical and health information which they need.

While data visualization for its own sake is fun, it is most relevant when it's applied to answer questions, whether in business or research. So as Major League Baseball teams have started spring training, I've come to wonder how unusual the 2020 season really was. Of course, it was much shorter -- 60 games compared to the usual 162, with games played in empty stadiums and, in the case of the playoffs, at neutral sites. But was performance really that different?

With the SABR Analytics Conference coming up next week, attendees got a free month of Stathead, the statistical service of Baseball Reference. (Thanks, Stathead!) To take advantage of the opportunity, I downloaded data for the 2017 through 2020 seasons to see if there are statistical differences between 2020 and the previous three seasons, and analyzed them in Tableau Public. All of these visualizations are on my Tableau Public page, where you can view them interactively. You can also follow my page so you will be notified when I post more data.

UPDATE: This morning, I ran another analysis looking at the years 2017 through 2020 separately. What I found confirms the initial analysis (in other words, that while the 2020 season was only 60 games compared to the usual 162), offensive performances weren't proportionally different from the previous three seasons. Every dot on these two scatterplots represents a single season by a single player. Let's take a closer look:


Like in the first analysis, the 2020 offensive statistics are bunched in the lower left hand corner, which makes sense. But look at the trend lines (light blue for 2020; dark blue, orange, and red for 2017, 2018, and 2019 respectively). The slopes of the lines for 2018-2020 are all fairly similar (meaning a similar proportion of home runs to overall hits). It actually looks like 2017 is the outlier, with fewer home runs as a share of all hits. I'll leave it to another time to explore that.

Now let's look at on base percentage (OBP) and slugging percentage (SLG), which correct for the truncated length of the 2020 season.

Like the year-on-year analysis of hits vs. home runs, the trend lines here show that 2020 wasn't much of an outlier either, and that in the case of slugging percentage, 2019 seems to have been a higher-performing season for batters.

My next step will be to look at pitching: watch this space, and please comment with your thoughts.

The Basics: Hits vs. Home Runs

For the first analysis, I plotted home runs on the vertical axis against hits on the horizontal axis. Each dot on the graphic is a single player, with 2020 season data in orange and aggregate 2017-2019 data in blue. It's important to note here that these data aren't normalized to account for the short 2020 season, but for this initial look I was more interested in the slope of the trend lines than the raw number of hits or home runs.

The slope shows that (unsurprisingly) there were far fewer hits and home runs in 2020. Again, this isn't that surprising because there were 102 fewer games played. It's the slope of the curve that's interesting, though. The slope of the trend line for 2020 (the orange line) is less steep than the one for the previous three years (the blue line). This means that in 2020, batters hit proportionally fewer home runs per hit than they did in the past three years. I'm not sure why this may be -- it might be that given the long layoff between the cancellation of spring training in March and the resumption of training and games in June and July, hitters had to find their timing again, which was difficult to do over a sixty-game season.

The Next Step: Slugging Percentage vs. On Base Percentage

To compare the 2020 season to previous seasons, though, it's more meaningful to use a percentage-based measure which compensates for the truncated number of games. I chose on base percentage because of its relationship to runs scored (at least as far as Billy Beane and his staff uncovered in Moneyball, and slugging percentage because it emphasizes offensive statistics (not only home runs, but also extra base hits).

In this comparison (on the right), the two trend lines are very close together. In fact, the slopes only differ by less than 10 percent. So on balance, these data suggests that as challenging as it was for MLB to shut down spring training and then start again three months later, batters performed as they did in the previous three seasons, while home run totals may have been slightly depressed.

Next Steps

I'm going to continue expanding on this analysis in the next days and weeks, and will hope to get more ideas at the Analytics Conference next week. The areas I plan to look into include analyzing individual years (in other words, the 2017-2020 seasons as individual years, instead of 2020 against the previous three years in aggregate), and to look at pitching to see how the short season may have affected that part of the game.

Please continue to watch this space and to comment below. One of the most powerful benefits of data visualization is how it can spur discussions on interesting topics, which hopefully it does here.



Comments