Yahoo! gives every game a gameid. For World Series Game 7, the gameid is:
game_id = '361102105'
The game boxscore data that you see on Yahoo! can be accessed in real-time here:
game_url = 'https://api-secure.sports.yahoo.com/v1/editorial/s/boxscore/mlb.g.' + game_id + \
'?lang=en-US®ion=US&tz=America%2FChicago&ysp_redesign=1&mode=&v=4&ysp_enable_last_update=1&polling=1'
Using requests, we can get the boxscore data.
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'}
response = requests.get(game_url, headers=headers)
game_data = response.json()
Exploring the boxscore, there is a gamepitches record. This is Yahoo!'s pitch data.
pitches = game_data['service']['boxscore']['gamepitches']['mlb.g.'+game_id]
I printed out how many pitches were thrown in Game 7.
print("Total number of pitches thrown during the game: " + str(len(pitches)))
I pushed the pitch data into a pandas data frame.
import pandas as pd
pitch_df = pd.DataFrame(pitches)
pitch_df.head()
I transposed the data frame so that the columns corresponded to the pitch variables such as balls, strikes, and velocity.
transpose_pitch_df = pitch_df.transpose()
transpose_pitch_df.head()
The data in the pitch data frame are all strings. I needed to convert numeric fields to numeric data types.
for val in list(transpose_pitch_df.columns.values):
transpose_pitch_df[val] = transpose_pitch_df[val].convert_objects(convert_numeric=True)
transpose_pitch_df.dtypes
Using BeautifulSoup, I retrieved the names of the pitchers and batters from their Yahoo! ID.
from bs4 import BeautifulSoup
I verified the starting pitcher and lead-off batter for Game 7.
pitcher_url = 'http://sports.yahoo.com/mlb/players/' + str(transpose_pitch_df['pitcher']['10100']) + '/'
req = requests.get(pitcher_url)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
print("On the mound to start the game: " + soup.title.string)
batter_url = 'http://sports.yahoo.com/mlb/players/' + str(transpose_pitch_df['batter']['10100']) + '/'
req = requests.get(batter_url)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
print("First batter faced: " + soup.title.string)
I also listed the pitchers who played in Game 7.
pitcher_list = transpose_pitch_df.pitcher.unique()
for pitcher in pitcher_list:
pitcher_url = 'http://sports.yahoo.com/mlb/players/' + str(pitcher) + '/'
req = requests.get(pitcher_url)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
pitcher_info = soup.title.string
print(pitcher_info.split('|')[0] + " of the " +
pitcher_info.split('|')[1] + "(Yahoo ID: "+str(pitcher)+")")
I thought that Kluber and Hendricks were going to be in Game 7 a lot longer than they were. I filtered only to look only at their pitches.
kluber = transpose_pitch_df[transpose_pitch_df.pitcher == 9048]
hendricks = transpose_pitch_df[transpose_pitch_df.pitcher == 9758]
After filtering on only their pitches, I can create some pitch charts using matplotlib.
import matplotlib.pyplot as plt
%matplotlib inline
I plotted a histogram of Kluber's pitch velocity during game 7.
plt.figure(figsize=(6,6))
plt.subplot(111)
k_max = kluber.velocity.max()
k_min = kluber.velocity.min()
ax1= kluber['velocity'].hist(bins=k_max-k_min)
plt.xlim(70, 100)
plt.title('Corey Kluber Pitch Velocity - Game 7')
plt.xlabel('Pitch Velocity')
plt.ylabel('Number of Pitches')
plt.show()
I also created a histogram of Hendrick's pitch velocity during Game 7.
plt.figure(figsize=(6,6))
plt.subplot(111)
h_max = hendricks.velocity.max()
h_min = hendricks.velocity.min()
ax2 = hendricks['velocity'].hist(bins=h_max-h_min)
plt.xlim(70, 100)
plt.title('Kyle Hendricks Pitch Velocity - Game 7')
plt.xlabel('Pitch Velocity')
plt.ylabel('Number of Pitches')
plt.show()
Next, looked at pitch location. I separated balls from strikes (strikes, foul balls, and hits) for each pitcher.
sorted_kluber_pitches = kluber.sort_values(['play_num'])
sorted_hendricks_pitches = hendricks.sort_values(['play_num'])
kluber_balls =sorted_kluber_pitches[sorted_kluber_pitches.result == 0]
hendricks_balls =sorted_hendricks_pitches[sorted_hendricks_pitches.result == 0]
kluber_others =sorted_kluber_pitches[sorted_kluber_pitches.result != 0]
hendricks_others =sorted_hendricks_pitches[sorted_hendricks_pitches.result != 0]
I plotted the pitch locations on using matplotlib and seaborn.
import seaborn as sns
First, I checked to see if pitches that were balls formed the strike zone.
plt.figure(figsize=(6,6))
ax = plt.subplot(111)
ax.scatter(transpose_pitch_df[transpose_pitch_df.result==0].horizontal,
transpose_pitch_df[transpose_pitch_df.result==0].vertical,
marker='o', label='balls')
ax.set_xlim(-20000,20000)
ax.set_ylim(-20000,20000)
#Axis labels
ax.set_xlabel('Horizontal')
ax.set_ylabel('Vertical')
#set title
ax.set_title('The Strike Zone',
y=1.0, fontsize=18)
#show legend
ax.legend(loc=3, frameon=True, shadow=True)
plt.show()
When I filtered to only show balls, you see a big gap in pitch locations. These points outline the strike zone.
I created the starting pitchers pitch charts. Here is Kyle Hendrick's.
# create our jointplot
joint_chart = sns.jointplot(hendricks_others.horizontal,
hendricks_others.vertical,
stat_func=None,
color='r',
marker='o',
s=50,
kind='scatter',
space=0,
label='Strikes/Hits/Fouls',
alpha=1.0)
joint_chart.fig.set_size_inches(6,6)
joint_chart.x = hendricks_balls.horizontal
joint_chart.y = hendricks_balls.vertical
joint_chart.plot_joint(plt.scatter, marker='o',
c='b', s=50,label='Balls')
ax = joint_chart.ax_joint
ax.set_xlim(-20000,20000)
ax.set_ylim(-20000, 20000)
# Get rid of axis labels and tick marks
#ax.set_xlabel('')
#ax.set_ylabel('')
#ax.tick_params(labelbottom='off', labelleft='off')
# Add a title and legend
ax.set_title('Kyle Hendricks: Stikes/Fouls/Balls Hit',
y=1.2, fontsize=18)
ax.legend(loc=3, frameon=True, shadow=True)
# Add Data Scource and Author
#ax.text(-20000,-20000,'Data Source: Yahoo!'
# '\nAuthor: Gregory Brunner, @gregbrunn',
# fontsize=12)
plt.show()
Here is Corey Kluber's pitch chart.
joint_chart = sns.jointplot(kluber_others.horizontal,
kluber_others.vertical,
stat_func=None,
color='r',
marker='o',
s=50,
kind='scatter',
space=0,
label='Strikes/Hits/Fouls',
alpha=1.0)
joint_chart.fig.set_size_inches(6,6)
joint_chart.x = kluber_balls.horizontal
joint_chart.y = kluber_balls.vertical
joint_chart.plot_joint(plt.scatter, marker='o',
c='b', s=50,label='Balls')
ax = joint_chart.ax_joint
ax.set_xlim(-20000,20000)
ax.set_ylim(-20000, 20000)
# Get rid of axis labels and tick marks
#ax.set_xlabel('')
#ax.set_ylabel('')
#ax.tick_params(labelbottom='off', labelleft='off')
# Add a title and legend
ax.set_title('Corey Kluber: World Series Game 7',
y=1.2, fontsize=18)
ax.legend(loc=3, frameon=True, shadow=True)
# Add Data Scource and Author
#ax.text(-20000,-20000,'Data Source: Yahoo!'
# '\nAuthor: Gregory Brunner, @gregbrunn',
# fontsize=12)
plt.show()
This is a really cursory analysis of the pitch data. There is a lot more I could do, but I'll save that for another day.
What I hope to do in a future post is look at the play-by-play data and plot hit locations on Gavin's team basemaps! If you are interested in doing this yourself, here is how you can access the hit locations.
play_by_play = game_data['service']['boxscore']['gameplay_by_play']['mlb.g.'+game_id]
play_by_play_df = pd.DataFrame(play_by_play)
play_by_play_df.transpose().head()