In January of this year, Seattle released a list of active/current pet licenses in the city.
For each of the 66,042 pets in the list we have the following data:
The data.seattle.gov website actually makes it fairly easy to explore, filter, and plot the data. Given that the only quantitative data we have are the License Number and zip code, however, the plots you can make online are a bit limited. Luckily, the dataset is easily downloaded as a csv
file.
When I intially browsed the data online, I came across a Pug named Franklin Tucker. I thought this was a hilarious name for a Pug and it gave me the idea to create word clouds of dog names separated by breed. These word clouds would provide a fun way to look at the most common name for a given breed of dog. To make these word clouds, I used the word cloud package.
#Import the usual stuff
import numpy as np
from PIL import Image
from os import path
import matplotlib.pyplot as plt
import random
import pandas as pd
from wordcloud import WordCloud, STOPWORDS
The data are easily read into a dataframe given that we were able to download them as a csv
file. I like using pandas for this sort of thing because it makes filtering the data on single or multiple columns extremely easy.
pet_data = pd.read_csv('../data/Seattle_Pet_Licenses.csv',index_col=None)
One quick way to split the dataframe into groups is using the pandas groupby
function.
breed_groups = pet_data.groupby(['Primary Breed']).groups
breed_groups
is now a dictionary whose keys are each of the Primary Breeds avaialbe in the original data, and each key corresponds to a list of index values into pet_data
. Let's see how many different breeds are represented in the data.
print (len(breed_groups))
Let's also print each breed and the number of pets that match that breed. This highlights how we can use the results of the groupby
function to slice the original data.
for breed in breed_groups.keys():
print (breed,len(pet_data.iloc[breed_groups[breed]]))
Now we can easily grab all of the data for a certain breed of pet. As an example let's grab the golden retrievers.
golden_df = pet_data.iloc[breed_groups['Retriever, Golden']]
golden_df.head()
You could also use groupby
to group by Primary Breed and Zip Code. This would give you a quick way to look up how prevalent a breed in a given area.
breed_groups = pet_data.groupby(['Primary Breed', 'ZIP Code']).groups
As an example lets look at all of the Standard Smooth Haired Dachshunds that live in the 98115 ZIP code.
greenlake_sausages = pet_data.iloc[breed_groups['Dachshund, Standard Smooth Haired','98115']]
greenlake_sausages
Below is a function that will pull out the data for a user given Primary Breed and return a word cloud. In this function I illustrate another way to slice the data using the pandas str.contains()
function over the Primary Breed coulumn.
This has the benefit of letting us slice the data with very specific or more general breed names (e.g Terrier vs Terrier, Jack Russell). Using the str.contains()
function, it is easy to separate or combine the two related breeds.
all_terrier_df = pet_data[pet_data['Primary Breed'].str.contains('Terrier') == True]
jack_terrier_df = pet_data[pet_data['Primary Breed'].str.contains('Terrier, Jack Russell') == True]
all_terrier_df.tail()
jack_terrier_df.head()
The breed_cloud
function below returns a word cloud based on a user-given breed
name.
At a minimum, the user must provide a pandas dataframe containing the pet license data and a breed
name.
The user may optionally pass this function breed2
and mask
arguments. The breed2
option is
used to provide a Secondary Breed (e.g breed
= 'Terrier, Jack Russel',breed2
= 'Terrier, Rat').
The mask
argument is a boolean. If mask
is set to True
, the user must also provide the file name
of the mask_image
. This option is used if you want to change the shape of your word cloud based to match
the given mask_image
.
def breed_cloud(all_pets_df,breed,**kwargs):
try:
if 'breed2' in kwargs.keys():
breed_df = all_pets_df[(all_pets_df['Primary Breed'].str.contains(breed) == True) &
(all_pets_df['Secondary Breed'].str.contains(kwargs[breed2]) == True)]
else:
breed_df = all_pets_df[(all_pets_df['Primary Breed'].str.contains(breed) == True)]
breed_names = list(breed_df["Animal's Name"].dropna())
all_name_string = ' '
for i, name in enumerate(breed_names):
all_name_string += str(name)+' '
try:
if kwargs['mask'] == True:
try:
mask_image = kwargs['mask_image']
breed_mask = np.array(Image.open(mask_image))
except KeyError:
print ('If mask is True must provide mask image')
wc = WordCloud(background_color="white",max_words=100,
max_font_size=75,mask=breed_mask).generate(all_name_string)
except KeyError:
wc = WordCloud(background_color="white",max_words=100,
max_font_size=75).generate(all_name_string)
return wc
except ValueError:
print ('Maybe that breed is not in the data?')
wc_dachshund = breed_cloud(pet_data,'Dachshund')
plt.figure(figsize=(10,10))
plt.imshow(wc_dachshund, interpolation='bilinear')
plt.axis("off")
plt.show()
Unsuprisingly, Oscar is a popular name for Dachshunds. Given that Dachshunds have such a distinctive profile, they also provide a good example of the mask
option.
wc_dachshund = breed_cloud(pet_data,'Dachshund',mask=True,mask_image='dach.png')
plt.figure(figsize=(20,20))
plt.imshow(wc_dachshund, interpolation='bilinear')
plt.axis("off")
plt.show()
wc_american = breed_cloud(pet_data,'Bulldog, American')
wc_french = breed_cloud(pet_data,'Bulldog, French')
wc_english = breed_cloud(pet_data,'Bulldog, English')
plt.figure(figsize=(25,25))
ax1 = plt.subplot(131)
ax2 = plt.subplot(132)
ax3 = plt.subplot(133)
ax1.imshow(wc_american, interpolation='bilinear')
ax1.set_title('American Bulldogs')
ax2.imshow(wc_french, interpolation='bilinear')
ax2.set_title('French Bulldogs')
ax3.imshow(wc_english, interpolation='bilinear')
ax3.set_title('English Bulldogs')
ax1.axis("off")
ax2.axis("off")
ax3.axis("off")
plt.show()