Survivorship of Titanic Passengers – Part 1

Everyone knows the story of the RMS Titanic that sank in 1912 after colliding with an iceberg. 2,224 souls were abroad. At least 1,490 of them went down with it. Who survived? Predicting the survivorship of passengers, based on their demographics and socioeconomic statuses, is a traditional rite-of-passage in the machine learning world. It is one of the most popular problems on Kaggle.

The problem: given a test set of data, predict who lived or perished. A training set of data consisting of 891 passengers and their features is provided. Here’s a quick overview of the data.

Column NameDescription
PassengerIDEach passenger has a unique ID number
Survived0 if died, 1 if survived.
Pclass1 for first class, 2 for second class, 3 for third class
NameThe passenger’s name
SexThe passenger’s sex
AgeThe passenger’s age
SibSpThe number of siblings and spouses of the passenger aboard Titanic
ParchThe number of parents and children of the passenger aboard Titanic
TicketThe passenger’s ticket number
FareThe passenger’s ticket fare
CabinThe passenger’s assigned cabin, if there is one
EmbarkedWhere the passenger embarked. C for Cherbourg, Q for Queenstown, and S for Southampton.

Obviously, some variables have a huge role in predicting a passenger’s survivorship. “Women and children first!” History also tells us that the first-class passengers had more success getting on lifeboats than steerage. But do those features account completely for whether one survived or not? The main challenge in this problem is figuring out which set of features, measured or derived, best predict survivorship. Let’s poke around and see what’s up.

Exploring the Data

First things first, load in train.csv as a Pandas dataframe.

import pandas as pd

passengers = pd.read_csv("train.csv")

Going through each feature shows some expected and unexpected correlations. I started with pclass, the passenger’s ticket class, a proxy for socioeconomic status. Since it’s a categorical variable, I went for a stacked bar graph.

Oh boy, third class really got screwed, didn’t they? The predictor is definitely going to rely heavily on ticket class. I made similar bar charts for other categorical variables such as sex and port of embarkation.

The distribution for sex makes sense, since they prioritized loading women onto the lifeboats. The port of embarkation is interesting…Southampton provided the majority of passengers and its distribution mirrors that of the whole ship. Cherbourg doesn’t follow that though. Probably mostly first class types.

Moving on to other variables, sibsp and parch may look numerical but they’re actually categorical. This is because one can’t have 0.5 siblings or 1.8 parents! This isn’t Ghost Ship. So it makes sense to do bar charts for these variables. I also looked at what it looks like if sibsp and parch were added together as a proxy for family size.

It looks like the bigger the families are, the more likely they were to lose someone. This drives home just how horrible the tragedy was.

It gets a bit more difficult to examine the last few categorical variables – name, ticket, and cabin. They have way too many levels to graph nicely as a bar chart. It’s necessary to simplify, or reduce, the number of levels. For name, this is easy – each name has a title such as Mr. or Mrs. Let’s make a new column title that stores the title or honorific of a name. Extraction is done with a regex: [A-Z][a-z]+.

titles = []
for name in passengers['Name']:
    titles.append(re.findall('[A-Z][a-z]+\.', name)[0])
passengers['Titles'] = titles

The set of all titles is quite interesting and shows a mixture of nobility and commoners: Capt, Col, Countess, Don, Dr, Jonkheer,1Dutch for Squire Lady, Major, Master,2for boys Miss, Mlle,3French for Miss Mme,4Madame Mr, Mrs, Ms, Rev, and Sir. Those will do nicely as levels for a categorical variable. Here’s the bar graph of titles with some of the less common titles excluded.

This agrees with the graph of survivorship of passengers by sex. Titles can be used as a proxy for sex. Now what about ticket and cabin? The ticket number doesn’t seem to have much bearing on survivorship because they were issued by multiple ticket agents all over Europe. So, let’s ignore that for now.

cabin is actually quite important because the passenger’s location on the ship can be inferred. Cabin numbers begin with a letter indicating deck. Let’s make a new column deck that extracts from cabin. There are quite a few missing values in cabin, so it’s necessary to compensate for that.

deck = []
for cabin in passengers['Cabin']:
    if isinstance(cabin, str):
        deck.append(cabin[0])
    else:
        deck.append('Unknown')
passengers['Deck'] = deck

Let’s see what the bar chart of survivorship of passengers by deck (if known) looks like.

Interesting! It looks like deck C held the majority of passengers whose cabin is listed. It also lost the most passengers. Also, for nearly every deck, the survivorship is favorable. It looks like passengers were much more likely to survive if they had an assigned cabin.

Finally, there are two numerical variables: age and fare amount. They need to be graphed as histograms with bins.

The histogram of age agrees with what we already know – children were more likely to survive. The histogram of fare ties in with passenger class. The cheaper the fare, the lower class a passenger was, and hence, the more likely that passenger would’ve died. It’s also important to point out that crew members had a fare of zero. Morbid and unfair, eh?

Drawing pictures of data is fun and informative for us. They don’t help the machine to divine the key differences between a passenger that survived and one that didn’t. However, these graphs can be useful guides in teaching the machine which features are important to learn. Feature selection and extraction will be covered in the next part of this blog post.