Investigating the Titanic Dataset with Python

Udacity Data Analyst Nanodegree

First Glance at Our Data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

filename = 'titanic_data.csv'
titanic_df = pd.read_csv(filename)

First let’s take a quick look at what we’ve got:

titanic_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Handling Missing Values

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 66.2+ KB

From this initial observation we notice that, from 891 passenger records: - 714 have valid ages; - only 204 have cabin records; - 2 embarkments are missing.

The rows with missing ages and embarkment values will be dropped whenever an analysis depends on them.

The cabin values are not going to be used in this analysis, so they will not be touched.

Other Considerations

I’m not going to analyze the number of Siblings/Spouses or Parents/Children isolatedly. Instead I am using the presence or not of family members aboard, represented by the ‘Family’ column.

titanic_df['Family'] = (titanic_df['SibSp'] > 0) | (titanic_df['Parch'] > 0)

We also are going to need a column stating if a passenger is a child or an adult. 15 is going to be the childhood age threshold for our study.

titanic_df['AgeRange'] = pd.cut(titanic_df['Age'], [0, 15, 80], labels=['child', 'adult'])

Now I’m getting rid of the data we are not going to use:

titanic_df.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1, inplace=True)

Which leaves us with the following columns, plus ‘Sex’, ‘Embarked’ and ‘Family’:

titanic_df.describe()

	Survived	Pclass	Age	Fare
count	891.000000	891.000000	714.000000	891.000000
mean	0.383838	2.308642	29.699118	32.204208
std	0.486592	0.836071	14.526497	49.693429
min	0.000000	1.000000	0.420000	0.000000
25%	0.000000	2.000000	NaN	7.910400
50%	0.000000	3.000000	NaN	14.454200
75%	1.000000	3.000000	NaN	31.000000
max	1.000000	3.000000	80.000000	512.329200

We can see that aproximately 38% of the passengers survived and the highest fare is over 15 times the average.

Let’s raise some questions:

What is the survival rate by class, sex and age? What about combining these factors?
Was the fare the same for men and women?
What fraction of the passengers embarked on each port? Is there a difference in their survival rates?
Is the presence of a family member a good indicator for survival?

1. What is the survival rate by class, sex and age? What about combining these factors?

Let’s take a look at the distribution of passengers by age and fare, grouped by sex and class, and with survival information. It will give us some global insights about the data. But first, removing rows with missing ages:

titanic_df_clean_age = titanic_df.dropna(subset=['Age'])

def scatter_plot_class(pclass):
    g = sns.FacetGrid(titanic_df_clean_age[titanic_df_clean_age['Pclass'] == pclass], 
                      col='Sex',
                      col_order=['male', 'female'],
                      hue='Survived', 
                      hue_kws=dict(marker=['v', '^']), 
                      size=6)
    g = (g.map(plt.scatter, 'Age', 'Fare', edgecolor='w', alpha=0.7, s=80).add_legend())
    plt.subplots_adjust(top=0.9)
    g.fig.suptitle('CLASS {}'.format(pclass))

# plotted separately because the fare scale for the first class makes it difficult to visualize second and third class charts
scatter_plot_class(1)
scatter_plot_class(2)
scatter_plot_class(3)

png

It seems like women have a much higher survival rate, specially in first and second classes. It seems too that children have a higher survival rate, specially in first and second classes again. Let’s find out the survival rate by class, sex and age range, and plot the results for a better understanding:

survived_by_class = titanic_df_clean_age.groupby('Pclass')['Survived'].mean()
survived_by_class

Pclass
1    0.655914
2    0.479769
3    0.239437
Name: Survived, dtype: float64

survived_by_sex = titanic_df_clean_age.groupby('Sex')['Survived'].mean()
survived_by_sex

Sex
female    0.754789
male      0.205298
Name: Survived, dtype: float64

survived_by_age = titanic_df_clean_age.groupby('AgeRange')['Survived'].mean()
survived_by_age

AgeRange
child    0.590361
adult    0.381933
Name: Survived, dtype: float64

fig, (axis1,axis2,axis3) = plt.subplots(1, 3, figsize=(16,6))

ax = survived_by_class.plot.bar(ax=axis1, color='#5975A4', title='Survival Rate by Class', sharey=True)
ax.set_ylabel('Survival Rate')
ax.set_ylim(0.0,1.0)
ax = survived_by_sex.plot.bar(ax=axis2, color='#5F9E6E', title='Survival Rate by Sex', sharey=True)
ax.set_ylim(0.0,1.0)
ax = survived_by_age.plot.bar(ax=axis3, color='#B55D60', title='Survival Rate by Age Range', sharey=True)
ax.set_ylim(0.0,1.0)

(0.0, 1.0)

png

As expected (since we all watched the Titanic movie 😉), the first class has a higher survival rate than the second, which has a higher survival rate than the third, and women and children have a higher chance of survival than men and adults, respectively.

Now combining the three factors and visualizing the plots:

grouped_data = pd.concat(
    [titanic_df_clean_age.groupby(['Pclass', 'Sex', 'AgeRange'])['Survived'].mean(),
     titanic_df_clean_age.groupby(['Pclass', 'Sex', 'AgeRange'])['Survived'].count()],
    axis=1)
grouped_data.columns = ['Survived', 'Count']
grouped_data

			Survived	Count
Pclass	Sex	AgeRange
1	female	child	0.666667	3
	female	adult	0.975610	82
	male	child	1.000000	3
	male	adult	0.377551	98
2	female	child	1.000000	10
	female	adult	0.906250	64
	male	child	1.000000	9
	male	adult	0.066667	90
3	female	child	0.533333	30
	female	adult	0.430556	72
	male	child	0.321429	28
	male	adult	0.128889	225

g = sns.factorplot(
    x='AgeRange', 
    y='Survived', 
    col='Pclass',
    row='Sex',
    data=titanic_df_clean_age,
    margin_titles=True, 
    kind="bar", 
    ci=None)

png

Analysing the three factors combined gives us expected results too. It is interesting to see that even the women from the third class have a higher survival rate than the men from first. It indicates that saving women had a higher priority than saving the richer classes.

Saving children also seemed like a higher priority as on all permutations of factors except first class women, where one of three female children died, they had a higher survival rate.

So we can conclude that saving women and children was indeed a priority on the Titanic shipwreck.

2. Was the fare the same for men and women?

While looking at the scatter plots shown in the first question I noticed that women seemed to be more spreaded among the ‘Fare’ axis, so it motivated me to check if the average fare paid by women was really higher than men’s.

Let’s check the mean fare paid by each sex:

fare_by_sex = titanic_df.groupby('Sex')['Fare'].mean()
fare_by_sex

Sex
female    44.479818
male      25.523893
Name: Fare, dtype: float64

ax = fare_by_sex.plot.bar(title='Fare Average and Sex')
ax.set_ylabel('Average Fare')

<matplotlib.text.Text at 0xa7d69b0>

png

It indeed seems that women paid way more than men on average. Women’s average fare is higher than I expected. Maybe it is due to the women of the first class. Let’s group the data by class and check it out:

fare_by_class_sex = titanic_df.groupby(['Pclass', 'Sex'])['Fare'].mean()
fare_by_class_sex

Pclass  Sex   
1       female    106.125798
        male       67.226127
2       female     21.970121
        male       19.741782
3       female     16.118810
        male       12.661633
Name: Fare, dtype: float64

ax = fare_by_class_sex.plot.bar(figsize=(16,4), title='Fare Average by Class and Sex')
ax.set_ylabel('Average Fare')

<matplotlib.text.Text at 0x8c7e590>

png

The average fare paid by women is higher than men’s on every class, although the fares on second class are almost equal. I wonder why women paid more… Maybe they demanded more privileges than men, but who knows…

3. What fraction of the passengers embarked on each port? Is there a difference in their survival rates?

Just for curiosity’s sake, let’s find out the proportion of passengers embarked on each port (C = Cherbourg; Q = Queenstown; S = Southampton), and their survival rates, but first, removing rows with missing embarkment values:

titanic_df_clean_embarked = titanic_df.dropna(subset=['Embarked'])

embarked = titanic_df_clean_embarked.groupby('Embarked').mean()
embarked['Count'] = titanic_df_clean_embarked['Embarked'].value_counts()
embarked

	Survived	Pclass	Age	Fare	Family	Count
Embarked
C	0.553571	1.886905	30.814769	59.954144	0.494048	168
Q	0.389610	2.909091	28.089286	13.276030	0.259740	77
S	0.336957	2.350932	29.445397	27.079812	0.389752	644

fig, (axis1,axis2) = plt.subplots(1, 2, figsize=(14,6))

sns.countplot(x='Embarked', data=titanic_df_clean_embarked, order=['S','C','Q'], ax=axis1)
sns.barplot(x=embarked.index, y='Survived', data=embarked, order=['S','C','Q'], ax=axis2)

<matplotlib.axes._subplots.AxesSubplot at 0xa98cb30>

png

The survival rate for passengers embarked on Cherbourg is higher than both other ports’. That is no wonder, since the mean ‘Pclass’ value for this port is 1.89 - way lower than Queenstown’s 2.91 and Southampton’s 2.35 - which means that people that embarked there belonged to richer classes, which we’ve already seen that have better survival rates than the poorer ones.

4. Is the presence of a family member a good indicator for survival?

Finally, let’s check if having a family member aboard means a higher survival chance:

survived_by_family = titanic_df_clean_age.groupby('Family')['Survived'].mean()
survived_by_family

Family
False    0.321782
True     0.516129
Name: Survived, dtype: float64

ax = survived_by_family.plot.bar(color='#5975A4', title='Survival Rate by Family Presence')
ax.set_ylabel('Survival Rate')

<matplotlib.text.Text at 0xaa26b90>

png

The data shows that having a family member aboard indicates a better chance for survival. But why is that? Let’s check some other numbers about family presence, like it’s relation with class, sex and age range:

family_by_class = titanic_df_clean_age.groupby('Pclass')['Family'].mean()
family_by_class

Pclass
1    0.537634
2    0.462428
3    0.366197
Name: Family, dtype: float64

family_by_sex = titanic_df_clean_age.groupby('Sex')['Family'].mean()
family_by_sex

Sex
female    0.616858
male      0.328918
Name: Family, dtype: float64

family_by_age = titanic_df_clean_age.groupby('AgeRange')['Family'].mean()
family_by_age

AgeRange
child    0.927711
adult    0.369255
Name: Family, dtype: float64

fig, (axis1,axis2,axis3) = plt.subplots(1, 3, figsize=(16,6))

ax = family_by_class.plot.bar(ax=axis1, color='#5975A4', title='Family Presence by Class', sharey=True)
ax.set_ylabel('Average Family Presence')
ax.set_ylim(0.0,1.0)
ax = family_by_sex.plot.bar(ax=axis2, color='#5F9E6E', title='Family Presence by Sex', sharey=True)
ax.set_ylim(0.0,1.0)
ax = family_by_age.plot.bar(ax=axis3, color='#B55D60', title='Family Presence by Age Range', sharey=True)
ax.set_ylim(0.0,1.0)

(0.0, 1.0)

png

We can see that family presence is higher on: - first class; - female sex; - children.

We have already discovered that these three factors show a higher survival rate, so maybe the higher survival rate for passengers with family members is more due to them than to the presence of family itself.

Conclusion

All the results presented on this report just show correlations between pieces of data. It is important to highlight that correlation does not imply causation. To make statistically valid statements, tests like chi-squared tests and t-tests should be applied.

To discover if class, sex and age have a relationship with survival, we make four chi-squared tests - one for each variable individually, and one for all combined - and find out if they really do matter, as this study suggests.

The same goes to find out if the embarkment site or the presence of a family member have relationships with survival.

To find out if the average fare was the same for men and women we must hypothesize that there was no difference, and then make a t-test to check if the difference is significative as this study suggests.

Thank you for reading!