Udacity Data Analyst Nanodegree

First Glance at Our Data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

filename = 'titanic_data.csv'
titanic_df = pd.read_csv(filename)

First let’s take a quick look at what we’ve got:

titanic_df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th…female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS


Handling Missing Values

titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 66.2+ KB

From this initial observation we notice that, from 891 passenger records: - 714 have valid ages; - only 204 have cabin records; - 2 embarkments are missing.

The rows with missing ages and embarkment values will be dropped whenever an analysis depends on them.

The cabin values are not going to be used in this analysis, so they will not be touched.

Other Considerations

I’m not going to analyze the number of Siblings/Spouses or Parents/Children isolatedly. Instead I am using the presence or not of family members aboard, represented by the ‘Family’ column.

titanic_df['Family'] = (titanic_df['SibSp'] > 0) | (titanic_df['Parch'] > 0)

We also are going to need a column stating if a passenger is a child or an adult. 15 is going to be the childhood age threshold for our study.

titanic_df['AgeRange'] = pd.cut(titanic_df['Age'], [0, 15, 80], labels=['child', 'adult'])

Now I’m getting rid of the data we are not going to use:

titanic_df.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1, inplace=True)

Which leaves us with the following columns, plus ‘Sex’, ‘Embarked’ and ‘Family’:

titanic_df.describe()
SurvivedPclassAgeFare
count891.000000891.000000714.000000891.000000
mean0.3838382.30864229.69911832.204208
std0.4865920.83607114.52649749.693429
min0.0000001.0000000.4200000.000000
25%0.0000002.000000NaN7.910400
50%0.0000003.000000NaN14.454200
75%1.0000003.000000NaN31.000000
max1.0000003.00000080.000000512.329200


We can see that aproximately 38% of the passengers survived and the highest fare is over 15 times the average.

Let’s raise some questions:

  1. What is the survival rate by class, sex and age? What about combining these factors?
  2. Was the fare the same for men and women?
  3. What fraction of the passengers embarked on each port? Is there a difference in their survival rates?
  4. Is the presence of a family member a good indicator for survival?

1. What is the survival rate by class, sex and age? What about combining these factors?

Let’s take a look at the distribution of passengers by age and fare, grouped by sex and class, and with survival information. It will give us some global insights about the data. But first, removing rows with missing ages:

titanic_df_clean_age = titanic_df.dropna(subset=['Age'])
def scatter_plot_class(pclass):
    g = sns.FacetGrid(titanic_df_clean_age[titanic_df_clean_age['Pclass'] == pclass], 
                      col='Sex',
                      col_order=['male', 'female'],
                      hue='Survived', 
                      hue_kws=dict(marker=['v', '^']), 
                      size=6)
    g = (g.map(plt.scatter, 'Age', 'Fare', edgecolor='w', alpha=0.7, s=80).add_legend())
    plt.subplots_adjust(top=0.9)
    g.fig.suptitle('CLASS {}'.format(pclass))

# plotted separately because the fare scale for the first class makes it difficult to visualize second and third class charts
scatter_plot_class(1)
scatter_plot_class(2)
scatter_plot_class(3)

png

png

png

It seems like women have a much higher survival rate, specially in first and second classes. It seems too that children have a higher survival rate, specially in first and second classes again. Let’s find out the survival rate by class, sex and age range, and plot the results for a better understanding:

survived_by_class = titanic_df_clean_age.groupby('Pclass')['Survived'].mean()
survived_by_class
Pclass
1    0.655914
2    0.479769
3    0.239437
Name: Survived, dtype: float64
survived_by_sex = titanic_df_clean_age.groupby('Sex')['Survived'].mean()
survived_by_sex
Sex
female    0.754789
male      0.205298
Name: Survived, dtype: float64
survived_by_age = titanic_df_clean_age.groupby('AgeRange')['Survived'].mean()
survived_by_age
AgeRange
child    0.590361
adult    0.381933
Name: Survived, dtype: float64
fig, (axis1,axis2,axis3) = plt.subplots(1, 3, figsize=(16,6))

ax = survived_by_class.plot.bar(ax=axis1, color='#5975A4', title='Survival Rate by Class', sharey=True)
ax.set_ylabel('Survival Rate')
ax.set_ylim(0.0,1.0)
ax = survived_by_sex.plot.bar(ax=axis2, color='#5F9E6E', title='Survival Rate by Sex', sharey=True)
ax.set_ylim(0.0,1.0)
ax = survived_by_age.plot.bar(ax=axis3, color='#B55D60', title='Survival Rate by Age Range', sharey=True)
ax.set_ylim(0.0,1.0)
(0.0, 1.0)

png

As expected (since we all watched the Titanic movie 😉), the first class has a higher survival rate than the second, which has a higher survival rate than the third, and women and children have a higher chance of survival than men and adults, respectively.

Now combining the three factors and visualizing the plots:

grouped_data = pd.concat(
    [titanic_df_clean_age.groupby(['Pclass', 'Sex', 'AgeRange'])['Survived'].mean(),
     titanic_df_clean_age.groupby(['Pclass', 'Sex', 'AgeRange'])['Survived'].count()],
    axis=1)
grouped_data.columns = ['Survived', 'Count']
grouped_data
SurvivedCount
PclassSexAgeRange
1femalechild0.6666673
adult0.97561082
malechild1.0000003
adult0.37755198
2femalechild1.00000010
adult0.90625064
malechild1.0000009
adult0.06666790
3femalechild0.53333330
adult0.43055672
malechild0.32142928
adult0.128889225


g = sns.factorplot(
    x='AgeRange', 
    y='Survived', 
    col='Pclass',
    row='Sex',
    data=titanic_df_clean_age,
    margin_titles=True, 
    kind="bar", 
    ci=None)

png

Analysing the three factors combined gives us expected results too. It is interesting to see that even the women from the third class have a higher survival rate than the men from first. It indicates that saving women had a higher priority than saving the richer classes.

Saving children also seemed like a higher priority as on all permutations of factors except first class women, where one of three female children died, they had a higher survival rate.

So we can conclude that saving women and children was indeed a priority on the Titanic shipwreck.

2. Was the fare the same for men and women?

While looking at the scatter plots shown in the first question I noticed that women seemed to be more spreaded among the ‘Fare’ axis, so it motivated me to check if the average fare paid by women was really higher than men’s.

Let’s check the mean fare paid by each sex:

fare_by_sex = titanic_df.groupby('Sex')['Fare'].mean()
fare_by_sex
Sex
female    44.479818
male      25.523893
Name: Fare, dtype: float64
ax = fare_by_sex.plot.bar(title='Fare Average and Sex')
ax.set_ylabel('Average Fare')
<matplotlib.text.Text at 0xa7d69b0>

png

It indeed seems that women paid way more than men on average. Women’s average fare is higher than I expected. Maybe it is due to the women of the first class. Let’s group the data by class and check it out:

fare_by_class_sex = titanic_df.groupby(['Pclass', 'Sex'])['Fare'].mean()
fare_by_class_sex
Pclass  Sex   
1       female    106.125798
        male       67.226127
2       female     21.970121
        male       19.741782
3       female     16.118810
        male       12.661633
Name: Fare, dtype: float64
ax = fare_by_class_sex.plot.bar(figsize=(16,4), title='Fare Average by Class and Sex')
ax.set_ylabel('Average Fare')
<matplotlib.text.Text at 0x8c7e590>

png

The average fare paid by women is higher than men’s on every class, although the fares on second class are almost equal. I wonder why women paid more… Maybe they demanded more privileges than men, but who knows…

3. What fraction of the passengers embarked on each port? Is there a difference in their survival rates?

Just for curiosity’s sake, let’s find out the proportion of passengers embarked on each port (C = Cherbourg; Q = Queenstown; S = Southampton), and their survival rates, but first, removing rows with missing embarkment values:

titanic_df_clean_embarked = titanic_df.dropna(subset=['Embarked'])
embarked = titanic_df_clean_embarked.groupby('Embarked').mean()
embarked['Count'] = titanic_df_clean_embarked['Embarked'].value_counts()
embarked
SurvivedPclassAgeFareFamilyCount
Embarked
C0.5535711.88690530.81476959.9541440.494048168
Q0.3896102.90909128.08928613.2760300.25974077
S0.3369572.35093229.44539727.0798120.389752644


fig, (axis1,axis2) = plt.subplots(1, 2, figsize=(14,6))

sns.countplot(x='Embarked', data=titanic_df_clean_embarked, order=['S','C','Q'], ax=axis1)
sns.barplot(x=embarked.index, y='Survived', data=embarked, order=['S','C','Q'], ax=axis2)
<matplotlib.axes._subplots.AxesSubplot at 0xa98cb30>

png

The survival rate for passengers embarked on Cherbourg is higher than both other ports’. That is no wonder, since the mean ‘Pclass’ value for this port is 1.89 - way lower than Queenstown’s 2.91 and Southampton’s 2.35 - which means that people that embarked there belonged to richer classes, which we’ve already seen that have better survival rates than the poorer ones.

4. Is the presence of a family member a good indicator for survival?

Finally, let’s check if having a family member aboard means a higher survival chance:

survived_by_family = titanic_df_clean_age.groupby('Family')['Survived'].mean()
survived_by_family
Family
False    0.321782
True     0.516129
Name: Survived, dtype: float64
ax = survived_by_family.plot.bar(color='#5975A4', title='Survival Rate by Family Presence')
ax.set_ylabel('Survival Rate')
<matplotlib.text.Text at 0xaa26b90>

png

The data shows that having a family member aboard indicates a better chance for survival. But why is that? Let’s check some other numbers about family presence, like it’s relation with class, sex and age range:

family_by_class = titanic_df_clean_age.groupby('Pclass')['Family'].mean()
family_by_class
Pclass
1    0.537634
2    0.462428
3    0.366197
Name: Family, dtype: float64
family_by_sex = titanic_df_clean_age.groupby('Sex')['Family'].mean()
family_by_sex
Sex
female    0.616858
male      0.328918
Name: Family, dtype: float64
family_by_age = titanic_df_clean_age.groupby('AgeRange')['Family'].mean()
family_by_age
AgeRange
child    0.927711
adult    0.369255
Name: Family, dtype: float64
fig, (axis1,axis2,axis3) = plt.subplots(1, 3, figsize=(16,6))

ax = family_by_class.plot.bar(ax=axis1, color='#5975A4', title='Family Presence by Class', sharey=True)
ax.set_ylabel('Average Family Presence')
ax.set_ylim(0.0,1.0)
ax = family_by_sex.plot.bar(ax=axis2, color='#5F9E6E', title='Family Presence by Sex', sharey=True)
ax.set_ylim(0.0,1.0)
ax = family_by_age.plot.bar(ax=axis3, color='#B55D60', title='Family Presence by Age Range', sharey=True)
ax.set_ylim(0.0,1.0)
(0.0, 1.0)

png

We can see that family presence is higher on: - first class; - female sex; - children.

We have already discovered that these three factors show a higher survival rate, so maybe the higher survival rate for passengers with family members is more due to them than to the presence of family itself.

Conclusion

All the results presented on this report just show correlations between pieces of data. It is important to highlight that correlation does not imply causation. To make statistically valid statements, tests like chi-squared tests and t-tests should be applied.

To discover if class, sex and age have a relationship with survival, we make four chi-squared tests - one for each variable individually, and one for all combined - and find out if they really do matter, as this study suggests.

The same goes to find out if the embarkment site or the presence of a family member have relationships with survival.

To find out if the average fare was the same for men and women we must hypothesize that there was no difference, and then make a t-test to check if the difference is significative as this study suggests.

Thank you for reading!