Luiz Schiller

Make Effective Data Visualisation

2017-05-04T00:00:00+01:00

Flight Performances for each Carrier in 2016

The code for this vis can be found here.

And you can check it out here.

Summary

The chart shows the monthly percentages of flight delays and cancellations/diversions for each carrier in the year of 2016. The carriers are sorted by overall delay performance, and the total number of flights for each one is also depicted.

Design

I chose to draw a main stacked bar chart with the following visual encodings: - The ratio between delayed or cancelled/diverted and total flights is represented vertically on the y axis; - Months are displayed horizontally on the x axis; - Delays and cancellations/diversions are represented by different colors;

There is also a secondary bar chart with the following visual encodings: - Carrier codes are displayed vertically; - Total flights for each carrier are represented by the lenghts of the horizontal bars.

At first I made the stacked bars show the number of delayed flights, however, the x axis scale changed too much between carriers, so I changed it to show ratios, so that the scales would be comparable. I chose not to show “on time” flights on the chart, so I could zoom in the scale, allowing a better visualization of the delays and cancellations/diversions. The horizontal bars on the secondary chart were chosen so that visualizers could reason about the number of flights each carrier had, and also to make it possible to order carriers by overall performance. This way the order of bars contributes to the chart storytelling.

After collecting feedback I changed the following: - Assured the order of the months also on Firefox; - Changed the chart title from “2016 Flight Delays by Cause” to “Flight Performances for each Carrier in 2016”, and then to “Which air carrier had the worst performance in 2016?”; - Added a “References” section to communicate the source of the dataset; - Changed the y axis label from “Ratio” to “Flights Ratio”, and made it show percentual values; - Made the x axis of the secondary bar chart visible, to make it clearer that it was also a chart; - Fixed the height of the stacked bar chart, so that the “Month” label would be visible; - Switched months labels to abbreviations instead of numbers; - Stopped showing delays divided into causes, and instead displayed only total delays; - Aggregated cancellations and diversions; - Changed the animation instructions message into a play/pause clickable button; - Made the animation stop at the end of the first cycle; - Changed from carrier names to codes on the secondary horizontal bar chart y axis; - Updated the secondary horizontal barchart colors to grayscale.

Chart Versions

Feedback

Laurent de Vito

“Hi, Interestingly, in Firefox, the months are labeled 12,7,8,… whereas they are correctly labeled in Chromium, but usually, we cannot do much about it. I find the title a bit misleading since you report not only the flights that were delayed but also those that were canceled. Furthermore, could you please cite your sources ? Overall, nicely done!”

Morgana Secco (my wife)

“The y axis show a ratio between what? You should make it clearer that the horizontal bars on the right display the total flights for each carrier. There is no month label on the x axis.”

tianchuanting

“Hi Luiz,

After spending a minute or two looking at your visualisation, my impression is that it is a very well made visualization. I especially like the small details you put into it, like the tooltip and animated guideline. And here is a list of feedback for you consideration.

I had some difficulty understanding what the vertical axis ‘flight delay’ ratio means. Maybe using something like % of delayed flight might be intuitive.
Similarly, It took me a while to get what the 1-12 on the horizontal axis is presenting, maybe using month abbrev (Jan, Feb etc) instead will be a better idea. LT”

John Enyeart

“ - It would probably be easier to read if you used month names instead of the numbers 1-12 on your x-axis.

The biggest cause of delay is "NAS”, and I have no idea what that is, so an explanation would be nice.
You might also consider putting in the option to switch the y-axis between ratio and number of flights.
Not sure how I feel about the stacked bar chart in terms of readability. Take a look at the following articles:

storytellingwithdata.com - to stack or not to stack

https://solomonmessing.wordpress.com/2014/10/11/when-to-use-stacked-barcharts/“

martin-martin

"Hello @luizschiller!

That’s a great visualization you are working on here! I agree that it seems you’re putting effort in the details, and it shows : )

Here’s my feedback:

The encoding of the amount of flights that the airlines each have is very innovative and I haven’t seen this around yet. Great idea :+1: - looks really interesting!
I was initially confused about what is going on in the graph since it was changing so quickly. I generally prefer if I have the choice to first orient in a visualisation before starting the reel. If you want to have it running right when the user accesses the page, maybe you could make the instructional message on how to start/stop it more obvious (e.g. it could be presented as a clickable button!)
The tickmarks under the months are different than the ones in the rest of your visualisation. Generally you display ticks where the value descriptions are - but here they are in between the data points. I’d suggest to keep this consistent and simply move the ticks into the middle of the columns
What is the NAS value about? Most of the options in the legend on top are somewhat self-explanatory, however not all of them are. And without the context of the fact that they are reasons for delays, the correct interpretation becomes even more difficult. A good legend should also have a title explaining what it’s explaining. - Potentially the graphs title could also fulfill this function, but currently it says "Flight Performances” (which is overall better fitting, yet doesn’t explain that you’re displaying “Reasons for flight delays”, encoded with the different colors).

Hope this helps, and great job! Keep it up and you already have a great piece of data viz! : ) “

Resources

Design an A/B Test

2017-04-17T00:00:00+01:00

Experiment Design

This project was made as part of Udacity’s Data Analyst Nanodegree.

The project instructions can be found here: https://docs.google.com/document/u/1/d/1aCquhIqsUApgsxQ8-SQBAigFDcfWVVohLEXcV6jWbdI/pub?embedded=True

Metric Choice

Invariant Metrics:

Number of Cookies;
Number of Clicks;
Click-through Probability

Visiting the course overview page or clicking on the “start free trial” button happen before the free trial screener is triggered, so they must behave equally on control and experiment groups.

Evaluation Metrics:

Gross Conversion: enrollments / clicks should be a good evaluation metric. It measures if the proposed change is really discouraging users who inform less than 5 hours of study per week from enrolling. This metric is expected to decrease significantly in order to launch the experiment.
Net Conversion: payments / clicks should also be a good evaluation metric. It measures if the free trial screener is changing the proportion of students who remain enrolled past the 14-day boundary after starting a free trial. This metric is expected not to decrease significantly in order to launch the experiment, since the students who complete payments usually dedicate 5 or more hours per week to studying.

Unused Metrics:

Number of user-ids: the number of enrollments could potentially be used as an evaluation metric, but since we have the gross conversion, it would be redundant, and also, comparing raw numbers of user-ids assumes control and experiment groups are equally sized, which is not always true.
Retention: payments / enrollments would be a perfect metric for this experiment, except for the experiment size needed to make a powerful test. At least 17 weeks would be needed in order to complete the experiment, which is too much. This metric would be expected to show a significant increase in order to launch the experiment.

Measuring Standard Deviation

Gross Conversion: 0.0202
Net Conversion: 0.0156

In both cases, the empirical and analytical variabilities are expected to be comparable, because the unit of diversion (cookies) and the unit of analysis (cookies) are the same.

Sizing

Number of Samples vs. Power

I’m not using Bonferroni correction, since the metrics are correlated, and also because I need a specific combination of results of all metrics in order to recommend a change, so it would be too conservative. Number of Pageviews: 679300.

Duration vs. Exposure

I would divert 100% of the traffic, which would lead to 17 days. The experiment introduces a popup on the site, which is one more step in the way of enrolling in a free trial. This change does not present physical, psychological, emotional, social or economic risks above minimal risk. If a student enrolls in the free trial, his data becomes personally identifiable, so there has to be an agreement on privacy policies towards the data, even though the collected information is not sensitive and does not involve political attitudes, financial or health data, for example. Based on the factors cited above, I chose to divert all traffic.

Experiment Analysis

Sanity Checks

Number of Cookies: CI = (0.4988, 0.5012), Observed = 0.5006, pass

Number of Clicks on “Start free trial”: CI = (0.4959, 0.5041), Observed = 0.5005, pass

Click-through-probability on “Start free trial”: CI = (-0.0013, 0.0013), Observed = 0.0001, pass

Result Analysis

Effect Size Tests

Gross Conversion: CI = (-0.0291, -0.0120), dmin = 0.0100, statistically and practically significant.

Net Conversion: CI = (-0.0116, 0.0019), dmin = 0.0075, not statistically nor practically significant.

Sign Tests

Gross Conversion: p-value = 0.0026, statistically significant.

Net Conversion: p-value = 0.6776, not statistically significant.

Summary

The Bonferroni correction is designed to reduce risks of false positives when, in a set of tests, the launch of an experiment is conditioned to any of them matching my expectations. In my case, I need all of my tests to match my expectations in order to launch the experiment, so I’m not using the Bonferroni correction. There were no discrepancies between the effect size and the sign tests.

Recommendation

The encountered results for the evaluation metrics were:

Gross Conversion: in order to launch the experiment, this metric should present a statistical and practical decrease, which were the results found on the tests.
Net Conversion: the confidence interval found for the effect size on net conversion includes the negative practical significance threshold. It means that there is a chance, for an alpha of 0.05, that this metric presented a practically significant decrease. It would be possible to repeat the experiment with more power, but it is unlikely that this trend would change. Since I need both metrics to match my expectations and I cannot conclude that the net conversion has not decreased, my recommendation is not to launch the experiment.

Follow-Up Experiment

One follow-up experiment that could reduce early cancellations could be the following: when a student clicked on the “start free trial” button, a message would appear informing him that the course usually requires 5 hours of dedication per week or more, and he would be requested to mark the hours he would commit to the course on his agenda or calendar in order to proceed. There is going to be a checkbox saying “I have reserved the hours I will commit to the course”, and a “next” button that would be disabled until the checkbox was checked. The next button would then proceed to the usual checkout process. This may seem similar to the attempted experiment, but it has an important difference: it does not suggest students to try the free course materials instead of engaging on the free trial. Maybe this fact could make a significant difference on the observed effects. The hypothesis is that this new change might cause some students, who would otherwise not do so, to organize themselves and reserve some hours per week to study. This would, thus, reduce the number of students who abandon the free trial without significantly reducing the number of students who eventually complete the course. The metrics would be gross and net conversions. They measure respectively the number of enrollments and the number of payments per click on the “start free trial” button. Combined, they can express if the hypothesis holds true. Also, as calculated on the attempted experiment, they are feasible in terms of experiment size. The unit of diversion would be cookie, and the invariant metrics could be the number of course pageviews, and the number of clicks on “start free trial”.

Investigating the Enron Fraud with Machine Learning

2016-12-22T00:00:00+00:00

Udacity Data Analyst Nanodegree

Overview

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives.

Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?

The goal of this project is to build a person of interest (POI, which means an individual who was indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity) identifier based on financial and email data made public as a result of the Enron scandal. Machine learning is an excellent tool for this kind of classification task as it can use patterns discovered from labeled data to infer the classes of new observations.

Our dataset combines the public record of Enron emails and financial data with a hand-generated list of POI’s in the fraud case.

Data Exploration

import sys
import cPickle as pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cross_validation import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data, test_classifier

%matplotlib inline
pd.set_option('display.max_columns', None)

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

# dict to dataframe
df = pd.DataFrame.from_dict(data_dict, orient='index')
df.replace('NaN', np.nan, inplace = True)

df.info()

C:\Users\schil\Anaconda2\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)


<class 'pandas.core.frame.DataFrame'>
Index: 146 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 21 columns):
salary                       95 non-null float64
to_messages                  86 non-null float64
deferral_payments            39 non-null float64
total_payments               125 non-null float64
exercised_stock_options      102 non-null float64
bonus                        82 non-null float64
restricted_stock             110 non-null float64
shared_receipt_with_poi      86 non-null float64
restricted_stock_deferred    18 non-null float64
total_stock_value            126 non-null float64
expenses                     95 non-null float64
loan_advances                4 non-null float64
from_messages                86 non-null float64
other                        93 non-null float64
from_this_person_to_poi      86 non-null float64
poi                          146 non-null bool
director_fees                17 non-null float64
deferred_income              49 non-null float64
long_term_incentive          66 non-null float64
email_address                111 non-null object
from_poi_to_this_person      86 non-null float64
dtypes: bool(1), float64(19), object(1)
memory usage: 24.1+ KB

len(df[df['poi']])

There are 146 observations and 21 variables in our dataset - 6 email features, 14 financial features and 1 POI label - and they are divided between 18 POI’s and 128 non-POI’s.

There are a lot of missing values, so, before the data is fed into the machine learning models they are going to be filled by zeros.

Outlier Investigation

df.plot.scatter(x = 'salary', y = 'bonus')

<matplotlib.axes._subplots.AxesSubplot at 0x2d0fb38>

There is a salary bigger than 2.5 *10⁷ 🤔. It seems too much even for Enron. Let’s find out whoose is it.

df['salary'].idxmax()

'TOTAL'

This huge salary is the TOTAL of the salaries of the listed employees, so I’m going to remove it.

df.drop('TOTAL', inplace = True)
df.plot.scatter(x = 'salary', y = 'bonus')

<matplotlib.axes._subplots.AxesSubplot at 0xc7f6ef0>

Create New Features

What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset – explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.

In our dataset we’ve got the number of emails sent to POI’s and received from POI’s for most of the employees. However, if an employee sends or receives a lot of emails in general, it is likely that the quantity of them sent or received from POI’s would be large as well. This is why we are creating these two new features: - fraction of ‘to_messages’ received from a POI; - fraction of ‘from_messages’ sent to a POI.

They can indicate if the majority of an employee’s emails were exchanged with POI’s. In fact, POI’s are grouped together in a scatter plot of the two new features.

df['fraction_from_poi'] = df['from_poi_to_this_person'] / df['to_messages']
df['fraction_to_poi'] = df['from_this_person_to_poi'] / df['from_messages']

ax = df[df['poi'] == False].plot.scatter(x='fraction_from_poi', y='fraction_to_poi', color='blue', label='non-poi')
df[df['poi'] == True].plot.scatter(x='fraction_from_poi', y='fraction_to_poi', color='red', label='poi', ax=ax)

<matplotlib.axes._subplots.AxesSubplot at 0xc9a8898>

Comparing the results for the final chosen model with and without our new engineered features, we get the following results:

New Features	Accuracy	Precision	Recall	F1
yes	0.879	0.543	0.325	0.380
no	0.879	0.543	0.325	0.380

Surprisingly the results were the same with and without the two engineered features.

Properly Scale Features

Since we are going to perform a Principal Component Analysis (PCA) to reduce dimensionality later on, and many machine learning models ask for scaled features, a standardization of the features is going to be tested as the first step of our classification pipeline. If it improves the evaluation score of the model then the chosen final model will have this scaling step.

To acomplish it I use the StandardScaler module from scikit learn, which standardizes features by removing the mean and scaling to unit variance.

Intelligently Select Features

The next step in the pipeline is selecting the features that convey the most information to our model.

Leaving some features behind has some advantages, like reducing the noise in the classification, and saving processing time, since there are less features to compute.

The chosen method was scikit learn’s SelectKBest using f_classif as scoring function. The f_classif function computes the ANOVA F-value between labels and features for classification tasks.

A few feature counts were tested with the aid of a grid search (it will be discussed in a later section), and finally, for the chosen model, 15 most important features were chosen:

feature	score
exercised_stock_options	22.84690056
total_stock_value	22.33456614
salary	16.96091624
bonus	15.49141455
fraction_to_poi	13.80595013
restricted_stock	8.61001147
total_payments	8.50623857
loan_advances	7.3499902
shared_receipt_with_poi	7.06339857
deferred_income	6.19466529
long_term_incentive	5.66331492
expenses	5.28384553
from_poi_to_this_person	5.05036916
other	4.42180729
fraction_from_poi	3.57449894

The output of the feature selection was used as input to PCA. The features were projected to a lower dimensional space, reducing dimensionality from 15 features to 6 principal components in our final chosen model.

Pick an Algorithm

What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?

I ended up using a Gaussian Naïve-Bayes, which scored 0.366984126984 on the nested cross-validation f1. The algorithms tested were: - Gaussian Naïve-Bayes; - Support Vector Machines; - Decision Tree Classifier.

The scores obtained for them are as follows:

Algorithm	Nested CV f1
Gaussian Naïve-Bayes	0.366984126984
Support Vector Machines	0.287132034632
Decision Tree Classifier	0.228430049483

Although the other tested models scored better on other evaluation metrics, it is the nested cross-validation score that best depicts how the model generalizes on unseen data, therefore the Gaussian Naïve-Bayes was the chosen model.

Tune the Algorithm

What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well? How did you tune the parameters of your particular algorithm? (Some algorithms do not have parameters that you need to tune – if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).

A crucial part of selecting a machine learning algorithm is to adjust it’s parameters in order to maximize the evaluation metrics. If the parameters are not properly tuned, the algorithm can underfit or overfit the data, hence producing suboptimal results.

To tune the algorithms, I used the GridSearchCV tool provided in scikit learn. It exhaustively searches for the best parameters between the ones specified in an array of possibilities. The parameters are chosen in order to optimize the chosen scoring function, in our case, f1 (the evaluation metrics will be better addressed on the ‘Usage of Evaluation Metrics’ section).

Validation Strategy

What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?

Validation in machine learning consists of evaluating a model using data that was not touched during the training process. A classic mistake is to ignore this rule, hence obtaining overly optimistic results due to overfitting the training data, but very poor performance on unseen data.

It is a good practice to separate data in three parts: training, cross-validation and test sets. The model is tuned to maximize the evaluation score on the cross-validation set, and then the final model efficiency is measured on the test set.

Since there are too few observations for us to train and test the algorithms, in order to extract the most information from the data, the selected strategy to validate our model was a Nested Stratified Shuffle Split Cross-Validation.

In this strategy effectively uses a series of train/validation/test set splits. In the inner loop, the score is approximately maximized by fitting a model to each training set, and then directly maximized in selecting (hyper)parameters over the validation set. In the outer loop, generalization error is estimated by averaging test set scores over several dataset splits. All sets are picked randomly, but keeping the same proportion of class labels.

Usage of Evaluation Metrics

Give at least 2 evaluation metrics and your average performance for each of them. Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance.

For classification algorithms, some of the most common evaluation metrics are accuracy, precision, recall and the f1 score.

Accuracy shows the ratio between right classifications and the total number of predicted labels. Since the POI/non-POI distribution is very uneven, accuracy does not mean much. A model that predicts always non-POI’s would get an accuracy of 87.6%, which is an apparently good score for a terrible classifier.
Precision is the ratio of right classifications over all observations with a given predicted label. For example, the ratio of true POI’s over all predicted POI’s.
Recall is the ratio of right classifications over all observations that are truly of a given class. For example, the ratio of observations correctly labeled POI over all true POI’s.
F1 is a way of balance precision and recall, and is given by the following formula:

$$F1 = 2 * (precision * recall) / (precision + recall)$$

For the final selected model, the average scores were the following:

Model	Accuracy	Precision	Recall	F1
GaussianNB	0.879310344828	0.543333333333	0.325	0.38

Additional Code

### The first feature must be "poi".
features_list = ['poi', 'salary', 'bonus', 'long_term_incentive', 'deferred_income', 'deferral_payments',
                 'loan_advances', 'other', 'expenses', 'director_fees', 'total_payments', 
                 'exercised_stock_options', 'restricted_stock', 'restricted_stock_deferred', 
                 'total_stock_value', 'to_messages', 'from_messages', 'from_this_person_to_poi', 
                 'from_poi_to_this_person', 'shared_receipt_with_poi', 'fraction_from_poi', 'fraction_to_poi']

### Load the dictionary containing the dataset
filled_df = df.fillna(value='NaN') # featureFormat expects 'NaN' strings
data_dict = filled_df.to_dict(orient='index')

### Store to my_dataset for easy export below.
my_dataset = data_dict

### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
y, X = targetFeatureSplit(data)
X = np.array(X)
y = np.array(y)

### Cross-validation
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)

SCALER = [None, StandardScaler()]
SELECTOR__K = [10, 13, 15, 18, 'all']
REDUCER__N_COMPONENTS = [2, 4, 6, 8, 10]

def evaluate_model(grid, X, y, cv):
    nested_score = cross_val_score(grid, X=X, y=y, cv=cv, n_jobs=-1)
    print "Nested f1 score: {}".format(nested_score.mean())

    grid.fit(X, y)    
    print "Best parameters: {}".format(grid.best_params_)

    cv_accuracy = []
    cv_precision = []
    cv_recall = []
    cv_f1 = []
    for train_index, test_index in cv.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        grid.best_estimator_.fit(X_train, y_train)
        pred = grid.best_estimator_.predict(X_test)

        cv_accuracy.append(accuracy_score(y_test, pred))
        cv_precision.append(precision_score(y_test, pred))
        cv_recall.append(recall_score(y_test, pred))
        cv_f1.append(f1_score(y_test, pred))

    print "Mean Accuracy: {}".format(np.mean(cv_accuracy))
    print "Mean Precision: {}".format(np.mean(cv_precision))
    print "Mean Recall: {}".format(np.mean(cv_recall))
    print "Mean f1: {}".format(np.mean(cv_f1))

Gaussian Naïve-Bayes

### comment to perform a full hyperparameter search
# SCALER = [None]
# SELECTOR__K = [15]
# REDUCER__N_COMPONENTS = [6]
###################################################

pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('selector', SelectKBest()),
        ('reducer', PCA(random_state=42)),
        ('classifier', GaussianNB())
    ])

param_grid = {
    'scaler': SCALER,
    'selector__k': SELECTOR__K,
    'reducer__n_components': REDUCER__N_COMPONENTS
}

gnb_grid = GridSearchCV(pipe, param_grid, scoring='f1', cv=sss)

evaluate_model(gnb_grid, X, y, sss)

test_classifier(gnb_grid.best_estimator_, my_dataset, features_list)

Nested f1 score: 0.366984126984


C:\Users\schil\Anaconda2\lib\site-packages\sklearn\metrics\classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)


Best parameters: {'reducer__n_components': 6, 'selector__k': 15, 'scaler': None}
Mean Accuracy: 0.879310344828
Mean Precision: 0.543333333333
Mean Recall: 0.325
Mean f1: 0.38


C:\Users\schil\Anaconda2\lib\site-packages\sklearn\metrics\classification.py:1113: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
C:\Users\schil\Anaconda2\lib\site-packages\sklearn\feature_selection\univariate_selection.py:113: UserWarning: Features [5] are constant.
  UserWarning)


Pipeline(steps=[('scaler', None), ('selector', SelectKBest(k=15, score_func=<function f_classif at 0x000000000C5869E8>)), ('reducer', PCA(copy=True, iterated_power='auto', n_components=6, random_state=42,
  svd_solver='auto', tol=0.0, whiten=False)), ('classifier', GaussianNB(priors=None))])
    Accuracy: 0.85733   Precision: 0.44868  Recall: 0.30600 F1: 0.36385 F2: 0.32678
    Total predictions: 15000    True positives:  612    False positives:  752   False negatives: 1388   True negatives: 12248

kbest = gnb_grid.best_estimator_.named_steps['selector']

features_array = np.array(features_list)
features_array = np.delete(features_array, 0)
indices = np.argsort(kbest.scores_)[::-1]
k_features = kbest.get_support().sum()

features = []
for i in range(k_features):
    features.append(features_array[indices[i]])

features = features[::-1]
scores = kbest.scores_[indices[range(k_features)]][::-1]

plt.barh(range(k_features), scores)
plt.yticks(np.arange(0.4, k_features), features)
plt.title('SelectKBest Feature Importances')
plt.show()

# Without the engineered features
# removing the 2 last columns
X_2 = np.delete(X, -1, 1)
X_2 = np.delete(X_2, -1, 1)

evaluate_model(gnb_grid, X_2, y, sss)

Nested f1 score: 0.345079365079
Best parameters: {'reducer__n_components': 6, 'selector__k': 13, 'scaler': None}
Mean Accuracy: 0.879310344828
Mean Precision: 0.543333333333
Mean Recall: 0.325
Mean f1: 0.38

Support Vector Machine Classifier

C_PARAM = np.logspace(-2, 3, 6)
GAMMA_PARAM = np.logspace(-4, 1, 6)
CLASS_WEIGHT = ['balanced', None]
KERNEL = ['rbf', 'sigmoid']

### comment to perform a full hyperparameter search
# SCALER = [StandardScaler()]
# SELECTOR__K = [18]
# REDUCER__N_COMPONENTS = [10]
# C_PARAM = [100]
# GAMMA_PARAM = [.01]
# CLASS_WEIGHT = ['balanced']
# KERNEL = ['sigmoid']
###################################################

pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('selector', SelectKBest()),
        ('reducer', PCA(random_state=42)),
        ('classifier', SVC())
    ])

param_grid = {
    'scaler': SCALER,
    'selector__k': SELECTOR__K,
    'reducer__n_components': REDUCER__N_COMPONENTS,
    'classifier__C': C_PARAM,
    'classifier__gamma': GAMMA_PARAM,
    'classifier__class_weight': CLASS_WEIGHT,
    'classifier__kernel': KERNEL
}

svc_grid = GridSearchCV(pipe, param_grid, scoring='f1', cv=sss)

evaluate_model(svc_grid, X, y, sss)

test_classifier(svc_grid.best_estimator_, my_dataset, features_list)

Nested f1 score: 0.287132034632
Best parameters: {'reducer__n_components': 10, 'selector__k': 18, 'scaler': StandardScaler(copy=True, with_mean=True, with_std=True), 'classifier__class_weight': 'balanced', 'classifier__gamma': 0.01, 'classifier__kernel': 'sigmoid', 'classifier__C': 100.0}
Mean Accuracy: 0.827586206897
Mean Precision: 0.460887445887
Mean Recall: 0.8
Mean f1: 0.566651681652
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('selector', SelectKBest(k=18, score_func=<function f_classif at 0x000000000C5869E8>)), ('reducer', PCA(copy=True, iterated_power='auto', n_components=10, random_state=42,
  svd_solver='auto', tol=0.0, whiten=False)), ('cla...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])
    Accuracy: 0.76920   Precision: 0.33595  Recall: 0.74850 F1: 0.46375 F2: 0.60092
    Total predictions: 15000    True positives: 1497    False positives: 2959   False negatives:  503   True negatives: 10041

Decision Tree Classifier

CRITERION = ['gini', 'entropy']
SPLITTER = ['best', 'random']
MIN_SAMPLES_SPLIT = [2, 4, 6, 8]
CLASS_WEIGHT = ['balanced', None]

### comment to perform a full hyperparameter search
# SCALER = [StandardScaler()]
# SELECTOR__K = [18]
# REDUCER__N_COMPONENTS = [2]
# CRITERION = ['gini']
# SPLITTER = ['random']
# MIN_SAMPLES_SPLIT = [8]
# CLASS_WEIGHT = ['balanced']
###################################################

pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('selector', SelectKBest()),
        ('reducer', PCA(random_state=42)),
        ('classifier', DecisionTreeClassifier())
    ])

param_grid = {
    'scaler': SCALER,
    'selector__k': SELECTOR__K,
    'reducer__n_components': REDUCER__N_COMPONENTS,
    'classifier__criterion': CRITERION,
    'classifier__splitter': SPLITTER,
    'classifier__min_samples_split': MIN_SAMPLES_SPLIT,
    'classifier__class_weight': CLASS_WEIGHT,
}

tree_grid = GridSearchCV(pipe, param_grid, scoring='f1', cv=sss)

evaluate_model(tree_grid, X, y, sss)

test_classifier(tree_grid.best_estimator_, my_dataset, features_list)

Nested f1 score: 0.228430049483
Best parameters: {'reducer__n_components': 4, 'selector__k': 15, 'scaler': StandardScaler(copy=True, with_mean=True, with_std=True), 'classifier__min_samples_split': 8, 'classifier__class_weight': 'balanced', 'classifier__splitter': 'random', 'classifier__criterion': 'gini'}
Mean Accuracy: 0.758620689655
Mean Precision: 0.325331890332
Mean Recall: 0.425
Mean f1: 0.321083916084
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('selector', SelectKBest(k=15, score_func=<function f_classif at 0x000000000C5869E8>)), ('reducer', PCA(copy=True, iterated_power='auto', n_components=4, random_state=42,
  svd_solver='auto', tol=0.0, whiten=False)), ('clas...=8, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='random'))])
    Accuracy: 0.73587   Precision: 0.24677  Recall: 0.47800 F1: 0.32550 F2: 0.40256
    Total predictions: 15000    True positives:  956    False positives: 2918   False negatives: 1044   True negatives: 10082

References

Exploring and Summarizing White Wine Data with R

2016-11-07T00:00:00+00:00

Udacity Data Analyst Nanodegree

Project Overview

This report explores a dataset containing attributes for 4898 instances of the Portuguese “Vinho Verde” white wine.

The attributes are the following:

fixed acidity (tartaric acid - g / dm^3): most acids involved with wine are fixed or nonvolatile (do not evaporate readily).
volatile acidity (acetic acid - g / dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
citric acid (g / dm^3): found in small quantities, citric acid can add ‘freshness’ and flavor to wines.
residual sugar (g / dm^3): the amount of sugar remaining after fermentation stops. It’s rare to find wines with less than 1 g / dm³ and wines with more than 45 g / dm³ are considered sweet.
chlorides (sodium chloride - g / dm^3): the amount of salt in the wine.
free sulfur dioxide (mg / dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion. It prevents microbial growth and the oxidation of wine.
total sulfur dioxide (mg / dm^3): amount of free and bound forms of S02. In low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
density (g / cm^3): the density of wine is close to that of water depending on the percent alcohol and sugar content.
pH - describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.
alcohol (% by volume) - the percent alcohol content of the wine.
quality: score between 0 and 10 (based on sensory data).

Univariate Plots Section

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Main feature of interest: Quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Quality follows a normal-like distribution with discrete integer values.

Regarding acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

There is an interesting peak at .49 and a smaller one at .74g / dm^3. This suggests me that maybe a standard amount of citric acid is added to some of the wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The pH shows a bell shaped distribution. I wonder how it relates individually to the concentrations of acids.

Regarding SO2

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

Free sulfur dioxide has some extreme outliers to the right of the curve.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

wines$bound.sulfur.dioxide <- with(wines, 
                                   total.sulfur.dioxide - free.sulfur.dioxide)
wines$sulfur.dioxide.ratio <- with(wines, 
                                   free.sulfur.dioxide / bound.sulfur.dioxide)

I created a bound sulfur dioxide variable by subtracting the free from the total sulfur dioxide. Then I created a feature consisting of the ratio between the free and bound sulfur dioxide present in the wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    78.0   100.0   103.1   125.0   331.0

It looks very similar to the total sulfur dioxide.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02419 0.23600 0.33990 0.36750 0.46150 2.45500

I transformed the scale to log10 to better visualize the distribution. Maybe it will be useful when trying to predict the quality, or even give us some insight about the data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Sulphates are a little positively skewed. Since it can contribute to sulfur dioxide levels, it can be valuable to plot relations between them.

Other attributes

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

I transformed the residual sugar to a log10 scale to better visualize its distribution. The transformed variable appears bimodal, with peaks around 1.3 and 8.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

I transformed the long tail distribution with a log10 scale so it could be better visualized. After the transformation, the chlorides histogram appears normal, with some outliers on the right side of the curve.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Most of the density values are between .99 and 1.00 g / cm3, but there are some outliers near 1.01 and 1.04.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol presents mostly discrete values, with intervals of .1%. There are a few exceptions though.

Univariate Analysis

What is the structure of your dataset?

There are 11 variables representing physicochemical measurements and 1 variable representing the median of at least 3 evaluations of quality made by wine experts, varying from 0 (very bad) to 10 (very excellent).

What is/are the main feature(s) of interest in your dataset?

Quality is the main feature of interest. The objective of the analysis is to determine the features that influence wine quality the most, and then building a predictive model of quality using these variables.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Most features have an aproximately normal distribution, just like the quality variable. It makes it hard to guess which features will have a greater impact on the prediction of quality.

Did you create any new variables from existing variables in the dataset?

I created the “sulfur.dioxide.ratio”, which consists of the ratio between “free.sulfur.dioxide” and “total.sulfur.dioxide”.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The distribution of citric acid presented two unusual peaks which standed out of an otherwise normal distribution.

I preformed a log transformation on the residual sugar and chlorides distributions, because they were very skewed, and the transformations allowed better visualizations of the data.

Bivariate Plots Section

This correlation matrix naturally shows strong correlations between free sulfur dioxide, total sulfur dioxide and the constructed variables bound sulfur dioxide and sulfur dioxide ratio.

It also shows interesting relations between residual.sugar vs density and alcohol vs density.

Density, residual sugar and alcohol

Density varies aproximately linearly with residual sugar (positive correlation) and with alcohol (negative correlation). It makes sense, taking into account the fermentation process of wine, in which sugar is consumed to generate alcohol. And since the residual sugar is more dense than alcohol, this inverse relation is presented.

Sulfur dioxide

The sulfur dioxide ratio increases along with the free sulfur dioxide, and wines with greater ratios tend to have smaller concentrations of bound sulfur dioxide. I wonder how quality varies related to these variables.

Acids

The only acid concentration that shows some considerable correlation with pH is the one regarding fixed acidity.

Main feature of interest: Quality

Better quality wines seem to have smaller fixed acidities on average.

The same seems to apply with volatile acidity, but nothing very conclusive.

The low correlation on the panel above can be seen on these charts of quality by citric acid.

Except for wines with quality score 3, the median pH increases along with quality score.

No clear relation between free sulfur dioxide and quality.

Here a trend can be seen. Overall, quality decreases as bound sulfur dioxide increases.

Here a slight correlation can be seen, somewhat similar to that of bound sulfur dioxide.

In general, quality increases as the ration of free and bound sulfur dioxide increases, but the correlation is weak.

Sulphates don’t seem to add much isolatedly.

Nothing very clear from these charts.

There is a curious amount of outliers for scores 5 and 6. I wonder why it happens.

A greater correlation is more evident here. This seems to be one of the most promising relations so far. Maybe it has something to do with the fact that density is highly correlated with residual sugar and alcohol concentration, features that may be more easily detected by the experts palate.

Alcohol is the variable with the greatest correlation with quality. It can be seen on the chart. Wines with grades 3 and 4 are going against the trend, but there are not many of those.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I analyzed the relations between quality and every other variable in the dataset. The two largest Pearson’s correlations found were with alcohol (.436) and density (-.307). With both variables, an aproximately linear relation exsited for wines with scores from 5 to 9. The same did not apply for scores 3 and 4.

Analyzing the wines with quality score 9, I observed that they have in average a high concentration of alcohol, a very low density, and also a low amount of residual sugar. I imagine it derives from a well adjusted fermentation process, in which the sugar from the grapes is almost completely consumed in the process of fermentation, generating an above average alcohol concentration and thus a smaller density.

There is also a curious amount of outliers for scores 5 and 6. I wonder why it happens.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Density is strongly correlated with two other variables: residual sugar (positively), and alcohol (negatively). It makes sense, taking into account the fermentation process of wine, in which sugar is consumed to generate alcohol. And since the residual sugar is more dense than alcohol, this inverse relation is presented.

Another relationship found was between fixed acidity and pH. Among the measures of acidity in the dataset, fixed acidity was the only one presenting at least a weak linear relationship with pH.

What was the strongest relationship you found?

The one between density and residual sugar. These features have a Pearson’s correlation coefficient of .839.

Multivariate Plots Section

I am dividing alcohol in bins to be able to plot density, alcohol, residual sugar and quality together and see how they relate with eachother:

It can be seen that the points corresponding to higher amounts of alcohol show wines of better quality in average, and, for a given residual sugar, quality increases as density decreases.

Revisiting this chart from the bivariate plots section, but now colored by quality score. None of the charts indicate that this factors are well fit for a good linear model for predicting quality. However, some regions with higher concentrations of good and bad quality wines are defined, although not very clearly.

Revisiting this charts and adding quality as color, nothin very useful seemed to appear.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wines)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wines)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar, 
##     data = wines)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar + 
##     sulfur.dioxide.ratio, data = wines)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar + 
##     sulfur.dioxide.ratio + sulphates, data = wines)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar + 
##     sulfur.dioxide.ratio + sulphates + density, data = wines)
## m7: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar + 
##     sulfur.dioxide.ratio + sulphates + density + pH, data = wines)
## m8: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar + 
##     sulfur.dioxide.ratio + sulphates + density + pH + fixed.acidity, 
##     data = wines)
## m9: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar + 
##     sulfur.dioxide.ratio + sulphates + density + pH + fixed.acidity + 
##     chlorides, data = wines)
## m10: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar + 
##     sulfur.dioxide.ratio + sulphates + density + pH + fixed.acidity + 
##     chlorides + citric.acid, data = wines)
## 
## ===============================================================================================================================================
##                            m1         m2         m3         m4         m5          m6          m7           m8           m9           m10      
## -----------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)            2.582***   3.017***   2.356***   2.264***   2.014***   82.862***   102.754***   145.254***   143.910***   144.536***  
##                         (0.098)    (0.098)    (0.114)    (0.114)    (0.125)    (12.567)     (12.925)     (18.216)     (18.506)     (18.561)    
##   alcohol                0.313***   0.324***   0.375***   0.367***   0.368***    0.271***     0.242***     0.192***     0.192***     0.191***  
##                         (0.009)    (0.009)    (0.010)    (0.010)    (0.010)     (0.018)      (0.019)      (0.024)      (0.024)      (0.024)    
##   volatile.acidity                 -1.979***  -2.107***  -1.961***  -1.943***   -1.910***    -1.887***    -1.835***    -1.831***    -1.823***  
##                                    (0.110)    (0.109)    (0.111)    (0.110)     (0.110)      (0.110)      (0.111)      (0.111)      (0.113)    
##   residual.sugar                               0.027***   0.025***   0.026***    0.055***     0.065***     0.081***     0.081***     0.081***  
##                                               (0.002)    (0.002)    (0.002)     (0.005)      (0.005)      (0.007)      (0.007)      (0.007)    
##   sulfur.dioxide.ratio                                    0.384***   0.388***    0.319***     0.304***     0.308***     0.309***     0.308***  
##                                                          (0.056)    (0.056)     (0.057)      (0.057)      (0.056)      (0.057)      (0.057)    
##   sulphates                                                          0.463***    0.636***     0.588***     0.645***     0.644***     0.642***  
##                                                                     (0.095)     (0.098)      (0.098)      (0.100)      (0.100)      (0.100)    
##   density                                                                      -80.561***  -101.824***  -145.402***  -144.002***  -144.641***  
##                                                                                (12.522)     (12.938)     (18.457)     (18.767)     (18.823)    
##   pH                                                                                          0.472***     0.702***     0.695***     0.699***  
##                                                                                              (0.076)      (0.103)      (0.105)      (0.105)    
##   fixed.acidity                                                                                            0.068***     0.066**      0.065**   
##                                                                                                           (0.020)      (0.021)      (0.021)    
##   chlorides                                                                                                            -0.224       -0.251     
##                                                                                                                        (0.543)      (0.546)    
##   citric.acid                                                                                                                        0.043     
##                                                                                                                                     (0.095)    
## -----------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                 0.190      0.240      0.259      0.266      0.269      0.275        0.281        0.283        0.283        0.283   
##   adj. R-squared            0.190      0.240      0.258      0.265      0.268      0.274        0.280        0.281        0.281        0.281   
##   sigma                     0.797      0.772      0.763      0.759      0.758      0.754        0.752        0.751        0.751        0.751   
##   F                      1146.395    773.875    568.789    442.368    360.295    309.623      272.891      240.632      213.878      192.478   
##   p                         0.000      0.000      0.000      0.000      0.000      0.000        0.000        0.000        0.000        0.000   
##   Log-likelihood        -5839.391  -5681.776  -5622.083  -5598.647  -5586.778  -5566.139    -5547.023    -5541.549    -5541.464    -5541.364   
##   Deviance               3112.257   2918.264   2847.993   2820.870   2807.231   2783.672     2762.028     2755.862     2755.766     2755.653   
##   AIC                   11684.782  11371.552  11254.166  11209.295  11187.556  11148.278    11112.045    11103.098    11104.927    11106.728   
##   BIC                   11704.272  11397.538  11286.649  11248.274  11233.032  11200.250    11170.515    11168.064    11176.389    11184.687   
##   N                      4898       4898       4898       4898       4898       4898         4898         4898         4898         4898       
## ===============================================================================================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There is a very interesting relation between density, alcohol, residual sugar and quality. In general, quality increases as alcohol increases, density decreases and residual sugar decreases. These variables were amongst the most important predictors in the linear model built.

Were there any interesting or surprising interactions between features?

Since I did not have much knowledge of wine appraising before this exercise, I did not set expectations for the role of each variable, and therefore I was not surprised by the relations between them.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a linear model for predicting quality. The R-squared value for the model was 0.283, which was a very low one. It indicates that a linear model probably is not the best fit for this dataset. Alcohol, volatile acidity and residual sugar were the most important prediction variables. Since there is a large correlation between some of the variables, some sort of feature selection would improve the model.

Final Plots and Summary

Plot One

Description One

This chart depicts the relation between alcohol concentration and quality score. For scores from 5 to 9, quality increases as alcohol increases, and for scores 3 and 4 the relation is the inverse. Alcohol has the largest correlation with quality among all the variables in the dataset, with a Pearson’s correlation coefficient of .436.

Plot Two

Description Two

A very interesting relation is shown in this chart. Given a value of residual sugar, density increases as alcohol decreases. This is in some extent due to the fermentation process of winemaking, in which sugar is consumed to generate alcohol. Since alcohol is less dense than water and sugar is more dense than water, this process makes the density of the wine decrease.

Plot Three

Description Three

This chart shows how quality relates with density and residual sugar. The two lowest and highest quality levels have been grouped to improve visibility.

It is noticeable that, for a given residual sugar concentration, quality increases as density increases. The same occurs if you fix density and increase residual sugar.

Reflection

This exploratory data analysis in which at first a univariate, then bivariate and finally multivariate examinations are performed allow for a progressive understanding of the dataset and the relations between its features.

Some interesting relations came up, like the one between alcohol, density, residual sugar and quality, that could be related to the fermentation process of wine. The correlation between pH and fixed acidity, while not correlating with volatile acidity and citric acids is also worth noting.

I strugled to find meaningful relations in the multivariate analysis section, and I have the feeling that some interesting relations may have been left aside among the many permutations of variables in the dataset. Anyway the whole analysis process was a very valuable experience, in which I practiced plotting various types of charts, handling overplotting and choosing the best chart type to convey the intended message.

A linear model for predicting quality was built, but it performed poorly, indicating that the dataset did not behave very much linearly. The process of evaluating wines is very subjective, and experts can be biased by their histories and preferences, making the relation between quality and the other variables too complex to be explained by a linear model. In the future, a diferent set of quality prediction models could be applied, and an evaluation of the best fit could be performed.

References

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
https://en.wikipedia.org/wiki/Acids_in_wine

OpenStreetMap Data Wrangling

2016-10-08T00:00:00+01:00

Udacity Data Analyst Nanodegree

Project Overview

Choose any area of the world in Open Street Map and use data munging techniques, such as assessing the quality of the data for validity, accuracy, completeness, consistency and uniformity, to clean the OpenStreetMap data for a part of the world that you care about. Choose to learn SQL or MongoDB and apply your chosen schema to the project.

Find the Python code for the project here: https://github.com/schiller/wrangle-open-street-map-data

Map Area: Rio de Janeiro, Brazil

https://mapzen.com/data/metro-extracts/metro/rio-de-janeiro_brazil/

This area contains three cities that had a great part in my history. I lived about one third of my life on each: Petrópolis (where I was born), Niterói and Rio de Janeiro. That said, I would like to explore this extract a little bit and see what interesting data I can find.

Problems Encountered in the Map

After the initial cleaning on the data from the downloaded xml file, it was imported into mongodb using the following command: mongoimport --db osm --collection rio --file rio-de-janeiro_brazil.osm.json

The elements were structured like this:

{
"id": "2406124091",
"type": "node",
"created": {
          "version":"2",
          "changeset":"17206049",
          "timestamp":"2013-08-03T16:43:42Z",
          "user":"linuxUser16",
          "uid":"1219059"
        },
"pos": [41.9757030, -87.6921867],
"address": {
          "housenumber": "5157",
          "postcode": "24230-062",
          "street": "Rua Moreira César"
        },
"amenity": "restaurant",
"cuisine": "mexican",
"name": "La Cabana De Don Luis",
"phone": "+55-21-95757782"
}

Analyzing a sample of the data, some problems showed up:

Tags with k=“type” overriding the element’s ‘type’ field;
String ‘bicycle_parking’ capacities instead of numbers;
Abbreviated street types in ‘address.street’ tag;
Many different formats in ‘phone’ field;
pprint.pprint method not printing Unicode characters properly.

Tags with k=“type” overriding the element’s ‘type’ field

Second level ‘k’ tags with the value ‘type’ were overriding the element’s ‘type’ field, which should equal ‘node’ or ‘way’ only. These tags were mapped to the element with the ‘type_tag’ key before being imported to mongodb.

String ‘bicycle_parking’ capacities instead of numbers

Nodes representing bicycle parkings had their capacity fields as strings, which did not allow numeric operations I was willing to make with them. All of them represented numbers, except for one ‘§0’ value. To solve this, I iterated over the xml file, updating the values with the parsed integer values. Whenever the parsing failed, the ‘capacity’ field was removed. The code used for the removal is shown below:

def handle_bicycle_parking_capacity(node):
    if ('amenity' in node) and (node['amenity'] == 'bicycle_parking'):
        if 'capacity' in node:
            try:
                node['capacity'] = int(node['capacity'])
            except ValueError:
                node.pop('capacity')

Abbreviated street types in ‘address.street’ tag

There were several street names with it’s type abbreviated, for example: Estr. da Paciência Av Castelo Branco R. Miguel Gustavo It is worth noting that in Portuguese the street types appear at the beginning of a street name, in contrast with English, where it appears at the end. To deal with this a mapping was created to convert abbreviations to complete street types: python mapping = { "Av": "Avenida", "Av.": "Avenida", "Est.": "Estrada", "Estr.": "Estrada", "estrada": "Estrada", "Pca": u"Praça", "Praca": u"Praça", u"Pça": u"Praça", u"Pça.": u"Praça", "R.": "Rua", "RUA": "Rua", "rua": "Rua", "Ruas": "Rua", "Rue": "Rua", "Rod.": "Rodovia", "Trav": "Travessa" } After the update, the abbreviation problem was solved for almost all cases, excluding only stranger ones probably caused by human erroneous inputs.

Many different formats in ‘phone’ field

The ‘phone’ field of most elements was filled with various different formats of phone number, and many times more than one phone number was inserted in the same field. To organize this data I defined a standard pattern for the phone values, and audited the file classifying the values into four groups: ok, wrong_separators, missing_area_code and other. The groups were defined by regular expressions as follows:

ok

# +55 99 99999999
phone_ok_re = re.compile(r'^\+55\s\d{2}\s\d{8,9}$')
# 0800 999 9999
phone_0800_ok_re = re.compile(r'^0800\s\d{3}\s\d{4}$')

wrong_separators

# 55-99-9-99999999
wrong_separators_re = re.compile(r'^\D*55\D*\d{2}\D*(\d\D?)?\d{4}\D?\d{4}$')
# +55-99-0800-999-9999
wrong_separators_0800_re = re.compile(r'^\D*(55)?\D*(\d{2})?\D*0800\D?\d{3}\D?\d\D?\d{3}$')

missing_area_code

# missing +55 (Rio area codes start with 2)
missing_ddi_re = re.compile(r'^\D*2\d\D*(\d\D?)?\d{4}\D?\d{4}$')
# missing +55 2X
missing_ddd_re = re.compile(r'^(\d\D?)?\d{4}\D?\d{4}$')

other

The remaining values.

Before the update of the values, which consisted in turning the phone values into a list of strings, removing non-alphanumeric values, including area codes and including spaces only when it was appropriated, the classification was like this: json { "missing_area_code": 72, "wrong_separators": 2055, "other": 41, "ok": 151 } and after the update it turned out like this: json { "missing_area_code": 18, "wrong_separators": 0, "other": 41, "ok": 2260 } With an upgrade from 6.5% to 97.5% of ‘ok’ values, I was content with the phones cleaning for this wrangling exercise.

pprint.pprint method not printing Unicode characters properly

This problem is not related to the data itself, but it was harming the wrangling process. When printing the results of some queries with the pprint.pprint method, characters out of the ascii table showed as their Unicode representation, making it hard to read. To solve this I had to instantiate my own printer, witch encoded unicode objects to utf-8, making it possible to read. Check the code below:

import pprint

class MyPrettyPrinter(pprint.PrettyPrinter):
    def format(self, object, context, maxlevels, level):
        if isinstance(object, unicode):
            return (object.encode('utf8'), True, False)
        return pprint.PrettyPrinter.format(self, object, context, maxlevels, level)

Data Overview

This section contains basic statistics about the dataset and the MongoDB queries used to gather them. Some queries make use of the ‘aggregate’ function.

from pymongo import MongoClient

def get_db(db_name):
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

def aggregate(db, pipeline):
    return [doc for doc in db.rio.aggregate(pipeline)]

db = get_db('osm')

File Sizes

rio-de-janeiro_brazil.osm ........... 323 MB
rio-de-janeiro_brazil.osm.json ...... 369 MB

Elements Count

db.rio.find().count()

Nodes Count

# node count
db.rio.find({'type': 'node'}).count()

Ways Count

# way count
db.rio.find({'type': 'way'}).count()

Number of Distinct Users

This query uses the following ‘aggregate’ method:

distinct_users = [
    {'$group': {'_id': '$created.user'}},
    {'$group': {'_id': 'Distinct users:', 'count': {'$sum': 1}}}]
result = aggregate(db, distinct_users)
MyPrettyPrinter().pprint(result)

[{_id: Distinct users:, count: 1239}]

Top 10 Contributing Users

top_10_users = [
    {'$group': {'_id': '$created.user', 'count': {'$sum': 1}}},
    {'$sort': {'count': -1}},
    {'$limit': 10}]
result = aggregate(db, top_10_users)
MyPrettyPrinter().pprint(result)

[{_id: Alexandrecw, count: 374621},
 {_id: ThiagoPv, count: 186562},
 {_id: smaprs_import, count: 185690},
 {_id: AlNo, count: 169678},
 {_id: Import Rio, count: 85129},
 {_id: Geaquinto, count: 69987},
 {_id: Nighto, count: 63148},
 {_id: Thundercel, count: 55004},
 {_id: Márcio Vínícius Pinheiro, count: 35985},
 {_id: smaprs, count: 31507}]

Users Appearing Only Once

users_appearing_once = [
    {'$group': {'_id': '$created.user', 'count': {'$sum':1}}},
    {'$group': {'_id': '$count', 'num_users': {'$sum':1}}},
    {'$sort': {'_id': 1}},
    {'$limit': 1}]
result = aggregate(db, users_appearing_once)
MyPrettyPrinter().pprint(result)

[{_id: 1, num_users: 274}]

Aditional Ideas

City validation based on postcodes

The city and postcode values could be crosschecked when inputing a new address. Most countries have public APIs to retrieve addresses from postcodes, so it could be done, with the help of contributors around the world. This improvement could prevent a lot of wrong data inputs - there are many examples in the examined dataset - and it would make the process of analyzing data related to cities considerably easier and more accurate. It would definitely cause a positive impact which would affect users througout the world. On the other hand, a change like this decreases the freedom of the user when inputing new addresses, since data could only be submitted if it was in accordance with the crosschecked value from another data source. These positive and negative impacts should be weighted before implementing this kind of improvement to the process.

Phone format validator

The Open Street Map input tool could have a phone format validator, varying from country to country, to avoid such a mess on the phones format 😉. It could also separate multiple phones with a standard separator, since it was one of the most difficult steps of the phone values wrangling. The fact that each country has a different standard format makes it difficult to implement this, but with the help of the open software community around Open Street Map it could be done. Again, it would decrease the freedom of the user inputing the data, since the phone format would have to be validated to the standards. And every time the standards change, the validators would have to be updated, causing some extra work that does not take place today.

Variety.js

The open-source tool Variety (https://github.com/variety/variety) allows the user get a sense of how the data is structured in a MongoDB schema. It does so by showing the number of occurences for each key on documents returned by a query. It is a useful ally when analysing datasets like Open Street Map, which does not define an allowed key set.

Most Common Amenities

most_common_amenities = [
    {'$match': {'amenity': {'$exists': 1}}},
    {'$group': {'_id': '$amenity', 'count': {'$sum': 1}}},
    {'$sort': {'count': -1}},
    {'$limit': 10}]
result = aggregate(db, most_common_amenities)
MyPrettyPrinter().pprint(result)

[{_id: school, count: 1818},
 {_id: bicycle_parking, count: 1409},
 {_id: restaurant, count: 1080},
 {_id: parking, count: 976},
 {_id: fast_food, count: 890},
 {_id: fuel, count: 678},
 {_id: place_of_worship, count: 562},
 {_id: bank, count: 534},
 {_id: pub, count: 400},
 {_id: pharmacy, count: 368}]

Statistics on Bike Parking Capacity

bike_parkings_capacity = [
    {'$match': {'amenity': 'bicycle_parking', 'capacity': {'$exists': 1}}},
    {'$group': {
            '_id': 'Bike parking stats:',
            'count': {'$sum': 1},
            'max': {'$max': '$capacity'},
            'min': {'$min': '$capacity'},
            'avg': {'$avg': '$capacity'}}}]
result = aggregate(db, bike_parkings_capacity)
MyPrettyPrinter().pprint(result)

[{_id: Bike parking stats:,
  avg: 11.487840825350037,
  count: 1357,
  max: 700,
  min: 1}]

10 Most Common Cuisines

top_10_cuisines = [
    {'$match': {'amenity': 'restaurant', 'cuisine': {'$exists': 1}}},
    {'$group': {'_id': '$cuisine', 'count': {'$sum': 1}}},
    {'$sort': {'count': -1}},
    {'$limit': 10}]
result = aggregate(db, top_10_cuisines)
MyPrettyPrinter().pprint(result)

[{_id: pizza, count: 88},
 {_id: regional, count: 83},
 {_id: japanese, count: 38},
 {_id: italian, count: 38},
 {_id: steak_house, count: 20},
 {_id: barbecue, count: 18},
 {_id: brazilian, count: 16},
 {_id: international, count: 12},
 {_id: seafood, count: 8},
 {_id: chinese, count: 8}]

10 Most Common Religions

most_common_religions = [
    {'$match': {'amenity': 'place_of_worship', 'religion': {'$exists': 1}}},
    {'$group': {'_id': '$religion', 'count': {'$sum': 1}}},
    {'$sort': {'count': -1}},
    {'$limit': 10}]
result = aggregate(db, most_common_religions)
MyPrettyPrinter().pprint(result)

[{_id: christian, count: 495},
 {_id: spiritualist, count: 7},
 {_id: jewish, count: 6},
 {_id: buddhist, count: 3},
 {_id: religion_of_humanity, count: 1},
 {_id: umbanda, count: 1},
 {_id: macumba, count: 1},
 {_id: muslim, count: 1},
 {_id: seicho_no_ie, count: 1}]

The vast majority is christian. Among them, which are the most common denominations?

christian_denominations = [
    {'$match': {'amenity': 'place_of_worship', 'religion': 'christian', 'denomination': {'$exists': 1}}},
    {'$group': {'_id': '$denomination', 'count': {'$sum': 1}}},
    {'$sort': {'count': -1}},
    {'$limit': 10}]
result = aggregate(db, christian_denominations)
MyPrettyPrinter().pprint(result)

[{_id: catholic, count: 157},
 {_id: baptist, count: 33},
 {_id: roman_catholic, count: 31},
 {_id: evangelical, count: 27},
 {_id: spiritist, count: 20},
 {_id: pentecostal, count: 19},
 {_id: protestant, count: 14},
 {_id: methodist, count: 10},
 {_id: presbyterian, count: 3},
 {_id: assemblies_of_god, count: 2}]

Fast-food Sites Near the Sugar Loaf

Consider you are visiting the Sugar Loaf in Rio and suddenly you are starving! Where to go? MongoDB Geospacial Index to the rescue!

from pymongo import GEO2D

db.rio.create_index([('pos', GEO2D)])

sugar_loaf = db.rio.find_one({'name': 'Pão de Açúcar', 'tourism': 'attraction'})

result = db.rio.find(
    {'pos': {'$near': sugar_loaf['pos']}, 'amenity': 'fast_food'},
    {'_id': 0, 'name': 1, 'cuisine': 1}).skip(1).limit(3)

MyPrettyPrinter().pprint([item for item in result])

[{cuisine: corn, name: Tino},
 {cuisine: sandwich, name: Max},
 {cuisine: popcorn, name: França}]

Luckily there are Tino’s corn, Max’s sandwich and França’s popcorn to satisfy your hunger!

Conclusion

Data inserted by humans is almost certain to show inconsistencies. And even though a big part of it is inserted by bots, different bots may insert data using different patterns, and the inconsistency remains. On the other hand, this freedom on the data input grants a lot of flexibility to users, and because of that, the representation of the map may be even more faithful to the real world than if there were key constraints or limitations.

Anyway, for the purposes of this wrangling exercise the data has been well cleaned.

References:

pprint Unicode

http://stackoverflow.com/questions/10883399/unable-to-encode-decode-pprint-output

MongoDB Geospacial Index

Variety Open Source Tool

https://github.com/variety/variety

Investigating the Titanic Dataset with Python

2016-09-08T00:00:00+01:00

Udacity Data Analyst Nanodegree

First Glance at Our Data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

filename = 'titanic_data.csv'
titanic_df = pd.read_csv(filename)

First let’s take a quick look at what we’ve got:

titanic_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Handling Missing Values

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 66.2+ KB

From this initial observation we notice that, from 891 passenger records: - 714 have valid ages; - only 204 have cabin records; - 2 embarkments are missing.

The rows with missing ages and embarkment values will be dropped whenever an analysis depends on them.

The cabin values are not going to be used in this analysis, so they will not be touched.

Other Considerations

I’m not going to analyze the number of Siblings/Spouses or Parents/Children isolatedly. Instead I am using the presence or not of family members aboard, represented by the ‘Family’ column.

titanic_df['Family'] = (titanic_df['SibSp'] > 0) | (titanic_df['Parch'] > 0)

We also are going to need a column stating if a passenger is a child or an adult. 15 is going to be the childhood age threshold for our study.

titanic_df['AgeRange'] = pd.cut(titanic_df['Age'], [0, 15, 80], labels=['child', 'adult'])

Now I’m getting rid of the data we are not going to use:

titanic_df.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1, inplace=True)

Which leaves us with the following columns, plus ‘Sex’, ‘Embarked’ and ‘Family’:

titanic_df.describe()

	Survived	Pclass	Age	Fare
count	891.000000	891.000000	714.000000	891.000000
mean	0.383838	2.308642	29.699118	32.204208
std	0.486592	0.836071	14.526497	49.693429
min	0.000000	1.000000	0.420000	0.000000
25%	0.000000	2.000000	NaN	7.910400
50%	0.000000	3.000000	NaN	14.454200
75%	1.000000	3.000000	NaN	31.000000
max	1.000000	3.000000	80.000000	512.329200

We can see that aproximately 38% of the passengers survived and the highest fare is over 15 times the average.

Let’s raise some questions:

What is the survival rate by class, sex and age? What about combining these factors?
Was the fare the same for men and women?
What fraction of the passengers embarked on each port? Is there a difference in their survival rates?
Is the presence of a family member a good indicator for survival?

1. What is the survival rate by class, sex and age? What about combining these factors?

Let’s take a look at the distribution of passengers by age and fare, grouped by sex and class, and with survival information. It will give us some global insights about the data. But first, removing rows with missing ages:

titanic_df_clean_age = titanic_df.dropna(subset=['Age'])

def scatter_plot_class(pclass):
    g = sns.FacetGrid(titanic_df_clean_age[titanic_df_clean_age['Pclass'] == pclass], 
                      col='Sex',
                      col_order=['male', 'female'],
                      hue='Survived', 
                      hue_kws=dict(marker=['v', '^']), 
                      size=6)
    g = (g.map(plt.scatter, 'Age', 'Fare', edgecolor='w', alpha=0.7, s=80).add_legend())
    plt.subplots_adjust(top=0.9)
    g.fig.suptitle('CLASS {}'.format(pclass))

# plotted separately because the fare scale for the first class makes it difficult to visualize second and third class charts
scatter_plot_class(1)
scatter_plot_class(2)
scatter_plot_class(3)

It seems like women have a much higher survival rate, specially in first and second classes. It seems too that children have a higher survival rate, specially in first and second classes again. Let’s find out the survival rate by class, sex and age range, and plot the results for a better understanding:

survived_by_class = titanic_df_clean_age.groupby('Pclass')['Survived'].mean()
survived_by_class

Pclass
1    0.655914
2    0.479769
3    0.239437
Name: Survived, dtype: float64

survived_by_sex = titanic_df_clean_age.groupby('Sex')['Survived'].mean()
survived_by_sex

Sex
female    0.754789
male      0.205298
Name: Survived, dtype: float64

survived_by_age = titanic_df_clean_age.groupby('AgeRange')['Survived'].mean()
survived_by_age

AgeRange
child    0.590361
adult    0.381933
Name: Survived, dtype: float64

fig, (axis1,axis2,axis3) = plt.subplots(1, 3, figsize=(16,6))

ax = survived_by_class.plot.bar(ax=axis1, color='#5975A4', title='Survival Rate by Class', sharey=True)
ax.set_ylabel('Survival Rate')
ax.set_ylim(0.0,1.0)
ax = survived_by_sex.plot.bar(ax=axis2, color='#5F9E6E', title='Survival Rate by Sex', sharey=True)
ax.set_ylim(0.0,1.0)
ax = survived_by_age.plot.bar(ax=axis3, color='#B55D60', title='Survival Rate by Age Range', sharey=True)
ax.set_ylim(0.0,1.0)

(0.0, 1.0)

As expected (since we all watched the Titanic movie 😉), the first class has a higher survival rate than the second, which has a higher survival rate than the third, and women and children have a higher chance of survival than men and adults, respectively.

Now combining the three factors and visualizing the plots:

grouped_data = pd.concat(
    [titanic_df_clean_age.groupby(['Pclass', 'Sex', 'AgeRange'])['Survived'].mean(),
     titanic_df_clean_age.groupby(['Pclass', 'Sex', 'AgeRange'])['Survived'].count()],
    axis=1)
grouped_data.columns = ['Survived', 'Count']
grouped_data

			Survived	Count
Pclass	Sex	AgeRange
1	female	child	0.666667	3
	female	adult	0.975610	82
	male	child	1.000000	3
	male	adult	0.377551	98
2	female	child	1.000000	10
	female	adult	0.906250	64
	male	child	1.000000	9
	male	adult	0.066667	90
3	female	child	0.533333	30
	female	adult	0.430556	72
	male	child	0.321429	28
	male	adult	0.128889	225

g = sns.factorplot(
    x='AgeRange', 
    y='Survived', 
    col='Pclass',
    row='Sex',
    data=titanic_df_clean_age,
    margin_titles=True, 
    kind="bar", 
    ci=None)

Analysing the three factors combined gives us expected results too. It is interesting to see that even the women from the third class have a higher survival rate than the men from first. It indicates that saving women had a higher priority than saving the richer classes.

Saving children also seemed like a higher priority as on all permutations of factors except first class women, where one of three female children died, they had a higher survival rate.

So we can conclude that saving women and children was indeed a priority on the Titanic shipwreck.

2. Was the fare the same for men and women?

While looking at the scatter plots shown in the first question I noticed that women seemed to be more spreaded among the ‘Fare’ axis, so it motivated me to check if the average fare paid by women was really higher than men’s.

Let’s check the mean fare paid by each sex:

fare_by_sex = titanic_df.groupby('Sex')['Fare'].mean()
fare_by_sex

Sex
female    44.479818
male      25.523893
Name: Fare, dtype: float64

ax = fare_by_sex.plot.bar(title='Fare Average and Sex')
ax.set_ylabel('Average Fare')

<matplotlib.text.Text at 0xa7d69b0>

It indeed seems that women paid way more than men on average. Women’s average fare is higher than I expected. Maybe it is due to the women of the first class. Let’s group the data by class and check it out:

fare_by_class_sex = titanic_df.groupby(['Pclass', 'Sex'])['Fare'].mean()
fare_by_class_sex

Pclass  Sex   
1       female    106.125798
        male       67.226127
2       female     21.970121
        male       19.741782
3       female     16.118810
        male       12.661633
Name: Fare, dtype: float64

ax = fare_by_class_sex.plot.bar(figsize=(16,4), title='Fare Average by Class and Sex')
ax.set_ylabel('Average Fare')

<matplotlib.text.Text at 0x8c7e590>

The average fare paid by women is higher than men’s on every class, although the fares on second class are almost equal. I wonder why women paid more… Maybe they demanded more privileges than men, but who knows…

3. What fraction of the passengers embarked on each port? Is there a difference in their survival rates?

Just for curiosity’s sake, let’s find out the proportion of passengers embarked on each port (C = Cherbourg; Q = Queenstown; S = Southampton), and their survival rates, but first, removing rows with missing embarkment values:

titanic_df_clean_embarked = titanic_df.dropna(subset=['Embarked'])

embarked = titanic_df_clean_embarked.groupby('Embarked').mean()
embarked['Count'] = titanic_df_clean_embarked['Embarked'].value_counts()
embarked

	Survived	Pclass	Age	Fare	Family	Count
Embarked
C	0.553571	1.886905	30.814769	59.954144	0.494048	168
Q	0.389610	2.909091	28.089286	13.276030	0.259740	77
S	0.336957	2.350932	29.445397	27.079812	0.389752	644

fig, (axis1,axis2) = plt.subplots(1, 2, figsize=(14,6))

sns.countplot(x='Embarked', data=titanic_df_clean_embarked, order=['S','C','Q'], ax=axis1)
sns.barplot(x=embarked.index, y='Survived', data=embarked, order=['S','C','Q'], ax=axis2)

<matplotlib.axes._subplots.AxesSubplot at 0xa98cb30>

The survival rate for passengers embarked on Cherbourg is higher than both other ports’. That is no wonder, since the mean ‘Pclass’ value for this port is 1.89 - way lower than Queenstown’s 2.91 and Southampton’s 2.35 - which means that people that embarked there belonged to richer classes, which we’ve already seen that have better survival rates than the poorer ones.

4. Is the presence of a family member a good indicator for survival?

Finally, let’s check if having a family member aboard means a higher survival chance:

survived_by_family = titanic_df_clean_age.groupby('Family')['Survived'].mean()
survived_by_family

Family
False    0.321782
True     0.516129
Name: Survived, dtype: float64

ax = survived_by_family.plot.bar(color='#5975A4', title='Survival Rate by Family Presence')
ax.set_ylabel('Survival Rate')

<matplotlib.text.Text at 0xaa26b90>

The data shows that having a family member aboard indicates a better chance for survival. But why is that? Let’s check some other numbers about family presence, like it’s relation with class, sex and age range:

family_by_class = titanic_df_clean_age.groupby('Pclass')['Family'].mean()
family_by_class

Pclass
1    0.537634
2    0.462428
3    0.366197
Name: Family, dtype: float64

family_by_sex = titanic_df_clean_age.groupby('Sex')['Family'].mean()
family_by_sex

Sex
female    0.616858
male      0.328918
Name: Family, dtype: float64

family_by_age = titanic_df_clean_age.groupby('AgeRange')['Family'].mean()
family_by_age

AgeRange
child    0.927711
adult    0.369255
Name: Family, dtype: float64

fig, (axis1,axis2,axis3) = plt.subplots(1, 3, figsize=(16,6))

ax = family_by_class.plot.bar(ax=axis1, color='#5975A4', title='Family Presence by Class', sharey=True)
ax.set_ylabel('Average Family Presence')
ax.set_ylim(0.0,1.0)
ax = family_by_sex.plot.bar(ax=axis2, color='#5F9E6E', title='Family Presence by Sex', sharey=True)
ax.set_ylim(0.0,1.0)
ax = family_by_age.plot.bar(ax=axis3, color='#B55D60', title='Family Presence by Age Range', sharey=True)
ax.set_ylim(0.0,1.0)

(0.0, 1.0)

We can see that family presence is higher on: - first class; - female sex; - children.

We have already discovered that these three factors show a higher survival rate, so maybe the higher survival rate for passengers with family members is more due to them than to the presence of family itself.

Conclusion

All the results presented on this report just show correlations between pieces of data. It is important to highlight that correlation does not imply causation. To make statistically valid statements, tests like chi-squared tests and t-tests should be applied.

To discover if class, sex and age have a relationship with survival, we make four chi-squared tests - one for each variable individually, and one for all combined - and find out if they really do matter, as this study suggests.

The same goes to find out if the embarkment site or the presence of a family member have relationships with survival.

To find out if the average fare was the same for men and women we must hypothesize that there was no difference, and then make a t-test to check if the difference is significative as this study suggests.

Thank you for reading!

Stroop Effect - Testing a Perceptual Phenomenon

2016-08-20T00:00:00+01:00

Udacity Data Analyst Nanodegree

Project Overview

In this project, you will investigate a classic phenomenon from experimental psychology called the Stroop Effect. You will learn a little bit about the experiment, create a hypothesis regarding the outcome of the task, then go through the task yourself. You will then look at some data collected from others who have performed the same task and will compute some statistics describing the results. Finally, you will interpret your results in terms of your hypotheses.

Find the spreadsheet with the calculations here: https://docs.google.com/spreadsheets/d/194Vc8K5SPjlEYZL97j4oDCDcvbP2ZrwNA6rtQ4NVMKQ/edit?usp=sharing

1. What is our independent variable? What is our dependent variable?

Independent: the words condition (congruent or incongruent); Dependent: the time it takes to name the ink colors.

2. What is an appropriate set of hypotheses for this task? What kind of statistical test do you expect to perform? Justify your choices.

Null hypothesis (H0): The mean time for the population to name the ink colors is equal for the Congruent and Incongruent conditions (μC = μI);

Alternative Hypothesis (H1): The mean time for the population to name the ink colors is different for the Congruent and Incongruent conditions (μC = μI);

We expect to perform a paired t-test, because: - We assume the distributions are normal; - The two samples are dependent; - We do not know the population’s standard deviation; - The samples size is below 30.

3. Report some descriptive statistics regarding this dataset. Include at least one measure of central tendency and at least one measure of variability.

Mean difference: -7.96 Standard deviation of the difference: 4.86 Standard error of the mean difference: .99

4. Provide one or two visualizations that show the distribution of the sample data. Write one or two sentences noting what you observe about the plot or plots.

The scatter plot shows some degree of correlation between the two samples.

The histograms show that the times on the incongruent sample are larger than on the congruent sample.

5. Now, perform the statistical test and report your results. What is your confidence level and your critical statistic value? Do you reject the null hypothesis or fail to reject it? Come to a conclusion in terms of the experiment task. Did the results match up with your expectations?

Confidence level = 99%

Alpha = .01

t-critical two-tailed = +-2.807

t-statistic = -8.021

r² = .737

Our t-statistic is less than the negative t-critical (-8.021 < -2.807) so we reject the null hypothesis.

This result means that the difference between the congruent and incongruent samples is statistically significant. Based on our r², 73.7% of this difference is due to the words condition (congruent or incongruent). The results match my expectations.

6. Optional: What do you think is responsible for the effects observed? Can you think of an alternative or similar task that would result in a similar effect? Some research about the problem will be helpful for thinking about these two questions!

Since understanding the meaning of words is an automatic process as a result of habitual reading, and recognizing colors is not, the brain spends attentional resources on it, interfering with the color recognition. A similar experiment could be to show up or down arrows, randomly above or below a central point (incongruent), and compare it to showing up arrows above and down arrows below a central point (congruent).

REFERENCES:

https://en.wikipedia.org/wiki/Stroop_effect