Udacity Data Analyst Nanodegree

Overview

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives.

Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?

The goal of this project is to build a person of interest (POI, which means an individual who was indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity) identifier based on financial and email data made public as a result of the Enron scandal. Machine learning is an excellent tool for this kind of classification task as it can use patterns discovered from labeled data to infer the classes of new observations.

Our dataset combines the public record of Enron emails and financial data with a hand-generated list of POI’s in the fraud case.

Data Exploration

import sys
import cPickle as pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cross_validation import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data, test_classifier

%matplotlib inline
pd.set_option('display.max_columns', None)

### Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)

# dict to dataframe
df = pd.DataFrame.from_dict(data_dict, orient='index')
df.replace('NaN', np.nan, inplace = True)

df.info()
C:\Users\schil\Anaconda2\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)


<class 'pandas.core.frame.DataFrame'>
Index: 146 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 21 columns):
salary                       95 non-null float64
to_messages                  86 non-null float64
deferral_payments            39 non-null float64
total_payments               125 non-null float64
exercised_stock_options      102 non-null float64
bonus                        82 non-null float64
restricted_stock             110 non-null float64
shared_receipt_with_poi      86 non-null float64
restricted_stock_deferred    18 non-null float64
total_stock_value            126 non-null float64
expenses                     95 non-null float64
loan_advances                4 non-null float64
from_messages                86 non-null float64
other                        93 non-null float64
from_this_person_to_poi      86 non-null float64
poi                          146 non-null bool
director_fees                17 non-null float64
deferred_income              49 non-null float64
long_term_incentive          66 non-null float64
email_address                111 non-null object
from_poi_to_this_person      86 non-null float64
dtypes: bool(1), float64(19), object(1)
memory usage: 24.1+ KB
len(df[df['poi']])
18

There are 146 observations and 21 variables in our dataset - 6 email features, 14 financial features and 1 POI label - and they are divided between 18 POI’s and 128 non-POI’s.

There are a lot of missing values, so, before the data is fed into the machine learning models they are going to be filled by zeros.

Outlier Investigation

df.plot.scatter(x = 'salary', y = 'bonus')
<matplotlib.axes._subplots.AxesSubplot at 0x2d0fb38>

png

There is a salary bigger than 2.5 *107 🤔. It seems too much even for Enron. Let’s find out whoose is it.

df['salary'].idxmax()
'TOTAL'

This huge salary is the TOTAL of the salaries of the listed employees, so I’m going to remove it.

df.drop('TOTAL', inplace = True)
df.plot.scatter(x = 'salary', y = 'bonus')
<matplotlib.axes._subplots.AxesSubplot at 0xc7f6ef0>

png

Create New Features

What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset – explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.

In our dataset we’ve got the number of emails sent to POI’s and received from POI’s for most of the employees. However, if an employee sends or receives a lot of emails in general, it is likely that the quantity of them sent or received from POI’s would be large as well. This is why we are creating these two new features: - fraction of ‘to_messages’ received from a POI; - fraction of ‘from_messages’ sent to a POI.

They can indicate if the majority of an employee’s emails were exchanged with POI’s. In fact, POI’s are grouped together in a scatter plot of the two new features.

df['fraction_from_poi'] = df['from_poi_to_this_person'] / df['to_messages']
df['fraction_to_poi'] = df['from_this_person_to_poi'] / df['from_messages']

ax = df[df['poi'] == False].plot.scatter(x='fraction_from_poi', y='fraction_to_poi', color='blue', label='non-poi')
df[df['poi'] == True].plot.scatter(x='fraction_from_poi', y='fraction_to_poi', color='red', label='poi', ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0xc9a8898>

png

Comparing the results for the final chosen model with and without our new engineered features, we get the following results:

New FeaturesAccuracyPrecisionRecallF1
yes0.8790.5430.3250.380
no0.8790.5430.3250.380

Surprisingly the results were the same with and without the two engineered features.

Properly Scale Features

Since we are going to perform a Principal Component Analysis (PCA) to reduce dimensionality later on, and many machine learning models ask for scaled features, a standardization of the features is going to be tested as the first step of our classification pipeline. If it improves the evaluation score of the model then the chosen final model will have this scaling step.

To acomplish it I use the StandardScaler module from scikit learn, which standardizes features by removing the mean and scaling to unit variance.

Intelligently Select Features

The next step in the pipeline is selecting the features that convey the most information to our model.

Leaving some features behind has some advantages, like reducing the noise in the classification, and saving processing time, since there are less features to compute.

The chosen method was scikit learn’s SelectKBest using f_classif as scoring function. The f_classif function computes the ANOVA F-value between labels and features for classification tasks.

A few feature counts were tested with the aid of a grid search (it will be discussed in a later section), and finally, for the chosen model, 15 most important features were chosen:

featurescore
exercised_stock_options22.84690056
total_stock_value22.33456614
salary16.96091624
bonus15.49141455
fraction_to_poi13.80595013
restricted_stock8.61001147
total_payments8.50623857
loan_advances7.3499902
shared_receipt_with_poi7.06339857
deferred_income6.19466529
long_term_incentive5.66331492
expenses5.28384553
from_poi_to_this_person5.05036916
other4.42180729
fraction_from_poi3.57449894

The output of the feature selection was used as input to PCA. The features were projected to a lower dimensional space, reducing dimensionality from 15 features to 6 principal components in our final chosen model.

Pick an Algorithm

What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?

I ended up using a Gaussian Naïve-Bayes, which scored 0.366984126984 on the nested cross-validation f1. The algorithms tested were: - Gaussian Naïve-Bayes; - Support Vector Machines; - Decision Tree Classifier.

The scores obtained for them are as follows:

AlgorithmNested CV f1
Gaussian Naïve-Bayes0.366984126984
Support Vector Machines0.287132034632
Decision Tree Classifier0.228430049483

Although the other tested models scored better on other evaluation metrics, it is the nested cross-validation score that best depicts how the model generalizes on unseen data, therefore the Gaussian Naïve-Bayes was the chosen model.

Tune the Algorithm

What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well? How did you tune the parameters of your particular algorithm? (Some algorithms do not have parameters that you need to tune – if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).

A crucial part of selecting a machine learning algorithm is to adjust it’s parameters in order to maximize the evaluation metrics. If the parameters are not properly tuned, the algorithm can underfit or overfit the data, hence producing suboptimal results.

To tune the algorithms, I used the GridSearchCV tool provided in scikit learn. It exhaustively searches for the best parameters between the ones specified in an array of possibilities. The parameters are chosen in order to optimize the chosen scoring function, in our case, f1 (the evaluation metrics will be better addressed on the ‘Usage of Evaluation Metrics’ section).

Validation Strategy

What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?

Validation in machine learning consists of evaluating a model using data that was not touched during the training process. A classic mistake is to ignore this rule, hence obtaining overly optimistic results due to overfitting the training data, but very poor performance on unseen data.

It is a good practice to separate data in three parts: training, cross-validation and test sets. The model is tuned to maximize the evaluation score on the cross-validation set, and then the final model efficiency is measured on the test set.

Since there are too few observations for us to train and test the algorithms, in order to extract the most information from the data, the selected strategy to validate our model was a Nested Stratified Shuffle Split Cross-Validation.

In this strategy effectively uses a series of train/validation/test set splits. In the inner loop, the score is approximately maximized by fitting a model to each training set, and then directly maximized in selecting (hyper)parameters over the validation set. In the outer loop, generalization error is estimated by averaging test set scores over several dataset splits. All sets are picked randomly, but keeping the same proportion of class labels.

Usage of Evaluation Metrics

Give at least 2 evaluation metrics and your average performance for each of them. Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance.

For classification algorithms, some of the most common evaluation metrics are accuracy, precision, recall and the f1 score.

  • Accuracy shows the ratio between right classifications and the total number of predicted labels. Since the POI/non-POI distribution is very uneven, accuracy does not mean much. A model that predicts always non-POI’s would get an accuracy of 87.6%, which is an apparently good score for a terrible classifier.

  • Precision is the ratio of right classifications over all observations with a given predicted label. For example, the ratio of true POI’s over all predicted POI’s.

  • Recall is the ratio of right classifications over all observations that are truly of a given class. For example, the ratio of observations correctly labeled POI over all true POI’s.

  • F1 is a way of balance precision and recall, and is given by the following formula:

$$F1 = 2 * (precision * recall) / (precision + recall)$$

For the final selected model, the average scores were the following:

ModelAccuracyPrecisionRecallF1
GaussianNB0.8793103448280.5433333333330.3250.38

Additional Code

### The first feature must be "poi".
features_list = ['poi', 'salary', 'bonus', 'long_term_incentive', 'deferred_income', 'deferral_payments',
                 'loan_advances', 'other', 'expenses', 'director_fees', 'total_payments', 
                 'exercised_stock_options', 'restricted_stock', 'restricted_stock_deferred', 
                 'total_stock_value', 'to_messages', 'from_messages', 'from_this_person_to_poi', 
                 'from_poi_to_this_person', 'shared_receipt_with_poi', 'fraction_from_poi', 'fraction_to_poi']

### Load the dictionary containing the dataset
filled_df = df.fillna(value='NaN') # featureFormat expects 'NaN' strings
data_dict = filled_df.to_dict(orient='index')

### Store to my_dataset for easy export below.
my_dataset = data_dict

### Extract features and labels from dataset for local testing
data = featureFormat(my_dataset, features_list, sort_keys = True)
y, X = targetFeatureSplit(data)
X = np.array(X)
y = np.array(y)

### Cross-validation
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)

SCALER = [None, StandardScaler()]
SELECTOR__K = [10, 13, 15, 18, 'all']
REDUCER__N_COMPONENTS = [2, 4, 6, 8, 10]
def evaluate_model(grid, X, y, cv):
    nested_score = cross_val_score(grid, X=X, y=y, cv=cv, n_jobs=-1)
    print "Nested f1 score: {}".format(nested_score.mean())

    grid.fit(X, y)    
    print "Best parameters: {}".format(grid.best_params_)

    cv_accuracy = []
    cv_precision = []
    cv_recall = []
    cv_f1 = []
    for train_index, test_index in cv.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        grid.best_estimator_.fit(X_train, y_train)
        pred = grid.best_estimator_.predict(X_test)

        cv_accuracy.append(accuracy_score(y_test, pred))
        cv_precision.append(precision_score(y_test, pred))
        cv_recall.append(recall_score(y_test, pred))
        cv_f1.append(f1_score(y_test, pred))

    print "Mean Accuracy: {}".format(np.mean(cv_accuracy))
    print "Mean Precision: {}".format(np.mean(cv_precision))
    print "Mean Recall: {}".format(np.mean(cv_recall))
    print "Mean f1: {}".format(np.mean(cv_f1))

Gaussian Naïve-Bayes

### comment to perform a full hyperparameter search
# SCALER = [None]
# SELECTOR__K = [15]
# REDUCER__N_COMPONENTS = [6]
###################################################

pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('selector', SelectKBest()),
        ('reducer', PCA(random_state=42)),
        ('classifier', GaussianNB())
    ])

param_grid = {
    'scaler': SCALER,
    'selector__k': SELECTOR__K,
    'reducer__n_components': REDUCER__N_COMPONENTS
}

gnb_grid = GridSearchCV(pipe, param_grid, scoring='f1', cv=sss)

evaluate_model(gnb_grid, X, y, sss)

test_classifier(gnb_grid.best_estimator_, my_dataset, features_list)
Nested f1 score: 0.366984126984


C:\Users\schil\Anaconda2\lib\site-packages\sklearn\metrics\classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)


Best parameters: {'reducer__n_components': 6, 'selector__k': 15, 'scaler': None}
Mean Accuracy: 0.879310344828
Mean Precision: 0.543333333333
Mean Recall: 0.325
Mean f1: 0.38


C:\Users\schil\Anaconda2\lib\site-packages\sklearn\metrics\classification.py:1113: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
C:\Users\schil\Anaconda2\lib\site-packages\sklearn\feature_selection\univariate_selection.py:113: UserWarning: Features [5] are constant.
  UserWarning)


Pipeline(steps=[('scaler', None), ('selector', SelectKBest(k=15, score_func=<function f_classif at 0x000000000C5869E8>)), ('reducer', PCA(copy=True, iterated_power='auto', n_components=6, random_state=42,
  svd_solver='auto', tol=0.0, whiten=False)), ('classifier', GaussianNB(priors=None))])
    Accuracy: 0.85733   Precision: 0.44868  Recall: 0.30600 F1: 0.36385 F2: 0.32678
    Total predictions: 15000    True positives:  612    False positives:  752   False negatives: 1388   True negatives: 12248
kbest = gnb_grid.best_estimator_.named_steps['selector']

features_array = np.array(features_list)
features_array = np.delete(features_array, 0)
indices = np.argsort(kbest.scores_)[::-1]
k_features = kbest.get_support().sum()

features = []
for i in range(k_features):
    features.append(features_array[indices[i]])

features = features[::-1]
scores = kbest.scores_[indices[range(k_features)]][::-1]

plt.barh(range(k_features), scores)
plt.yticks(np.arange(0.4, k_features), features)
plt.title('SelectKBest Feature Importances')
plt.show()

png

# Without the engineered features
# removing the 2 last columns
X_2 = np.delete(X, -1, 1)
X_2 = np.delete(X_2, -1, 1)

evaluate_model(gnb_grid, X_2, y, sss)
Nested f1 score: 0.345079365079
Best parameters: {'reducer__n_components': 6, 'selector__k': 13, 'scaler': None}
Mean Accuracy: 0.879310344828
Mean Precision: 0.543333333333
Mean Recall: 0.325
Mean f1: 0.38

Support Vector Machine Classifier

C_PARAM = np.logspace(-2, 3, 6)
GAMMA_PARAM = np.logspace(-4, 1, 6)
CLASS_WEIGHT = ['balanced', None]
KERNEL = ['rbf', 'sigmoid']

### comment to perform a full hyperparameter search
# SCALER = [StandardScaler()]
# SELECTOR__K = [18]
# REDUCER__N_COMPONENTS = [10]
# C_PARAM = [100]
# GAMMA_PARAM = [.01]
# CLASS_WEIGHT = ['balanced']
# KERNEL = ['sigmoid']
###################################################

pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('selector', SelectKBest()),
        ('reducer', PCA(random_state=42)),
        ('classifier', SVC())
    ])

param_grid = {
    'scaler': SCALER,
    'selector__k': SELECTOR__K,
    'reducer__n_components': REDUCER__N_COMPONENTS,
    'classifier__C': C_PARAM,
    'classifier__gamma': GAMMA_PARAM,
    'classifier__class_weight': CLASS_WEIGHT,
    'classifier__kernel': KERNEL
}

svc_grid = GridSearchCV(pipe, param_grid, scoring='f1', cv=sss)

evaluate_model(svc_grid, X, y, sss)

test_classifier(svc_grid.best_estimator_, my_dataset, features_list)
Nested f1 score: 0.287132034632
Best parameters: {'reducer__n_components': 10, 'selector__k': 18, 'scaler': StandardScaler(copy=True, with_mean=True, with_std=True), 'classifier__class_weight': 'balanced', 'classifier__gamma': 0.01, 'classifier__kernel': 'sigmoid', 'classifier__C': 100.0}
Mean Accuracy: 0.827586206897
Mean Precision: 0.460887445887
Mean Recall: 0.8
Mean f1: 0.566651681652
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('selector', SelectKBest(k=18, score_func=<function f_classif at 0x000000000C5869E8>)), ('reducer', PCA(copy=True, iterated_power='auto', n_components=10, random_state=42,
  svd_solver='auto', tol=0.0, whiten=False)), ('cla...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])
    Accuracy: 0.76920   Precision: 0.33595  Recall: 0.74850 F1: 0.46375 F2: 0.60092
    Total predictions: 15000    True positives: 1497    False positives: 2959   False negatives:  503   True negatives: 10041

Decision Tree Classifier

CRITERION = ['gini', 'entropy']
SPLITTER = ['best', 'random']
MIN_SAMPLES_SPLIT = [2, 4, 6, 8]
CLASS_WEIGHT = ['balanced', None]

### comment to perform a full hyperparameter search
# SCALER = [StandardScaler()]
# SELECTOR__K = [18]
# REDUCER__N_COMPONENTS = [2]
# CRITERION = ['gini']
# SPLITTER = ['random']
# MIN_SAMPLES_SPLIT = [8]
# CLASS_WEIGHT = ['balanced']
###################################################

pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('selector', SelectKBest()),
        ('reducer', PCA(random_state=42)),
        ('classifier', DecisionTreeClassifier())
    ])

param_grid = {
    'scaler': SCALER,
    'selector__k': SELECTOR__K,
    'reducer__n_components': REDUCER__N_COMPONENTS,
    'classifier__criterion': CRITERION,
    'classifier__splitter': SPLITTER,
    'classifier__min_samples_split': MIN_SAMPLES_SPLIT,
    'classifier__class_weight': CLASS_WEIGHT,
}

tree_grid = GridSearchCV(pipe, param_grid, scoring='f1', cv=sss)

evaluate_model(tree_grid, X, y, sss)

test_classifier(tree_grid.best_estimator_, my_dataset, features_list)
Nested f1 score: 0.228430049483
Best parameters: {'reducer__n_components': 4, 'selector__k': 15, 'scaler': StandardScaler(copy=True, with_mean=True, with_std=True), 'classifier__min_samples_split': 8, 'classifier__class_weight': 'balanced', 'classifier__splitter': 'random', 'classifier__criterion': 'gini'}
Mean Accuracy: 0.758620689655
Mean Precision: 0.325331890332
Mean Recall: 0.425
Mean f1: 0.321083916084
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('selector', SelectKBest(k=15, score_func=<function f_classif at 0x000000000C5869E8>)), ('reducer', PCA(copy=True, iterated_power='auto', n_components=4, random_state=42,
  svd_solver='auto', tol=0.0, whiten=False)), ('clas...=8, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='random'))])
    Accuracy: 0.73587   Precision: 0.24677  Recall: 0.47800 F1: 0.32550 F2: 0.40256
    Total predictions: 15000    True positives:  956    False positives: 2918   False negatives: 1044   True negatives: 10082

References