Jekyll2017-07-03T15:34:46+01:00http://luizschiller.com//Luiz SchillerThoughts and projects on Data ScienceLuiz Gustavo Schillerschillerbr@gmail.comhttp://luizschiller.comMake Effective Data Visualisation2017-05-04T00:00:00+01:002017-05-04T00:00:00+01:00http://luizschiller.com/data-visualisation<h1>Flight Performances for each Carrier in 2016</h1>
<p>The code for this vis can be found <a href="https://github.com/schiller/flight-delays-visualization" target="_blank">here</a>.</p>
<p>And you can check it out <a href="https://bl.ocks.org/schiller/raw/8f340fe633cfdc7346b51058f36dada7/" target="_blank">here</a>.</p>
<p><a href="https://bl.ocks.org/schiller/raw/8f340fe633cfdc7346b51058f36dada7/" alt="Visualisation preview" target="_blank"><img src="/assets/images/make-effective-data-visualisation/preview.png"></a></p>
<h2>Summary</h2>
<p>The chart shows the monthly percentages of flight delays and cancellations/diversions for each carrier in the year of 2016. The carriers are sorted by overall delay performance, and the total number of flights for each one is also depicted.</p>
<h2>Design</h2>
<p>I chose to draw a main stacked bar chart with the following visual encodings:
- The ratio between delayed or cancelled/diverted and total flights is represented vertically on the y axis;
- Months are displayed horizontally on the x axis;
- Delays and cancellations/diversions are represented by different colors;</p>
<p>There is also a secondary bar chart with the following visual encodings:
- Carrier codes are displayed vertically;
- Total flights for each carrier are represented by the lenghts of the horizontal bars.</p>
<p>At first I made the stacked bars show the number of delayed flights, however, the x axis scale changed too much between carriers, so I changed it to show ratios, so that the scales would be comparable.
I chose not to show “on time” flights on the chart, so I could zoom in the scale, allowing a better visualization of the delays and cancellations/diversions.
The horizontal bars on the secondary chart were chosen so that visualizers could reason about the number of flights each carrier had, and also to make it possible to order carriers by overall performance. This way the order of bars contributes to the chart storytelling.</p>
<p>After collecting feedback I changed the following:
- Assured the order of the months also on Firefox;
- Changed the chart title from “2016 Flight Delays by Cause” to “Flight Performances for each Carrier in 2016”, and then to “Which air carrier had the worst performance in 2016?”;
- Added a “References” section to communicate the source of the dataset;
- Changed the y axis label from “Ratio” to “Flights Ratio”, and made it show percentual values;
- Made the x axis of the secondary bar chart visible, to make it clearer that it was also a chart;
- Fixed the height of the stacked bar chart, so that the “Month” label would be visible;
- Switched months labels to abbreviations instead of numbers;
- Stopped showing delays divided into causes, and instead displayed only total delays;
- Aggregated cancellations and diversions;
- Changed the animation instructions message into a play/pause clickable button;
- Made the animation stop at the end of the first cycle;
- Changed from carrier names to codes on the secondary horizontal bar chart y axis;
- Updated the secondary horizontal barchart colors to grayscale.</p>
<h3>Chart Versions</h3>
<ul>
<li><a href="https://bl.ocks.org/schiller/raw/8f340fe633cfdc7346b51058f36dada7/">Current</a></li>
<li><a href="https://bl.ocks.org/schiller/raw/ed5ea5c6199d2700d2c0458d5a8079e5/">Second</a></li>
<li><a href="https://bl.ocks.org/schiller/raw/7eb7e5f8236f5820f4b63e268a541884/">First</a></li>
</ul>
<h2>Feedback</h2>
<h3>Laurent de Vito</h3>
<p>“Hi,
Interestingly, in Firefox, the months are labeled 12,7,8,… whereas they are correctly labeled in Chromium, but usually, we cannot do much about it.
I find the title a bit misleading since you report not only the flights that were delayed but also those that were canceled.
Furthermore, could you please cite your sources ?
Overall, nicely done!”</p>
<h3>Morgana Secco (my wife)</h3>
<p>“The y axis show a ratio between what?
You should make it clearer that the horizontal bars on the right display the total flights for each carrier.
There is no month label on the x axis.”</p>
<h3>tianchuanting</h3>
<p>“Hi Luiz,</p>
<p>After spending a minute or two looking at your visualisation, my impression is that it is a very well made visualization. I especially like the small details you put into it, like the tooltip and animated guideline. And here is a list of feedback for you consideration.</p>
<ol>
<li>I had some difficulty understanding what the vertical axis ‘flight delay’ ratio means. Maybe using something like % of delayed flight might be intuitive.</li>
<li>Similarly, It took me a while to get what the 1-12 on the horizontal axis is presenting, maybe using month abbrev (Jan, Feb etc) instead will be a better idea.
LT”</li>
</ol>
<h3>John Enyeart</h3>
<p>“
- It would probably be easier to read if you used month names instead of the numbers 1-12 on your x-axis. </p>
<ul>
<li><p>The biggest cause of delay is "NAS”, and I have no idea what that is, so an explanation would be nice.</p></li>
<li><p>You might also consider putting in the option to switch the y-axis between ratio and number of flights.</p></li>
<li><p>Not sure how I feel about the stacked bar chart in terms of readability. Take a look at the following articles:</p></li>
</ul>
<p><a href="http://www.storytellingwithdata.com/blog/2012/11/to-stack-or-not-to-stack">storytellingwithdata.com - to stack or not to stack</a></p>
<p><a href="https://solomonmessing.wordpress.com/2014/10/11/when-to-use-stacked-barcharts/">https://solomonmessing.wordpress.com/2014/10/11/when-to-use-stacked-barcharts/</a>“</p>
<h3>martin-martin</h3>
<p>"Hello @luizschiller!</p>
<p>That’s a great visualization you are working on here! I agree that it seems you’re putting effort in the details, and it shows : )</p>
<p>Here’s my feedback:</p>
<ol>
<li>The encoding of the amount of flights that the airlines each have is very innovative and I haven’t seen this around yet. Great idea :+1: - looks really interesting!</li>
<li>I was initially confused about what is going on in the graph since it was changing so quickly. I generally prefer if I have the choice to first orient in a visualisation before starting the reel. If you want to have it running right when the user accesses the page, maybe you could make the instructional message on how to start/stop it more obvious (e.g. it could be presented as a clickable button!)</li>
<li>The tickmarks under the months are different than the ones in the rest of your visualisation. Generally you display ticks where the value descriptions are - but here they are in between the data points. I’d suggest to keep this consistent and simply move the ticks into the middle of the columns</li>
<li>What is the NAS value about? Most of the options in the legend on top are somewhat self-explanatory, however not all of them are. And without the context of the fact that they are reasons for delays, the correct interpretation becomes even more difficult. A good legend should also have a title explaining what it’s explaining. - Potentially the graphs title could also fulfill this function, but currently it says "Flight Performances” (which is overall better fitting, yet doesn’t explain that you’re displaying “Reasons for flight delays”, encoded with the different colors).</li>
</ol>
<p>Hope this helps, and great job!
Keep it up and you already have a great piece of data viz! : ) “</p>
<h2>Resources</h2>
<ul>
<li><a href="http://dimplejs.org/examples_viewer.html?id=bars_vertical_stacked">http://dimplejs.org/examples_viewer.html?id=bars_vertical_stacked</a></li>
<li><a href="http://dimplejs.org/advanced_examples_viewer.html?id=advanced_storyboard_control">http://dimplejs.org/advanced_examples_viewer.html?id=advanced_storyboard_control</a></li>
<li><a href="https://codepen.io/mistkaes/pen/WvPrJL">https://codepen.io/mistkaes/pen/WvPrJL</a></li>
</ul>Luiz Gustavo Schillerschillerbr@gmail.comhttp://luizschiller.comDesign an A/B Test2017-04-17T00:00:00+01:002017-04-17T00:00:00+01:00http://luizschiller.com/ab-test<h2>Experiment Design</h2>
<p>This project was made as part of Udacity’s Data Analyst Nanodegree.</p>
<p>The project instructions can be found here: <a href="https://docs.google.com/document/u/1/d/1aCquhIqsUApgsxQ8-SQBAigFDcfWVVohLEXcV6jWbdI/pub?embedded=True">https://docs.google.com/document/u/1/d/1aCquhIqsUApgsxQ8-SQBAigFDcfWVVohLEXcV6jWbdI/pub?embedded=True</a></p>
<h3>Metric Choice</h3>
<h4>Invariant Metrics:</h4>
<ul>
<li>Number of Cookies; </li>
<li>Number of Clicks;</li>
<li>Click-through Probability</li>
</ul>
<p>Visiting the course overview page or clicking on the “start free trial” button happen before the free trial screener is triggered, so they must behave equally on control and experiment groups.</p>
<h4>Evaluation Metrics:</h4>
<ul>
<li>Gross Conversion: enrollments / clicks should be a good evaluation metric. It measures if the proposed change is really discouraging users who inform less than 5 hours of study per week from enrolling. This metric is expected to decrease significantly in order to launch the experiment.</li>
<li>Net Conversion: payments / clicks should also be a good evaluation metric. It measures if the free trial screener is changing the proportion of students who remain enrolled past the 14-day boundary after starting a free trial. This metric is expected not to decrease significantly in order to launch the experiment, since the students who complete payments usually dedicate 5 or more hours per week to studying.</li>
</ul>
<h4>Unused Metrics:</h4>
<ul>
<li>Number of user-ids: the number of enrollments could potentially be used as an evaluation metric, but since we have the gross conversion, it would be redundant, and also, comparing raw numbers of user-ids assumes control and experiment groups are equally sized, which is not always true.</li>
<li>Retention: payments / enrollments would be a perfect metric for this experiment, except for the experiment size needed to make a powerful test. At least 17 weeks would be needed in order to complete the experiment, which is too much. This metric would be expected to show a significant increase in order to launch the experiment.</li>
</ul>
<h3>Measuring Standard Deviation</h3>
<ul>
<li>Gross Conversion: 0.0202</li>
<li>Net Conversion: 0.0156</li>
</ul>
<p>In both cases, the empirical and analytical variabilities are expected to be comparable, because the unit of diversion (cookies) and the unit of analysis (cookies) are the same.</p>
<h3>Sizing</h3>
<h4>Number of Samples vs. Power</h4>
<p>I’m not using Bonferroni correction, since the metrics are correlated, and also because I need a specific combination of results of all metrics in order to recommend a change, so it would be too conservative.
Number of Pageviews: 679300.</p>
<h4>Duration vs. Exposure</h4>
<p>I would divert 100% of the traffic, which would lead to 17 days.
The experiment introduces a popup on the site, which is one more step in the way of enrolling in a free trial. This change does not present physical, psychological, emotional, social or economic risks above minimal risk.
If a student enrolls in the free trial, his data becomes personally identifiable, so there has to be an agreement on privacy policies towards the data, even though the collected information is not sensitive and does not involve political attitudes, financial or health data, for example.
Based on the factors cited above, I chose to divert all traffic.</p>
<h2>Experiment Analysis</h2>
<h3>Sanity Checks</h3>
<p>Number of Cookies: CI = (0.4988, 0.5012), Observed = 0.5006, pass</p>
<p>Number of Clicks on “Start free trial”: CI = (0.4959, 0.5041), Observed = 0.5005, pass</p>
<p>Click-through-probability on “Start free trial”: CI = (-0.0013, 0.0013), Observed = 0.0001, pass</p>
<h3>Result Analysis</h3>
<h4>Effect Size Tests</h4>
<p>Gross Conversion: CI = (-0.0291, -0.0120), dmin = 0.0100, statistically and practically significant.</p>
<p>Net Conversion: CI = (-0.0116, 0.0019), dmin = 0.0075, not statistically nor practically significant.</p>
<h4>Sign Tests</h4>
<p>Gross Conversion: p-value = 0.0026, statistically significant.</p>
<p>Net Conversion: p-value = 0.6776, not statistically significant.</p>
<h4>Summary</h4>
<p>The Bonferroni correction is designed to reduce risks of false positives when, in a set of tests, the launch of an experiment is conditioned to any of them matching my expectations. In my case, I need all of my tests to match my expectations in order to launch the experiment, so I’m not using the Bonferroni correction.
There were no discrepancies between the effect size and the sign tests.</p>
<h3>Recommendation</h3>
<p>The encountered results for the evaluation metrics were:</p>
<ul>
<li>Gross Conversion: in order to launch the experiment, this metric should present a statistical and practical decrease, which were the results found on the tests.</li>
<li>Net Conversion: the confidence interval found for the effect size on net conversion includes the negative practical significance threshold. It means that there is a chance, for an alpha of 0.05, that this metric presented a practically significant decrease. It would be possible to repeat the experiment with more power, but it is unlikely that this trend would change.
Since I need both metrics to match my expectations and I cannot conclude that the net conversion has not decreased, my recommendation is not to launch the experiment.</li>
</ul>
<h2>Follow-Up Experiment</h2>
<p>One follow-up experiment that could reduce early cancellations could be the following: when a student clicked on the “start free trial” button, a message would appear informing him that the course usually requires 5 hours of dedication per week or more, and he would be requested to mark the hours he would commit to the course on his agenda or calendar in order to proceed. There is going to be a checkbox saying “I have reserved the hours I will commit to the course”, and a “next” button that would be disabled until the checkbox was checked. The next button would then proceed to the usual checkout process.
This may seem similar to the attempted experiment, but it has an important difference: it does not suggest students to try the free course materials instead of engaging on the free trial. Maybe this fact could make a significant difference on the observed effects.
The hypothesis is that this new change might cause some students, who would otherwise not do so, to organize themselves and reserve some hours per week to study. This would, thus, reduce the number of students who abandon the free trial without significantly reducing the number of students who eventually complete the course.
The metrics would be gross and net conversions. They measure respectively the number of enrollments and the number of payments per click on the “start free trial” button. Combined, they can express if the hypothesis holds true. Also, as calculated on the attempted experiment, they are feasible in terms of experiment size.
The unit of diversion would be cookie, and the invariant metrics could be the number of course pageviews, and the number of clicks on “start free trial”.</p>Luiz Gustavo Schillerschillerbr@gmail.comhttp://luizschiller.comInvestigating the Enron Fraud with Machine Learning2016-12-22T00:00:00+00:002016-12-22T00:00:00+00:00http://luizschiller.com/enron<h4>Udacity Data Analyst Nanodegree</h4>
<h2>Overview</h2>
<p>In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives.</p>
<blockquote>
<p>Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?</p>
</blockquote>
<p>The goal of this project is to build a person of interest (POI, which means an individual who was indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity) identifier based on financial and email data made public as a result of the Enron scandal. Machine learning is an excellent tool for this kind of classification task as it can use patterns discovered from labeled data to infer the classes of new observations.</p>
<p>Our dataset combines the public record of Enron emails and financial data with a hand-generated list of POI’s in the fraud case.</p>
<h2>Data Exploration</h2>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">cPickle</span> <span class="kn">as</span> <span class="nn">pickle</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="kn">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="kn">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="kn">as</span> <span class="nn">sns</span>
<span class="kn">from</span> <span class="nn">sklearn.cross_validation</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="kn">from</span> <span class="nn">sklearn.pipeline</span> <span class="kn">import</span> <span class="n">Pipeline</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">GridSearchCV</span><span class="p">,</span> <span class="n">StratifiedShuffleSplit</span><span class="p">,</span> <span class="n">cross_val_score</span>
<span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">StandardScaler</span>
<span class="kn">from</span> <span class="nn">sklearn.decomposition</span> <span class="kn">import</span> <span class="n">PCA</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_selection</span> <span class="kn">import</span> <span class="n">SelectKBest</span>
<span class="kn">from</span> <span class="nn">sklearn.naive_bayes</span> <span class="kn">import</span> <span class="n">GaussianNB</span>
<span class="kn">from</span> <span class="nn">sklearn.svm</span> <span class="kn">import</span> <span class="n">SVC</span>
<span class="kn">from</span> <span class="nn">sklearn.tree</span> <span class="kn">import</span> <span class="n">DecisionTreeClassifier</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">accuracy_score</span><span class="p">,</span> <span class="n">precision_score</span><span class="p">,</span> <span class="n">recall_score</span><span class="p">,</span> <span class="n">f1_score</span>
<span class="n">sys</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s">"../tools/"</span><span class="p">)</span>
<span class="kn">from</span> <span class="nn">feature_format</span> <span class="kn">import</span> <span class="n">featureFormat</span><span class="p">,</span> <span class="n">targetFeatureSplit</span>
<span class="kn">from</span> <span class="nn">tester</span> <span class="kn">import</span> <span class="n">dump_classifier_and_data</span><span class="p">,</span> <span class="n">test_classifier</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="n">pd</span><span class="o">.</span><span class="n">set_option</span><span class="p">(</span><span class="s">'display.max_columns'</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
<span class="c">### Load the dictionary containing the dataset</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">"final_project_dataset.pkl"</span><span class="p">,</span> <span class="s">"r"</span><span class="p">)</span> <span class="k">as</span> <span class="n">data_file</span><span class="p">:</span>
<span class="n">data_dict</span> <span class="o">=</span> <span class="n">pickle</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">data_file</span><span class="p">)</span>
<span class="c"># dict to dataframe</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="o">.</span><span class="n">from_dict</span><span class="p">(</span><span class="n">data_dict</span><span class="p">,</span> <span class="n">orient</span><span class="o">=</span><span class="s">'index'</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'NaN'</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">nan</span><span class="p">,</span> <span class="n">inplace</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">info</span><span class="p">()</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">C:\Users\schil\Anaconda2\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
<class 'pandas.core.frame.DataFrame'>
Index: 146 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 21 columns):
salary 95 non-null float64
to_messages 86 non-null float64
deferral_payments 39 non-null float64
total_payments 125 non-null float64
exercised_stock_options 102 non-null float64
bonus 82 non-null float64
restricted_stock 110 non-null float64
shared_receipt_with_poi 86 non-null float64
restricted_stock_deferred 18 non-null float64
total_stock_value 126 non-null float64
expenses 95 non-null float64
loan_advances 4 non-null float64
from_messages 86 non-null float64
other 93 non-null float64
from_this_person_to_poi 86 non-null float64
poi 146 non-null bool
director_fees 17 non-null float64
deferred_income 49 non-null float64
long_term_incentive 66 non-null float64
email_address 111 non-null object
from_poi_to_this_person 86 non-null float64
dtypes: bool(1), float64(19), object(1)
memory usage: 24.1+ KB
</code></pre></div><div class="highlight"><pre><code class="language-python" data-lang="python"><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">'poi'</span><span class="p">]])</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">18
</code></pre></div>
<p>There are 146 observations and 21 variables in our dataset - 6 email features, 14 financial features and 1 POI label - and they are divided between 18 POI’s and 128 non-POI’s.</p>
<p>There are a lot of missing values, so, before the data is fed into the machine learning models they are going to be filled by zeros.</p>
<h2>Outlier Investigation</h2>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="s">'salary'</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="s">'bonus'</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang=""><matplotlib.axes._subplots.AxesSubplot at 0x2d0fb38>
</code></pre></div>
<p><img src="/assets/images/enron/output_4_1.png" alt="png"></p>
<p>There is a salary bigger than 2.5 *10<sup>7</sup> 🤔. It seems too much even for Enron. Let’s find out whoose is it.</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s">'salary'</span><span class="p">]</span><span class="o">.</span><span class="n">idxmax</span><span class="p">()</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">'TOTAL'
</code></pre></div>
<p>This huge salary is the TOTAL of the salaries of the listed employees, so I’m going to remove it.</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'TOTAL'</span><span class="p">,</span> <span class="n">inplace</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="s">'salary'</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="s">'bonus'</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang=""><matplotlib.axes._subplots.AxesSubplot at 0xc7f6ef0>
</code></pre></div>
<p><img src="/assets/images/enron/output_8_1.png" alt="png"></p>
<h2>Create New Features</h2>
<blockquote>
<p>What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset – explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.</p>
</blockquote>
<p>In our dataset we’ve got the number of emails sent to POI’s and received from POI’s for most of the employees. However, if an employee sends or receives a lot of emails in general, it is likely that the quantity of them sent or received from POI’s would be large as well. This is why we are creating these two new features:
- fraction of ‘to_messages’ received from a POI;
- fraction of ‘from_messages’ sent to a POI.</p>
<p>They can indicate if the majority of an employee’s emails were exchanged with POI’s. In fact, POI’s are grouped together in a scatter plot of the two new features. </p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">df</span><span class="p">[</span><span class="s">'fraction_from_poi'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'from_poi_to_this_person'</span><span class="p">]</span> <span class="o">/</span> <span class="n">df</span><span class="p">[</span><span class="s">'to_messages'</span><span class="p">]</span>
<span class="n">df</span><span class="p">[</span><span class="s">'fraction_to_poi'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'from_this_person_to_poi'</span><span class="p">]</span> <span class="o">/</span> <span class="n">df</span><span class="p">[</span><span class="s">'from_messages'</span><span class="p">]</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">'poi'</span><span class="p">]</span> <span class="o">==</span> <span class="bp">False</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s">'fraction_from_poi'</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s">'fraction_to_poi'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'blue'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'non-poi'</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">'poi'</span><span class="p">]</span> <span class="o">==</span> <span class="bp">True</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s">'fraction_from_poi'</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s">'fraction_to_poi'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'red'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'poi'</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang=""><matplotlib.axes._subplots.AxesSubplot at 0xc9a8898>
</code></pre></div>
<p><img src="/assets/images/enron/output_10_1.png" alt="png"></p>
<p>Comparing the results for the final chosen model with and without our new engineered features, we get the following results:</p>
<table><thead>
<tr>
<th>New Features</th>
<th>Accuracy</th>
<th>Precision</th>
<th>Recall</th>
<th>F1</th>
</tr>
</thead><tbody>
<tr>
<td>yes</td>
<td>0.879</td>
<td>0.543</td>
<td>0.325</td>
<td>0.380</td>
</tr>
<tr>
<td>no</td>
<td>0.879</td>
<td>0.543</td>
<td>0.325</td>
<td>0.380</td>
</tr>
</tbody></table>
<p>Surprisingly the results were the same with and without the two engineered features.</p>
<h2>Properly Scale Features</h2>
<p>Since we are going to perform a Principal Component Analysis (PCA) to reduce dimensionality later on, and many machine learning models ask for scaled features, a standardization of the features is going to be tested as the first step of our classification pipeline. If it improves the evaluation score of the model then the chosen final model will have this scaling step.</p>
<p>To acomplish it I use the StandardScaler module from scikit learn, which standardizes features by removing the mean and scaling to unit variance.</p>
<h2>Intelligently Select Features</h2>
<p>The next step in the pipeline is selecting the features that convey the most information to our model.</p>
<p>Leaving some features behind has some advantages, like reducing the noise in the classification, and saving processing time, since there are less features to compute.</p>
<p>The chosen method was scikit learn’s SelectKBest using f_classif as scoring function. The f_classif function computes the ANOVA F-value between labels and features for classification tasks.</p>
<p>A few feature counts were tested with the aid of a grid search (it will be discussed in a later section), and finally, for the chosen model, 15 most important features were chosen:</p>
<table><thead>
<tr>
<th>feature</th>
<th>score</th>
</tr>
</thead><tbody>
<tr>
<td>exercised_stock_options</td>
<td>22.84690056</td>
</tr>
<tr>
<td>total_stock_value</td>
<td>22.33456614</td>
</tr>
<tr>
<td>salary</td>
<td>16.96091624</td>
</tr>
<tr>
<td>bonus</td>
<td>15.49141455</td>
</tr>
<tr>
<td>fraction_to_poi</td>
<td>13.80595013</td>
</tr>
<tr>
<td>restricted_stock</td>
<td>8.61001147</td>
</tr>
<tr>
<td>total_payments</td>
<td>8.50623857</td>
</tr>
<tr>
<td>loan_advances</td>
<td>7.3499902</td>
</tr>
<tr>
<td>shared_receipt_with_poi</td>
<td>7.06339857</td>
</tr>
<tr>
<td>deferred_income</td>
<td>6.19466529</td>
</tr>
<tr>
<td>long_term_incentive</td>
<td>5.66331492</td>
</tr>
<tr>
<td>expenses</td>
<td>5.28384553</td>
</tr>
<tr>
<td>from_poi_to_this_person</td>
<td>5.05036916</td>
</tr>
<tr>
<td>other</td>
<td>4.42180729</td>
</tr>
<tr>
<td>fraction_from_poi</td>
<td>3.57449894</td>
</tr>
</tbody></table>
<p>The output of the feature selection was used as input to PCA. The features were projected to a lower dimensional space, reducing dimensionality from 15 features to 6 principal components in our final chosen model.</p>
<h2>Pick an Algorithm</h2>
<blockquote>
<p>What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms? </p>
</blockquote>
<p>I ended up using a Gaussian Naïve-Bayes, which scored 0.366984126984 on the nested cross-validation f1. The algorithms tested were:
- Gaussian Naïve-Bayes;
- Support Vector Machines;
- Decision Tree Classifier.</p>
<p>The scores obtained for them are as follows:</p>
<table><thead>
<tr>
<th>Algorithm</th>
<th>Nested CV f1</th>
</tr>
</thead><tbody>
<tr>
<td>Gaussian Naïve-Bayes</td>
<td>0.366984126984</td>
</tr>
<tr>
<td>Support Vector Machines</td>
<td>0.287132034632</td>
</tr>
<tr>
<td>Decision Tree Classifier</td>
<td>0.228430049483</td>
</tr>
</tbody></table>
<p>Although the other tested models scored better on other evaluation metrics, it is the nested cross-validation score that best depicts how the model generalizes on unseen data, therefore the Gaussian Naïve-Bayes was the chosen model.</p>
<h2>Tune the Algorithm</h2>
<blockquote>
<p>What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well? How did you tune the parameters of your particular algorithm? (Some algorithms do not have parameters that you need to tune – if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).</p>
</blockquote>
<p>A crucial part of selecting a machine learning algorithm is to adjust it’s parameters in order to maximize the evaluation metrics. If the parameters are not properly tuned, the algorithm can underfit or overfit the data, hence producing suboptimal results.</p>
<p>To tune the algorithms, I used the GridSearchCV tool provided in scikit learn. It exhaustively searches for the best parameters between the ones specified in an array of possibilities. The parameters are chosen in order to optimize the chosen scoring function, in our case, f1 (the evaluation metrics will be better addressed on the ‘Usage of Evaluation Metrics’ section).</p>
<h2>Validation Strategy</h2>
<blockquote>
<p>What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?</p>
</blockquote>
<p>Validation in machine learning consists of evaluating a model using data that was not touched during the training process. A classic mistake is to ignore this rule, hence obtaining overly optimistic results due to overfitting the training data, but very poor performance on unseen data.</p>
<p>It is a good practice to separate data in three parts: training, cross-validation and test sets. The model is tuned to maximize the evaluation score on the cross-validation set, and then the final model efficiency is measured on the test set.</p>
<p>Since there are too few observations for us to train and test the algorithms, in order to extract the most information from the data, the selected strategy to validate our model was a Nested Stratified Shuffle Split Cross-Validation.</p>
<p>In this strategy effectively uses a series of train/validation/test set splits. In the inner loop, the score is approximately maximized by fitting a model to each training set, and then directly maximized in selecting (hyper)parameters over the validation set. In the outer loop, generalization error is estimated by averaging test set scores over several dataset splits. All sets are picked randomly, but keeping the same proportion of class labels.</p>
<h2>Usage of Evaluation Metrics</h2>
<blockquote>
<p>Give at least 2 evaluation metrics and your average performance for each of them. Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance.</p>
</blockquote>
<p>For classification algorithms, some of the most common evaluation metrics are accuracy, precision, recall and the f1 score.</p>
<ul>
<li><p>Accuracy shows the ratio between right classifications and the total number of predicted labels. Since the POI/non-POI distribution is very uneven, accuracy does not mean much. A model that predicts always non-POI’s would get an accuracy of 87.6%, which is an apparently good score for a terrible classifier.</p></li>
<li><p>Precision is the ratio of right classifications over all observations with a given predicted label. For example, the ratio of true POI’s over all predicted POI’s.</p></li>
<li><p>Recall is the ratio of right classifications over all observations that are truly of a given class. For example, the ratio of observations correctly labeled POI over all true POI’s.</p></li>
<li><p>F1 is a way of balance precision and recall, and is given by the following formula:</p></li>
</ul>
<p>$$F1 = 2 * (precision * recall) / (precision + recall)$$</p>
<p>For the final selected model, the average scores were the following:</p>
<table><thead>
<tr>
<th>Model</th>
<th>Accuracy</th>
<th>Precision</th>
<th>Recall</th>
<th>F1</th>
</tr>
</thead><tbody>
<tr>
<td>GaussianNB</td>
<td>0.879310344828</td>
<td>0.543333333333</td>
<td>0.325</td>
<td>0.38</td>
</tr>
</tbody></table>
<h2>Additional Code</h2>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c">### The first feature must be "poi".</span>
<span class="n">features_list</span> <span class="o">=</span> <span class="p">[</span><span class="s">'poi'</span><span class="p">,</span> <span class="s">'salary'</span><span class="p">,</span> <span class="s">'bonus'</span><span class="p">,</span> <span class="s">'long_term_incentive'</span><span class="p">,</span> <span class="s">'deferred_income'</span><span class="p">,</span> <span class="s">'deferral_payments'</span><span class="p">,</span>
<span class="s">'loan_advances'</span><span class="p">,</span> <span class="s">'other'</span><span class="p">,</span> <span class="s">'expenses'</span><span class="p">,</span> <span class="s">'director_fees'</span><span class="p">,</span> <span class="s">'total_payments'</span><span class="p">,</span>
<span class="s">'exercised_stock_options'</span><span class="p">,</span> <span class="s">'restricted_stock'</span><span class="p">,</span> <span class="s">'restricted_stock_deferred'</span><span class="p">,</span>
<span class="s">'total_stock_value'</span><span class="p">,</span> <span class="s">'to_messages'</span><span class="p">,</span> <span class="s">'from_messages'</span><span class="p">,</span> <span class="s">'from_this_person_to_poi'</span><span class="p">,</span>
<span class="s">'from_poi_to_this_person'</span><span class="p">,</span> <span class="s">'shared_receipt_with_poi'</span><span class="p">,</span> <span class="s">'fraction_from_poi'</span><span class="p">,</span> <span class="s">'fraction_to_poi'</span><span class="p">]</span>
<span class="c">### Load the dictionary containing the dataset</span>
<span class="n">filled_df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="n">value</span><span class="o">=</span><span class="s">'NaN'</span><span class="p">)</span> <span class="c"># featureFormat expects 'NaN' strings</span>
<span class="n">data_dict</span> <span class="o">=</span> <span class="n">filled_df</span><span class="o">.</span><span class="n">to_dict</span><span class="p">(</span><span class="n">orient</span><span class="o">=</span><span class="s">'index'</span><span class="p">)</span>
<span class="c">### Store to my_dataset for easy export below.</span>
<span class="n">my_dataset</span> <span class="o">=</span> <span class="n">data_dict</span>
<span class="c">### Extract features and labels from dataset for local testing</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">featureFormat</span><span class="p">(</span><span class="n">my_dataset</span><span class="p">,</span> <span class="n">features_list</span><span class="p">,</span> <span class="n">sort_keys</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">y</span><span class="p">,</span> <span class="n">X</span> <span class="o">=</span> <span class="n">targetFeatureSplit</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>
<span class="c">### Cross-validation</span>
<span class="n">sss</span> <span class="o">=</span> <span class="n">StratifiedShuffleSplit</span><span class="p">(</span><span class="n">n_splits</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
<span class="n">SCALER</span> <span class="o">=</span> <span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="n">StandardScaler</span><span class="p">()]</span>
<span class="n">SELECTOR__K</span> <span class="o">=</span> <span class="p">[</span><span class="mi">10</span><span class="p">,</span> <span class="mi">13</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">18</span><span class="p">,</span> <span class="s">'all'</span><span class="p">]</span>
<span class="n">REDUCER__N_COMPONENTS</span> <span class="o">=</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">10</span><span class="p">]</span>
</code></pre></div><div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">evaluate_model</span><span class="p">(</span><span class="n">grid</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">cv</span><span class="p">):</span>
<span class="n">nested_score</span> <span class="o">=</span> <span class="n">cross_val_score</span><span class="p">(</span><span class="n">grid</span><span class="p">,</span> <span class="n">X</span><span class="o">=</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">y</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="n">cv</span><span class="p">,</span> <span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="k">print</span> <span class="s">"Nested f1 score: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">nested_score</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
<span class="n">grid</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="k">print</span> <span class="s">"Best parameters: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">grid</span><span class="o">.</span><span class="n">best_params_</span><span class="p">)</span>
<span class="n">cv_accuracy</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">cv_precision</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">cv_recall</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">cv_f1</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">train_index</span><span class="p">,</span> <span class="n">test_index</span> <span class="ow">in</span> <span class="n">cv</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">train_index</span><span class="p">],</span> <span class="n">X</span><span class="p">[</span><span class="n">test_index</span><span class="p">]</span>
<span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">y</span><span class="p">[</span><span class="n">train_index</span><span class="p">],</span> <span class="n">y</span><span class="p">[</span><span class="n">test_index</span><span class="p">]</span>
<span class="n">grid</span><span class="o">.</span><span class="n">best_estimator_</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">pred</span> <span class="o">=</span> <span class="n">grid</span><span class="o">.</span><span class="n">best_estimator_</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">cv_accuracy</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">pred</span><span class="p">))</span>
<span class="n">cv_precision</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">precision_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">pred</span><span class="p">))</span>
<span class="n">cv_recall</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">recall_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">pred</span><span class="p">))</span>
<span class="n">cv_f1</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">f1_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">pred</span><span class="p">))</span>
<span class="k">print</span> <span class="s">"Mean Accuracy: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">cv_accuracy</span><span class="p">))</span>
<span class="k">print</span> <span class="s">"Mean Precision: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">cv_precision</span><span class="p">))</span>
<span class="k">print</span> <span class="s">"Mean Recall: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">cv_recall</span><span class="p">))</span>
<span class="k">print</span> <span class="s">"Mean f1: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">cv_f1</span><span class="p">))</span>
</code></pre></div>
<h3>Gaussian Naïve-Bayes</h3>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c">### comment to perform a full hyperparameter search</span>
<span class="c"># SCALER = [None]</span>
<span class="c"># SELECTOR__K = [15]</span>
<span class="c"># REDUCER__N_COMPONENTS = [6]</span>
<span class="c">###################################################</span>
<span class="n">pipe</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">([</span>
<span class="p">(</span><span class="s">'scaler'</span><span class="p">,</span> <span class="n">StandardScaler</span><span class="p">()),</span>
<span class="p">(</span><span class="s">'selector'</span><span class="p">,</span> <span class="n">SelectKBest</span><span class="p">()),</span>
<span class="p">(</span><span class="s">'reducer'</span><span class="p">,</span> <span class="n">PCA</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)),</span>
<span class="p">(</span><span class="s">'classifier'</span><span class="p">,</span> <span class="n">GaussianNB</span><span class="p">())</span>
<span class="p">])</span>
<span class="n">param_grid</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">'scaler'</span><span class="p">:</span> <span class="n">SCALER</span><span class="p">,</span>
<span class="s">'selector__k'</span><span class="p">:</span> <span class="n">SELECTOR__K</span><span class="p">,</span>
<span class="s">'reducer__n_components'</span><span class="p">:</span> <span class="n">REDUCER__N_COMPONENTS</span>
<span class="p">}</span>
<span class="n">gnb_grid</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">pipe</span><span class="p">,</span> <span class="n">param_grid</span><span class="p">,</span> <span class="n">scoring</span><span class="o">=</span><span class="s">'f1'</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="n">sss</span><span class="p">)</span>
<span class="n">evaluate_model</span><span class="p">(</span><span class="n">gnb_grid</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">sss</span><span class="p">)</span>
<span class="n">test_classifier</span><span class="p">(</span><span class="n">gnb_grid</span><span class="o">.</span><span class="n">best_estimator_</span><span class="p">,</span> <span class="n">my_dataset</span><span class="p">,</span> <span class="n">features_list</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">Nested f1 score: 0.366984126984
C:\Users\schil\Anaconda2\lib\site-packages\sklearn\metrics\classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
'precision', 'predicted', average, warn_for)
Best parameters: {'reducer__n_components': 6, 'selector__k': 15, 'scaler': None}
Mean Accuracy: 0.879310344828
Mean Precision: 0.543333333333
Mean Recall: 0.325
Mean f1: 0.38
C:\Users\schil\Anaconda2\lib\site-packages\sklearn\metrics\classification.py:1113: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
'precision', 'predicted', average, warn_for)
C:\Users\schil\Anaconda2\lib\site-packages\sklearn\feature_selection\univariate_selection.py:113: UserWarning: Features [5] are constant.
UserWarning)
Pipeline(steps=[('scaler', None), ('selector', SelectKBest(k=15, score_func=<function f_classif at 0x000000000C5869E8>)), ('reducer', PCA(copy=True, iterated_power='auto', n_components=6, random_state=42,
svd_solver='auto', tol=0.0, whiten=False)), ('classifier', GaussianNB(priors=None))])
Accuracy: 0.85733 Precision: 0.44868 Recall: 0.30600 F1: 0.36385 F2: 0.32678
Total predictions: 15000 True positives: 612 False positives: 752 False negatives: 1388 True negatives: 12248
</code></pre></div><div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">kbest</span> <span class="o">=</span> <span class="n">gnb_grid</span><span class="o">.</span><span class="n">best_estimator_</span><span class="o">.</span><span class="n">named_steps</span><span class="p">[</span><span class="s">'selector'</span><span class="p">]</span>
<span class="n">features_array</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">features_list</span><span class="p">)</span>
<span class="n">features_array</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">delete</span><span class="p">(</span><span class="n">features_array</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">indices</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argsort</span><span class="p">(</span><span class="n">kbest</span><span class="o">.</span><span class="n">scores_</span><span class="p">)[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">k_features</span> <span class="o">=</span> <span class="n">kbest</span><span class="o">.</span><span class="n">get_support</span><span class="p">()</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span>
<span class="n">features</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k_features</span><span class="p">):</span>
<span class="n">features</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">features_array</span><span class="p">[</span><span class="n">indices</span><span class="p">[</span><span class="n">i</span><span class="p">]])</span>
<span class="n">features</span> <span class="o">=</span> <span class="n">features</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">scores</span> <span class="o">=</span> <span class="n">kbest</span><span class="o">.</span><span class="n">scores_</span><span class="p">[</span><span class="n">indices</span><span class="p">[</span><span class="nb">range</span><span class="p">(</span><span class="n">k_features</span><span class="p">)]][::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">plt</span><span class="o">.</span><span class="n">barh</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">k_features</span><span class="p">),</span> <span class="n">scores</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">yticks</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mf">0.4</span><span class="p">,</span> <span class="n">k_features</span><span class="p">),</span> <span class="n">features</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'SelectKBest Feature Importances'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div>
<p><img src="/assets/images/enron/output_16_0.png" alt="png"></p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Without the engineered features</span>
<span class="c"># removing the 2 last columns</span>
<span class="n">X_2</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">delete</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">X_2</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">delete</span><span class="p">(</span><span class="n">X_2</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">evaluate_model</span><span class="p">(</span><span class="n">gnb_grid</span><span class="p">,</span> <span class="n">X_2</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">sss</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">Nested f1 score: 0.345079365079
Best parameters: {'reducer__n_components': 6, 'selector__k': 13, 'scaler': None}
Mean Accuracy: 0.879310344828
Mean Precision: 0.543333333333
Mean Recall: 0.325
Mean f1: 0.38
</code></pre></div>
<h3>Support Vector Machine Classifier</h3>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">C_PARAM</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">6</span><span class="p">)</span>
<span class="n">GAMMA_PARAM</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="o">-</span><span class="mi">4</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">6</span><span class="p">)</span>
<span class="n">CLASS_WEIGHT</span> <span class="o">=</span> <span class="p">[</span><span class="s">'balanced'</span><span class="p">,</span> <span class="bp">None</span><span class="p">]</span>
<span class="n">KERNEL</span> <span class="o">=</span> <span class="p">[</span><span class="s">'rbf'</span><span class="p">,</span> <span class="s">'sigmoid'</span><span class="p">]</span>
<span class="c">### comment to perform a full hyperparameter search</span>
<span class="c"># SCALER = [StandardScaler()]</span>
<span class="c"># SELECTOR__K = [18]</span>
<span class="c"># REDUCER__N_COMPONENTS = [10]</span>
<span class="c"># C_PARAM = [100]</span>
<span class="c"># GAMMA_PARAM = [.01]</span>
<span class="c"># CLASS_WEIGHT = ['balanced']</span>
<span class="c"># KERNEL = ['sigmoid']</span>
<span class="c">###################################################</span>
<span class="n">pipe</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">([</span>
<span class="p">(</span><span class="s">'scaler'</span><span class="p">,</span> <span class="n">StandardScaler</span><span class="p">()),</span>
<span class="p">(</span><span class="s">'selector'</span><span class="p">,</span> <span class="n">SelectKBest</span><span class="p">()),</span>
<span class="p">(</span><span class="s">'reducer'</span><span class="p">,</span> <span class="n">PCA</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)),</span>
<span class="p">(</span><span class="s">'classifier'</span><span class="p">,</span> <span class="n">SVC</span><span class="p">())</span>
<span class="p">])</span>
<span class="n">param_grid</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">'scaler'</span><span class="p">:</span> <span class="n">SCALER</span><span class="p">,</span>
<span class="s">'selector__k'</span><span class="p">:</span> <span class="n">SELECTOR__K</span><span class="p">,</span>
<span class="s">'reducer__n_components'</span><span class="p">:</span> <span class="n">REDUCER__N_COMPONENTS</span><span class="p">,</span>
<span class="s">'classifier__C'</span><span class="p">:</span> <span class="n">C_PARAM</span><span class="p">,</span>
<span class="s">'classifier__gamma'</span><span class="p">:</span> <span class="n">GAMMA_PARAM</span><span class="p">,</span>
<span class="s">'classifier__class_weight'</span><span class="p">:</span> <span class="n">CLASS_WEIGHT</span><span class="p">,</span>
<span class="s">'classifier__kernel'</span><span class="p">:</span> <span class="n">KERNEL</span>
<span class="p">}</span>
<span class="n">svc_grid</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">pipe</span><span class="p">,</span> <span class="n">param_grid</span><span class="p">,</span> <span class="n">scoring</span><span class="o">=</span><span class="s">'f1'</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="n">sss</span><span class="p">)</span>
<span class="n">evaluate_model</span><span class="p">(</span><span class="n">svc_grid</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">sss</span><span class="p">)</span>
<span class="n">test_classifier</span><span class="p">(</span><span class="n">svc_grid</span><span class="o">.</span><span class="n">best_estimator_</span><span class="p">,</span> <span class="n">my_dataset</span><span class="p">,</span> <span class="n">features_list</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">Nested f1 score: 0.287132034632
Best parameters: {'reducer__n_components': 10, 'selector__k': 18, 'scaler': StandardScaler(copy=True, with_mean=True, with_std=True), 'classifier__class_weight': 'balanced', 'classifier__gamma': 0.01, 'classifier__kernel': 'sigmoid', 'classifier__C': 100.0}
Mean Accuracy: 0.827586206897
Mean Precision: 0.460887445887
Mean Recall: 0.8
Mean f1: 0.566651681652
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('selector', SelectKBest(k=18, score_func=<function f_classif at 0x000000000C5869E8>)), ('reducer', PCA(copy=True, iterated_power='auto', n_components=10, random_state=42,
svd_solver='auto', tol=0.0, whiten=False)), ('cla...,
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))])
Accuracy: 0.76920 Precision: 0.33595 Recall: 0.74850 F1: 0.46375 F2: 0.60092
Total predictions: 15000 True positives: 1497 False positives: 2959 False negatives: 503 True negatives: 10041
</code></pre></div>
<h3>Decision Tree Classifier</h3>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">CRITERION</span> <span class="o">=</span> <span class="p">[</span><span class="s">'gini'</span><span class="p">,</span> <span class="s">'entropy'</span><span class="p">]</span>
<span class="n">SPLITTER</span> <span class="o">=</span> <span class="p">[</span><span class="s">'best'</span><span class="p">,</span> <span class="s">'random'</span><span class="p">]</span>
<span class="n">MIN_SAMPLES_SPLIT</span> <span class="o">=</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">8</span><span class="p">]</span>
<span class="n">CLASS_WEIGHT</span> <span class="o">=</span> <span class="p">[</span><span class="s">'balanced'</span><span class="p">,</span> <span class="bp">None</span><span class="p">]</span>
<span class="c">### comment to perform a full hyperparameter search</span>
<span class="c"># SCALER = [StandardScaler()]</span>
<span class="c"># SELECTOR__K = [18]</span>
<span class="c"># REDUCER__N_COMPONENTS = [2]</span>
<span class="c"># CRITERION = ['gini']</span>
<span class="c"># SPLITTER = ['random']</span>
<span class="c"># MIN_SAMPLES_SPLIT = [8]</span>
<span class="c"># CLASS_WEIGHT = ['balanced']</span>
<span class="c">###################################################</span>
<span class="n">pipe</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">([</span>
<span class="p">(</span><span class="s">'scaler'</span><span class="p">,</span> <span class="n">StandardScaler</span><span class="p">()),</span>
<span class="p">(</span><span class="s">'selector'</span><span class="p">,</span> <span class="n">SelectKBest</span><span class="p">()),</span>
<span class="p">(</span><span class="s">'reducer'</span><span class="p">,</span> <span class="n">PCA</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)),</span>
<span class="p">(</span><span class="s">'classifier'</span><span class="p">,</span> <span class="n">DecisionTreeClassifier</span><span class="p">())</span>
<span class="p">])</span>
<span class="n">param_grid</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">'scaler'</span><span class="p">:</span> <span class="n">SCALER</span><span class="p">,</span>
<span class="s">'selector__k'</span><span class="p">:</span> <span class="n">SELECTOR__K</span><span class="p">,</span>
<span class="s">'reducer__n_components'</span><span class="p">:</span> <span class="n">REDUCER__N_COMPONENTS</span><span class="p">,</span>
<span class="s">'classifier__criterion'</span><span class="p">:</span> <span class="n">CRITERION</span><span class="p">,</span>
<span class="s">'classifier__splitter'</span><span class="p">:</span> <span class="n">SPLITTER</span><span class="p">,</span>
<span class="s">'classifier__min_samples_split'</span><span class="p">:</span> <span class="n">MIN_SAMPLES_SPLIT</span><span class="p">,</span>
<span class="s">'classifier__class_weight'</span><span class="p">:</span> <span class="n">CLASS_WEIGHT</span><span class="p">,</span>
<span class="p">}</span>
<span class="n">tree_grid</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">pipe</span><span class="p">,</span> <span class="n">param_grid</span><span class="p">,</span> <span class="n">scoring</span><span class="o">=</span><span class="s">'f1'</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="n">sss</span><span class="p">)</span>
<span class="n">evaluate_model</span><span class="p">(</span><span class="n">tree_grid</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">sss</span><span class="p">)</span>
<span class="n">test_classifier</span><span class="p">(</span><span class="n">tree_grid</span><span class="o">.</span><span class="n">best_estimator_</span><span class="p">,</span> <span class="n">my_dataset</span><span class="p">,</span> <span class="n">features_list</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">Nested f1 score: 0.228430049483
Best parameters: {'reducer__n_components': 4, 'selector__k': 15, 'scaler': StandardScaler(copy=True, with_mean=True, with_std=True), 'classifier__min_samples_split': 8, 'classifier__class_weight': 'balanced', 'classifier__splitter': 'random', 'classifier__criterion': 'gini'}
Mean Accuracy: 0.758620689655
Mean Precision: 0.325331890332
Mean Recall: 0.425
Mean f1: 0.321083916084
Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('selector', SelectKBest(k=15, score_func=<function f_classif at 0x000000000C5869E8>)), ('reducer', PCA(copy=True, iterated_power='auto', n_components=4, random_state=42,
svd_solver='auto', tol=0.0, whiten=False)), ('clas...=8, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='random'))])
Accuracy: 0.73587 Precision: 0.24677 Recall: 0.47800 F1: 0.32550 F2: 0.40256
Total predictions: 15000 True positives: 956 False positives: 2918 False negatives: 1044 True negatives: 10082
</code></pre></div>
<h2>References</h2>
<ul>
<li><a href="http://scikit-learn.org/">http://scikit-learn.org/</a></li>
<li><a href="http://sebastianraschka.com/Articles/2014_about_feature_scaling.html">http://sebastianraschka.com/Articles/2014_about_feature_scaling.html</a></li>
</ul>Luiz Gustavo Schillerschillerbr@gmail.comhttp://luizschiller.comExploring and Summarizing White Wine Data with R2016-11-07T00:00:00+00:002016-11-07T00:00:00+00:00http://luizschiller.com/white-wine<h4>Udacity Data Analyst Nanodegree</h4>
<h3>Project Overview</h3>
<p>This report explores a dataset containing attributes for 4898 instances of the Portuguese “Vinho Verde” white wine.</p>
<p>The attributes are the following:</p>
<ol>
<li> fixed acidity (tartaric acid - g / dm<sup>3):</sup> most acids involved with wine are fixed or nonvolatile (do not evaporate readily).</li>
<li> volatile acidity (acetic acid - g / dm<sup>3):</sup> the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.</li>
<li> citric acid (g / dm<sup>3):</sup> found in small quantities, citric acid can add ‘freshness’ and flavor to wines.</li>
<li> residual sugar (g / dm<sup>3):</sup> the amount of sugar remaining after fermentation stops. It’s rare to find wines with less than 1 g / dm<sup>3</sup> and wines with more than 45 g / dm<sup>3</sup> are considered sweet.</li>
<li> chlorides (sodium chloride - g / dm<sup>3):</sup> the amount of salt in the wine.</li>
<li> free sulfur dioxide (mg / dm<sup>3):</sup> the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion. It prevents microbial growth and the oxidation of wine.</li>
<li> total sulfur dioxide (mg / dm<sup>3):</sup> amount of free and bound forms of S02. In low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.</li>
<li> density (g / cm<sup>3):</sup> the density of wine is close to that of water depending on the percent alcohol and sugar content.</li>
<li> pH - describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.</li>
<li>sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.</li>
<li>alcohol (% by volume) - the percent alcohol content of the wine.</li>
<li>quality: score between 0 and 10 (based on sensory data).</li>
</ol>
<h1>Univariate Plots Section</h1>
<div class="highlight"><pre><code class="language-" data-lang="">## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
</code></pre></div>
<h2>Main feature of interest: Quality</h2>
<p><img src="/assets/images/white-wine/quality-1.png" alt=""></p>
<div class="highlight"><pre><code class="language-" data-lang="">## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
</code></pre></div>
<p>Quality follows a normal-like distribution with discrete integer values.</p>
<h2>Regarding acidity</h2>
<p><img src="/assets/images/white-wine/fixed.acidity-1.png" alt=""></p>
<div class="highlight"><pre><code class="language-" data-lang="">## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
</code></pre></div>
<p><img src="/assets/images/white-wine/volatile.acidity-1.png" alt=""></p>
<div class="highlight"><pre><code class="language-" data-lang="">## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
</code></pre></div>
<p><img src="/assets/images/white-wine/citric.acid-1.png" alt=""></p>
<div class="highlight"><pre><code class="language-" data-lang="">## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
</code></pre></div>
<p>There is an interesting peak at .49 and a smaller one at .74g / dm<sup>3.</sup> This suggests me that maybe a standard amount of citric acid is added to some of the wines.</p>
<p><img src="/assets/images/white-wine/pH-1.png" alt=""></p>
<div class="highlight"><pre><code class="language-" data-lang="">## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
</code></pre></div>
<p>The pH shows a bell shaped distribution. I wonder how it relates individually to the concentrations of acids.</p>
<h2>Regarding SO2</h2>
<p><img src="/assets/images/white-wine/free.sulfur.dioxide-1.png" alt=""></p>
<div class="highlight"><pre><code class="language-" data-lang="">## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
</code></pre></div>
<p>Free sulfur dioxide has some extreme outliers to the right of the curve.</p>
<p><img src="/assets/images/white-wine/total.sulfur.dioxide-1.png" alt=""></p>
<div class="highlight"><pre><code class="language-" data-lang="">## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
</code></pre></div><div class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">wines</span><span class="o">$</span><span class="n">bound.sulfur.dioxide</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">wines</span><span class="p">,</span><span class="w">
</span><span class="n">total.sulfur.dioxide</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">free.sulfur.dioxide</span><span class="p">)</span><span class="w">
</span><span class="n">wines</span><span class="o">$</span><span class="n">sulfur.dioxide.ratio</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">wines</span><span class="p">,</span><span class="w">
</span><span class="n">free.sulfur.dioxide</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">bound.sulfur.dioxide</span><span class="p">)</span><span class="w">
</span></code></pre></div>
<p>I created a bound sulfur dioxide variable by subtracting the free from the total sulfur dioxide. Then I created a feature consisting of the ratio between the free and bound sulfur dioxide present in the wine.</p>
<p><img src="/assets/images/white-wine/bound.sulfur.dioxide-1.png" alt=""></p>
<div class="highlight"><pre><code class="language-" data-lang="">## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 78.0 100.0 103.1 125.0 331.0
</code></pre></div>
<p>It looks very similar to the total sulfur dioxide.</p>
<p><img src="/assets/images/white-wine/sulfur.dioxide.ratio-1.png" alt=""></p>
<div class="highlight"><pre><code class="language-" data-lang="">## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02419 0.23600 0.33990 0.36750 0.46150 2.45500
</code></pre></div>
<p>I transformed the scale to log10 to better visualize the distribution. Maybe it will be useful when trying to predict the quality, or even give us some insight about the data.</p>
<p><img src="/assets/images/white-wine/sulphates-1.png" alt=""></p>
<div class="highlight"><pre><code class="language-" data-lang="">## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
</code></pre></div>
<p>Sulphates are a little positively skewed. Since it can contribute to sulfur dioxide levels, it can be valuable to plot relations between them.</p>
<h2>Other attributes</h2>
<p><img src="/assets/images/white-wine/residual.sugar-1.png" alt=""></p>
<div class="highlight"><pre><code class="language-" data-lang="">## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
</code></pre></div>
<p>I transformed the residual sugar to a log10 scale to better visualize its distribution. The transformed variable appears bimodal, with peaks around 1.3 and 8.</p>
<p><img src="/assets/images/white-wine/chlorides-1.png" alt=""></p>
<div class="highlight"><pre><code class="language-" data-lang="">## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
</code></pre></div>
<p>I transformed the long tail distribution with a log10 scale so it could be better visualized. After the transformation, the chlorides histogram appears normal, with some outliers on the right side of the curve.</p>
<p><img src="/assets/images/white-wine/density-1.png" alt=""></p>
<div class="highlight"><pre><code class="language-" data-lang="">## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
</code></pre></div>
<p>Most of the density values are between .99 and 1.00 g / cm3, but there are some outliers near 1.01 and 1.04.</p>
<p><img src="/assets/images/white-wine/alcohol-1.png" alt=""></p>
<div class="highlight"><pre><code class="language-" data-lang="">## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
</code></pre></div>
<p>Alcohol presents mostly discrete values, with intervals of .1%. There are a few exceptions though.</p>
<h1>Univariate Analysis</h1>
<h3>What is the structure of your dataset?</h3>
<p>There are 11 variables representing physicochemical measurements and 1 variable representing the median of at least 3 evaluations of quality made by wine experts, varying from 0 (very bad) to 10 (very excellent).</p>
<h3>What is/are the main feature(s) of interest in your dataset?</h3>
<p>Quality is the main feature of interest. The objective of the analysis is to determine the features that influence wine quality the most, and then building a predictive model of quality using these variables.</p>
<h3>What other features in the dataset do you think will help support your investigation into your feature(s) of interest?</h3>
<p>Most features have an aproximately normal distribution, just like the quality variable. It makes it hard to guess which features will have a greater impact on the prediction of quality.</p>
<h3>Did you create any new variables from existing variables in the dataset?</h3>
<p>I created the “sulfur.dioxide.ratio”, which consists of the ratio between “free.sulfur.dioxide” and “total.sulfur.dioxide”.</p>
<h3>Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?</h3>
<p>The distribution of citric acid presented two unusual peaks which standed out of an otherwise normal distribution.</p>
<p>I preformed a log transformation on the residual sugar and chlorides distributions, because they were very skewed, and the transformations allowed better visualizations of the data.</p>
<h1>Bivariate Plots Section</h1>
<p><img src="/assets/images/white-wine/Correlation_Matrix-1.png" alt=""></p>
<p>This correlation matrix naturally shows strong correlations between free sulfur dioxide, total sulfur dioxide and the constructed variables bound sulfur dioxide and sulfur dioxide ratio.</p>
<p>It also shows interesting relations between residual.sugar vs density and alcohol vs density.</p>
<h2>Density, residual sugar and alcohol</h2>
<p><img src="/assets/images/white-wine/density_sugar_alcohol-1.png" alt=""></p>
<p>Density varies aproximately linearly with residual sugar (positive correlation) and with alcohol (negative correlation). It makes sense, taking into account the fermentation process of wine, in which sugar is consumed to generate alcohol. And since the residual sugar is more dense than alcohol, this inverse relation is presented.</p>
<h2>Sulfur dioxide</h2>
<p><img src="/assets/images/white-wine/sulfur_dioxide-1.png" alt=""></p>
<p>The sulfur dioxide ratio increases along with the free sulfur dioxide, and wines with greater ratios tend to have smaller concentrations of bound sulfur dioxide. I wonder how quality varies related to these variables.</p>
<h2>Acids</h2>
<p><img src="/assets/images/white-wine/acids-1.png" alt=""></p>
<p>The only acid concentration that shows some considerable correlation with pH is the one regarding fixed acidity.</p>
<h2>Main feature of interest: Quality</h2>
<p><img src="/assets/images/white-wine/quality_fixed.acidity-1.png" alt=""></p>
<p>Better quality wines seem to have smaller fixed acidities on average.</p>
<p><img src="/assets/images/white-wine/quality_volatile.acidity-1.png" alt=""></p>
<p>The same seems to apply with volatile acidity, but nothing very conclusive.</p>
<p><img src="/assets/images/white-wine/quality_citric.acid-1.png" alt=""></p>
<p>The low correlation on the panel above can be seen on these charts of quality by citric acid.</p>
<p><img src="/assets/images/white-wine/quality_pH-1.png" alt=""></p>
<p>Except for wines with quality score 3, the median pH increases along with quality score.</p>
<p><img src="/assets/images/white-wine/quality_free.sulfur.dioxide-1.png" alt=""></p>
<p>No clear relation between free sulfur dioxide and quality.</p>
<p><img src="/assets/images/white-wine/quality_bound.sulfur.dioxide-1.png" alt=""></p>
<p>Here a trend can be seen. Overall, quality decreases as bound sulfur dioxide increases.</p>
<p><img src="/assets/images/white-wine/quality_total.sulfur.dioxide-1.png" alt=""></p>
<p>Here a slight correlation can be seen, somewhat similar to that of bound sulfur dioxide.</p>
<p><img src="/assets/images/white-wine/quality_sulfur.dioxide.ratio-1.png" alt=""></p>
<p>In general, quality increases as the ration of free and bound sulfur dioxide increases, but the correlation is weak.</p>
<p><img src="/assets/images/white-wine/quality_sulphates-1.png" alt=""></p>
<p>Sulphates don’t seem to add much isolatedly.</p>
<p><img src="/assets/images/white-wine/quality_residual.sugar-1.png" alt=""></p>
<p>Nothing very clear from these charts.</p>
<p><img src="/assets/images/white-wine/quality_chlorides-1.png" alt=""></p>
<p>There is a curious amount of outliers for scores 5 and 6. I wonder why it happens.</p>
<p><img src="/assets/images/white-wine/quality_density-1.png" alt=""></p>
<p>A greater correlation is more evident here. This seems to be one of the most promising relations so far. Maybe it has something to do with the fact that density is highly correlated with residual sugar and alcohol concentration, features that may be more easily detected by the experts palate.</p>
<p><img src="/assets/images/white-wine/quality_alcohol-1.png" alt=""></p>
<p>Alcohol is the variable with the greatest correlation with quality. It can be seen on the chart. Wines with grades 3 and 4 are going against the trend, but there are not many of those.</p>
<h1>Bivariate Analysis</h1>
<h3>Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?</h3>
<p>I analyzed the relations between quality and every other variable in the dataset. The two largest Pearson’s correlations found were with alcohol (.436) and density (-.307). With both variables, an aproximately linear relation exsited for wines with scores from 5 to 9. The same did not apply for scores 3 and 4.</p>
<p>Analyzing the wines with quality score 9, I observed that they have in average a high concentration of alcohol, a very low density, and also a low amount of residual sugar. I imagine it derives from a well adjusted fermentation process, in which the sugar from the grapes is almost completely consumed in the process of fermentation, generating an above average alcohol concentration and thus a smaller density.</p>
<p>There is also a curious amount of outliers for scores 5 and 6. I wonder why it happens.</p>
<h3>Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?</h3>
<p>Density is strongly correlated with two other variables: residual sugar (positively), and alcohol (negatively). It makes sense, taking into account the fermentation process of wine, in which sugar is consumed to generate alcohol. And since the residual sugar is more dense than alcohol, this inverse relation is presented.</p>
<p>Another relationship found was between fixed acidity and pH. Among the measures of acidity in the dataset, fixed acidity was the only one presenting at least a weak linear relationship with pH.</p>
<h3>What was the strongest relationship you found?</h3>
<p>The one between density and residual sugar. These features have a Pearson’s correlation coefficient of .839.</p>
<h1>Multivariate Plots Section</h1>
<p>I am dividing alcohol in bins to be able to plot density, alcohol, residual sugar and quality together and see how they relate with eachother:</p>
<p><img src="/assets/images/white-wine/alcohol_buckets-1.png" alt=""></p>
<p><img src="/assets/images/white-wine/quality_levels-1.png" alt=""></p>
<p>It can be seen that the points corresponding to higher amounts of alcohol show wines of better quality in average, and, for a given residual sugar, quality increases as density decreases.</p>
<p><img src="/assets/images/white-wine/sulfur_quality-1.png" alt=""></p>
<p>Revisiting this chart from the bivariate plots section, but now colored by quality score. None of the charts indicate that this factors are well fit for a good linear model for predicting quality. However, some regions with higher concentrations of good and bad quality wines are defined, although not very clearly.</p>
<p><img src="/assets/images/white-wine/acid_quality-1.png" alt=""></p>
<p>Revisiting this charts and adding quality as color, nothin very useful seemed to appear.</p>
<div class="highlight"><pre><code class="language-" data-lang="">##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wines)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wines)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar,
## data = wines)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar +
## sulfur.dioxide.ratio, data = wines)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar +
## sulfur.dioxide.ratio + sulphates, data = wines)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar +
## sulfur.dioxide.ratio + sulphates + density, data = wines)
## m7: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar +
## sulfur.dioxide.ratio + sulphates + density + pH, data = wines)
## m8: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar +
## sulfur.dioxide.ratio + sulphates + density + pH + fixed.acidity,
## data = wines)
## m9: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar +
## sulfur.dioxide.ratio + sulphates + density + pH + fixed.acidity +
## chlorides, data = wines)
## m10: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar +
## sulfur.dioxide.ratio + sulphates + density + pH + fixed.acidity +
## chlorides + citric.acid, data = wines)
##
## ===============================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m9 m10
## -----------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 2.582*** 3.017*** 2.356*** 2.264*** 2.014*** 82.862*** 102.754*** 145.254*** 143.910*** 144.536***
## (0.098) (0.098) (0.114) (0.114) (0.125) (12.567) (12.925) (18.216) (18.506) (18.561)
## alcohol 0.313*** 0.324*** 0.375*** 0.367*** 0.368*** 0.271*** 0.242*** 0.192*** 0.192*** 0.191***
## (0.009) (0.009) (0.010) (0.010) (0.010) (0.018) (0.019) (0.024) (0.024) (0.024)
## volatile.acidity -1.979*** -2.107*** -1.961*** -1.943*** -1.910*** -1.887*** -1.835*** -1.831*** -1.823***
## (0.110) (0.109) (0.111) (0.110) (0.110) (0.110) (0.111) (0.111) (0.113)
## residual.sugar 0.027*** 0.025*** 0.026*** 0.055*** 0.065*** 0.081*** 0.081*** 0.081***
## (0.002) (0.002) (0.002) (0.005) (0.005) (0.007) (0.007) (0.007)
## sulfur.dioxide.ratio 0.384*** 0.388*** 0.319*** 0.304*** 0.308*** 0.309*** 0.308***
## (0.056) (0.056) (0.057) (0.057) (0.056) (0.057) (0.057)
## sulphates 0.463*** 0.636*** 0.588*** 0.645*** 0.644*** 0.642***
## (0.095) (0.098) (0.098) (0.100) (0.100) (0.100)
## density -80.561*** -101.824*** -145.402*** -144.002*** -144.641***
## (12.522) (12.938) (18.457) (18.767) (18.823)
## pH 0.472*** 0.702*** 0.695*** 0.699***
## (0.076) (0.103) (0.105) (0.105)
## fixed.acidity 0.068*** 0.066** 0.065**
## (0.020) (0.021) (0.021)
## chlorides -0.224 -0.251
## (0.543) (0.546)
## citric.acid 0.043
## (0.095)
## -----------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.190 0.240 0.259 0.266 0.269 0.275 0.281 0.283 0.283 0.283
## adj. R-squared 0.190 0.240 0.258 0.265 0.268 0.274 0.280 0.281 0.281 0.281
## sigma 0.797 0.772 0.763 0.759 0.758 0.754 0.752 0.751 0.751 0.751
## F 1146.395 773.875 568.789 442.368 360.295 309.623 272.891 240.632 213.878 192.478
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5681.776 -5622.083 -5598.647 -5586.778 -5566.139 -5547.023 -5541.549 -5541.464 -5541.364
## Deviance 3112.257 2918.264 2847.993 2820.870 2807.231 2783.672 2762.028 2755.862 2755.766 2755.653
## AIC 11684.782 11371.552 11254.166 11209.295 11187.556 11148.278 11112.045 11103.098 11104.927 11106.728
## BIC 11704.272 11397.538 11286.649 11248.274 11233.032 11200.250 11170.515 11168.064 11176.389 11184.687
## N 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898
## ===============================================================================================================================================
</code></pre></div>
<h1>Multivariate Analysis</h1>
<h3>Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?</h3>
<p>There is a very interesting relation between density, alcohol, residual sugar and quality. In general, quality increases as alcohol increases, density decreases and residual sugar decreases. These variables were amongst the most important predictors in the linear model built.</p>
<h3>Were there any interesting or surprising interactions between features?</h3>
<p>Since I did not have much knowledge of wine appraising before this exercise, I did not set expectations for the role of each variable, and therefore I was not surprised by the relations between them.</p>
<h3>OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.</h3>
<p>I created a linear model for predicting quality. The R-squared value for the model was 0.283, which was a very low one. It indicates that a linear model probably is not the best fit for this dataset. Alcohol, volatile acidity and residual sugar were the most important prediction variables. Since there is a large correlation between some of the variables, some sort of feature selection would improve the model.</p>
<hr>
<h1>Final Plots and Summary</h1>
<h3>Plot One</h3>
<p><img src="/assets/images/white-wine/Plot_One-1.png" alt=""></p>
<h3>Description One</h3>
<p>This chart depicts the relation between alcohol concentration and quality score. For scores from 5 to 9, quality increases as alcohol increases, and for scores 3 and 4 the relation is the inverse. Alcohol has the largest correlation with quality among all the variables in the dataset, with a Pearson’s correlation coefficient of .436.</p>
<h3>Plot Two</h3>
<p><img src="/assets/images/white-wine/Plot_Two-1.png" alt=""></p>
<h3>Description Two</h3>
<p>A very interesting relation is shown in this chart. Given a value of residual sugar, density increases as alcohol decreases. This is in some extent due to the fermentation process of winemaking, in which sugar is consumed to generate alcohol. Since alcohol is less dense than water and sugar is more dense than water, this process makes the density of the wine decrease.</p>
<h3>Plot Three</h3>
<p><img src="/assets/images/white-wine/Plot_Three-1.png" alt=""></p>
<h3>Description Three</h3>
<p>This chart shows how quality relates with density and residual sugar. The two lowest and highest quality levels have been grouped to improve visibility.</p>
<p>It is noticeable that, for a given residual sugar concentration, quality increases as density increases. The same occurs if you fix density and increase residual sugar.</p>
<hr>
<h1>Reflection</h1>
<p>This exploratory data analysis in which at first a univariate, then bivariate and finally multivariate examinations are performed allow for a progressive understanding of the dataset and the relations between its features.</p>
<p>Some interesting relations came up, like the one between alcohol, density, residual sugar and quality, that could be related to the fermentation process of wine. The correlation between pH and fixed acidity, while not correlating with volatile acidity and citric acids is also worth noting.</p>
<p>I strugled to find meaningful relations in the multivariate analysis section, and I have the feeling that some interesting relations may have been left aside among the many permutations of variables in the dataset. Anyway the whole analysis process was a very valuable experience, in which I practiced plotting various types of charts, handling overplotting and choosing the best chart type to convey the intended message.</p>
<p>A linear model for predicting quality was built, but it performed poorly, indicating that the dataset did not behave very much linearly. The process of evaluating wines is very subjective, and experts can be biased by their histories and preferences, making the relation between quality and the other variables too complex to be explained by a linear model. In the future, a diferent set of quality prediction models could be applied, and an evaluation of the best fit could be performed.</p>
<h1>References</h1>
<ul>
<li><p>P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. Available at: [@Elsevier] <a href="http://dx.doi.org/10.1016/j.dss.2009.05.016">http://dx.doi.org/10.1016/j.dss.2009.05.016</a> [Pre-press (pdf)] <a href="http://www3.dsi.uminho.pt/pcortez/winequality09.pdf">http://www3.dsi.uminho.pt/pcortez/winequality09.pdf</a> [bib] <a href="http://www3.dsi.uminho.pt/pcortez/dss09.bib">http://www3.dsi.uminho.pt/pcortez/dss09.bib</a></p></li>
<li><p><a href="https://en.wikipedia.org/wiki/Acids_in_wine">https://en.wikipedia.org/wiki/Acids_in_wine</a></p></li>
</ul>Luiz Gustavo Schillerschillerbr@gmail.comhttp://luizschiller.comOpenStreetMap Data Wrangling2016-10-08T00:00:00+01:002016-10-08T00:00:00+01:00http://luizschiller.com/openstreetmap<h4>Udacity Data Analyst Nanodegree</h4>
<h2>Project Overview</h2>
<p>Choose any area of the world in <a href="https://www.openstreetmap.org">Open Street Map</a> and use data munging techniques, such as assessing the quality of the data for validity, accuracy, completeness, consistency and uniformity, to clean the OpenStreetMap data for a part of the world that you care about. Choose to learn SQL or MongoDB and apply your chosen schema to the project.</p>
<p>Find the Python code for the project here: <a href="https://github.com/schiller/wrangle-open-street-map-data">https://github.com/schiller/wrangle-open-street-map-data</a></p>
<h3>Map Area: Rio de Janeiro, Brazil</h3>
<ul>
<li><a href="https://mapzen.com/data/metro-extracts/metro/rio-de-janeiro_brazil/">https://mapzen.com/data/metro-extracts/metro/rio-de-janeiro_brazil/</a></li>
</ul>
<p>This area contains three cities that had a great part in my history. I lived about one third of my life on each: Petrópolis (where I was born), Niterói and Rio de Janeiro. That said, I would like to explore this extract a little bit and see what interesting data I can find.</p>
<h2>Problems Encountered in the Map</h2>
<p>After the initial cleaning on the data from the downloaded xml file, it was imported into mongodb using the following command:
<code>
mongoimport --db osm --collection rio --file rio-de-janeiro_brazil.osm.json
</code></p>
<p>The elements were structured like this:</p>
<div class="highlight"><pre><code class="language-json" data-lang="json"><span class="p">{</span><span class="w">
</span><span class="nt">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2406124091"</span><span class="p">,</span><span class="w">
</span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"node"</span><span class="p">,</span><span class="w">
</span><span class="nt">"created"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nt">"version"</span><span class="p">:</span><span class="s2">"2"</span><span class="p">,</span><span class="w">
</span><span class="nt">"changeset"</span><span class="p">:</span><span class="s2">"17206049"</span><span class="p">,</span><span class="w">
</span><span class="nt">"timestamp"</span><span class="p">:</span><span class="s2">"2013-08-03T16:43:42Z"</span><span class="p">,</span><span class="w">
</span><span class="nt">"user"</span><span class="p">:</span><span class="s2">"linuxUser16"</span><span class="p">,</span><span class="w">
</span><span class="nt">"uid"</span><span class="p">:</span><span class="s2">"1219059"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nt">"pos"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="mf">41.9757030</span><span class="p">,</span><span class="w"> </span><span class="mf">-87.6921867</span><span class="p">],</span><span class="w">
</span><span class="nt">"address"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nt">"housenumber"</span><span class="p">:</span><span class="w"> </span><span class="s2">"5157"</span><span class="p">,</span><span class="w">
</span><span class="nt">"postcode"</span><span class="p">:</span><span class="w"> </span><span class="s2">"24230-062"</span><span class="p">,</span><span class="w">
</span><span class="nt">"street"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Rua Moreira César"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nt">"amenity"</span><span class="p">:</span><span class="w"> </span><span class="s2">"restaurant"</span><span class="p">,</span><span class="w">
</span><span class="nt">"cuisine"</span><span class="p">:</span><span class="w"> </span><span class="s2">"mexican"</span><span class="p">,</span><span class="w">
</span><span class="nt">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"La Cabana De Don Luis"</span><span class="p">,</span><span class="w">
</span><span class="nt">"phone"</span><span class="p">:</span><span class="w"> </span><span class="s2">"+55-21-95757782"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div>
<p>Analyzing a sample of the data, some problems showed up:</p>
<ul>
<li>Tags with k=“type” overriding the element’s ‘type’ field;</li>
<li>String ‘bicycle_parking’ capacities instead of numbers;</li>
<li>Abbreviated street types in ‘address.street’ tag;</li>
<li>Many different formats in ‘phone’ field;</li>
<li>pprint.pprint method not printing Unicode characters properly.</li>
</ul>
<h3>Tags with k=“type” overriding the element’s ‘type’ field</h3>
<p>Second level ‘k’ tags with the value ‘type’ were overriding the element’s ‘type’ field, which should equal ‘node’ or ‘way’ only. These tags were mapped to the element with the ‘type_tag’ key before being imported to mongodb.</p>
<h3>String ‘bicycle_parking’ capacities instead of numbers</h3>
<p>Nodes representing bicycle parkings had their capacity fields as strings, which did not allow numeric operations I was willing to make with them. All of them represented numbers, except for one ‘§0’ value. To solve this, I iterated over the xml file, updating the values with the parsed integer values. Whenever the parsing failed, the ‘capacity’ field was removed. The code used for the removal is shown below:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">handle_bicycle_parking_capacity</span><span class="p">(</span><span class="n">node</span><span class="p">):</span>
<span class="k">if</span> <span class="p">(</span><span class="s">'amenity'</span> <span class="ow">in</span> <span class="n">node</span><span class="p">)</span> <span class="ow">and</span> <span class="p">(</span><span class="n">node</span><span class="p">[</span><span class="s">'amenity'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'bicycle_parking'</span><span class="p">):</span>
<span class="k">if</span> <span class="s">'capacity'</span> <span class="ow">in</span> <span class="n">node</span><span class="p">:</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">node</span><span class="p">[</span><span class="s">'capacity'</span><span class="p">]</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">node</span><span class="p">[</span><span class="s">'capacity'</span><span class="p">])</span>
<span class="k">except</span> <span class="nb">ValueError</span><span class="p">:</span>
<span class="n">node</span><span class="o">.</span><span class="n">pop</span><span class="p">(</span><span class="s">'capacity'</span><span class="p">)</span>
</code></pre></div>
<h3>Abbreviated street types in ‘address.street’ tag</h3>
<p>There were several street names with it’s type abbreviated, for example:
<code>
Estr. da Paciência
Av Castelo Branco
R. Miguel Gustavo
</code>
It is worth noting that in Portuguese the street types appear at the beginning of a street name, in contrast with English, where it appears at the end.
To deal with this a mapping was created to convert abbreviations to complete street types:
<code>python
mapping = { "Av": "Avenida",
"Av.": "Avenida",
"Est.": "Estrada",
"Estr.": "Estrada",
"estrada": "Estrada",
"Pca": u"Praça",
"Praca": u"Praça",
u"Pça": u"Praça",
u"Pça.": u"Praça",
"R.": "Rua",
"RUA": "Rua",
"rua": "Rua",
"Ruas": "Rua",
"Rue": "Rua",
"Rod.": "Rodovia",
"Trav": "Travessa" }
</code>
After the update, the abbreviation problem was solved for almost all cases, excluding only stranger ones probably caused by human erroneous inputs.</p>
<h3>Many different formats in ‘phone’ field</h3>
<p>The ‘phone’ field of most elements was filled with various different formats of phone number, and many times more than one phone number was inserted in the same field.
To organize this data I defined a standard pattern for the phone values, and audited the file classifying the values into four groups: ok, wrong_separators, missing_area_code and other. The groups were defined by regular expressions as follows:</p>
<h4>ok</h4>
<div class="highlight"><pre><code class="language-" data-lang=""># +55 99 99999999
phone_ok_re = re.compile(r'^\+55\s\d{2}\s\d{8,9}$')
# 0800 999 9999
phone_0800_ok_re = re.compile(r'^0800\s\d{3}\s\d{4}$')
</code></pre></div>
<h4>wrong_separators</h4>
<div class="highlight"><pre><code class="language-" data-lang=""># 55-99-9-99999999
wrong_separators_re = re.compile(r'^\D*55\D*\d{2}\D*(\d\D?)?\d{4}\D?\d{4}$')
# +55-99-0800-999-9999
wrong_separators_0800_re = re.compile(r'^\D*(55)?\D*(\d{2})?\D*0800\D?\d{3}\D?\d\D?\d{3}$')
</code></pre></div>
<h4>missing_area_code</h4>
<div class="highlight"><pre><code class="language-" data-lang=""># missing +55 (Rio area codes start with 2)
missing_ddi_re = re.compile(r'^\D*2\d\D*(\d\D?)?\d{4}\D?\d{4}$')
# missing +55 2X
missing_ddd_re = re.compile(r'^(\d\D?)?\d{4}\D?\d{4}$')
</code></pre></div>
<h4>other</h4>
<div class="highlight"><pre><code class="language-" data-lang="">The remaining values.
</code></pre></div>
<p>Before the update of the values, which consisted in turning the phone values into a list of strings, removing non-alphanumeric values, including area codes and including spaces only when it was appropriated, the classification was like this:
<code>json
{
"missing_area_code": 72,
"wrong_separators": 2055,
"other": 41,
"ok": 151
}
</code>
and after the update it turned out like this:
<code>json
{
"missing_area_code": 18,
"wrong_separators": 0,
"other": 41,
"ok": 2260
}
</code>
With an upgrade from 6.5% to 97.5% of ‘ok’ values, I was content with the phones cleaning for this wrangling exercise.</p>
<h3>pprint.pprint method not printing Unicode characters properly</h3>
<p>This problem is not related to the data itself, but it was harming the wrangling process.
When printing the results of some queries with the pprint.pprint method, characters out of the ascii table showed as their Unicode representation, making it hard to read.
To solve this I had to instantiate my own printer, witch encoded unicode objects to utf-8, making it possible to read. Check the code below:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pprint</span>
<span class="k">class</span> <span class="nc">MyPrettyPrinter</span><span class="p">(</span><span class="n">pprint</span><span class="o">.</span><span class="n">PrettyPrinter</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">format</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">object</span><span class="p">,</span> <span class="n">context</span><span class="p">,</span> <span class="n">maxlevels</span><span class="p">,</span> <span class="n">level</span><span class="p">):</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="nb">object</span><span class="p">,</span> <span class="nb">unicode</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="nb">object</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">'utf8'</span><span class="p">),</span> <span class="bp">True</span><span class="p">,</span> <span class="bp">False</span><span class="p">)</span>
<span class="k">return</span> <span class="n">pprint</span><span class="o">.</span><span class="n">PrettyPrinter</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">object</span><span class="p">,</span> <span class="n">context</span><span class="p">,</span> <span class="n">maxlevels</span><span class="p">,</span> <span class="n">level</span><span class="p">)</span>
</code></pre></div>
<h2>Data Overview</h2>
<p>This section contains basic statistics about the dataset and the MongoDB queries used to gather them. Some queries make use of the ‘aggregate’ function.</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">pymongo</span> <span class="kn">import</span> <span class="n">MongoClient</span>
<span class="k">def</span> <span class="nf">get_db</span><span class="p">(</span><span class="n">db_name</span><span class="p">):</span>
<span class="n">client</span> <span class="o">=</span> <span class="n">MongoClient</span><span class="p">(</span><span class="s">'localhost:27017'</span><span class="p">)</span>
<span class="n">db</span> <span class="o">=</span> <span class="n">client</span><span class="p">[</span><span class="n">db_name</span><span class="p">]</span>
<span class="k">return</span> <span class="n">db</span>
<span class="k">def</span> <span class="nf">aggregate</span><span class="p">(</span><span class="n">db</span><span class="p">,</span> <span class="n">pipeline</span><span class="p">):</span>
<span class="k">return</span> <span class="p">[</span><span class="n">doc</span> <span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">db</span><span class="o">.</span><span class="n">rio</span><span class="o">.</span><span class="n">aggregate</span><span class="p">(</span><span class="n">pipeline</span><span class="p">)]</span>
<span class="n">db</span> <span class="o">=</span> <span class="n">get_db</span><span class="p">(</span><span class="s">'osm'</span><span class="p">)</span>
</code></pre></div>
<h3>File Sizes</h3>
<div class="highlight"><pre><code class="language-" data-lang="">rio-de-janeiro_brazil.osm ........... 323 MB
rio-de-janeiro_brazil.osm.json ...... 369 MB
</code></pre></div>
<h3>Elements Count</h3>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">db</span><span class="o">.</span><span class="n">rio</span><span class="o">.</span><span class="n">find</span><span class="p">()</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">1737174
</code></pre></div>
<h3>Nodes Count</h3>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># node count</span>
<span class="n">db</span><span class="o">.</span><span class="n">rio</span><span class="o">.</span><span class="n">find</span><span class="p">({</span><span class="s">'type'</span><span class="p">:</span> <span class="s">'node'</span><span class="p">})</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">1550716
</code></pre></div>
<h3>Ways Count</h3>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># way count</span>
<span class="n">db</span><span class="o">.</span><span class="n">rio</span><span class="o">.</span><span class="n">find</span><span class="p">({</span><span class="s">'type'</span><span class="p">:</span> <span class="s">'way'</span><span class="p">})</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">186458
</code></pre></div>
<h3>Number of Distinct Users</h3>
<p>This query uses the following ‘aggregate’ method:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">distinct_users</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">{</span><span class="s">'$group'</span><span class="p">:</span> <span class="p">{</span><span class="s">'_id'</span><span class="p">:</span> <span class="s">'$created.user'</span><span class="p">}},</span>
<span class="p">{</span><span class="s">'$group'</span><span class="p">:</span> <span class="p">{</span><span class="s">'_id'</span><span class="p">:</span> <span class="s">'Distinct users:'</span><span class="p">,</span> <span class="s">'count'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$sum'</span><span class="p">:</span> <span class="mi">1</span><span class="p">}}}]</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">aggregate</span><span class="p">(</span><span class="n">db</span><span class="p">,</span> <span class="n">distinct_users</span><span class="p">)</span>
<span class="n">MyPrettyPrinter</span><span class="p">()</span><span class="o">.</span><span class="n">pprint</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">[{_id: Distinct users:, count: 1239}]
</code></pre></div>
<h3>Top 10 Contributing Users</h3>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">top_10_users</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">{</span><span class="s">'$group'</span><span class="p">:</span> <span class="p">{</span><span class="s">'_id'</span><span class="p">:</span> <span class="s">'$created.user'</span><span class="p">,</span> <span class="s">'count'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$sum'</span><span class="p">:</span> <span class="mi">1</span><span class="p">}}},</span>
<span class="p">{</span><span class="s">'$sort'</span><span class="p">:</span> <span class="p">{</span><span class="s">'count'</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">}},</span>
<span class="p">{</span><span class="s">'$limit'</span><span class="p">:</span> <span class="mi">10</span><span class="p">}]</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">aggregate</span><span class="p">(</span><span class="n">db</span><span class="p">,</span> <span class="n">top_10_users</span><span class="p">)</span>
<span class="n">MyPrettyPrinter</span><span class="p">()</span><span class="o">.</span><span class="n">pprint</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">[{_id: Alexandrecw, count: 374621},
{_id: ThiagoPv, count: 186562},
{_id: smaprs_import, count: 185690},
{_id: AlNo, count: 169678},
{_id: Import Rio, count: 85129},
{_id: Geaquinto, count: 69987},
{_id: Nighto, count: 63148},
{_id: Thundercel, count: 55004},
{_id: Márcio Vínícius Pinheiro, count: 35985},
{_id: smaprs, count: 31507}]
</code></pre></div>
<h3>Users Appearing Only Once</h3>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">users_appearing_once</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">{</span><span class="s">'$group'</span><span class="p">:</span> <span class="p">{</span><span class="s">'_id'</span><span class="p">:</span> <span class="s">'$created.user'</span><span class="p">,</span> <span class="s">'count'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$sum'</span><span class="p">:</span><span class="mi">1</span><span class="p">}}},</span>
<span class="p">{</span><span class="s">'$group'</span><span class="p">:</span> <span class="p">{</span><span class="s">'_id'</span><span class="p">:</span> <span class="s">'$count'</span><span class="p">,</span> <span class="s">'num_users'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$sum'</span><span class="p">:</span><span class="mi">1</span><span class="p">}}},</span>
<span class="p">{</span><span class="s">'$sort'</span><span class="p">:</span> <span class="p">{</span><span class="s">'_id'</span><span class="p">:</span> <span class="mi">1</span><span class="p">}},</span>
<span class="p">{</span><span class="s">'$limit'</span><span class="p">:</span> <span class="mi">1</span><span class="p">}]</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">aggregate</span><span class="p">(</span><span class="n">db</span><span class="p">,</span> <span class="n">users_appearing_once</span><span class="p">)</span>
<span class="n">MyPrettyPrinter</span><span class="p">()</span><span class="o">.</span><span class="n">pprint</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">[{_id: 1, num_users: 274}]
</code></pre></div>
<h2>Aditional Ideas</h2>
<h3>City validation based on postcodes</h3>
<p>The city and postcode values could be crosschecked when inputing a new address. Most countries have public APIs to retrieve addresses from postcodes, so it could be done, with the help of contributors around the world.
This improvement could prevent a lot of wrong data inputs - there are many examples in the examined dataset - and it would make the process of analyzing data related to cities considerably easier and more accurate. It would definitely cause a positive impact which would affect users througout the world.
On the other hand, a change like this decreases the freedom of the user when inputing new addresses, since data could only be submitted if it was in accordance with the crosschecked value from another data source. These positive and negative impacts should be weighted before implementing this kind of improvement to the process.</p>
<h3>Phone format validator</h3>
<p>The Open Street Map input tool could have a phone format validator, varying from country to country, to avoid such a mess on the phones format 😉. It could also separate multiple phones with a standard separator, since it was one of the most difficult steps of the phone values wrangling.
The fact that each country has a different standard format makes it difficult to implement this, but with the help of the open software community around Open Street Map it could be done.
Again, it would decrease the freedom of the user inputing the data, since the phone format would have to be validated to the standards. And every time the standards change, the validators would have to be updated, causing some extra work that does not take place today.</p>
<h3>Variety.js</h3>
<p>The open-source tool Variety (<a href="https://github.com/variety/variety">https://github.com/variety/variety</a>) allows the user get a sense of how the data is structured in a MongoDB schema. It does so by showing the number of occurences for each key on documents returned by a query.
It is a useful ally when analysing datasets like Open Street Map, which does not define an allowed key set.</p>
<h3>Most Common Amenities</h3>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">most_common_amenities</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">{</span><span class="s">'$match'</span><span class="p">:</span> <span class="p">{</span><span class="s">'amenity'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$exists'</span><span class="p">:</span> <span class="mi">1</span><span class="p">}}},</span>
<span class="p">{</span><span class="s">'$group'</span><span class="p">:</span> <span class="p">{</span><span class="s">'_id'</span><span class="p">:</span> <span class="s">'$amenity'</span><span class="p">,</span> <span class="s">'count'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$sum'</span><span class="p">:</span> <span class="mi">1</span><span class="p">}}},</span>
<span class="p">{</span><span class="s">'$sort'</span><span class="p">:</span> <span class="p">{</span><span class="s">'count'</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">}},</span>
<span class="p">{</span><span class="s">'$limit'</span><span class="p">:</span> <span class="mi">10</span><span class="p">}]</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">aggregate</span><span class="p">(</span><span class="n">db</span><span class="p">,</span> <span class="n">most_common_amenities</span><span class="p">)</span>
<span class="n">MyPrettyPrinter</span><span class="p">()</span><span class="o">.</span><span class="n">pprint</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">[{_id: school, count: 1818},
{_id: bicycle_parking, count: 1409},
{_id: restaurant, count: 1080},
{_id: parking, count: 976},
{_id: fast_food, count: 890},
{_id: fuel, count: 678},
{_id: place_of_worship, count: 562},
{_id: bank, count: 534},
{_id: pub, count: 400},
{_id: pharmacy, count: 368}]
</code></pre></div>
<h3>Statistics on Bike Parking Capacity</h3>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">bike_parkings_capacity</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">{</span><span class="s">'$match'</span><span class="p">:</span> <span class="p">{</span><span class="s">'amenity'</span><span class="p">:</span> <span class="s">'bicycle_parking'</span><span class="p">,</span> <span class="s">'capacity'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$exists'</span><span class="p">:</span> <span class="mi">1</span><span class="p">}}},</span>
<span class="p">{</span><span class="s">'$group'</span><span class="p">:</span> <span class="p">{</span>
<span class="s">'_id'</span><span class="p">:</span> <span class="s">'Bike parking stats:'</span><span class="p">,</span>
<span class="s">'count'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$sum'</span><span class="p">:</span> <span class="mi">1</span><span class="p">},</span>
<span class="s">'max'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$max'</span><span class="p">:</span> <span class="s">'$capacity'</span><span class="p">},</span>
<span class="s">'min'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$min'</span><span class="p">:</span> <span class="s">'$capacity'</span><span class="p">},</span>
<span class="s">'avg'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$avg'</span><span class="p">:</span> <span class="s">'$capacity'</span><span class="p">}}}]</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">aggregate</span><span class="p">(</span><span class="n">db</span><span class="p">,</span> <span class="n">bike_parkings_capacity</span><span class="p">)</span>
<span class="n">MyPrettyPrinter</span><span class="p">()</span><span class="o">.</span><span class="n">pprint</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">[{_id: Bike parking stats:,
avg: 11.487840825350037,
count: 1357,
max: 700,
min: 1}]
</code></pre></div>
<h3>10 Most Common Cuisines</h3>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">top_10_cuisines</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">{</span><span class="s">'$match'</span><span class="p">:</span> <span class="p">{</span><span class="s">'amenity'</span><span class="p">:</span> <span class="s">'restaurant'</span><span class="p">,</span> <span class="s">'cuisine'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$exists'</span><span class="p">:</span> <span class="mi">1</span><span class="p">}}},</span>
<span class="p">{</span><span class="s">'$group'</span><span class="p">:</span> <span class="p">{</span><span class="s">'_id'</span><span class="p">:</span> <span class="s">'$cuisine'</span><span class="p">,</span> <span class="s">'count'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$sum'</span><span class="p">:</span> <span class="mi">1</span><span class="p">}}},</span>
<span class="p">{</span><span class="s">'$sort'</span><span class="p">:</span> <span class="p">{</span><span class="s">'count'</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">}},</span>
<span class="p">{</span><span class="s">'$limit'</span><span class="p">:</span> <span class="mi">10</span><span class="p">}]</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">aggregate</span><span class="p">(</span><span class="n">db</span><span class="p">,</span> <span class="n">top_10_cuisines</span><span class="p">)</span>
<span class="n">MyPrettyPrinter</span><span class="p">()</span><span class="o">.</span><span class="n">pprint</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">[{_id: pizza, count: 88},
{_id: regional, count: 83},
{_id: japanese, count: 38},
{_id: italian, count: 38},
{_id: steak_house, count: 20},
{_id: barbecue, count: 18},
{_id: brazilian, count: 16},
{_id: international, count: 12},
{_id: seafood, count: 8},
{_id: chinese, count: 8}]
</code></pre></div>
<h3>10 Most Common Religions</h3>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">most_common_religions</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">{</span><span class="s">'$match'</span><span class="p">:</span> <span class="p">{</span><span class="s">'amenity'</span><span class="p">:</span> <span class="s">'place_of_worship'</span><span class="p">,</span> <span class="s">'religion'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$exists'</span><span class="p">:</span> <span class="mi">1</span><span class="p">}}},</span>
<span class="p">{</span><span class="s">'$group'</span><span class="p">:</span> <span class="p">{</span><span class="s">'_id'</span><span class="p">:</span> <span class="s">'$religion'</span><span class="p">,</span> <span class="s">'count'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$sum'</span><span class="p">:</span> <span class="mi">1</span><span class="p">}}},</span>
<span class="p">{</span><span class="s">'$sort'</span><span class="p">:</span> <span class="p">{</span><span class="s">'count'</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">}},</span>
<span class="p">{</span><span class="s">'$limit'</span><span class="p">:</span> <span class="mi">10</span><span class="p">}]</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">aggregate</span><span class="p">(</span><span class="n">db</span><span class="p">,</span> <span class="n">most_common_religions</span><span class="p">)</span>
<span class="n">MyPrettyPrinter</span><span class="p">()</span><span class="o">.</span><span class="n">pprint</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">[{_id: christian, count: 495},
{_id: spiritualist, count: 7},
{_id: jewish, count: 6},
{_id: buddhist, count: 3},
{_id: religion_of_humanity, count: 1},
{_id: umbanda, count: 1},
{_id: macumba, count: 1},
{_id: muslim, count: 1},
{_id: seicho_no_ie, count: 1}]
</code></pre></div>
<p>The vast majority is christian. Among them, which are the most common denominations?</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">christian_denominations</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">{</span><span class="s">'$match'</span><span class="p">:</span> <span class="p">{</span><span class="s">'amenity'</span><span class="p">:</span> <span class="s">'place_of_worship'</span><span class="p">,</span> <span class="s">'religion'</span><span class="p">:</span> <span class="s">'christian'</span><span class="p">,</span> <span class="s">'denomination'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$exists'</span><span class="p">:</span> <span class="mi">1</span><span class="p">}}},</span>
<span class="p">{</span><span class="s">'$group'</span><span class="p">:</span> <span class="p">{</span><span class="s">'_id'</span><span class="p">:</span> <span class="s">'$denomination'</span><span class="p">,</span> <span class="s">'count'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$sum'</span><span class="p">:</span> <span class="mi">1</span><span class="p">}}},</span>
<span class="p">{</span><span class="s">'$sort'</span><span class="p">:</span> <span class="p">{</span><span class="s">'count'</span><span class="p">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">}},</span>
<span class="p">{</span><span class="s">'$limit'</span><span class="p">:</span> <span class="mi">10</span><span class="p">}]</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">aggregate</span><span class="p">(</span><span class="n">db</span><span class="p">,</span> <span class="n">christian_denominations</span><span class="p">)</span>
<span class="n">MyPrettyPrinter</span><span class="p">()</span><span class="o">.</span><span class="n">pprint</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">[{_id: catholic, count: 157},
{_id: baptist, count: 33},
{_id: roman_catholic, count: 31},
{_id: evangelical, count: 27},
{_id: spiritist, count: 20},
{_id: pentecostal, count: 19},
{_id: protestant, count: 14},
{_id: methodist, count: 10},
{_id: presbyterian, count: 3},
{_id: assemblies_of_god, count: 2}]
</code></pre></div>
<h3>Fast-food Sites Near the Sugar Loaf</h3>
<p>Consider you are visiting the Sugar Loaf in Rio and suddenly you are starving! Where to go?
MongoDB Geospacial Index to the rescue!</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">pymongo</span> <span class="kn">import</span> <span class="n">GEO2D</span>
<span class="n">db</span><span class="o">.</span><span class="n">rio</span><span class="o">.</span><span class="n">create_index</span><span class="p">([(</span><span class="s">'pos'</span><span class="p">,</span> <span class="n">GEO2D</span><span class="p">)])</span>
<span class="n">sugar_loaf</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">rio</span><span class="o">.</span><span class="n">find_one</span><span class="p">({</span><span class="s">'name'</span><span class="p">:</span> <span class="s">'Pão de Açúcar'</span><span class="p">,</span> <span class="s">'tourism'</span><span class="p">:</span> <span class="s">'attraction'</span><span class="p">})</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">rio</span><span class="o">.</span><span class="n">find</span><span class="p">(</span>
<span class="p">{</span><span class="s">'pos'</span><span class="p">:</span> <span class="p">{</span><span class="s">'$near'</span><span class="p">:</span> <span class="n">sugar_loaf</span><span class="p">[</span><span class="s">'pos'</span><span class="p">]},</span> <span class="s">'amenity'</span><span class="p">:</span> <span class="s">'fast_food'</span><span class="p">},</span>
<span class="p">{</span><span class="s">'_id'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> <span class="s">'name'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s">'cuisine'</span><span class="p">:</span> <span class="mi">1</span><span class="p">})</span><span class="o">.</span><span class="n">skip</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">limit</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
<span class="n">MyPrettyPrinter</span><span class="p">()</span><span class="o">.</span><span class="n">pprint</span><span class="p">([</span><span class="n">item</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">result</span><span class="p">])</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">[{cuisine: corn, name: Tino},
{cuisine: sandwich, name: Max},
{cuisine: popcorn, name: França}]
</code></pre></div>
<p>Luckily there are Tino’s corn, Max’s sandwich and França’s popcorn to satisfy your hunger!</p>
<h3>Conclusion</h3>
<p>Data inserted by humans is almost certain to show inconsistencies. And even though a big part of it is inserted by bots, different bots may insert data using different patterns, and the inconsistency remains. On the other hand, this freedom on the data input grants a lot of flexibility to users, and because of that, the representation of the map may be even more faithful to the real world than if there were key constraints or limitations.</p>
<p>Anyway, for the purposes of this wrangling exercise the data has been well cleaned.</p>
<h3>References:</h3>
<h4>pprint Unicode</h4>
<ul>
<li><a href="http://stackoverflow.com/questions/10883399/unable-to-encode-decode-pprint-output">http://stackoverflow.com/questions/10883399/unable-to-encode-decode-pprint-output</a></li>
</ul>
<h4>MongoDB Geospacial Index</h4>
<ul>
<li><a href="https://docs.mongodb.com/v3.2/tutorial/build-a-2d-index/">https://docs.mongodb.com/v3.2/tutorial/build-a-2d-index/</a></li>
<li><a href="https://docs.mongodb.com/v3.2/tutorial/query-a-2d-index/">https://docs.mongodb.com/v3.2/tutorial/query-a-2d-index/</a></li>
<li><a href="http://api.mongodb.com/python/current/api/pymongo/collection.html?_ga=1.25837502.2095208423.1476211996#pymongo.collection.Collection.create_index">http://api.mongodb.com/python/current/api/pymongo/collection.html?_ga=1.25837502.2095208423.1476211996#pymongo.collection.Collection.create_index</a></li>
</ul>
<h4>Variety Open Source Tool</h4>
<ul>
<li><a href="https://github.com/variety/variety">https://github.com/variety/variety</a></li>
</ul>Luiz Gustavo Schillerschillerbr@gmail.comhttp://luizschiller.comInvestigating the Titanic Dataset with Python2016-09-08T00:00:00+01:002016-09-08T00:00:00+01:00http://luizschiller.com/titanic<h4>Udacity Data Analyst Nanodegree</h4>
<h3>First Glance at Our Data</h3>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="kn">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="kn">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="kn">as</span> <span class="nn">sns</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="n">filename</span> <span class="o">=</span> <span class="s">'titanic_data.csv'</span>
<span class="n">titanic_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span>
</code></pre></div>
<p>First let’s take a quick look at what we’ve got:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">titanic_df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>PassengerId</th>
<th>Survived</th>
<th>Pclass</th>
<th>Name</th>
<th>Sex</th>
<th>Age</th>
<th>SibSp</th>
<th>Parch</th>
<th>Ticket</th>
<th>Fare</th>
<th>Cabin</th>
<th>Embarked</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>0</td>
<td>3</td>
<td>Braund, Mr. Owen Harris</td>
<td>male</td>
<td>22.0</td>
<td>1</td>
<td>0</td>
<td>A/5 21171</td>
<td>7.2500</td>
<td>NaN</td>
<td>S</td>
</tr>
<tr>
<th>1</th>
<td>2</td>
<td>1</td>
<td>1</td>
<td>Cumings, Mrs. John Bradley (Florence Briggs Th…</td>
<td>female</td>
<td>38.0</td>
<td>1</td>
<td>0</td>
<td>PC 17599</td>
<td>71.2833</td>
<td>C85</td>
<td>C</td>
</tr>
<tr>
<th>2</th>
<td>3</td>
<td>1</td>
<td>3</td>
<td>Heikkinen, Miss. Laina</td>
<td>female</td>
<td>26.0</td>
<td>0</td>
<td>0</td>
<td>STON/O2. 3101282</td>
<td>7.9250</td>
<td>NaN</td>
<td>S</td>
</tr>
<tr>
<th>3</th>
<td>4</td>
<td>1</td>
<td>1</td>
<td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>
<td>female</td>
<td>35.0</td>
<td>1</td>
<td>0</td>
<td>113803</td>
<td>53.1000</td>
<td>C123</td>
<td>S</td>
</tr>
<tr>
<th>4</th>
<td>5</td>
<td>0</td>
<td>3</td>
<td>Allen, Mr. William Henry</td>
<td>male</td>
<td>35.0</td>
<td>0</td>
<td>0</td>
<td>373450</td>
<td>8.0500</td>
<td>NaN</td>
<td>S</td>
</tr>
</tbody>
</table>
</div>
<p></br></p>
<h3>Handling Missing Values</h3>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">titanic_df</span><span class="o">.</span><span class="n">info</span><span class="p">()</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang=""><class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 66.2+ KB
</code></pre></div>
<p>From this initial observation we notice that, from 891 passenger records:
- 714 have valid ages;
- only 204 have cabin records;
- 2 embarkments are missing.</p>
<p>The rows with missing ages and embarkment values will be dropped whenever an analysis depends on them.</p>
<p>The cabin values are not going to be used in this analysis, so they will not be touched.</p>
<h3>Other Considerations</h3>
<p>I’m not going to analyze the number of Siblings/Spouses or Parents/Children isolatedly. Instead I am using the presence or not of family members aboard, represented by the ‘Family’ column.</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">titanic_df</span><span class="p">[</span><span class="s">'Family'</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">titanic_df</span><span class="p">[</span><span class="s">'SibSp'</span><span class="p">]</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">titanic_df</span><span class="p">[</span><span class="s">'Parch'</span><span class="p">]</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span>
</code></pre></div>
<p>We also are going to need a column stating if a passenger is a child or an adult. 15 is going to be the childhood age threshold for our study.</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">titanic_df</span><span class="p">[</span><span class="s">'AgeRange'</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">cut</span><span class="p">(</span><span class="n">titanic_df</span><span class="p">[</span><span class="s">'Age'</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">80</span><span class="p">],</span> <span class="n">labels</span><span class="o">=</span><span class="p">[</span><span class="s">'child'</span><span class="p">,</span> <span class="s">'adult'</span><span class="p">])</span>
</code></pre></div>
<p>Now I’m getting rid of the data we are not going to use:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">titanic_df</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'PassengerId'</span><span class="p">,</span> <span class="s">'Name'</span><span class="p">,</span> <span class="s">'SibSp'</span><span class="p">,</span> <span class="s">'Parch'</span><span class="p">,</span> <span class="s">'Ticket'</span><span class="p">,</span> <span class="s">'Cabin'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div>
<p>Which leaves us with the following columns, plus ‘Sex’, ‘Embarked’ and ‘Family’:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">titanic_df</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span>
</code></pre></div>
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Survived</th>
<th>Pclass</th>
<th>Age</th>
<th>Fare</th>
</tr>
</thead>
<tbody>
<tr>
<th>count</th>
<td>891.000000</td>
<td>891.000000</td>
<td>714.000000</td>
<td>891.000000</td>
</tr>
<tr>
<th>mean</th>
<td>0.383838</td>
<td>2.308642</td>
<td>29.699118</td>
<td>32.204208</td>
</tr>
<tr>
<th>std</th>
<td>0.486592</td>
<td>0.836071</td>
<td>14.526497</td>
<td>49.693429</td>
</tr>
<tr>
<th>min</th>
<td>0.000000</td>
<td>1.000000</td>
<td>0.420000</td>
<td>0.000000</td>
</tr>
<tr>
<th>25%</th>
<td>0.000000</td>
<td>2.000000</td>
<td>NaN</td>
<td>7.910400</td>
</tr>
<tr>
<th>50%</th>
<td>0.000000</td>
<td>3.000000</td>
<td>NaN</td>
<td>14.454200</td>
</tr>
<tr>
<th>75%</th>
<td>1.000000</td>
<td>3.000000</td>
<td>NaN</td>
<td>31.000000</td>
</tr>
<tr>
<th>max</th>
<td>1.000000</td>
<td>3.000000</td>
<td>80.000000</td>
<td>512.329200</td>
</tr>
</tbody>
</table>
</div>
<p></br></p>
<p>We can see that aproximately 38% of the passengers survived and the highest fare is over 15 times the average.</p>
<h3>Let’s raise some questions:</h3>
<ol>
<li>What is the survival rate by class, sex and age? What about combining these factors?</li>
<li>Was the fare the same for men and women?</li>
<li>What fraction of the passengers embarked on each port? Is there a difference in their survival rates?</li>
<li>Is the presence of a family member a good indicator for survival?</li>
</ol>
<h3>1. What is the survival rate by class, sex and age? What about combining these factors?</h3>
<p>Let’s take a look at the distribution of passengers by age and fare, grouped by sex and class, and with survival information. It will give us some global insights about the data. But first, removing rows with missing ages:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">titanic_df_clean_age</span> <span class="o">=</span> <span class="n">titanic_df</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">subset</span><span class="o">=</span><span class="p">[</span><span class="s">'Age'</span><span class="p">])</span>
</code></pre></div><div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">scatter_plot_class</span><span class="p">(</span><span class="n">pclass</span><span class="p">):</span>
<span class="n">g</span> <span class="o">=</span> <span class="n">sns</span><span class="o">.</span><span class="n">FacetGrid</span><span class="p">(</span><span class="n">titanic_df_clean_age</span><span class="p">[</span><span class="n">titanic_df_clean_age</span><span class="p">[</span><span class="s">'Pclass'</span><span class="p">]</span> <span class="o">==</span> <span class="n">pclass</span><span class="p">],</span>
<span class="n">col</span><span class="o">=</span><span class="s">'Sex'</span><span class="p">,</span>
<span class="n">col_order</span><span class="o">=</span><span class="p">[</span><span class="s">'male'</span><span class="p">,</span> <span class="s">'female'</span><span class="p">],</span>
<span class="n">hue</span><span class="o">=</span><span class="s">'Survived'</span><span class="p">,</span>
<span class="n">hue_kws</span><span class="o">=</span><span class="nb">dict</span><span class="p">(</span><span class="n">marker</span><span class="o">=</span><span class="p">[</span><span class="s">'v'</span><span class="p">,</span> <span class="s">'^'</span><span class="p">]),</span>
<span class="n">size</span><span class="o">=</span><span class="mi">6</span><span class="p">)</span>
<span class="n">g</span> <span class="o">=</span> <span class="p">(</span><span class="n">g</span><span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">,</span> <span class="s">'Age'</span><span class="p">,</span> <span class="s">'Fare'</span><span class="p">,</span> <span class="n">edgecolor</span><span class="o">=</span><span class="s">'w'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">80</span><span class="p">)</span><span class="o">.</span><span class="n">add_legend</span><span class="p">())</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplots_adjust</span><span class="p">(</span><span class="n">top</span><span class="o">=</span><span class="mf">0.9</span><span class="p">)</span>
<span class="n">g</span><span class="o">.</span><span class="n">fig</span><span class="o">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">'CLASS {}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">pclass</span><span class="p">))</span>
<span class="c"># plotted separately because the fare scale for the first class makes it difficult to visualize second and third class charts</span>
<span class="n">scatter_plot_class</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">scatter_plot_class</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="n">scatter_plot_class</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
</code></pre></div>
<p><img src="/assets/images/titanic/output_20_0.png" alt="png"></p>
<p><img src="/assets/images/titanic/output_20_1.png" alt="png"></p>
<p><img src="/assets/images/titanic/output_20_2.png" alt="png"></p>
<p>It seems like women have a much higher survival rate, specially in first and second classes. It seems too that children have a higher survival rate, specially in first and second classes again. Let’s find out the survival rate by class, sex and age range, and plot the results for a better understanding:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">survived_by_class</span> <span class="o">=</span> <span class="n">titanic_df_clean_age</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'Pclass'</span><span class="p">)[</span><span class="s">'Survived'</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">survived_by_class</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">Pclass
1 0.655914
2 0.479769
3 0.239437
Name: Survived, dtype: float64
</code></pre></div><div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">survived_by_sex</span> <span class="o">=</span> <span class="n">titanic_df_clean_age</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'Sex'</span><span class="p">)[</span><span class="s">'Survived'</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">survived_by_sex</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">Sex
female 0.754789
male 0.205298
Name: Survived, dtype: float64
</code></pre></div><div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">survived_by_age</span> <span class="o">=</span> <span class="n">titanic_df_clean_age</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'AgeRange'</span><span class="p">)[</span><span class="s">'Survived'</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">survived_by_age</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">AgeRange
child 0.590361
adult 0.381933
Name: Survived, dtype: float64
</code></pre></div><div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">fig</span><span class="p">,</span> <span class="p">(</span><span class="n">axis1</span><span class="p">,</span><span class="n">axis2</span><span class="p">,</span><span class="n">axis3</span><span class="p">)</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span><span class="mi">6</span><span class="p">))</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">survived_by_class</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">axis1</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'#5975A4'</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">'Survival Rate by Class'</span><span class="p">,</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Survival Rate'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mf">0.0</span><span class="p">,</span><span class="mf">1.0</span><span class="p">)</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">survived_by_sex</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">axis2</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'#5F9E6E'</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">'Survival Rate by Sex'</span><span class="p">,</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mf">0.0</span><span class="p">,</span><span class="mf">1.0</span><span class="p">)</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">survived_by_age</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">axis3</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'#B55D60'</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">'Survival Rate by Age Range'</span><span class="p">,</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mf">0.0</span><span class="p">,</span><span class="mf">1.0</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">(0.0, 1.0)
</code></pre></div>
<p><img src="/assets/images/titanic/output_25_1.png" alt="png"></p>
<p>As expected (since we all watched the Titanic movie 😉), the first class has a higher survival rate than the second, which has a higher survival rate than the third, and women and children have a higher chance of survival than men and adults, respectively.</p>
<p>Now combining the three factors and visualizing the plots:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">grouped_data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">(</span>
<span class="p">[</span><span class="n">titanic_df_clean_age</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'Pclass'</span><span class="p">,</span> <span class="s">'Sex'</span><span class="p">,</span> <span class="s">'AgeRange'</span><span class="p">])[</span><span class="s">'Survived'</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">(),</span>
<span class="n">titanic_df_clean_age</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'Pclass'</span><span class="p">,</span> <span class="s">'Sex'</span><span class="p">,</span> <span class="s">'AgeRange'</span><span class="p">])[</span><span class="s">'Survived'</span><span class="p">]</span><span class="o">.</span><span class="n">count</span><span class="p">()],</span>
<span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">grouped_data</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'Survived'</span><span class="p">,</span> <span class="s">'Count'</span><span class="p">]</span>
<span class="n">grouped_data</span>
</code></pre></div>
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th></th>
<th></th>
<th>Survived</th>
<th>Count</th>
</tr>
<tr>
<th>Pclass</th>
<th>Sex</th>
<th>AgeRange</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="4" valign="top">1</th>
<th rowspan="2" valign="top">female</th>
<th>child</th>
<td>0.666667</td>
<td>3</td>
</tr>
<tr>
<th>adult</th>
<td>0.975610</td>
<td>82</td>
</tr>
<tr>
<th rowspan="2" valign="top">male</th>
<th>child</th>
<td>1.000000</td>
<td>3</td>
</tr>
<tr>
<th>adult</th>
<td>0.377551</td>
<td>98</td>
</tr>
<tr>
<th rowspan="4" valign="top">2</th>
<th rowspan="2" valign="top">female</th>
<th>child</th>
<td>1.000000</td>
<td>10</td>
</tr>
<tr>
<th>adult</th>
<td>0.906250</td>
<td>64</td>
</tr>
<tr>
<th rowspan="2" valign="top">male</th>
<th>child</th>
<td>1.000000</td>
<td>9</td>
</tr>
<tr>
<th>adult</th>
<td>0.066667</td>
<td>90</td>
</tr>
<tr>
<th rowspan="4" valign="top">3</th>
<th rowspan="2" valign="top">female</th>
<th>child</th>
<td>0.533333</td>
<td>30</td>
</tr>
<tr>
<th>adult</th>
<td>0.430556</td>
<td>72</td>
</tr>
<tr>
<th rowspan="2" valign="top">male</th>
<th>child</th>
<td>0.321429</td>
<td>28</td>
</tr>
<tr>
<th>adult</th>
<td>0.128889</td>
<td>225</td>
</tr>
</tbody>
</table>
</div>
<p></br></p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">g</span> <span class="o">=</span> <span class="n">sns</span><span class="o">.</span><span class="n">factorplot</span><span class="p">(</span>
<span class="n">x</span><span class="o">=</span><span class="s">'AgeRange'</span><span class="p">,</span>
<span class="n">y</span><span class="o">=</span><span class="s">'Survived'</span><span class="p">,</span>
<span class="n">col</span><span class="o">=</span><span class="s">'Pclass'</span><span class="p">,</span>
<span class="n">row</span><span class="o">=</span><span class="s">'Sex'</span><span class="p">,</span>
<span class="n">data</span><span class="o">=</span><span class="n">titanic_df_clean_age</span><span class="p">,</span>
<span class="n">margin_titles</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="n">kind</span><span class="o">=</span><span class="s">"bar"</span><span class="p">,</span>
<span class="n">ci</span><span class="o">=</span><span class="bp">None</span><span class="p">)</span>
</code></pre></div>
<p><img src="/assets/images/titanic/output_28_0.png" alt="png"></p>
<p>Analysing the three factors combined gives us expected results too. It is interesting to see that even the women from the third class have a higher survival rate than the men from first. It indicates that saving women had a higher priority than saving the richer classes.</p>
<p>Saving children also seemed like a higher priority as on all permutations of factors except first class women, where one of three female children died, they had a higher survival rate. </p>
<p>So we can conclude that saving women and children was indeed a priority on the Titanic shipwreck.</p>
<h3>2. Was the fare the same for men and women?</h3>
<p>While looking at the scatter plots shown in the first question I noticed that women seemed to be more spreaded among the ‘Fare’ axis, so it motivated me to check if the average fare paid by women was really higher than men’s.</p>
<p>Let’s check the mean fare paid by each sex:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">fare_by_sex</span> <span class="o">=</span> <span class="n">titanic_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'Sex'</span><span class="p">)[</span><span class="s">'Fare'</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">fare_by_sex</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">Sex
female 44.479818
male 25.523893
Name: Fare, dtype: float64
</code></pre></div><div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">ax</span> <span class="o">=</span> <span class="n">fare_by_sex</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s">'Fare Average and Sex'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Average Fare'</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang=""><matplotlib.text.Text at 0xa7d69b0>
</code></pre></div>
<p><img src="/assets/images/titanic/output_32_1.png" alt="png"></p>
<p>It indeed seems that women paid way more than men on average. Women’s average fare is higher than I expected. Maybe it is due to the women of the first class. Let’s group the data by class and check it out:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">fare_by_class_sex</span> <span class="o">=</span> <span class="n">titanic_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'Pclass'</span><span class="p">,</span> <span class="s">'Sex'</span><span class="p">])[</span><span class="s">'Fare'</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">fare_by_class_sex</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">Pclass Sex
1 female 106.125798
male 67.226127
2 female 21.970121
male 19.741782
3 female 16.118810
male 12.661633
Name: Fare, dtype: float64
</code></pre></div><div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">ax</span> <span class="o">=</span> <span class="n">fare_by_class_sex</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span><span class="mi">4</span><span class="p">),</span> <span class="n">title</span><span class="o">=</span><span class="s">'Fare Average by Class and Sex'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Average Fare'</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang=""><matplotlib.text.Text at 0x8c7e590>
</code></pre></div>
<p><img src="/assets/images/titanic/output_35_1.png" alt="png"></p>
<p>The average fare paid by women is higher than men’s on every class, although the fares on second class are almost equal.
I wonder why women paid more… Maybe they demanded more privileges than men, but who knows…</p>
<h3>3. What fraction of the passengers embarked on each port? Is there a difference in their survival rates?</h3>
<p>Just for curiosity’s sake, let’s find out the proportion of passengers embarked on each port (C = Cherbourg; Q = Queenstown; S = Southampton), and their survival rates, but first, removing rows with missing embarkment values:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">titanic_df_clean_embarked</span> <span class="o">=</span> <span class="n">titanic_df</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">subset</span><span class="o">=</span><span class="p">[</span><span class="s">'Embarked'</span><span class="p">])</span>
</code></pre></div><div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">embarked</span> <span class="o">=</span> <span class="n">titanic_df_clean_embarked</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'Embarked'</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">embarked</span><span class="p">[</span><span class="s">'Count'</span><span class="p">]</span> <span class="o">=</span> <span class="n">titanic_df_clean_embarked</span><span class="p">[</span><span class="s">'Embarked'</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
<span class="n">embarked</span>
</code></pre></div>
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Survived</th>
<th>Pclass</th>
<th>Age</th>
<th>Fare</th>
<th>Family</th>
<th>Count</th>
</tr>
<tr>
<th>Embarked</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>C</th>
<td>0.553571</td>
<td>1.886905</td>
<td>30.814769</td>
<td>59.954144</td>
<td>0.494048</td>
<td>168</td>
</tr>
<tr>
<th>Q</th>
<td>0.389610</td>
<td>2.909091</td>
<td>28.089286</td>
<td>13.276030</td>
<td>0.259740</td>
<td>77</td>
</tr>
<tr>
<th>S</th>
<td>0.336957</td>
<td>2.350932</td>
<td>29.445397</td>
<td>27.079812</td>
<td>0.389752</td>
<td>644</td>
</tr>
</tbody>
</table>
</div>
<p></br></p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">fig</span><span class="p">,</span> <span class="p">(</span><span class="n">axis1</span><span class="p">,</span><span class="n">axis2</span><span class="p">)</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">14</span><span class="p">,</span><span class="mi">6</span><span class="p">))</span>
<span class="n">sns</span><span class="o">.</span><span class="n">countplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s">'Embarked'</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">titanic_df_clean_embarked</span><span class="p">,</span> <span class="n">order</span><span class="o">=</span><span class="p">[</span><span class="s">'S'</span><span class="p">,</span><span class="s">'C'</span><span class="p">,</span><span class="s">'Q'</span><span class="p">],</span> <span class="n">ax</span><span class="o">=</span><span class="n">axis1</span><span class="p">)</span>
<span class="n">sns</span><span class="o">.</span><span class="n">barplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">embarked</span><span class="o">.</span><span class="n">index</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s">'Survived'</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">embarked</span><span class="p">,</span> <span class="n">order</span><span class="o">=</span><span class="p">[</span><span class="s">'S'</span><span class="p">,</span><span class="s">'C'</span><span class="p">,</span><span class="s">'Q'</span><span class="p">],</span> <span class="n">ax</span><span class="o">=</span><span class="n">axis2</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang=""><matplotlib.axes._subplots.AxesSubplot at 0xa98cb30>
</code></pre></div>
<p><img src="/assets/images/titanic/output_40_1.png" alt="png"></p>
<p>The survival rate for passengers embarked on Cherbourg is higher than both other ports’. That is no wonder, since the mean ‘Pclass’ value for this port is 1.89 - way lower than Queenstown’s 2.91 and Southampton’s 2.35 - which means that people that embarked there belonged to richer classes, which we’ve already seen that have better survival rates than the poorer ones.</p>
<h3>4. Is the presence of a family member a good indicator for survival?</h3>
<p>Finally, let’s check if having a family member aboard means a higher survival chance:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">survived_by_family</span> <span class="o">=</span> <span class="n">titanic_df_clean_age</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'Family'</span><span class="p">)[</span><span class="s">'Survived'</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">survived_by_family</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">Family
False 0.321782
True 0.516129
Name: Survived, dtype: float64
</code></pre></div><div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">ax</span> <span class="o">=</span> <span class="n">survived_by_family</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s">'#5975A4'</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">'Survival Rate by Family Presence'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Survival Rate'</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang=""><matplotlib.text.Text at 0xaa26b90>
</code></pre></div>
<p><img src="/assets/images/titanic/output_44_1.png" alt="png"></p>
<p>The data shows that having a family member aboard indicates a better chance for survival. But why is that? Let’s check some other numbers about family presence, like it’s relation with class, sex and age range:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">family_by_class</span> <span class="o">=</span> <span class="n">titanic_df_clean_age</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'Pclass'</span><span class="p">)[</span><span class="s">'Family'</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">family_by_class</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">Pclass
1 0.537634
2 0.462428
3 0.366197
Name: Family, dtype: float64
</code></pre></div><div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">family_by_sex</span> <span class="o">=</span> <span class="n">titanic_df_clean_age</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'Sex'</span><span class="p">)[</span><span class="s">'Family'</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">family_by_sex</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">Sex
female 0.616858
male 0.328918
Name: Family, dtype: float64
</code></pre></div><div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">family_by_age</span> <span class="o">=</span> <span class="n">titanic_df_clean_age</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'AgeRange'</span><span class="p">)[</span><span class="s">'Family'</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">family_by_age</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">AgeRange
child 0.927711
adult 0.369255
Name: Family, dtype: float64
</code></pre></div><div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">fig</span><span class="p">,</span> <span class="p">(</span><span class="n">axis1</span><span class="p">,</span><span class="n">axis2</span><span class="p">,</span><span class="n">axis3</span><span class="p">)</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span><span class="mi">6</span><span class="p">))</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">family_by_class</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">axis1</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'#5975A4'</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">'Family Presence by Class'</span><span class="p">,</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Average Family Presence'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mf">0.0</span><span class="p">,</span><span class="mf">1.0</span><span class="p">)</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">family_by_sex</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">axis2</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'#5F9E6E'</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">'Family Presence by Sex'</span><span class="p">,</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mf">0.0</span><span class="p">,</span><span class="mf">1.0</span><span class="p">)</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">family_by_age</span><span class="o">.</span><span class="n">plot</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">axis3</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'#B55D60'</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">'Family Presence by Age Range'</span><span class="p">,</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mf">0.0</span><span class="p">,</span><span class="mf">1.0</span><span class="p">)</span>
</code></pre></div><div class="highlight"><pre><code class="language-" data-lang="">(0.0, 1.0)
</code></pre></div>
<p><img src="/assets/images/titanic/output_49_1.png" alt="png"></p>
<p>We can see that family presence is higher on:
- first class;
- female sex;
- children.</p>
<p>We have already discovered that these three factors show a higher survival rate, so maybe the higher survival rate for passengers with family members is more due to them than to the presence of family itself.</p>
<h3>Conclusion</h3>
<p>All the results presented on this report just show correlations between pieces of data. It is important to highlight that correlation does not imply causation. To make statistically valid statements, tests like chi-squared tests and t-tests should be applied.</p>
<p>To discover if class, sex and age have a relationship with survival, we make four chi-squared tests - one for each variable individually, and one for all combined - and find out if they really do matter, as this study suggests.</p>
<p>The same goes to find out if the embarkment site or the presence of a family member have relationships with survival.</p>
<p>To find out if the average fare was the same for men and women we must hypothesize that there was no difference, and then make a t-test to check if the difference is significative as this study suggests.</p>
<p>Thank you for reading!</p>Luiz Gustavo Schillerschillerbr@gmail.comhttp://luizschiller.comStroop Effect - Testing a Perceptual Phenomenon2016-08-20T00:00:00+01:002016-08-20T00:00:00+01:00http://luizschiller.com/stroop-effect<h4>Udacity Data Analyst Nanodegree</h4>
<h3>Project Overview</h3>
<p>In this project, you will investigate a classic phenomenon from experimental psychology called the <a href="https://en.wikipedia.org/wiki/Stroop_effect">Stroop Effect</a>. You will learn a little bit about the experiment, create a hypothesis regarding the outcome of the task, then go through the task yourself. You will then look at some data collected from others who have performed the same task and will compute some statistics describing the results. Finally, you will interpret your results in terms of your hypotheses.</p>
<p>Find the spreadsheet with the calculations here: <a href="https://docs.google.com/spreadsheets/d/194Vc8K5SPjlEYZL97j4oDCDcvbP2ZrwNA6rtQ4NVMKQ/edit?usp=sharing">https://docs.google.com/spreadsheets/d/194Vc8K5SPjlEYZL97j4oDCDcvbP2ZrwNA6rtQ4NVMKQ/edit?usp=sharing</a></p>
<h3>1. What is our independent variable? What is our dependent variable?</h3>
<p>Independent: the words condition (congruent or incongruent);
Dependent: the time it takes to name the ink colors.</p>
<h3>2. What is an appropriate set of hypotheses for this task? What kind of statistical test do you expect to perform? Justify your choices.</h3>
<p><strong>Null hypothesis (H0)</strong>: The mean time for the population to name the ink colors is equal for the Congruent and Incongruent conditions (μC = μI);</p>
<p><strong>Alternative Hypothesis (H1)</strong>: The mean time for the population to name the ink colors is different for the Congruent and Incongruent conditions (μC = μI);</p>
<p>We expect to perform a paired t-test, because:
- We assume the distributions are normal;
- The two samples are dependent;
- We do not know the population’s standard deviation;
- The samples size is below 30.</p>
<h3>3. Report some descriptive statistics regarding this dataset. Include at least one measure of central tendency and at least one measure of variability.</h3>
<p>Mean difference: -7.96
Standard deviation of the difference: 4.86
Standard error of the mean difference: .99</p>
<h3>4. Provide one or two visualizations that show the distribution of the sample data. Write one or two sentences noting what you observe about the plot or plots.</h3>
<p><img src="/assets/images/stroop-effect/scatterplot.png" alt="Incongruent vs Congruent">
The scatter plot shows some degree of correlation between the two samples.</p>
<p><img src="/assets/images/stroop-effect/histograms.png" alt="Histograms">
The histograms show that the times on the incongruent sample are larger than on the congruent sample.</p>
<h3>5. Now, perform the statistical test and report your results. What is your confidence level and your critical statistic value? Do you reject the null hypothesis or fail to reject it? Come to a conclusion in terms of the experiment task. Did the results match up with your expectations?</h3>
<p>Confidence level = 99%</p>
<p>Alpha = .01</p>
<p>t-critical two-tailed = +-2.807</p>
<p>t-statistic = -8.021</p>
<p>r² = .737</p>
<p>Our t-statistic is less than the negative t-critical (-8.021 < -2.807) so we reject the null hypothesis.</p>
<p>This result means that the difference between the congruent and incongruent samples is statistically significant. Based on our r², 73.7% of this difference is due to the words condition (congruent or incongruent). The results match my expectations. </p>
<h3>6. Optional: What do you think is responsible for the effects observed? Can you think of an alternative or similar task that would result in a similar effect? Some research about the problem will be helpful for thinking about these two questions!</h3>
<p>Since understanding the meaning of words is an automatic process as a result of habitual reading, and recognizing colors is not, the brain spends attentional resources on it, interfering with the color recognition.
A similar experiment could be to show up or down arrows, randomly above or below a central point (incongruent), and compare it to showing up arrows above and down arrows below a central point (congruent).</p>
<h3>REFERENCES:</h3>
<p><a href="https://en.wikipedia.org/wiki/Stroop_effect">https://en.wikipedia.org/wiki/Stroop_effect</a></p>Luiz Gustavo Schillerschillerbr@gmail.comhttp://luizschiller.com