El web scraping es una buena habilidad para diferentes usos más allá de que se sea un científico de datos o no. Mucha información que reside en la web es útil para ser analizada y elaborada. Este post sobre web scraping con Python, está dividido en tres partes.
• Extracción de datos de la web usando el módulo Beautiful Soup
• Manipulación de datos y limpieza usando la librería Panda
• Visualización de datos mediante Matplotlib
El dataset usado en este tutorial fue tomado de una carrera que tuvo lugar en Hillsboro, OR en Junio de 2017. Se podrá responder entonces preguntas como:
• Cuál fue el tiempo promedio de los corredores?
• Los tiempos de llegada responden a una distribución normal?
• Hubo diferencia de tiempos entre mujeres y hombres de diferentes grupos de edades?
Web Scraping con BeautifulSoup
Usando Jupyter Notebook, se deben importar las librerías necesarias (pandas, numpy, matplotlib.pyplot, seaborn). Si no lo tienes instalado, se recomienda utilizar Anaconda con ese fin. Para visualizar los gráficos, asegurarse de incluir la línea %matplotlib inline como se muestra a continuación.
<span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd
<span class="token keyword">import</span> numpy <span class="token keyword">as</span> np
<span class="token keyword">import</span> matplotlib<span class="token punctuation">.</span>pyplot <span class="token keyword">as</span> plt
<span class="token keyword">import</span> seaborn <span class="token keyword">as</span> sns
<span class="token operator">%</span>matplotlib inline
Las librerías especificas de web scraping son urllib.request, que es usado para abrir URLs y el paquete BeautifulSoup que es usado para extraer los datos de los html.
<span class="token keyword">from</span> urllib<span class="token punctuation">.</span>request <span class="token keyword">import</span> urlopen
<span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup
Seguido, hay que especificar la URL que contiene el dataset en urlopen() para obtener el html de la página.
url <span class="token operator">=</span> <span class="token string">"http://www.hubertiming.com/results/2017GPTR10K"</span>
html <span class="token operator">=</span> urlopen<span class="token punctuation">(</span>url<span class="token punctuation">)</span>
El primer paso entonces será obtener el hatml de la página, seguido, crear un objeto BeautifulSoup pasando el html a BeautifulSoup(). Este paquete es usado para parsear el html crudo en objetos de Python. El segundo argumento ‘lxml’ lo veremos después.
soup <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>html<span class="token punctuation">,</span> <span class="token string">'lxml'</span><span class="token punctuation">)</span>
type<span class="token punctuation">(</span>soup<span class="token punctuation">)</span>
bs4.BeautifulSoup
El objeto soup permite extraer información interesante sobre el website como el título tal como se muestra aquí.
<span class="token comment"># Get the title</span>
title <span class="token operator">=</span> soup<span class="token punctuation">.</span>title
<span class="token keyword">print</span><span class="token punctuation">(</span>title<span class="token punctuation">)</span>
<title>2017 Intel Great Place to Run 10K \ Urban Clash Games Race Results</title>
También se puede obtener el texto e imprimirlo para verificar que sea lo esperado.
<span class="token comment"># Print out the text</span>
text <span class="token operator">=</span> soup<span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token comment">#print(soup.text)</span>
Se puede ver el html de una página web haciendo right-clicking en cualquier lugar y seleccionando “Inspect.”

Usando el método find_all() de soup se pueden extraer todos los tags de html dentro de una página web. Ejemplos útiles de tags son < a > para hyperlinks, < table > para tables, < tr > para table rows, < th > para table headers, y < td > para table cells.
soup<span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">'a'</span><span class="token punctuation">)</span>
[<a class="btn btn-primary btn-lg" href="/results/2017GPTR" role="button">5K</a>,
<a href="http://hubertiming.com/">Huber Timing Home</a>,
<a href="#individual">Individual Results</a>,
<a href="#team">Team Results</a>,
<a href="mailto:[email protected]">[email protected]</a>,
<a href="#tabs-1" style="font-size: 18px">Results</a>,
<a name="individual"></a>,
<a name="team"></a>,
<a href="http://www.hubertiming.com/"><img height="65" src="/sites/all/themes/hubertiming/images/clockWithFinishSign_small.png" width="50"/>Huber Timing</a>,
<a href="http://facebook.com/hubertiming/"><img src="/results/FB-f-Logo__blue_50.png"/></a>]
Como se puede ver en la salida anterior, los tags de html vienen a veces con atributos scomo class, src, etc. que proveen información adicional sobre los elementos. Se puede usar un loop y el método get(‘”href”) para extraer solo hyperlinks.
all_links <span class="token operator">=</span> soup<span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">"a"</span><span class="token punctuation">)</span>
<span class="token keyword">for</span> link <span class="token keyword">in</span> all_links<span class="token punctuation">:</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>link<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">"href"</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
/results/2017GPTR
http://hubertiming.com/
#individual
#team
mailto:[email protected]
#tabs-1
None
None
http://www.hubertiming.com/
http://facebook.com/hubertiming/
Para imprimir solo filas de tablas, hay que pasar a soup.find_all() el argumento ‘tr’.
<span class="token comment"># Print the first 10 rows for sanity check</span>
rows <span class="token operator">=</span> soup<span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">'tr'</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>rows<span class="token punctuation">[</span><span class="token punctuation">:</span><span class="token number">10</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
[<tr><td>Finishers:</td><td>577</td></tr>, <tr><td>Male:</td><td>414</td></tr>, <tr><td>Female:</td><td>163</td></tr>, <tr class="header">
<th>Place</th>
<th>Bib</th>
<th>Name</th>
<th>Gender</th>
<th>City</th>
<th>State</th>
<th>Chip Time</th>
<th>Chip Pace</th>
<th>Gender Place</th>
<th>Age Group</th>
<th>Age Group Place</th>
<th>Time to Start</th>
<th>Gun Time</th>
<th>Team</th>
</tr>, <tr>
<td>1</td>
<td>814</td>
<td>JARED WILSON</td>
<td>M</td>
<td>TIGARD</td>
<td>OR</td>
<td>00:36:21</td>
<td>05:51</td>
<td>1 of 414</td>
<td>M 36-45</td>
<td>1 of 152</td>
<td>00:00:03</td>
<td>00:36:24</td>
<td></td>
</tr>, <tr>
<td>2</td>
<td>573</td>
<td>NATHAN A SUSTERSIC</td>
<td>M</td>
<td>PORTLAND</td>
<td>OR</td>
<td>00:36:42</td>
<td>05:55</td>
<td>2 of 414</td>
<td>M 26-35</td>
<td>1 of 154</td>
<td>00:00:03</td>
<td>00:36:45</td>
<td>INTEL TEAM F</td>
</tr>, <tr>
<td>3</td>
<td>687</td>
<td>FRANCISCO MAYA</td>
<td>M</td>
<td>PORTLAND</td>
<td>OR</td>
<td>00:37:44</td>
<td>06:05</td>
<td>3 of 414</td>
<td>M 46-55</td>
<td>1 of 64</td>
<td>00:00:04</td>
<td>00:37:48</td>
<td></td>
</tr>, <tr>
<td>4</td>
<td>623</td>
<td>PAUL MORROW</td>
<td>M</td>
<td>BEAVERTON</td>
<td>OR</td>
<td>00:38:34</td>
<td>06:13</td>
<td>4 of 414</td>
<td>M 36-45</td>
<td>2 of 152</td>
<td>00:00:03</td>
<td>00:38:37</td>
<td></td>
</tr>, <tr>
<td>5</td>
<td>569</td>
<td>DEREK G OSBORNE</td>
<td>M</td>
<td>HILLSBORO</td>
<td>OR</td>
<td>00:39:21</td>
<td>06:20</td>
<td>5 of 414</td>
<td>M 26-35</td>
<td>2 of 154</td>
<td>00:00:03</td>
<td>00:39:24</td>
<td>INTEL TEAM F</td>
</tr>, <tr>
<td>6</td>
<td>642</td>
<td>JONATHON TRAN</td>
<td>M</td>
<td>PORTLAND</td>
<td>OR</td>
<td>00:39:49</td>
<td>06:25</td>
<td>6 of 414</td>
<td>M 18-25</td>
<td>1 of 34</td>
<td>00:00:06</td>
<td>00:39:55</td>
<td></td>
</tr>]
El objetivo es tomar de la tabla del sitio y convertirlo en un dataframe para utilizar la información más fácilmente con Python. Para eso usaremos un loop que itera a través de las filas e imprime las celdas en las filas.
<span class="token keyword">for</span> row <span class="token keyword">in</span> rows<span class="token punctuation">:</span>
row_td <span class="token operator">=</span> row<span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">'td'</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>row_td<span class="token punctuation">)</span>
type<span class="token punctuation">(</span>row_td<span class="token punctuation">)</span>
[<td>14TH</td>, <td>INTEL TEAM M</td>, <td>04:43:23</td>, <td>00:58:59 - DANIELLE CASILLAS</td>, <td>01:02:06 - RAMYA MERUVA</td>, <td>01:17:06 - PALLAVI J SHINDE</td>, <td>01:25:11 - NALINI MURARI</td>]
bs4.element.ResultSet
La salida muestra que se imprimen las filas con los tags de HTML, cosa que no queremos, pero podemos usar el remove de HTML del soup, get_text.
str_cells <span class="token operator">=</span> str<span class="token punctuation">(</span>row_td<span class="token punctuation">)</span>
cleantext <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>str_cells<span class="token punctuation">,</span> <span class="token string">"lxml"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>cleantext<span class="token punctuation">)</span>
[14TH, INTEL TEAM M, 04:43:23, 00:58:59 - DANIELLE CASILLAS, 01:02:06 - RAMYA MERUVA, 01:17:06 - PALLAVI J SHINDE, 01:25:11 - NALINI MURARI]
Using regular expressions is highly discouraged since it requires several lines of code and one can easily make mistakes. It requires importing the re (for regular expressions) module. The code below shows how to build a regular expression that finds all the characters inside the < td > html tags and replace them with an empty string for each table row.
First, you compile a regular expression by passing a string to match to re.compile(). The dot, star, and question mark (.*?) will match an opening angle bracket followed by anything and followed by a closing angle bracket. It matches text in a non-greedy fashion, that is, it matches the shortest possible string. If you omit the question mark, it will match all the text between the first opening angle bracket and the last closing angle bracket. After compiling a regular expression, you can use the re.sub() method to find all the substrings where the regular expression matches and replace them with an empty string. The full code below generates an empty list, extract text in between html tags for each row, and append it to the assigned list.
<span class="token keyword">import</span> re
list_rows <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
<span class="token keyword">for</span> row <span class="token keyword">in</span> rows<span class="token punctuation">:</span>
cells <span class="token operator">=</span> row<span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">'td'</span><span class="token punctuation">)</span>
str_cells <span class="token operator">=</span> str<span class="token punctuation">(</span>cells<span class="token punctuation">)</span>
clean <span class="token operator">=</span> re<span class="token punctuation">.</span>compile<span class="token punctuation">(</span><span class="token string">'<.*?>'</span><span class="token punctuation">)</span>
clean2 <span class="token operator">=</span> <span class="token punctuation">(</span>re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>clean<span class="token punctuation">,</span> <span class="token string">''</span><span class="token punctuation">,</span>str_cells<span class="token punctuation">)</span><span class="token punctuation">)</span>
list_rows<span class="token punctuation">.</span>append<span class="token punctuation">(</span>clean2<span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>clean2<span class="token punctuation">)</span>
type<span class="token punctuation">(</span>clean2<span class="token punctuation">)</span>
[14TH, INTEL TEAM M, 04:43:23, 00:58:59 - DANIELLE CASILLAS, 01:02:06 - RAMYA MERUVA, 01:17:06 - PALLAVI J SHINDE, 01:25:11 - NALINI MURARI]
str
The next step is to convert the list into a dataframe and get a quick view of the first 10 rows using Pandas.
df <span class="token operator">=</span> pd<span class="token punctuation">.</span>DataFrame<span class="token punctuation">(</span>list_rows<span class="token punctuation">)</span>
df<span class="token punctuation">.</span>head<span class="token punctuation">(</span><span class="token number">10</span><span class="token punctuation">)</span>
0 | |
---|---|
0 | [Finishers:, 577] |
1 | [Male:, 414] |
2 | [Female:, 163] |
3 | [] |
4 | [1, 814, JARED WILSON, M, TIGARD, OR, 00:36:21… |
5 | [2, 573, NATHAN A SUSTERSIC, M, PORTLAND, OR, … |
6 | [3, 687, FRANCISCO MAYA, M, PORTLAND, OR, 00:3… |
7 | [4, 623, PAUL MORROW, M, BEAVERTON, OR, 00:38:… |
8 | [5, 569, DEREK G OSBORNE, M, HILLSBORO, OR, 00… |
9 | [6, 642, JONATHON TRAN, M, PORTLAND, OR, 00:39… |
Manipulación y Limpieza de Datos
El dataframe no está en el formato que pretendemos, para limpiarlo, debemos dividir la columna 0 en varias en la posición de la coma usando el método str.split().
df1 <span class="token operator">=</span> df<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">.</span>str<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">','</span><span class="token punctuation">,</span> expand<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span>
df1<span class="token punctuation">.</span>head<span class="token punctuation">(</span><span class="token number">10</span><span class="token punctuation">)</span>

El dataframe tiene corchetes no deseados en cada fila, se puede usar el método strip() para eliminarlos en la columna “0.”
df1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token operator">=</span> df1<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">.</span>str<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token string">'['</span><span class="token punctuation">)</span>
df1<span class="token punctuation">.</span>head<span class="token punctuation">(</span><span class="token number">10</span><span class="token punctuation">)</span>

La tabla no tiene encabezados, se puede usar el método find_all() para armarlo
col_labels <span class="token operator">=</span> soup<span class="token punctuation">.</span>find_all<span class="token punctuation">(</span><span class="token string">'th'</span><span class="token punctuation">)</span>
Como con las filas de tabla, se puede usar Beautiful Soup para extraer los textos de los tags del HTML para los headers.
all_header <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
col_str <span class="token operator">=</span> str<span class="token punctuation">(</span>col_labels<span class="token punctuation">)</span>
cleantext2 <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>col_str<span class="token punctuation">,</span> <span class="token string">"lxml"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>get_text<span class="token punctuation">(</span><span class="token punctuation">)</span>
all_header<span class="token punctuation">.</span>append<span class="token punctuation">(</span>cleantext2<span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>all_header<span class="token punctuation">)</span>
['[Place, Bib, Name, Gender, City, State, Chip Time, Chip Pace, Gender Place, Age Group, Age Group Place, Time to Start, Gun Time, Team]']
Convertir lista de encabezados en un dataframe de pandas.
df2 <span class="token operator">=</span> pd<span class="token punctuation">.</span>DataFrame<span class="token punctuation">(</span>all_header<span class="token punctuation">)</span>
df2<span class="token punctuation">.</span>head<span class="token punctuation">(</span><span class="token punctuation">)</span>
0 | |
---|---|
0 | [Place, Bib, Name, Gender, City, State, Chip T… |
De igual forma, se puede separar la comuna 0 en múltiples columnas en la posición de la coma para todas las filas
df3 <span class="token operator">=</span> df2<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">.</span>str<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">','</span><span class="token punctuation">,</span> expand<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span>
df3<span class="token punctuation">.</span>head<span class="token punctuation">(</span><span class="token punctuation">)</span>

Los dos dataframes pueden ser concatenados en uno el método concat()
frames <span class="token operator">=</span> <span class="token punctuation">[</span>df3<span class="token punctuation">,</span> df1<span class="token punctuation">]</span>
df4 <span class="token operator">=</span> pd<span class="token punctuation">.</span>concat<span class="token punctuation">(</span>frames<span class="token punctuation">)</span>
df4<span class="token punctuation">.</span>head<span class="token punctuation">(</span><span class="token number">10</span><span class="token punctuation">)</span>

Cómo asignar la primera fila a la cabecera de la tabla.
df5 <span class="token operator">=</span> df4<span class="token punctuation">.</span>rename<span class="token punctuation">(</span>columns<span class="token operator">=</span>df4<span class="token punctuation">.</span>iloc<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
df5<span class="token punctuation">.</span>head<span class="token punctuation">(</span><span class="token punctuation">)</span>

Ya en estado formateado, se puede sacar un overview de los datos para evaluar
df5<span class="token punctuation">.</span>info<span class="token punctuation">(</span><span class="token punctuation">)</span>
df5<span class="token punctuation">.</span>shape
<class 'pandas.core.frame.DataFrame'>
Int64Index: 597 entries, 0 to 595
Data columns (total 14 columns):
[Place 597 non-null object
Bib 596 non-null object
Name 593 non-null object
Gender 593 non-null object
City 593 non-null object
State 593 non-null object
Chip Time 593 non-null object
Chip Pace 578 non-null object
Gender Place 578 non-null object
Age Group 578 non-null object
Age Group Place 578 non-null object
Time to Start 578 non-null object
Gun Time 578 non-null object
Team] 578 non-null object
dtypes: object(14)
memory usage: 70.0+ KB
(597, 14)
La tabla tiene 597 filas y 14 columnas. Se pueden eliminar todas las de valores perdidos.
df6 <span class="token operator">=</span> df5<span class="token punctuation">.</span>dropna<span class="token punctuation">(</span>axis<span class="token operator">=</span><span class="token number">0</span><span class="token punctuation">,</span> how<span class="token operator">=</span><span class="token string">'any'</span><span class="token punctuation">)</span>
La cabecera es replicada como la primera fila en df5.
df7 <span class="token operator">=</span> df6<span class="token punctuation">.</span>drop<span class="token punctuation">(</span>df6<span class="token punctuation">.</span>index<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
df7<span class="token punctuation">.</span>head<span class="token punctuation">(</span><span class="token punctuation">)</span>

Se puede hacer más limpieza renombrando las columnas ‘[Place’ and ‘ Team]’. Python es muy sensible con los espacios, asegúrese de incluir espacio después de las tildes en ‘ Team]’.
df7<span class="token punctuation">.</span>rename<span class="token punctuation">(</span>columns<span class="token operator">=</span><span class="token punctuation">{</span><span class="token string">'[Place'</span><span class="token punctuation">:</span> <span class="token string">'Place'</span><span class="token punctuation">}</span><span class="token punctuation">,</span>inplace<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span>
df7<span class="token punctuation">.</span>rename<span class="token punctuation">(</span>columns<span class="token operator">=</span><span class="token punctuation">{</span><span class="token string">' Team]'</span><span class="token punctuation">:</span> <span class="token string">'Team'</span><span class="token punctuation">}</span><span class="token punctuation">,</span>inplace<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span>
df7<span class="token punctuation">.</span>head<span class="token punctuation">(</span><span class="token punctuation">)</span>

Ultimo paso, remover los corchetes en celdas en la columna “Team”.
df7<span class="token punctuation">[</span><span class="token string">'Team'</span><span class="token punctuation">]</span> <span class="token operator">=</span> df7<span class="token punctuation">[</span><span class="token string">'Team'</span><span class="token punctuation">]</span><span class="token punctuation">.</span>str<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token string">']'</span><span class="token punctuation">)</span>
df7<span class="token punctuation">.</span>head<span class="token punctuation">(</span><span class="token punctuation">)</span>

Listo!
Análisis y Visualización de Datos
La primera pregunta es el promedio de tiempos de llegada.
Hay que convertir la comuna “Chip Time” en minutos. Una manera es convertir la columna en una lista para manipularla
time_list <span class="token operator">=</span> df7<span class="token punctuation">[</span><span class="token string">' Chip Time'</span><span class="token punctuation">]</span><span class="token punctuation">.</span>tolist<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token comment"># You can use a for loop to convert 'Chip Time' to minutes</span>
time_mins <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
<span class="token keyword">for</span> i <span class="token keyword">in</span> time_list<span class="token punctuation">:</span>
h<span class="token punctuation">,</span> m<span class="token punctuation">,</span> s <span class="token operator">=</span> i<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">':'</span><span class="token punctuation">)</span>
math <span class="token operator">=</span> <span class="token punctuation">(</span>int<span class="token punctuation">(</span>h<span class="token punctuation">)</span> <span class="token operator">*</span> <span class="token number">3600</span> <span class="token operator">+</span> int<span class="token punctuation">(</span>m<span class="token punctuation">)</span> <span class="token operator">*</span> <span class="token number">60</span> <span class="token operator">+</span> int<span class="token punctuation">(</span>s<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token operator">/</span><span class="token number">60</span>
time_mins<span class="token punctuation">.</span>append<span class="token punctuation">(</span>math<span class="token punctuation">)</span>
<span class="token comment">#print(time_mins)</span>
El siguiente paso es convertir otra vez la lista en un dataframe y hacer una columna (“Runner_mins”) para los tiempos de llegada expresados en minutos.
df7<span class="token punctuation">[</span><span class="token string">'Runner_mins'</span><span class="token punctuation">]</span> <span class="token operator">=</span> time_mins
df7<span class="token punctuation">.</span>head<span class="token punctuation">(</span><span class="token punctuation">)</span>

Cómo calcular estadísticas para columnas numéricas en un dataframe.
df7<span class="token punctuation">.</span>describe<span class="token punctuation">(</span>include<span class="token operator">=</span><span class="token punctuation">[</span>np<span class="token punctuation">.</span>number<span class="token punctuation">]</span><span class="token punctuation">)</span>
Runner_mins | |
---|---|
count | 577.000000 |
mean | 60.035933 |
std | 11.970623 |
min | 36.350000 |
25% | 51.000000 |
50% | 59.016667 |
75% | 67.266667 |
max | 101.300000 |
Interesantemente los tiempos de llegada son de aproximadamente 60 min. El más rápido terminó en 36.35 mins, y el más lento en 101.30 minutos.
El boxplot es otra herramienta util para visualizar datos estadísticos (maximum, minimum, medium, first quartile, third quartile, incluyendo outliers).
<span class="token keyword">from</span> pylab <span class="token keyword">import</span> rcParams
rcParams<span class="token punctuation">[</span><span class="token string">'figure.figsize'</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token number">15</span><span class="token punctuation">,</span> <span class="token number">5</span>
df7<span class="token punctuation">.</span>boxplot<span class="token punctuation">(</span>column<span class="token operator">=</span><span class="token string">'Runner_mins'</span><span class="token punctuation">)</span>
plt<span class="token punctuation">.</span>grid<span class="token punctuation">(</span><span class="token boolean">True</span><span class="token punctuation">,</span> axis<span class="token operator">=</span><span class="token string">'y'</span><span class="token punctuation">)</span>
plt<span class="token punctuation">.</span>ylabel<span class="token punctuation">(</span><span class="token string">'Chip Time'</span><span class="token punctuation">)</span>
plt<span class="token punctuation">.</span>xticks<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token punctuation">[</span><span class="token string">'Runners'</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
([<matplotlib.axis.XTick at 0x570dd106d8>],
<a list of 1 Text xticklabel objects>)

La segunda pregunta es si los tiempos de llegada forman una distribución normal
En este caso se usa la librería Seaborn donde se ve que la distribución parece casi normal.
x <span class="token operator">=</span> df7<span class="token punctuation">[</span><span class="token string">'Runner_mins'</span><span class="token punctuation">]</span>
ax <span class="token operator">=</span> sns<span class="token punctuation">.</span>distplot<span class="token punctuation">(</span>x<span class="token punctuation">,</span> hist<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">,</span> kde<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">,</span> rug<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">,</span> color<span class="token operator">=</span><span class="token string">'m'</span><span class="token punctuation">,</span> bins<span class="token operator">=</span><span class="token number">25</span><span class="token punctuation">,</span> hist_kws<span class="token operator">=</span><span class="token punctuation">{</span><span class="token string">'edgecolor'</span><span class="token punctuation">:</span><span class="token string">'black'</span><span class="token punctuation">}</span><span class="token punctuation">)</span>
plt<span class="token punctuation">.</span>show<span class="token punctuation">(</span><span class="token punctuation">)</span>

La tercera pregunta es cuando hay diferencia entre hombres y mujeres en diferentes grupos de edades.
f_fuko <span class="token operator">=</span> df7<span class="token punctuation">.</span>loc<span class="token punctuation">[</span>df7<span class="token punctuation">[</span><span class="token string">' Gender'</span><span class="token punctuation">]</span><span class="token operator">==</span><span class="token string">' F'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'Runner_mins'</span><span class="token punctuation">]</span>
m_fuko <span class="token operator">=</span> df7<span class="token punctuation">.</span>loc<span class="token punctuation">[</span>df7<span class="token punctuation">[</span><span class="token string">' Gender'</span><span class="token punctuation">]</span><span class="token operator">==</span><span class="token string">' M'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'Runner_mins'</span><span class="token punctuation">]</span>
sns<span class="token punctuation">.</span>distplot<span class="token punctuation">(</span>f_fuko<span class="token punctuation">,</span> hist<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">,</span> kde<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">,</span> rug<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">,</span> hist_kws<span class="token operator">=</span><span class="token punctuation">{</span><span class="token string">'edgecolor'</span><span class="token punctuation">:</span><span class="token string">'black'</span><span class="token punctuation">}</span><span class="token punctuation">,</span> label<span class="token operator">=</span><span class="token string">'Female'</span><span class="token punctuation">)</span>
sns<span class="token punctuation">.</span>distplot<span class="token punctuation">(</span>m_fuko<span class="token punctuation">,</span> hist<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">,</span> kde<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">,</span> rug<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">,</span> hist_kws<span class="token operator">=</span><span class="token punctuation">{</span><span class="token string">'edgecolor'</span><span class="token punctuation">:</span><span class="token string">'black'</span><span class="token punctuation">}</span><span class="token punctuation">,</span> label<span class="token operator">=</span><span class="token string">'Male'</span><span class="token punctuation">)</span>
plt<span class="token punctuation">.</span>legend<span class="token punctuation">(</span><span class="token punctuation">)</span>
<matplotlib.legend.Legend at 0x570e301fd0>

La distribución indica que las mukeres fueron más lentas en promedio. Se puede usar el método groupby() para calcular los totales para los hombres y mujeres por separado.
g_stats <span class="token operator">=</span> df7<span class="token punctuation">.</span>groupby<span class="token punctuation">(</span><span class="token string">" Gender"</span><span class="token punctuation">,</span> as_index<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span><span class="token punctuation">.</span>describe<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span>g_stats<span class="token punctuation">)</span>
Runner_mins \
count mean std min 25% 50%
Gender
F 163.0 66.119223 12.184440 43.766667 58.758333 64.616667
M 414.0 57.640821 11.011857 36.350000 49.395833 55.791667
75% max
Gender
F 72.058333 101.300000
M 64.804167 98.516667
El promedio de tiempo para todas las mujeres y hombres fue ~66
mins y ~58 mins, respectivamente.
df7<span class="token punctuation">.</span>boxplot<span class="token punctuation">(</span>column<span class="token operator">=</span><span class="token string">'Runner_mins'</span><span class="token punctuation">,</span> by<span class="token operator">=</span><span class="token string">' Gender'</span><span class="token punctuation">)</span>
plt<span class="token punctuation">.</span>ylabel<span class="token punctuation">(</span><span class="token string">'Chip Time'</span><span class="token punctuation">)</span>
plt<span class="token punctuation">.</span>suptitle<span class="token punctuation">(</span><span class="token string">""</span><span class="token punctuation">)</span>
C:\Users\smasango\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\fromnumeric.py:57: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
return getattr(obj, method)(*args, **kwds)
Text(0.5,0.98,'')

fuente: datacamp