For the final exercise you should try to combine together all the skills you’ve learned over this course. This is expected to take you longer than any of the exercises so far. If you get to this point in the session, feel free to get started on it now, otherwise you can treat this as practice later on.
Read the file ./data/titanic.csv. This file contains information about all the passengers and crew aboard the RMS Titanic. Its columns are:
name: strings with the name of the passenger.
gender: “male” or “female”.
age: a float with the persons age on the day of the sinking. The age of babies (under 12 months) is given as a fraction of one year.
class: a string specifying the class (1st, 2nd or 3rd) for passengers or the type of service aboard for crew members.
embarked: the persons place of of embarkment.
country: the persons home country.
ticketno: the persons ticket number (NA for crew members).
fare: the ticket price (NA for crew members, musicians and employees of the shipyard company).
sibsp: a number specifying the number if siblings/spouses aboard.
parch: a number specifying the number of parents/children aboard.
survived: a string (“no” or “yes”) specifying whether the person has survived the sinking.
Work through the following suggested questions. Feel free to go off-course and explore whatever you think is interesting in the data.
Summarising
Find the average age of all people on board
Use a filter to select only the males
Find the average age of the males on board
Filtering
Use a filter to select only the people in 3rd class.
Create a DataFrame which only contains the passengers on the ship (those in first, second or third class).
Plotting
Plot and compare the distribution of ages for males and females.
How does this differ by class?
Combining
Calculate the percentage that survived within each class.
Explore
Explore the data and see what you can find. For example, try exploring what the main factors predicting survival were.
Answer
There’s some help with the answers to some parts of this exercise here.
import pandas as pdimport seaborn as snstitanic = pd.read_csv("./data/titanic.csv")titanic
name
gender
age
class
embarked
country
ticketno
fare
sibsp
parch
survived
0
Abbing, Mr. Anthony
male
42.0
3rd
S
United States
5547.0
7.11
0.0
0.0
no
1
Abbott, Mr. Eugene Joseph
male
13.0
3rd
S
United States
2673.0
20.05
0.0
2.0
no
2
Abbott, Mr. Rossmore Edward
male
16.0
3rd
S
United States
2673.0
20.05
1.0
1.0
no
3
Abbott, Mrs. Rhoda Mary 'Rosa'
female
39.0
3rd
S
England
2673.0
20.05
1.0
1.0
yes
4
Abelseth, Miss. Karen Marie
female
16.0
3rd
S
Norway
348125.0
7.13
0.0
0.0
yes
...
...
...
...
...
...
...
...
...
...
...
...
2202
Wynn, Mr. Walter
male
41.0
deck crew
B
England
NaN
NaN
NaN
NaN
yes
2203
Yearsley, Mr. Harry
male
40.0
victualling crew
S
England
NaN
NaN
NaN
NaN
yes
2204
Young, Mr. Francis James
male
32.0
engineering crew
S
England
NaN
NaN
NaN
NaN
no
2205
Zanetti, Sig. Minio
male
20.0
restaurant staff
S
England
NaN
NaN
NaN
NaN
no
2206
Zarracchi, Sig. L.
male
26.0
restaurant staff
S
England
NaN
NaN
NaN
NaN
no
2207 rows × 11 columns
Summarising
Find the average age of all people on board
titanic["age"].mean()
30.436734693877504
Use a filter to select only the males
all_males = titanic[titanic["gender"] =="male"]
Find the average age of the males on board
all_males["age"].mean()
30.83231351981346
Filtering
Select on the people in 3rd class
titanic[titanic["class"] =="3rd"]
name
gender
age
class
embarked
country
ticketno
fare
sibsp
parch
survived
0
Abbing, Mr. Anthony
male
42.0
3rd
S
United States
5547.0
7.1100
0.0
0.0
no
1
Abbott, Mr. Eugene Joseph
male
13.0
3rd
S
United States
2673.0
20.0500
0.0
2.0
no
2
Abbott, Mr. Rossmore Edward
male
16.0
3rd
S
United States
2673.0
20.0500
1.0
1.0
no
3
Abbott, Mrs. Rhoda Mary 'Rosa'
female
39.0
3rd
S
England
2673.0
20.0500
1.0
1.0
yes
4
Abelseth, Miss. Karen Marie
female
16.0
3rd
S
Norway
348125.0
7.1300
0.0
0.0
yes
...
...
...
...
...
...
...
...
...
...
...
...
1313
Yūsuf, Mrs. Kātrīn
female
23.0
3rd
C
Lebanon
2668.0
22.0702
0.0
2.0
yes
1315
Zakarian, Mr. Mapriededer
male
22.0
3rd
C
Turkey
2656.0
7.0406
0.0
0.0
no
1316
Zakarian, Mr. Ortin
male
27.0
3rd
C
Turkey
2670.0
7.0406
0.0
0.0
no
1317
Zenni, Mr. Philip
male
25.0
3rd
C
Lebanon
2620.0
7.0406
0.0
0.0
yes
1318
Zimmermann, Mr. Leo
male
29.0
3rd
S
Germany
315082.0
7.1706
0.0
0.0
no
709 rows × 11 columns
Select just the passengers
The technique shown in class was to combine together multiple selectors with |:
Plot the distribution of ages for males and females
Using displot with age as the main variable shows the distribution. YOu can overlay the two genders using hue="gender". To simplify the view, you can set kind="kde". Since KDE mode smooths the data, you can also set a cutoff of 0 to avoid it showing negative ages:
To reduce the duplication of effort here, I create a function which, given a set of data, calculated the survived fraction within. This is then called three times, once for each class: