import pandas as pd
Firstly, if we read the data in without passing any extra arguments, we get:
= pd.read_csv(
temperature "https://milliams.com/courses/data_analysis_python/meantemp_monthly_totals.txt",
) temperature.head()
Mean Central England Temperature (Degrees Celsius) | |
---|---|
1659-1973 Manley (Q.J.R.METEOROL.SOC. | 1974) |
1974 on Parker et al. (INT.J.CLIM. | 1992) |
Parker and Horton (INT.J.CLIM. | 2005) |
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Annual | NaN |
1659 3.0 4.0 6.0 7.0 11.0 13.0 16.0 16.0 13.0 10.0 5.0 2.0 8.9 | NaN |
So we need to dot he same as before, setting the skiprows
argument:
= pd.read_csv(
temperature "https://milliams.com/courses/data_analysis_python/meantemp_monthly_totals.txt",
=4, # skip first 4 rows of the header
skiprows
) temperature.head()
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Annual | |
---|---|
0 | 1659 3.0 4.0 6.0 7.0 11.0 13... |
1 | 1660 0.0 4.0 6.0 9.0 11.0 14... |
2 | 1661 5.0 5.0 6.0 8.0 11.0 14... |
3 | 1662 5.0 6.0 6.0 8.0 11.0 15... |
4 | 1663 1.0 1.0 5.0 7.0 10.0 14... |
It’s not separating the columns correctly so if we look at the data and see spaces, we might think that useing sep=" "
would work, but if we try it:
= pd.read_csv(
temperature "https://milliams.com/courses/data_analysis_python/meantemp_monthly_totals.txt",
=4,
skiprows=" ", # try this...
sep
) temperature.head()
Unnamed: 0 | Unnamed: 1 | Year | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Jan | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | ... | Unnamed: 43 | Unnamed: 44 | Unnamed: 45 | Nov | Unnamed: 47 | Unnamed: 48 | Unnamed: 49 | Dec | Unnamed: 51 | Annual | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | 1659 | NaN | NaN | NaN | 3.0 | NaN | NaN | NaN | ... | NaN | 2.0 | NaN | NaN | NaN | 8.9 | NaN | NaN | NaN | NaN |
1 | NaN | NaN | 1660 | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | ... | NaN | 5.0 | NaN | NaN | NaN | 9.1 | NaN | NaN | NaN | NaN |
2 | NaN | NaN | 1661 | NaN | NaN | NaN | 5.0 | NaN | NaN | NaN | ... | NaN | 6.0 | NaN | NaN | NaN | 9.8 | NaN | NaN | NaN | NaN |
3 | NaN | NaN | 1662 | NaN | NaN | NaN | 5.0 | NaN | NaN | NaN | ... | NaN | 3.0 | NaN | NaN | NaN | 9.5 | NaN | NaN | NaN | NaN |
4 | NaN | NaN | 1663 | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | ... | NaN | 5.0 | NaN | NaN | NaN | 8.6 | NaN | NaN | NaN | NaN |
5 rows × 53 columns
That doesn’t look right. This is because sep=" "
means “use a single space” as the separator, but in the data most columns are separated by multiple spaces. To make it use “any number of spaces” as the separator, you can instead set sep='\s+'
:
= pd.read_csv(
temperature "https://milliams.com/courses/data_analysis_python/meantemp_monthly_totals.txt",
=4,
skiprows='\s+', # whitespace-separated columns
sep
) temperature.head()
Year | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Annual | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1659 | 3.0 | 4.0 | 6.0 | 7.0 | 11.0 | 13.0 | 16.0 | 16.0 | 13.0 | 10.0 | 5.0 | 2.0 | 8.9 |
1 | 1660 | 0.0 | 4.0 | 6.0 | 9.0 | 11.0 | 14.0 | 15.0 | 16.0 | 13.0 | 10.0 | 6.0 | 5.0 | 9.1 |
2 | 1661 | 5.0 | 5.0 | 6.0 | 8.0 | 11.0 | 14.0 | 15.0 | 15.0 | 13.0 | 11.0 | 8.0 | 6.0 | 9.8 |
3 | 1662 | 5.0 | 6.0 | 6.0 | 8.0 | 11.0 | 15.0 | 15.0 | 15.0 | 13.0 | 11.0 | 6.0 | 3.0 | 9.5 |
4 | 1663 | 1.0 | 1.0 | 5.0 | 7.0 | 10.0 | 14.0 | 15.0 | 15.0 | 13.0 | 10.0 | 7.0 | 5.0 | 8.6 |
That looks much better! Now we set the index_col
:
= pd.read_csv(
temperature "https://milliams.com/courses/data_analysis_python/meantemp_monthly_totals.txt",
=4,
skiprows='\s+',
sep="Year", # Set the index
index_col
) temperature.head()
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Annual | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Year | |||||||||||||
1659 | 3.0 | 4.0 | 6.0 | 7.0 | 11.0 | 13.0 | 16.0 | 16.0 | 13.0 | 10.0 | 5.0 | 2.0 | 8.9 |
1660 | 0.0 | 4.0 | 6.0 | 9.0 | 11.0 | 14.0 | 15.0 | 16.0 | 13.0 | 10.0 | 6.0 | 5.0 | 9.1 |
1661 | 5.0 | 5.0 | 6.0 | 8.0 | 11.0 | 14.0 | 15.0 | 15.0 | 13.0 | 11.0 | 8.0 | 6.0 | 9.8 |
1662 | 5.0 | 6.0 | 6.0 | 8.0 | 11.0 | 15.0 | 15.0 | 15.0 | 13.0 | 11.0 | 6.0 | 3.0 | 9.5 |
1663 | 1.0 | 1.0 | 5.0 | 7.0 | 10.0 | 14.0 | 15.0 | 15.0 | 13.0 | 10.0 | 7.0 | 5.0 | 8.6 |
And, as we should always do, we plot the data we’ve just read in:
temperature.plot()
Something is wrong with this. There’s a line on the right-hand side which seems wrong. If we look at the last few lines of the data to see what’s going on:
temperature.tail()
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Annual | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Year | |||||||||||||
2018 | 5.3 | 3.1 | 5.0 | 9.9 | 13.3 | 16.1 | 19.3 | 16.8 | 13.7 | 10.5 | 8.3 | 6.8 | 10.7 |
2019 | 4.0 | 6.9 | 7.9 | 9.1 | 11.2 | 14.2 | 17.6 | 17.2 | 14.3 | 9.8 | 6.2 | 5.7 | 10.4 |
2020 | 6.4 | 6.4 | 6.8 | 10.5 | 12.6 | 15.3 | 15.8 | 17.7 | 14.0 | 10.4 | 8.5 | 4.9 | 10.8 |
2021 | 3.2 | 5.3 | 7.3 | 6.5 | 10.3 | 15.5 | 17.8 | 16.0 | 16.0 | 12.0 | 7.2 | 6.3 | 10.3 |
2022 | 4.7 | 6.9 | 8.0 | 9.2 | 13.1 | 14.9 | -99.9 | -99.9 | -99.9 | -99.9 | -99.9 | -99.9 | -99.9 |
We can see there are some -99.9
in the data, repsenting missing data. We should fix this with na_values
:
= pd.read_csv(
temperature "https://milliams.com/courses/data_analysis_python/meantemp_monthly_totals.txt",
=4,
skiprows='\s+',
sep="Year",
index_col=["-99.9"]
na_values
) temperature.tail()
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Annual | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Year | |||||||||||||
2018 | 5.3 | 3.1 | 5.0 | 9.9 | 13.3 | 16.1 | 19.3 | 16.8 | 13.7 | 10.5 | 8.3 | 6.8 | 10.7 |
2019 | 4.0 | 6.9 | 7.9 | 9.1 | 11.2 | 14.2 | 17.6 | 17.2 | 14.3 | 9.8 | 6.2 | 5.7 | 10.4 |
2020 | 6.4 | 6.4 | 6.8 | 10.5 | 12.6 | 15.3 | 15.8 | 17.7 | 14.0 | 10.4 | 8.5 | 4.9 | 10.8 |
2021 | 3.2 | 5.3 | 7.3 | 6.5 | 10.3 | 15.5 | 17.8 | 16.0 | 16.0 | 12.0 | 7.2 | 6.3 | 10.3 |
2022 | 4.7 | 6.9 | 8.0 | 9.2 | 13.1 | 14.9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
temperature.plot()