1 Weather data

This week you will be looking at investigating historic weather data.

Figure 1

An image of filter like diagonal strips across various skies such as an orange sunset, a storm and a clear blue sky Of course, such data is hugely important for research into the large-scale, long-term shift in our planet’s weather patterns and average temperatures – climate change. However, such data is also incredibly useful for more mundane planning purposes. To demonstrate the learning this week, I, Rob Griffiths, will be using historic weather data to try and plan a summer holiday in the UK. You’ll use the data too and get a chance to work on your own project at the end of the week.

The dataset we’ll use to do this will come from the Weather Underground, which creates weather forecasts from data sent to them by a worldwide network of over 100,000 weather enthusiasts who have personal weather stations on their house or in their garden.

In addition to creating weather forecasts from that data, the Weather Underground also keeps that data as historic weather records allowing members of the public to download weather datasets for a particular time period and location. These datasets are downloaded as CSV files, explained in the next step.

Datasets are rarely ‘clean’ and fit for purpose, so it will be necessary to clean up the data and ‘mould it’ for your purposes. You will then learn how to visualise data by creating graphs using the 

__plot()__

 function.

1.1 What is a CSV file?

A CSV file is a plain text file that is used to hold tabular data. The acronym CSV is short for ‘comma-separated values’.

Figure 2

An image of many pins marking various countries on a globe Take a look at the first few lines of a CSV file that holds the same data as the Excel file ‘WHO POP TB all.xls’ that you encountered in Week 2:


Country,Population (1000s),TB deaths
Afghanistan,30552,13000.0
Albania,3173,20.0
Algeria,39208,5100.0
Andorra,79,0.26 
Angola,21472,6900.0
Antigua and Barbuda,90,1.2
Argentina,41446,570.0 
Armenia,2977,170.0

Notice that the first line is a row of column names. The subsequent lines are rows of actual data that correspond to the column names. The row of column names is optional, but it is helpful in understanding the data in the following lines and making sure the right values fall in the right place. In this example, the first value on every row must be a string representing a country’s name, the second value is an integer representing that country’s population (in 1000s) and the third value is a decimal representing the number of deaths due to TB. Note that the third value is a decimal (like 0.26 deaths for Andorra) and not an integer because it is an estimate obtained from statistical processing of collected data.

Note that each value or column name is separated by a comma but actually any character can be used to separate values in a CSV file, including spaces and tabs etc., hence CSV can also stand for ‘character-separated values’.

Because CSV files are in plain-text it makes the data easy to import into any spreadsheet program, database or pandas dataframe.

Before anything can be done with a CSV file with pandas, the following import statement must be executed:

__In []:__
from pandas import *

As you learned in Week 2, the import statement loads into memory all the code in the pandas module.

To read a CSV file into a dataframe, the pandas function 

__read_csv()__

needs to be called.

__In []:__
df = read_csv('WHO POP TB all.csv')

The above code creates a dataframe from the data in the file 

__WHO POP TB__
__all.csv__

 and assigns it to the variable 

__df__

. This is the simplest usage of the 

__read_csv()__

 function, just using a single argument, a string that holds the name of the CSV file.

However the function can take many additional arguments (some of which you’ll use later), which determine how the file is to be read.

In the next step, find out about dataframes and the ‘dot’ notation.

1.2 Dataframes and the ‘dot’ notation

In Week 2 you learned that dataframes have methods, which are like functions, that can only be called in the context of a dataframe.

For example, because the TB deaths dataframe 

__df __

has a column named ‘Country’, the 

__sort_values()__

 method can be called like this:

__In []:__
df.sort_values('Country')

Because there is variable name, followed by a dot, followed by the method, this is called dot notation. Methods are said to be a property of a dataframe. In addition to methods, dataframes have another property – attributes.

Figure 3

A multi-coloured image of many different sized circles. They could be described as bubbles

Attributes

A dataframe attribute is like a variable that can only be accessed in the context of a dataframe. One such attribute is 

__columns __

which holds a dataframe’s column names.

So the expression 

__df.columns__

 evaluates to the value of the 

__columns __

attribute inside the dataframe 

__df__

. The following code will get and display the names of the columns in the dataframe 

__df:__
__In []:__
df.columns
__Out[]:__

Index(['Country', 'Population (1000s)', 'TB deaths'],
dtype='object')

1.3 Getting and displaying dataframe rows

Dataframes can have hundreds or thousands of rows, so it is not practical to display a whole dataframe.

However, there are a number of dataframe attributes and methods that allow you to get and display either a single row or a number of rows at a time. Three of the most useful methods are:

__ iloc()__

__head()__

 and 

__tail()__

. Note that to distinguish methods and attributes, we write

()

after a method’s name.

Figure 4

An image of a data algorithm

The iloc attribute

A dataframe has a default integer index for its rows, which starts at 0 (zero). You can get and display any single row in a dataframe by using the

__iloc__

 attribute with the index of the row you want to access as its argument. For example, the following code will get and display the first row of data in the dataframe 

__df__

, which is at index 0:

__In []:__
df.iloc[0]
__Out[]:__

Country Afghanistan
Population (1000s) 30552
TB deaths 13000
Name: 0, dtype: object

Similarly, the following code will get and display the third row of data in the dataframe 

__df__

, which is at index 2:

__In []:__
df.iloc[2]
__Out[]:__

Country Algeria
Population (1000s) 39208
TB deaths 5100.0
Name: 0, dtype: object


The head() method

The first few rows of a dataframe can be printed out with the 

__head()__

method.

You can tell 

__head()__

 is a method, rather than an attribute such as

__columns__

, because of the parentheses (round brackets) after the property name.

If you don’t give any argument, i.e. don’t put any number within those parentheses, the default behaviour is to return the first five rows of the dataframe. If you give an argument, it will print that number of rows (starting from the row indexed by 0).

For example, executing the following code will get and display the first five rows in the dataframe 

__df__

.

__In []:__
df.head()
__Out[]:__
  Country Population (1000s) TB deaths
0 Afghanistan 30552 13000.00
1 Albania 3173 20.00
2 Algeria 39208 5100.00
3 Andorra 79 0.26
4 Angola 21472 6900.00

And, executing the following code will get and display the first seven rows in the dataframe 

__df.__
__In []:__
df.head(7)
__Out[]:__
  Country Population (1000s) TB deaths
0 Afghanistan 30552 13000.00
1 Albania 3173 20.00
2 Algeria 39208 5100.00
3 Andorra 79 0.26
4 Angola 21472 6900.00
5 Antigua and Barbuda 90 1.20
6 Argentina 41446 570.00


The tail() method

The 

__tail()__

 method is similar to the

__head()__

method.

If no argument is given, the last five rows of the dataframe are returned, otherwise the number of rows returned is dependent on the argument, just like for the 

__head()__

 method.

__In []:__
df.tail()
__Out[]:__
  Country Population (1000s) TB deaths
189 Venezuela (Bolivarian Republic of) 30405 480
190 Viet Nam 91680 17000
191 Yemen 24407 990
192 Zambia 14539 3600
193 Zimbabwe 14150 5700

1.4 Getting and displaying dataframe columns

You learned in Week 2 that you can get and display a single column of a dataframe by putting the name of the column (in quotes) within square brackets immediately after the dataframe’s name.

For example, like this:

__In []:__
df['TB deaths']

You then get output like this:

__Out[]:__

0    13000.00
1       20.00
2     5100.00
3        0.26
4     6900.00
5        1.20
6      570.00
...

Notice that although there is an index, there is no column heading. This is because what is returned is not a new dataframe with a single column but an example of the 

__Series__

 data type.

Figure 5

An perspective image of the isle between many data storage towers. The floor and the storage units are lit up.

Each column in a dataframe is an example of a series

The 

__Series__

 data type is a collection of values with an integer index that starts from zero. In addition, the 

__Series__

 data type has many of the same methods and attributes as the 

__DataFrame__

 data type, so you can still execute code like:

__In []:__
df['TB deaths'].head()
__Out[]:__

0    13000.00
1       20.00
2     5100.00
3        0.26
4     6900.00
Name: TB deaths, dtype: float64

And

__In []:__
df['TB deaths'].iloc[2]
__Out[]:__
5100.00

However, pandas does provide a mechanism for you to get and display one or more selected columns as a new dataframe in its own right. To do this you need to use a list. A list in Python consists of one or more items separated by commas and enclosed within square brackets, for example

__['Country']__

 or

__ ['Country', 'Population (1000s)']__

. This list is then put within outer square brackets immediately after the dataframe’s name, like this:

__In []:__
df[['Country']].head()
__Out[]:__
  __Country__
0 Afghanistan
1 Albania
2 Algeria
3 Andorra
4 Angola

Note that the column is now named. The expression

__ df[['Country']]__

(with two square brackets) evaluates to a new dataframe (which happens to have a single column) rather than a series.

To get a new dataframe with multiple columns you just need to put more column names in the list, like this:

__In []:__
df[['Country', 'Population (1000s)']].head()
__Out[]:__
  __Country__ __Population (1000s)__
0 Afghanistan 30552
1 Albania 3173
2 Algeria 39208
3 Andorra 79
4 Angola 21472

The code has returned a new dataframe with just the 

__'Country'__

 and

__'Population (1000s)’__

columns.

Exercise 1 Dataframes and CSV files

Question

Now that you’ve learned about CSV files and more about pandas you are ready to complete Exercise 1 in the exercise notebook 2.

Open the exercise 2 notebook and the data file you used last week WHO POP TB all.csv and save it in the folder you created in Week 1.

If you’re using Anaconda instead of CoCalc, remember that to open the notebook you’ll need to navigate to the notebook using Jupyter. Once it’s open, run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook. If you need a quick reminder of how to use Jupyter watch again the video in Week 1 Exercise 1.


1.5 Comparison operators

In Expressions, you learned that Python has arithmetic operators: +, /, - and * and that expressions such as 5 + 2 evaluate to a value (in this case the number 7).

Figure 6

An illustration of two girls holding up signs. One sign says, ‘YES’, the other says, ‘NO’. Python also has what are called comparison operators, these are:


==    equals
!=    not equal
<     less than
>     greater than
<=    less than or equal to 
>=    greater than or equal to

Expressions involving these operators always evaluate to a Boolean value, that is 

__True__

 or 

__False__

. Here are some examples:


2 = = 2      evaluates to True
2 + 2 = = 5  evaluates to False
2 != 1 + 1   evaluates to False
45 < 50      evaluates to True
20 > 30      evaluates to False
100 <= 100   evaluates to True
101 >= 100   evaluates to True

The comparison operators can be used with other types of data, not just numbers. Used with strings they compare using alphabetical order. For example:

'aardvark' &lt; 'zebra' evaluates to True

In Calculating over columns you saw that when applied to whole columns, the arithmetic operators did the calculations row by row. Similarly, an expression like 

__df['Country'] &gt;= 'K'__

 will compare the country names, row by row, against the string ’K’ and record whether the result is 

__True__

 or 

__False__

 in a series like this:


0    False
1    False
2    False
3    False
4    False
5    False
...
Name: Country, dtype: bool 

If such an expression is put within square brackets immediately after a dataframe’s name, a new dataframe is obtained with only those rows where the result is 

__True__

. So:

df[df['Country'] &gt;= 'K']

returns a new dataframe with all the columns of 

__df __

but with only the rows corresponding to countries starting with K or a letter later in the alphabet.

As another example, to see the data for countries with over 80 million inhabitants, the following code will return and display a new dataframe with all the columns of 

__df__

 but with only the rows where it is 

__True__

 that the value in the 

__'Population (1000s)'__

 column is greater than 

__80000:__
__In []:__
df[df['Population (1000s)'] &gt; 80000]
__Out[]:__
  Country Population (1000s) TB deaths
13 Bangladesh 156595 80000
23 Brazil 200362 4400
36 China 1393337 41000
53 Egypt 82056 550
58 Ethiopia 94101 30000
65 Germany 82727 300
77 India 1252140 240000
78 Indonesia 249866 64000
85 Japan 127144 2100
109 Mexico 122332 2200
124 Nigeria 173615 160000
128 Pakistan 182143 49000
134 Philippines 98394 27000
141 Russian Federation 142834 17000
185 United States of America 320051 490
190 Viet Nam 91680 17000

Exercise 2 Comparison operators

Question

You are ready to complete Exercise 2 in the Exercise notebook 2.

Remember to run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook.

1.6 Bitwise operators

To build more complicated expressions involving column comparisons, there are two bitwise operators.

Figure 7

An image of someone constructing a building from wooden blocks The 

__&amp;__
 operator means ‘and’ and the operator (vertical bar, not uppercase letter ‘i’) means ‘or’. So, for example the expression:

(df['Country'] >= 'Latvia') & (df['Country'] <= 'Sweden')

will evaluate to a series containing Boolean values where the values are

__True__

 only if the equivalent rows in the dataframe contain the countries ‘

__Latvia__

’ to ‘

__Sweden__
’, inclusive. However, the following expression which uses (or) rather than & (and):
(df['Country'] &gt;= 'Latvia') | (df['Country'] &lt;= 'Sweden')

will evaluate to 

__True__

 for all countries, because every country comes alphabetically after ‘

__Latvia__

’ (e.g. the ‘UK’) or before ’

__Sweden__

‘ (e.g. ‘

__Brazil__

’).

Note the round brackets around each comparison. Without them you will get an error.

The whole expression with multiple comparisons has to be put within

__df[…]__

 to get a dataframe with only those rows that match the condition.

As a further example, using different columns, it is relatively easy to find the rows in 

__df__

 where ’

__Population (1000s)__

‘ is greater than 

__80000__

 and where ’

__TB deaths__

‘ are greater than 

10000

.

__In []:__
df[(df['Population (1000s)'] &gt; 80000) &amp; (df['TB deaths'] &gt; 10000)]
__Out []:__
  Country Population (1000s) TB deaths
13 Bangladesh 156595 80000
36 China 1393337 41000
58 Ethiopia 94101 30000
77 India 1252140 240000
78 Indonesia 249866 64000
124 Nigeria 173615 160000
128 Pakistan 182143 49000
134 Philippines 98394 27000
141 Russian Federation 142834 17000
190 Viet Nam 91680 17000

These expressions can get long and complicated, making it easy to miss a crucial round or square bracket. In those cases it is best to break up the expression into small steps. The previous example could also be written as:

__In []:__

population = df['Population (1000s)'] 
deaths = df['TB deaths']
df[(population > 80000) & (deaths > 10000)]

Exercise 3 Bitwise operators

Question

Complete Exercise 3 in the Exercise notebook 2.