1 Weather data

This week you will be looking at investigating historic weather data.

Figure 1

An image of filter like diagonal strips across various skies such as an orange sunset, a storm and a clear blue sky Of course, such data is hugely important for research into the large-scale, long-term shift in our planet’s weather patterns and average temperatures – climate change. However, such data is also incredibly useful for more mundane planning purposes. To demonstrate the learning this week, I, Rob Griffiths, will be using historic weather data to try and plan a summer holiday in the UK. You’ll use the data too and get a chance to work on your own project at the end of the week.

The dataset we’ll use to do this will come from the Weather Underground, which creates weather forecasts from data sent to them by a worldwide network of over 100,000 weather enthusiasts who have personal weather stations on their house or in their garden.

In addition to creating weather forecasts from that data, the Weather Underground also keeps that data as historic weather records allowing members of the public to download weather datasets for a particular time period and location. These datasets are downloaded as CSV files, explained in the next step.

Datasets are rarely ‘clean’ and fit for purpose, so it will be necessary to clean up the data and ‘mould it’ for your purposes. You will then learn how to visualise data by creating graphs using the

__plot()__

function.

1.1 What is a CSV file?

A CSV file is a plain text file that is used to hold tabular data. The acronym CSV is short for ‘comma-separated values’.

Figure 2

An image of many pins marking various countries on a globe Take a look at the first few lines of a CSV file that holds the same data as the Excel file ‘WHO POP TB all.xls’ that you encountered in Week 2:

Country,Population (1000s),TB deaths
Afghanistan,30552,13000.0
Albania,3173,20.0
Algeria,39208,5100.0
Andorra,79,0.26 
Angola,21472,6900.0
Antigua and Barbuda,90,1.2
Argentina,41446,570.0 
Armenia,2977,170.0

Notice that the first line is a row of column names. The subsequent lines are rows of actual data that correspond to the column names. The row of column names is optional, but it is helpful in understanding the data in the following lines and making sure the right values fall in the right place. In this example, the first value on every row must be a string representing a country’s name, the second value is an integer representing that country’s population (in 1000s) and the third value is a decimal representing the number of deaths due to TB. Note that the third value is a decimal (like 0.26 deaths for Andorra) and not an integer because it is an estimate obtained from statistical processing of collected data.

Note that each value or column name is separated by a comma but actually any character can be used to separate values in a CSV file, including spaces and tabs etc., hence CSV can also stand for ‘character-separated values’.

Because CSV files are in plain-text it makes the data easy to import into any spreadsheet program, database or pandas dataframe.

Before anything can be done with a CSV file with pandas, the following import statement must be executed:

__In []:__

from pandas import *

As you learned in Week 2, the import statement loads into memory all the code in the pandas module.

To read a CSV file into a dataframe, the pandas function

__read_csv()__

needs to be called.

__In []:__

df = read_csv('WHO POP TB all.csv')

The above code creates a dataframe from the data in the file

__WHO POP TB__

__all.csv__

and assigns it to the variable

__df__

. This is the simplest usage of the

__read_csv()__

function, just using a single argument, a string that holds the name of the CSV file.

However the function can take many additional arguments (some of which you’ll use later), which determine how the file is to be read.

In the next step, find out about dataframes and the ‘dot’ notation.

1.2 Dataframes and the ‘dot’ notation

In Week 2 you learned that dataframes have methods, which are like functions, that can only be called in the context of a dataframe.

For example, because the TB deaths dataframe

__df __

has a column named ‘Country’, the

__sort_values()__

method can be called like this:

__In []:__

df.sort_values('Country')

Because there is variable name, followed by a dot, followed by the method, this is called dot notation. Methods are said to be a property of a dataframe. In addition to methods, dataframes have another property – attributes.

Figure 3

A multi-coloured image of many different sized circles. They could be described as bubbles

Attributes

A dataframe attribute is like a variable that can only be accessed in the context of a dataframe. One such attribute is

__columns __

which holds a dataframe’s column names.

So the expression

__df.columns__

evaluates to the value of the

__columns __

attribute inside the dataframe

__df__

. The following code will get and display the names of the columns in the dataframe

__df:__

__In []:__

df.columns

__Out[]:__

Index(['Country', 'Population (1000s)', 'TB deaths'],
dtype='object')

1.3 Getting and displaying dataframe rows

Dataframes can have hundreds or thousands of rows, so it is not practical to display a whole dataframe.

However, there are a number of dataframe attributes and methods that allow you to get and display either a single row or a number of rows at a time. Three of the most useful methods are:

__ iloc()__

__head()__

and

__tail()__

. Note that to distinguish methods and attributes, we write

()

after a method’s name.

Figure 4

An image of a data algorithm

The iloc attribute

A dataframe has a default integer index for its rows, which starts at 0 (zero). You can get and display any single row in a dataframe by using the

__iloc__

attribute with the index of the row you want to access as its argument. For example, the following code will get and display the first row of data in the dataframe

__df__

, which is at index 0:

__In []:__

df.iloc[0]

__Out[]:__

Country Afghanistan
Population (1000s) 30552
TB deaths 13000
Name: 0, dtype: object

Similarly, the following code will get and display the third row of data in the dataframe

__df__

, which is at index 2:

__In []:__

df.iloc[2]

__Out[]:__

Country Algeria
Population (1000s) 39208
TB deaths 5100.0
Name: 0, dtype: object

The head() method

The first few rows of a dataframe can be printed out with the

__head()__

method.

You can tell

__head()__

is a method, rather than an attribute such as

__columns__

, because of the parentheses (round brackets) after the property name.

If you don’t give any argument, i.e. don’t put any number within those parentheses, the default behaviour is to return the first five rows of the dataframe. If you give an argument, it will print that number of rows (starting from the row indexed by 0).

For example, executing the following code will get and display the first five rows in the dataframe

__df__

__In []:__

df.head()

__Out[]:__


	Country	Population (1000s)	TB deaths
0	Afghanistan	30552	13000.00
1	Albania	3173	20.00
2	Algeria	39208	5100.00
3	Andorra	79	0.26
4	Angola	21472	6900.00

And, executing the following code will get and display the first seven rows in the dataframe

__df.__

__In []:__

df.head(7)

__Out[]:__


	Country	Population (1000s)	TB deaths
0	Afghanistan	30552	13000.00
1	Albania	3173	20.00
2	Algeria	39208	5100.00
3	Andorra	79	0.26
4	Angola	21472	6900.00
5	Antigua and Barbuda	90	1.20
6	Argentina	41446	570.00

The tail() method

The

__tail()__

method is similar to the

__head()__

method.

If no argument is given, the last five rows of the dataframe are returned, otherwise the number of rows returned is dependent on the argument, just like for the

__head()__

method.

__In []:__

df.tail()

__Out[]:__


	Country	Population (1000s)	TB deaths
189	Venezuela (Bolivarian Republic of)	30405	480
190	Viet Nam	91680	17000
191	Yemen	24407	990
192	Zambia	14539	3600
193	Zimbabwe	14150	5700

1.4 Getting and displaying dataframe columns

You learned in Week 2 that you can get and display a single column of a dataframe by putting the name of the column (in quotes) within square brackets immediately after the dataframe’s name.

For example, like this:

__In []:__

df['TB deaths']

You then get output like this:

__Out[]:__

Notice that although there is an index, there is no column heading. This is because what is returned is not a new dataframe with a single column but an example of the

__Series__

data type.

Figure 5

An perspective image of the isle between many data storage towers. The floor and the storage units are lit up.

Each column in a dataframe is an example of a series

The

__Series__

data type is a collection of values with an integer index that starts from zero. In addition, the

__Series__

data type has many of the same methods and attributes as the

__DataFrame__

data type, so you can still execute code like:

__In []:__

df['TB deaths'].head()

__Out[]:__

  13000.00
     20.00
   5100.00
      0.26
   6900.00
Name: TB deaths, dtype: float64

And

__In []:__

df['TB deaths'].iloc[2]

__Out[]:__

5100.00

However, pandas does provide a mechanism for you to get and display one or more selected columns as a new dataframe in its own right. To do this you need to use a list. A list in Python consists of one or more items separated by commas and enclosed within square brackets, for example

__['Country']__

__ ['Country', 'Population (1000s)']__

. This list is then put within outer square brackets immediately after the dataframe’s name, like this:

__In []:__

df[['Country']].head()

__Out[]:__


	__Country__
0	Afghanistan
1	Albania
2	Algeria
3	Andorra
4	Angola

Note that the column is now named. The expression

__ df[['Country']]__

(with two square brackets) evaluates to a new dataframe (which happens to have a single column) rather than a series.

To get a new dataframe with multiple columns you just need to put more column names in the list, like this:

__In []:__

df[['Country', 'Population (1000s)']].head()

__Out[]:__


	__Country__	__Population (1000s)__
0	Afghanistan	30552
1	Albania	3173
2	Algeria	39208
3	Andorra	79
4	Angola	21472

The code has returned a new dataframe with just the

__'Country'__

and

__'Population (1000s)’__

columns.

Exercise 1 Dataframes and CSV files

Question

Now that you’ve learned about CSV files and more about pandas you are ready to complete Exercise 1 in the exercise notebook 2.

Open the exercise 2 notebook and the data file you used last week WHO POP TB all.csv and save it in the folder you created in Week 1.

If you’re using Anaconda instead of CoCalc, remember that to open the notebook you’ll need to navigate to the notebook using Jupyter. Once it’s open, run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook. If you need a quick reminder of how to use Jupyter watch again the video in Week 1 Exercise 1.

1.5 Comparison operators

In Expressions, you learned that Python has arithmetic operators: +, /, - and * and that expressions such as 5 + 2 evaluate to a value (in this case the number 7).

Figure 6

An illustration of two girls holding up signs. One sign says, ‘YES’, the other says, ‘NO’. Python also has what are called comparison operators, these are:

==    equals
!=    not equal
<     less than
>     greater than
<=    less than or equal to 
>=    greater than or equal to

Expressions involving these operators always evaluate to a Boolean value, that is

__True__

__False__

. Here are some examples:

= = 2      evaluates to True
+ 2 = = 5  evaluates to False
!= 1 + 1   evaluates to False
< 50      evaluates to True
> 30      evaluates to False
<= 100   evaluates to True
>= 100   evaluates to True

The comparison operators can be used with other types of data, not just numbers. Used with strings they compare using alphabetical order. For example:

'aardvark' &lt; 'zebra' evaluates to True

In Calculating over columns you saw that when applied to whole columns, the arithmetic operators did the calculations row by row. Similarly, an expression like

__df['Country'] &gt;= 'K'__

will compare the country names, row by row, against the string ’K’ and record whether the result is

__True__

__False__

in a series like this:

  False
  False
  False
  False
  False
  False
...
Name: Country, dtype: bool 

If such an expression is put within square brackets immediately after a dataframe’s name, a new dataframe is obtained with only those rows where the result is

__True__

. So:

df[df['Country'] &gt;= 'K']

returns a new dataframe with all the columns of

__df __

but with only the rows corresponding to countries starting with K or a letter later in the alphabet.

As another example, to see the data for countries with over 80 million inhabitants, the following code will return and display a new dataframe with all the columns of

__df__

but with only the rows where it is

__True__

that the value in the

__'Population (1000s)'__

column is greater than

__80000:__

__In []:__

df[df['Population (1000s)'] &gt; 80000]

__Out[]:__


	Country	Population (1000s)	TB deaths
13	Bangladesh	156595	80000
23	Brazil	200362	4400
36	China	1393337	41000
53	Egypt	82056	550
58	Ethiopia	94101	30000
65	Germany	82727	300
77	India	1252140	240000
78	Indonesia	249866	64000
85	Japan	127144	2100
109	Mexico	122332	2200
124	Nigeria	173615	160000
128	Pakistan	182143	49000
134	Philippines	98394	27000
141	Russian Federation	142834	17000
185	United States of America	320051	490
190	Viet Nam	91680	17000

Exercise 2 Comparison operators

Question

You are ready to complete Exercise 2 in the Exercise notebook 2.

Remember to run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook.

1.6 Bitwise operators

To build more complicated expressions involving column comparisons, there are two bitwise operators.

Figure 7

An image of someone constructing a building from wooden blocks The

__&amp;__

operator means ‘and’ and the

operator (vertical bar, not uppercase letter ‘i’) means ‘or’. So, for example the expression:


(df['Country'] >= 'Latvia') & (df['Country'] <= 'Sweden')

will evaluate to a series containing Boolean values where the values are

__True__

only if the equivalent rows in the dataframe contain the countries ‘

__Latvia__

’ to ‘

__Sweden__

’, inclusive. However, the following expression which uses

(or) rather than & (and):

(df['Country'] &gt;= 'Latvia') | (df['Country'] &lt;= 'Sweden')

will evaluate to

__True__

for all countries, because every country comes alphabetically after ‘

__Latvia__

’ (e.g. the ‘UK’) or before ’

__Sweden__

‘ (e.g. ‘

__Brazil__

’).

Note the round brackets around each comparison. Without them you will get an error.

The whole expression with multiple comparisons has to be put within

__df[…]__

to get a dataframe with only those rows that match the condition.

As a further example, using different columns, it is relatively easy to find the rows in

__df__

where ’

__Population (1000s)__

‘ is greater than

__80000__

and where ’

__TB deaths__

‘ are greater than

__In []:__

df[(df['Population (1000s)'] &gt; 80000) &amp; (df['TB deaths'] &gt; 10000)]

__Out []:__


	Country	Population (1000s)	TB deaths
13	Bangladesh	156595	80000
36	China	1393337	41000
58	Ethiopia	94101	30000
77	India	1252140	240000
78	Indonesia	249866	64000
124	Nigeria	173615	160000
128	Pakistan	182143	49000
134	Philippines	98394	27000
141	Russian Federation	142834	17000
190	Viet Nam	91680	17000

These expressions can get long and complicated, making it easy to miss a crucial round or square bracket. In those cases it is best to break up the expression into small steps. The previous example could also be written as:

__In []:__

population = df['Population (1000s)'] 
deaths = df['TB deaths']
df[(population > 80000) & (deaths > 10000)]

Exercise 3 Bitwise operators

Question

Complete Exercise 3 in the Exercise notebook 2.