{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 1 Enter the pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you probably realised, this way of coding is not practical for large scale data analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](https://www.open.edu/openlearn/ocw/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1034.jpg)\n", "__Figure 1__\n", "\n", "An image of four giant panda cubs climbing a bamboo fence \n", "Three lines of code were required for each country, to store the number of deaths, store the population, and calculate the death rate. With roughly 200 countries in the world, my trivial analysis would require 400 variables and typing almost 600 lines of code! Life’s too short to be spent that way.\n", "\n", "Instead of using a separate variable for each datum, it is better to organise data as a table of rows and columns.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Table 1
CountryDeathsPopulation
Angola690021472
Brazil4400200362
Portugal14010608
\n", "In that way, instead of 400 variables, I only need one that stores the whole table. Instead of writing a mile long expression that adds 200 variables to obtain the total deaths, I’ll write a short expression that calculates the total of the ‘Deaths’ column, no matter how many countries (rows) there are.\n", "\n", "To organise data into tables and do calculations on such tables, you and I will use the pandas module, which is included in Anaconda and CoCalc. A __module__ is a package of various pieces of code that can be used individually. The pandas module provides very extensive and advanced data analysis capabilities to compliment Python. This course only scratches the surface of pandas.\n", "\n", "I have to tell the computer that I’m going to use a module.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__In []:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "\n", "from pandas import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That line of code is an __import__ statement: from the pandas module, import everything. In plain English: load into memory all pieces of code that are in the pandas module, so that I can use any of them. In the above statement, the asterisk isn’t the multiplication operator but instead means ‘everything’.\n", "\n", "Each weekly project in this course will start with this import statement, because all projects need the pandas module.\n", "\n", "The words __" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__ and __" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__ are __reserved words__ : they can’t be used as variable, function or module names. Otherwise you will get a syntax error.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__In []:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "\n", "from = 100\n", "File \"\", line 1\n", "from = 100\n", "^\n", "SyntaxError: invalid syntax" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jupyter notebooks show reserved words in boldface font to make them easier to spot. If you see a boldface name in an assignment (as you will for the code above), you must choose a different name." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 5 pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question\n", "\n", "Use Exercise 5 the Exercise notebook 1 to help you answer these questions about errors you might come across.\n", "\n", "__ 1. What kind of error will you get if you misspell 'pandas' as 'Pandas'? __\n", "\n", "A syntax error\n", "\n", "Remember that after the reserved word 'from' comes a module name.\n", "\n", "Take a look at The art of naming .\n", "\n", "A name error, reported as an import error\n", "\n", "The computer is expecting a name but there is no module with the name 'Pandas' in the Anaconda distribution. Remember that names are case-sensitive." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question\n", "\n", "__ 2. What kind of error will you get if you misspell 'import' as 'impart'? __\n", "\n", "A name error\n", "\n", "A name error only occurs when a name is undefined, but import is not a name, it’s a reserved word.\n", "\n", "A syntax error\n", "\n", "The computer is expecting a reserved word and anything else will raise a syntax error." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question\n", "\n", "__ 3. What kind of error will you get if you forget the asterisk? __\n", "\n", "A name error\n", "\n", "An asterisk is not a name so the reported error can’t be this one.\n", "\n", "A syntax error\n", "\n", "The statement cannot end with the reserved word 'import'; the computer is expecting an indication of what to import.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.1 This week’s data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the next part of the course you’ll need to download a file of data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](https://www.open.edu/openlearn/ocw/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1026.jpg)\n", "__Figure 2__\n", "\n", "An image with a young boy wearing a medical mask, in the foreground; a patient in a South African tuberculosis clinic \n", "I have created a table with all the data necessary for the project and saved it in an Excel file. Excel is a popular application to create, edit and analyse tabular data. You won’t need Excel to complete this course, but many datasets are provided as Excel files.\n", "\n", "Open the data file WHO POP TB some.xls . The file is encoded using UTF-8, a character encoding that allows for accented letters. Do __not__ open or edit the file, as you may change how it is encoded, which will lead to errors later on. If you do want to look at its contents, make a copy of the file and look at the copy.\n", "\n", "Put the data file in the same folder (or CoCalc project) where you saved your exercise notebook. Done? Great, let’s proceed to loading the data – you’ll learn how to do this in the next section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.2 Loading the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many applications can read files in Excel format, and pandas can too. Asking the computer to read the data looks like this:\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__In []:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "\n", "data = read_excel('WHO POP TB some.xls')\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "__Out[]:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
CountryPopulation (1000s)TB deaths
0Angola214726900
1Brazil2003624400
2China139333741000
3Equatorial Guinea75767
4Guinea-Bissau17041200
5India1252140240000
6Mozambique2583418000
7Portugal10608140
8Russian Federation14283417000
9Sao Tome and Principe19318
10South Africa5277625000
11Timor-Leste1133990
\n", "The variable name data is not descriptive, but as there is only one dataset in our analysis, there is no possible confusion with other data, and short names help to keep the lines of code short.\n", "\n", "The function " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "__read_excel()__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " takes a file name as an argument and returns the table contained in the file. In pandas, tables are called __dataframes__ . To load the " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "__data__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ", I simply call the function and store the returned dataframe in a variable.\n", "\n", "A file name must be given as a __string__ , a piece of text surrounded by quotes. The quote marks tell Python that this isn’t a variable, function or module name. Also, the quote marks state that this is a single name, even if it contains spaces, punctuation and other characters besides letters.\n", "\n", "Misspelling the file name, or not having the file in the same folder as the notebook containing the code, results in a __file not found__ error. In the example below there is an error in the file name.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__In []:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "\n", "data = read_excel('WHO POP TB same.xls')\n", "data\n", "\n", "\n", "\n", "---------------------------------------------\n", "\n", "FileNotFoundError Traceback (most recent call last)\n", "\n", " in ()\n", "----> 1 data = read_excel(‘WHO POP TB same.xls’)\n", "2 data\n", "\n", "\n", "\n", "\n", "/Users/mw4687/anaconda/lib/python3.4/site-packages/pandas/io/excel.py in read_excel(io, sheetname, **kwds)\n", "\n", "130 engine = kwds.pop(‘engine’, None)\n", "131\n", "\n", "--> 132 return ExcelFile(io, engine=engine).parse(sheetname=sheetname, **kwds)\n", "\n", "133\n", "134\n", "\n", "\n", "\n", "\n", "/Users/mw4687/anaconda/lib/python3.4/site-packages/pandas/io/excel.py in __init__(self, io, **kwds)\n", "\n", "\n", "167 self.book = xlrd.open_workbook(file_contents=data)\n", "\n", "168 else:\n", "\n", "--> 169 self.book = xlrd.open_workbook(io)\n", "\n", "\n", "170 elif engine == ‘xlrd’ and isinstance(io, xlrd.Book):\n", "\n", "171 self.book = io\n", "\n", "\n", "\n", "\n", "/Users/mw4687/anaconda/lib/python3.4/site-packages/xlrd/__init__.py in open_workbook(filename, logfile,\n", "\n", " verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)\n", "\n", "392 peek = file_contents[:peeksz]\n", "393 else:\n", "--> 394 f = open(filename, \"rb\")\n", "395 peek = f.read(peeksz)\n", "396 f.close()\n", "\n", "\n", "\n", "\n", "FileNotFoundError: [Errno 2] No such file or directory: ‘WHO POP TB same.xls’\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jupyter notebooks show strings in red. If you see red characters until the end of the line, you have forgotten to type the second quote that marks the end of the string.\n", "\n", "In the next section, find out how to select a column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.3 Selecting a column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you have the data, let the analysis begin!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](https://www.open.edu/openlearn/ocw/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1035.jpg)\n", "__Figure 3__\n", "\n", " An image of free standing Roman columns standing against a blue sky \n", "Let’s tackle the first part of the first question: ‘What are the total, smallest, largest and average number of deaths due to TB?’ Obtaining the total number will be done in two steps: first select the column with the TB deaths, then sum the values in that column.\n", "\n", "Selecting a single column of a dataframe is done with an expression in the format: __" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataFrame['column name']." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__In []:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "data['TB deaths']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__Out[]:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "\n", "0 6900\n", "1 4400\n", "2 41000\n", "3 67\n", "4 1200\n", "5 240000\n", "6 18000\n", "7 140\n", "8 17000\n", "9 18\n", "10 25000\n", "11 990\n", "Name: TB deaths, dtype: int64" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Strings are verbatim text, which means that the column name must be written exactly as given in the dataframe, which you saw after loading the data. The slightest deviation leads to a __key error__ , which can be seen as a kind of name error. You can try out in the Week 2 exercise notebook what happens when misspelling the column name. The error message is horribly long. In such cases, just skip to the last line of the error message to see the type of error.\n", "\n", "Put this learning into practice in Exercise 6." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 6 selecting a column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question\n", "\n", "In your Exercise notebook 1, select the population column and store it in a variable, so that you can use it in later exercises.\n", "\n", "Remember that to open the notebook you’ll need to launch Anaconda and then navigate to the notebook using Jupyter. Once it’s open, run all the code.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, you’ll learn about making calculations on a column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.4 Calculations on a column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Having selected the column with the number of deaths per country, I’ll add them with the appropriately named sum() method to obtain the overall total deaths.\n", "\n", "A __method__ is a function that can only be called in a certain context. In this course, the context will mostly be a dataframe or a column. A __method call__ looks like a function call, but adds the context in which to call the method: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "__context.methodName(argument1, argument2, ...)__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " . In other words, a dataframe method can only be called on dataframes, a column method only on columns. Because methods are functions, a method call returns a value and is therefore an expression.\n", "\n", "If all that sounded too abstract, here’s how to call the " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "__sum()__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " method on the TB deaths column. Note that " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "__sum()__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " doesn’t need any arguments because all the values are in the column.\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__In []:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "\n", "tbColumn = data['TB deaths']\n", "tbColumn.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__Out[]:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "354715" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The estimated total number of deaths due to TB in 2013 in the BRICS and Portuguese-speaking countries was over 350 thousand. An impressive number, for the wrong reasons.\n", "\n", "Calculating the minimum and maximum number of deaths is done in a similar way.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__In []:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "tbColumn.min()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__Out[]:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "18" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__In []:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "tbColumn.max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__Out[]:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "240000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "__sum()__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " , the column methods " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "__min()__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " and " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "__max()__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " don’t need arguments, whereas the Python functions " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "__min()__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " and " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "__max()__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " did need them, because there was no context (column) providing the values.\n", "\n", "The average number is computed as before, dividing the total by the number of countries.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__In []:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "tbColumn.sum() / 12" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__Out[]:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "29559.583333333332" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This kind of average is called the __mean__ and there’s a method for that.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__In []:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "tbColumn.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__Out[]:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "29559.583333333332" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another kind of average measure is the __median__ , which is the number in the middle, i.e. half of the values are above the median and half below it.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__In []:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "tbColumn.median()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__Out[]:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "5650.0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The mean is five times higher than the median. While half the countries had less than 5650 deaths in 2013, some countries had far more, which pushes the mean up.\n", "\n", "The median is probably closer to the intuition you have of what ‘average’ should mean (pun intended). News reports don’t always make clear what average measure is being used, and using the mean may distort reality. For example, the mean household income in a country will be influenced by very poor and very rich households, whereas the median income doesn’t take into account how poor or rich the extremes are: it will always be half the households below and half above the median.\n", "\n", "Put this learning into practice in Exercise 7." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 7 calculations on a column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question\n", "\n", "Practise the use of column methods by applying them to the population column you obtained in Exercise 6 in the Exercise notebook 1. Remember to run all code before doing the exercise.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.5 Sorting on a column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the research questions was: which countries have the smallest and largest number of deaths?\n", "\n", "Being a small table, it is not too difficult to scan the TB deaths column and find those countries. However, such a process is prone to errors and impractical for large tables. It’s much better to sort the table by that column, and then look up the countries in the first and last rows.\n", "\n", "As you’ve guessed by now, sorting a table is another single line of code.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__In []:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "data.sort_values('TB deaths')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "__Out[]:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
CountryPopulation (1000s)TB deaths
9Sao Tome and Principe19318
3Equatorial Guinea75767
7Portugal10608140
11Timor-Leste1133990
4Guinea-Bissau17041200
1Brazil2003624400
0Angola214726900
8Russian Federation14283417000
6Mozambique2583418000
10South Africa5277625000
2China139333741000
5India1252140240000
\n", "The dataframe method " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "__sort_values()__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " takes as argument a column name and returns a new dataframe where the rows are in ascending order of the values in that column. Note that sorting doesn’t modify the original dataframe.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__In []:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "data # rows still in original order" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "__Out[]:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
CountryPopulation (1000s)TB deaths
0Angola214726900
1Brazil2003624400
2China139333741000
3Equatorial Guinea75767
4Guinea-Bissau17041200
5India1252140240000
6Mozambique2583418000
7Portugal10608140
8Russian Federation14283417000
9Sao Tome and Principe19318
10South Africa5277625000
11Timor-Leste1133990
\n", "It’s also possible to sort on a column that has text instead of numbers; the rows will be sorted in alphabetical order.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__In []:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "data.sort_values('Country')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "__Out[]:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
CountryPopulation (1000s)TB deaths
0Angola214726900
1Brazil2003624400
2China139333741000
3Equatorial Guinea75767
4Guinea-Bissau17041200
5India1252140240000
6Mozambique2583418000
7Portugal10608140
8Russian Federation14283417000
9Sao Tome and Principe19318
10South Africa5277625000
11Timor-Leste1133990
\n", "\n", "### Exercise 8 sorting on a column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question\n", "\n", "Use the Exercise notebook 1 to sort the table by population so that you can quickly see which are the least and the most populous countries. Remember to run all code before doing the exercise.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next section you’ll learn about calculations over columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.6 Calculations over columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last remaining task is to calculate the death rate of each country.\n", "\n", "You may recall that with the simple approach I’d have to write:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "\n", "\n", "rateAngola = deathsInAngola * 100 / populationOfAngola\n", "\n", "\n", "rateBrazil = deathsInBrazil * 100 / populationOfBrazil\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and so on, and so on. If you’ve used spreadsheets, it’s the same process: create the formula for the first row and then copy it down for all the rows. This is laborious and error-prone, e.g. if rows are added later on. Given that data is organised by columns, wouldn’t it be nice to simply write the following?\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "rateColumn = deathsColumn * 100 / populationColumn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Say no more: your wish is pandas’s command.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__In []:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "\n", "deathsColumn = data['TB deaths']\n", "populationColumn = data['Population (1000s)']\n", "rateColumn = deathsColumn * 100 / populationColumn\n", "rateColumn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__Out[]:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "0 32.134873\n", "\n", "1 2.196025\n", "\n", "2 2.942576\n", "\n", "3 8.850727\n", "\n", "4 70.422535\n", "\n", "5 19.167186\n", "\n", "6 69.675621\n", "\n", "7 1.319759\n", "\n", "8 11.901928\n", "\n", "9 9.326425\n", "\n", "10 47.370017\n", "\n", "11 87.378641\n", "\n", "dtype: float64\n", "\n", "Tadaaa! With pandas, the arithmetic operators become much smarter. When adding, subtracting, multiplying or dividing columns, the computer understands that the operation is to be done row by row and creates a new column.\n", "\n", "All well and nice, but how to put that new column into the dataframe, in order to have everything in a single table? In an assignment " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "__variable = expression__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " , if the variable hasn’t been mentioned before, the computer creates the variable and stores in it the expression’s value. Likewise, if I assign to a column that doesn’t exist in the dataframe, the computer will create it.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "__In []:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "\n", "data['TB deaths (per 100,000)'] = rateColumn\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "__Out[]:__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
CountryPopulation (1000s)TB deathsTB deaths (per 100,000)
0Angola21472690032.134873
1Brazil20036244002.196025
2China1393337410002.942576
3Equatorial Guinea757678.850727
4Guinea-Bissau1704120070.422535
5India125214024000019.167186
6Mozambique258341800069.675621
7Portugal106081401.319759
8Russian Federation1428341700011.901928
9Sao Tome and Principe193189.326425
10South Africa527762500047.370017
11Timor-Leste113399087.378641
\n", "That’s it! I’ve written all the code needed to answer the questions I had. Next I’ll write up the analysis into a succinct and stand-alone notebook that can be shared with friends, family and colleagues or the whole world. You’ll find that in the next section.\n" ] } ], "metadata": { "jupytext": { "cell_metadata_filter": "-all", "main_language": "python", "notebook_metadata_filter": "-all" }, "kernelspec": { "display_name": "", "language": "", "name": "" } }, "nbformat": 4, "nbformat_minor": 2 }