WEBVTT 1 00:00:03.050 --> 00:00:04.070 - [Instructor] So, let's get started 2 00:00:04.070 --> 00:00:08.020 on the challenge for chapter six which is indexing. 3 00:00:08.020 --> 00:00:11.040 So, let's import Pandas, 4 00:00:11.040 --> 00:00:14.060 let's read the Olympic CSV file into our DataFrame 5 00:00:14.060 --> 00:00:17.080 called 00, so plot the total number 6 00:00:17.080 --> 00:00:20.040 of medals awarded at each of the Olympic Games 7 00:00:20.040 --> 00:00:23.040 throughout history, so since we want 8 00:00:23.040 --> 00:00:25.080 the number of medals throughout history, 9 00:00:25.080 --> 00:00:29.030 let's use the edition, so 00.Edition 10 00:00:29.030 --> 00:00:32.040 which is the series in the 00 DataFrame 11 00:00:32.040 --> 00:00:36.040 and let's use value_counts. 12 00:00:36.040 --> 00:00:41.090 And if we plot that, 13 00:00:41.090 --> 00:00:44.010 we don't seem to have a plot that we expected, 14 00:00:44.010 --> 00:00:46.070 so what seems to be wrong here? 15 00:00:46.070 --> 00:00:48.070 Well, let's take a few steps back 16 00:00:48.070 --> 00:00:52.080 and look at the value_counts object. 17 00:00:52.080 --> 00:00:55.040 Now, if I do a type here, 18 00:00:55.040 --> 00:01:03.050 we can see that 00.Edition.value_counts 19 00:01:03.050 --> 00:01:06.050 is in fact a Pandas series, 20 00:01:06.050 --> 00:01:08.020 so what that means then 21 00:01:08.020 --> 00:01:12.030 is that when I do a value_counts, 22 00:01:12.030 --> 00:01:14.050 the first column is the index 23 00:01:14.050 --> 00:01:17.080 and the second column are the values for that index, 24 00:01:17.080 --> 00:01:20.010 so part of the reason that we had a problem 25 00:01:20.010 --> 00:01:21.050 when plotting that graph 26 00:01:21.050 --> 00:01:24.050 was this index is not sorted, 27 00:01:24.050 --> 00:01:28.030 so what we can do then is to sort the index 28 00:01:28.030 --> 00:01:29.070 of these value_Counts because remember, 29 00:01:29.070 --> 00:01:32.010 this is a series, 30 00:01:32.010 --> 00:01:36.030 so, sort_index 31 00:01:36.030 --> 00:01:38.040 and let's look at the resulting structure 32 00:01:38.040 --> 00:01:40.050 and now we can see that we've got the series sorted 33 00:01:40.050 --> 00:01:44.030 by year, and so, then, 34 00:01:44.030 --> 00:01:49.020 if we go ahead and do a plot, 35 00:01:49.020 --> 00:01:51.080 we get the total number of medals presented 36 00:01:51.080 --> 00:01:58.000 at each of the Olympic Games throughout history. 37 00:01:58.000 --> 00:02:00.060 So, which countries did not win a medal 38 00:02:00.060 --> 00:02:03.060 in the 2008 Beijing Olympics 39 00:02:03.060 --> 00:02:06.070 and how many countries were there? 40 00:02:06.070 --> 00:02:07.060 So, let's think a little bit 41 00:02:07.060 --> 00:02:10.080 about how we will tackle this problem. 42 00:02:10.080 --> 00:02:13.070 Our approach will be to get a common index 43 00:02:13.070 --> 00:02:16.070 of NOC or the country, 44 00:02:16.070 --> 00:02:18.090 and we'll get the list of all of the countries 45 00:02:18.090 --> 00:02:22.020 and index them using the NOC code. 46 00:02:22.020 --> 00:02:24.040 So, if you visualize a column 47 00:02:24.040 --> 00:02:27.040 with all of the countries listed there, 48 00:02:27.040 --> 00:02:29.040 and what we're going to then do is we're going to take a list 49 00:02:29.040 --> 00:02:32.080 of all of the countries that won a medal in 2008 50 00:02:32.080 --> 00:02:35.090 and overlay that on that column 51 00:02:35.090 --> 00:02:38.030 with all of the countries. 52 00:02:38.030 --> 00:02:40.010 Where we have missing data 53 00:02:40.010 --> 00:02:43.040 is where a country has not won a medal. 54 00:02:43.040 --> 00:02:48.020 So, let's identify the countries that won a medal in 2008. 55 00:02:48.020 --> 00:02:52.030 So, 00, which is our DataFrame, 56 00:02:52.030 --> 00:02:55.060 so 00.Edition 57 00:02:55.060 --> 00:02:58.020 and 2008, equals 2008 58 00:02:58.020 --> 00:03:05.090 and let's store this in the DataFrame last Olympics. 59 00:03:05.090 --> 00:03:07.040 Now, this is another really good reason 60 00:03:07.040 --> 00:03:09.050 to familiarize yourself 61 00:03:09.050 --> 00:03:11.060 with your original dataset. 62 00:03:11.060 --> 00:03:14.010 If you remember, the data from The Guardian 63 00:03:14.010 --> 00:03:17.070 had a tab of all of the Olympic committees or countries, 64 00:03:17.070 --> 00:03:21.060 so let's grab that from The Guardian site. 65 00:03:21.060 --> 00:03:23.080 This file will also be in your data folder 66 00:03:23.080 --> 00:03:26.020 in the exercise file. 67 00:03:26.020 --> 00:03:30.020 So, I've done a search for Summer Olympics Medal Data, 68 00:03:30.020 --> 00:03:31.080 get the data 69 00:03:31.080 --> 00:03:35.090 and I'll download the full spreadsheet. 70 00:03:35.090 --> 00:03:37.050 So, this is the tab that I was referring to, 71 00:03:37.050 --> 00:03:39.070 the IOC country codes. 72 00:03:39.070 --> 00:03:40.080 We're going to need this information, 73 00:03:40.080 --> 00:03:47.080 so let's save this as a CSV file. 74 00:03:47.080 --> 00:03:50.040 And I'm going to copy this across 75 00:03:50.040 --> 00:03:54.050 to my data folder. 76 00:03:54.050 --> 00:03:58.040 Now, let's read that CSV file into a DataFrame. 77 00:03:58.040 --> 00:04:02.050 Let's call that DataFrame NOC for the Olympic countries, 78 00:04:02.050 --> 00:04:07.070 so pd.read_csv. 79 00:04:07.070 --> 00:04:10.010 That's in the data folder 80 00:04:10.010 --> 00:04:13.040 and I'm going to just leave the original file name 81 00:04:13.040 --> 00:04:18.080 and let's use the DataFrame name NOC here 82 00:04:18.080 --> 00:04:21.060 and let's just make sure that we have inputted 83 00:04:21.060 --> 00:04:23.070 the data that we expect, 84 00:04:23.070 --> 00:04:26.040 so NOC.Head 85 00:04:26.040 --> 00:04:29.010 and you can see that the first column is the country, 86 00:04:29.010 --> 00:04:31.060 the second column is the Olympic Committee code 87 00:04:31.060 --> 00:04:33.040 for that country, the ISO code 88 00:04:33.040 --> 00:04:35.090 and finally, country.1. 89 00:04:35.090 --> 00:04:38.060 Now, because I haven't seen this dataset before, 90 00:04:38.060 --> 00:04:39.060 I probably would want to know 91 00:04:39.060 --> 00:04:42.070 what's the difference between what's in the country column 92 00:04:42.070 --> 00:04:44.050 and the country.1 column. 93 00:04:44.050 --> 00:04:46.060 Is there anything that we need to understand 94 00:04:46.060 --> 00:04:49.010 about the differences between these columns? 95 00:04:49.010 --> 00:04:51.000 So, firstly I want to just check 96 00:04:51.000 --> 00:04:52.010 if there is any difference, 97 00:04:52.010 --> 00:04:57.060 so I'll do an NOC country 98 00:04:57.060 --> 00:04:59.080 and I want to see if there are any rules 99 00:04:59.080 --> 00:05:01.070 where the country and the country.1 differ, 100 00:05:01.070 --> 00:05:05.010 so NOC and country.1 101 00:05:05.010 --> 00:05:08.070 and so, I'm checking to see if NOC is not equal to country.1 102 00:05:08.070 --> 00:05:09.090 at any of the rows 103 00:05:09.090 --> 00:05:16.090 and I want to display those results. 104 00:05:16.090 --> 00:05:18.060 And given that this is empty, 105 00:05:18.060 --> 00:05:20.000 we can see that there is no difference 106 00:05:20.000 --> 00:05:21.070 between country and country.1 107 00:05:21.070 --> 00:05:23.060 and so, we don't need to be overly concerned 108 00:05:23.060 --> 00:05:25.090 about the fact that there's this additional column 109 00:05:25.090 --> 00:05:27.090 called country.1. 110 00:05:27.090 --> 00:05:31.030 So, let's set the NOC DataFrame to have an index 111 00:05:31.030 --> 00:05:35.050 of the Olympic Committee code. 112 00:05:35.050 --> 00:05:40.020 So, I see NOC and let's set the index 113 00:05:40.020 --> 00:05:45.070 to be the series name 114 00:05:45.070 --> 00:05:51.040 and let me just copy that series name from here. 115 00:05:51.040 --> 00:05:53.000 And I want to store this inplace, 116 00:05:53.000 --> 00:05:56.030 so inplace equals true 117 00:05:56.030 --> 00:05:59.030 and let's have a look at our DataFrame 118 00:05:59.030 --> 00:06:01.080 with its new index. 119 00:06:01.080 --> 00:06:03.020 So, in the next step, 120 00:06:03.020 --> 00:06:05.080 you want to get a unique list 121 00:06:05.080 --> 00:06:07.010 of all of the countries 122 00:06:07.010 --> 00:06:09.080 and we can do that using value_counts, 123 00:06:09.080 --> 00:06:13.020 so our original DataFrame from 2008 124 00:06:13.020 --> 00:06:19.080 is LO, the NOC's there and value_counts 125 00:06:19.080 --> 00:06:22.010 and remember that this is going to be a series, 126 00:06:22.010 --> 00:06:27.070 so we'll just store this in the series name medals_2008 127 00:06:27.070 --> 00:06:31.050 'cause this gives us the medals that have been won in 2008. 128 00:06:31.050 --> 00:06:35.010 Let's just check that 129 00:06:35.010 --> 00:06:36.050 and now what we're going to want to do 130 00:06:36.050 --> 00:06:39.020 is to add a new series 131 00:06:39.020 --> 00:06:41.010 to our DataFrame NOC 132 00:06:41.010 --> 00:06:46.040 which has the results of all of the medals won in 2008. 133 00:06:46.040 --> 00:06:50.010 So, we have NOC 134 00:06:50.010 --> 00:06:55.070 medal 2008 135 00:06:55.070 --> 00:06:58.080 and we're going to set the new series that we've created 136 00:06:58.080 --> 00:06:59.090 to that DataFrame, 137 00:06:59.090 --> 00:07:04.070 so let's have a look at our NOC DataFrame. 138 00:07:04.070 --> 00:07:08.080 So, here wherever we have a nan value or a missing data 139 00:07:08.080 --> 00:07:10.030 is where the country did not win 140 00:07:10.030 --> 00:07:13.050 any medals in the 2008 Olympics. 141 00:07:13.050 --> 00:07:15.050 So, if you want to get a list of that, 142 00:07:15.050 --> 00:07:20.080 we say NOC medal 2008 143 00:07:20.080 --> 00:07:23.060 is null 144 00:07:23.060 --> 00:07:28.010 and let's return that DataFrame. 145 00:07:28.010 --> 00:07:29.030 So, here we have it. 146 00:07:29.030 --> 00:07:31.040 These are all of the countries 147 00:07:31.040 --> 00:07:33.000 that did not win a medal 148 00:07:33.000 --> 00:07:36.010 in the 2008 Olympics. 149 00:07:36.010 --> 00:07:36.010 In the next video, we will look at Groupby.