WEBVTT

1
00:00:03.050 --> 00:00:04.070
- [Instructor] So, let's get started

2
00:00:04.070 --> 00:00:08.020
on the challenge for chapter six which is indexing.

3
00:00:08.020 --> 00:00:11.040
So, let's import Pandas,

4
00:00:11.040 --> 00:00:14.060
let's read the Olympic CSV file into our DataFrame

5
00:00:14.060 --> 00:00:17.080
called 00, so plot the total number

6
00:00:17.080 --> 00:00:20.040
of medals awarded at each of the Olympic Games

7
00:00:20.040 --> 00:00:23.040
throughout history, so since we want

8
00:00:23.040 --> 00:00:25.080
the number of medals throughout history,

9
00:00:25.080 --> 00:00:29.030
let's use the edition, so 00.Edition

10
00:00:29.030 --> 00:00:32.040
which is the series in the 00 DataFrame

11
00:00:32.040 --> 00:00:36.040
and let's use value_counts.

12
00:00:36.040 --> 00:00:41.090
And if we plot that,

13
00:00:41.090 --> 00:00:44.010
we don't seem to have a plot that we expected,

14
00:00:44.010 --> 00:00:46.070
so what seems to be wrong here?

15
00:00:46.070 --> 00:00:48.070
Well, let's take a few steps back

16
00:00:48.070 --> 00:00:52.080
and look at the value_counts object.

17
00:00:52.080 --> 00:00:55.040
Now, if I do a type here,

18
00:00:55.040 --> 00:01:03.050
we can see that 00.Edition.value_counts

19
00:01:03.050 --> 00:01:06.050
is in fact a Pandas series,

20
00:01:06.050 --> 00:01:08.020
so what that means then

21
00:01:08.020 --> 00:01:12.030
is that when I do a value_counts,

22
00:01:12.030 --> 00:01:14.050
the first column is the index

23
00:01:14.050 --> 00:01:17.080
and the second column are the values for that index,

24
00:01:17.080 --> 00:01:20.010
so part of the reason that we had a problem

25
00:01:20.010 --> 00:01:21.050
when plotting that graph

26
00:01:21.050 --> 00:01:24.050
was this index is not sorted,

27
00:01:24.050 --> 00:01:28.030
so what we can do then is to sort the index

28
00:01:28.030 --> 00:01:29.070
of these value_Counts because remember,

29
00:01:29.070 --> 00:01:32.010
this is a series,

30
00:01:32.010 --> 00:01:36.030
so, sort_index

31
00:01:36.030 --> 00:01:38.040
and let's look at the resulting structure

32
00:01:38.040 --> 00:01:40.050
and now we can see that we've got the series sorted

33
00:01:40.050 --> 00:01:44.030
by year, and so, then,

34
00:01:44.030 --> 00:01:49.020
if we go ahead and do a plot,

35
00:01:49.020 --> 00:01:51.080
we get the total number of medals presented

36
00:01:51.080 --> 00:01:58.000
at each of the Olympic Games throughout history.

37
00:01:58.000 --> 00:02:00.060
So, which countries did not win a medal

38
00:02:00.060 --> 00:02:03.060
in the 2008 Beijing Olympics

39
00:02:03.060 --> 00:02:06.070
and how many countries were there?

40
00:02:06.070 --> 00:02:07.060
So, let's think a little bit

41
00:02:07.060 --> 00:02:10.080
about how we will tackle this problem.

42
00:02:10.080 --> 00:02:13.070
Our approach will be to get a common index

43
00:02:13.070 --> 00:02:16.070
of NOC or the country,

44
00:02:16.070 --> 00:02:18.090
and we'll get the list of all of the countries

45
00:02:18.090 --> 00:02:22.020
and index them using the NOC code.

46
00:02:22.020 --> 00:02:24.040
So, if you visualize a column

47
00:02:24.040 --> 00:02:27.040
with all of the countries listed there,

48
00:02:27.040 --> 00:02:29.040
and what we're going to then do is we're going to take a list

49
00:02:29.040 --> 00:02:32.080
of all of the countries that won a medal in 2008

50
00:02:32.080 --> 00:02:35.090
and overlay that on that column

51
00:02:35.090 --> 00:02:38.030
with all of the countries.

52
00:02:38.030 --> 00:02:40.010
Where we have missing data

53
00:02:40.010 --> 00:02:43.040
is where a country has not won a medal.

54
00:02:43.040 --> 00:02:48.020
So, let's identify the countries that won a medal in 2008.

55
00:02:48.020 --> 00:02:52.030
So, 00, which is our DataFrame,

56
00:02:52.030 --> 00:02:55.060
so 00.Edition

57
00:02:55.060 --> 00:02:58.020
and 2008, equals 2008

58
00:02:58.020 --> 00:03:05.090
and let's store this in the DataFrame last Olympics.

59
00:03:05.090 --> 00:03:07.040
Now, this is another really good reason

60
00:03:07.040 --> 00:03:09.050
to familiarize yourself

61
00:03:09.050 --> 00:03:11.060
with your original dataset.

62
00:03:11.060 --> 00:03:14.010
If you remember, the data from The Guardian

63
00:03:14.010 --> 00:03:17.070
had a tab of all of the Olympic committees or countries,

64
00:03:17.070 --> 00:03:21.060
so let's grab that from The Guardian site.

65
00:03:21.060 --> 00:03:23.080
This file will also be in your data folder

66
00:03:23.080 --> 00:03:26.020
in the exercise file.

67
00:03:26.020 --> 00:03:30.020
So, I've done a search for Summer Olympics Medal Data,

68
00:03:30.020 --> 00:03:31.080
get the data

69
00:03:31.080 --> 00:03:35.090
and I'll download the full spreadsheet.

70
00:03:35.090 --> 00:03:37.050
So, this is the tab that I was referring to,

71
00:03:37.050 --> 00:03:39.070
the IOC country codes.

72
00:03:39.070 --> 00:03:40.080
We're going to need this information,

73
00:03:40.080 --> 00:03:47.080
so let's save this as a CSV file.

74
00:03:47.080 --> 00:03:50.040
And I'm going to copy this across

75
00:03:50.040 --> 00:03:54.050
to my data folder.

76
00:03:54.050 --> 00:03:58.040
Now, let's read that CSV file into a DataFrame.

77
00:03:58.040 --> 00:04:02.050
Let's call that DataFrame NOC for the Olympic countries,

78
00:04:02.050 --> 00:04:07.070
so pd.read_csv.

79
00:04:07.070 --> 00:04:10.010
That's in the data folder

80
00:04:10.010 --> 00:04:13.040
and I'm going to just leave the original file name

81
00:04:13.040 --> 00:04:18.080
and let's use the DataFrame name NOC here

82
00:04:18.080 --> 00:04:21.060
and let's just make sure that we have inputted

83
00:04:21.060 --> 00:04:23.070
the data that we expect,

84
00:04:23.070 --> 00:04:26.040
so NOC.Head

85
00:04:26.040 --> 00:04:29.010
and you can see that the first column is the country,

86
00:04:29.010 --> 00:04:31.060
the second column is the Olympic Committee code

87
00:04:31.060 --> 00:04:33.040
for that country, the ISO code

88
00:04:33.040 --> 00:04:35.090
and finally, country.1.

89
00:04:35.090 --> 00:04:38.060
Now, because I haven't seen this dataset before,

90
00:04:38.060 --> 00:04:39.060
I probably would want to know

91
00:04:39.060 --> 00:04:42.070
what's the difference between what's in the country column

92
00:04:42.070 --> 00:04:44.050
and the country.1 column.

93
00:04:44.050 --> 00:04:46.060
Is there anything that we need to understand

94
00:04:46.060 --> 00:04:49.010
about the differences between these columns?

95
00:04:49.010 --> 00:04:51.000
So, firstly I want to just check

96
00:04:51.000 --> 00:04:52.010
if there is any difference,

97
00:04:52.010 --> 00:04:57.060
so I'll do an NOC country

98
00:04:57.060 --> 00:04:59.080
and I want to see if there are any rules

99
00:04:59.080 --> 00:05:01.070
where the country and the country.1 differ,

100
00:05:01.070 --> 00:05:05.010
so NOC and country.1

101
00:05:05.010 --> 00:05:08.070
and so, I'm checking to see if NOC is not equal to country.1

102
00:05:08.070 --> 00:05:09.090
at any of the rows

103
00:05:09.090 --> 00:05:16.090
and I want to display those results.

104
00:05:16.090 --> 00:05:18.060
And given that this is empty,

105
00:05:18.060 --> 00:05:20.000
we can see that there is no difference

106
00:05:20.000 --> 00:05:21.070
between country and country.1

107
00:05:21.070 --> 00:05:23.060
and so, we don't need to be overly concerned

108
00:05:23.060 --> 00:05:25.090
about the fact that there's this additional column

109
00:05:25.090 --> 00:05:27.090
called country.1.

110
00:05:27.090 --> 00:05:31.030
So, let's set the NOC DataFrame to have an index

111
00:05:31.030 --> 00:05:35.050
of the Olympic Committee code.

112
00:05:35.050 --> 00:05:40.020
So, I see NOC and let's set the index

113
00:05:40.020 --> 00:05:45.070
to be the series name

114
00:05:45.070 --> 00:05:51.040
and let me just copy that series name from here.

115
00:05:51.040 --> 00:05:53.000
And I want to store this inplace,

116
00:05:53.000 --> 00:05:56.030
so inplace equals true

117
00:05:56.030 --> 00:05:59.030
and let's have a look at our DataFrame

118
00:05:59.030 --> 00:06:01.080
with its new index.

119
00:06:01.080 --> 00:06:03.020
So, in the next step,

120
00:06:03.020 --> 00:06:05.080
you want to get a unique list

121
00:06:05.080 --> 00:06:07.010
of all of the countries

122
00:06:07.010 --> 00:06:09.080
and we can do that using value_counts,

123
00:06:09.080 --> 00:06:13.020
so our original DataFrame from 2008

124
00:06:13.020 --> 00:06:19.080
is LO, the NOC's there and value_counts

125
00:06:19.080 --> 00:06:22.010
and remember that this is going to be a series,

126
00:06:22.010 --> 00:06:27.070
so we'll just store this in the series name medals_2008

127
00:06:27.070 --> 00:06:31.050
'cause this gives us the medals that have been won in 2008.

128
00:06:31.050 --> 00:06:35.010
Let's just check that

129
00:06:35.010 --> 00:06:36.050
and now what we're going to want to do

130
00:06:36.050 --> 00:06:39.020
is to add a new series

131
00:06:39.020 --> 00:06:41.010
to our DataFrame NOC

132
00:06:41.010 --> 00:06:46.040
which has the results of all of the medals won in 2008.

133
00:06:46.040 --> 00:06:50.010
So, we have NOC

134
00:06:50.010 --> 00:06:55.070
medal 2008

135
00:06:55.070 --> 00:06:58.080
and we're going to set the new series that we've created

136
00:06:58.080 --> 00:06:59.090
to that DataFrame,

137
00:06:59.090 --> 00:07:04.070
so let's have a look at our NOC DataFrame.

138
00:07:04.070 --> 00:07:08.080
So, here wherever we have a nan value or a missing data

139
00:07:08.080 --> 00:07:10.030
is where the country did not win

140
00:07:10.030 --> 00:07:13.050
any medals in the 2008 Olympics.

141
00:07:13.050 --> 00:07:15.050
So, if you want to get a list of that,

142
00:07:15.050 --> 00:07:20.080
we say NOC medal 2008

143
00:07:20.080 --> 00:07:23.060
is null

144
00:07:23.060 --> 00:07:28.010
and let's return that DataFrame.

145
00:07:28.010 --> 00:07:29.030
So, here we have it.

146
00:07:29.030 --> 00:07:31.040
These are all of the countries

147
00:07:31.040 --> 00:07:33.000
that did not win a medal

148
00:07:33.000 --> 00:07:36.010
in the 2008 Olympics.

149
00:07:36.010 --> 00:07:36.010
In the next video, we will look at Groupby.