WEBVTT 1 00:00:00.004 --> 00:00:02.008 - [Instructor] Having seen the security use case 2 00:00:02.008 --> 00:00:05.003 and the use of AI to address them, 3 00:00:05.003 --> 00:00:08.002 it's time to look under the hood. 4 00:00:08.002 --> 00:00:12.001 I will start with a few commonly interchanged 5 00:00:12.001 --> 00:00:14.008 and often, confused terms. 6 00:00:14.008 --> 00:00:17.000 If you read through the literature on AI 7 00:00:17.000 --> 00:00:21.002 you will come across these terms very often. 8 00:00:21.002 --> 00:00:23.009 Now, I don't want you to walk away with the impression 9 00:00:23.009 --> 00:00:28.009 that the discipline of learning is the entire field of AI. 10 00:00:28.009 --> 00:00:31.009 In fact, it is one of the capabilities 11 00:00:31.009 --> 00:00:35.001 that is exhibited by AI systems. 12 00:00:35.001 --> 00:00:37.000 Machine learning, on the other hand, 13 00:00:37.000 --> 00:00:40.007 is a type of learning that uses statistical techniques 14 00:00:40.007 --> 00:00:45.007 and modeling to perform a task without programming. 15 00:00:45.007 --> 00:00:49.003 So that leaves us with deep learning. 16 00:00:49.003 --> 00:00:52.001 Deep learning is a type of machine learning 17 00:00:52.001 --> 00:00:56.002 that uses layering of many learning algorithms. 18 00:00:56.002 --> 00:00:59.003 It tries to mimic the way neural networks 19 00:00:59.003 --> 00:01:01.006 in our brain function. 20 00:01:01.006 --> 00:01:04.008 When you build an AI-based security solution 21 00:01:04.008 --> 00:01:09.005 you have many machine learning algorithms at your disposal. 22 00:01:09.005 --> 00:01:12.000 The algorithm you end up choosing 23 00:01:12.000 --> 00:01:16.008 depends primarily on two macro factors. 24 00:01:16.008 --> 00:01:21.003 First, the type of training data available to you 25 00:01:21.003 --> 00:01:24.004 and next the type of security problem 26 00:01:24.004 --> 00:01:27.006 you are trying to solve. 27 00:01:27.006 --> 00:01:31.001 If you study your organization's vulnerability database, 28 00:01:31.001 --> 00:01:35.001 log files, packet trace, and user access records 29 00:01:35.001 --> 00:01:40.006 you will discover that you have two types of data samples. 30 00:01:40.006 --> 00:01:44.002 First, the type of data whose characteristics 31 00:01:44.002 --> 00:01:47.004 you fully understand, in other words, 32 00:01:47.004 --> 00:01:49.008 you already know that the data at hand 33 00:01:49.008 --> 00:01:55.004 is indicative of either genuine or suspicious behavior. 34 00:01:55.004 --> 00:01:59.002 For example, a website that you know is fraudulent 35 00:01:59.002 --> 00:02:02.004 and is likely being used in a phishing attack, 36 00:02:02.004 --> 00:02:06.007 or a program trace that is a clear sign 37 00:02:06.007 --> 00:02:08.008 of malware execution. 38 00:02:08.008 --> 00:02:11.007 In fact, you know the data so well, 39 00:02:11.007 --> 00:02:15.008 you can label it with tags such as good or bad. 40 00:02:15.008 --> 00:02:20.003 This type of data is known as labeled data. 41 00:02:20.003 --> 00:02:23.006 In the second type of data, you don't know beforehand 42 00:02:23.006 --> 00:02:28.004 whether the data at hand represents good or bad behavior. 43 00:02:28.004 --> 00:02:31.004 For example, you're looking at the login dates 44 00:02:31.004 --> 00:02:34.008 and times of your employees over the past month, 45 00:02:34.008 --> 00:02:38.007 you just don't know which ones are suspicious or not. 46 00:02:38.007 --> 00:02:43.001 In other words, you can't put a good or bad label on it. 47 00:02:43.001 --> 00:02:48.001 This type of data is known as unlabeled data. 48 00:02:48.001 --> 00:02:52.008 To train a machine learning model using label data 49 00:02:52.008 --> 00:02:56.001 or, in other words, using our prior knowledge 50 00:02:56.001 --> 00:02:58.003 of the relationship between the data 51 00:02:58.003 --> 00:03:03.009 and the desired outcome is known as supervised learning. 52 00:03:03.009 --> 00:03:08.002 On the other hand, when such a label doesn't exist 53 00:03:08.002 --> 00:03:10.002 and we use the machine learning model 54 00:03:10.002 --> 00:03:14.001 to discover new and interesting patterns within the data 55 00:03:14.001 --> 00:03:19.001 that process is known as unsupervised learning. 56 00:03:19.001 --> 00:03:21.001 The second factor that determines 57 00:03:21.001 --> 00:03:23.007 your choice of algorithm is the type 58 00:03:23.007 --> 00:03:27.003 of security problem you aim to solve. 59 00:03:27.003 --> 00:03:30.005 Although machine learning has been applied commercially 60 00:03:30.005 --> 00:03:33.007 to a variety of problems across industries, 61 00:03:33.007 --> 00:03:37.003 in the field of security it is commonly applied 62 00:03:37.003 --> 00:03:40.000 to four types of problems today. 63 00:03:40.000 --> 00:03:43.004 Of course, that may change in the future. 64 00:03:43.004 --> 00:03:47.003 You want to predict a future security event based 65 00:03:47.003 --> 00:03:50.006 on the information you have about the past events. 66 00:03:50.006 --> 00:03:54.003 You want to categorize your data into known categories 67 00:03:54.003 --> 00:03:57.006 such as normal versus malicious. 68 00:03:57.006 --> 00:04:01.003 You want to find interesting and usual patterns 69 00:04:01.003 --> 00:04:06.006 in your data that you couldn't have found yourself. 70 00:04:06.006 --> 00:04:09.002 And the last one, you want to generate 71 00:04:09.002 --> 00:04:13.004 adversarial synthetic data that is indistinguishable 72 00:04:13.004 --> 00:04:16.003 from the real data. 73 00:04:16.003 --> 00:04:18.008 By clearly articulating the type 74 00:04:18.008 --> 00:04:21.002 of machine learning problem you have, 75 00:04:21.002 --> 00:04:23.009 combined with the type of data at hand 76 00:04:23.009 --> 00:04:26.001 you come up with a subset of algorithms 77 00:04:26.001 --> 00:04:29.003 that you can start experimenting with. 78 00:04:29.003 --> 00:04:32.005 Here is a visual that summarizes the mapping 79 00:04:32.005 --> 00:04:35.005 of data type to various use cases.