Thursday, March 9, 2017

Rise of the Machines: Machine Learning

Learning By Doing

Allow me to attempt to simulate what it's like to be two years old again. Below are two types of bugs. The first type (on the left) are called moneks, and the second type (on the right) are called plaples [1]. Study both types and pay particular attention to the attributes of each type of bug. You might even imagine your mom pointing to each one and saying, "That's a monek. Can you point to the monek?"

Figure 1. Examples of moneks and plaples.

Once you've familiarized yourself with these delightful creatures, test your knowledge by taking the following quiz [1]. You might want to scroll your window so that you're not tempted to cheat!

Figure 2. Test your knowledge of these two types of bugs. 

How did you do? Was it easy? What features did you rely on to figure out if something was a monek or a plaple?

What is "Machine Learning?"

Learning to categorize two different types of bugs may not seem all that incredible. That is, until you try to teach a computer how to recognize and classify visual objects. It's not easy! How might you approach this problem? One method is called machine learning.

Maybe you've heard about machine learning as it applies to Facebook's facial recognition software, or Google's reliance on machine learning to serve up highly specific (and accurate) search results. Or maybe you heard about the machine learning project to identify pictures of cats on the internet (I've heard a rumor that there are a couple of pictures of cats on the internet).

As it turns out, all of the big tech companies are using it. Apple, Microsoft, and Amazon all rely on machine learning to solve some of their thorniest technical problems. But have you ever wondered what the heck "machine learning" is? Have you also wondered, Can I learn how to harness the power of machine learning to solve my own problems? If you've given any thought to either of these two questions, then this is your lucky day! I am going to attempt to explain what machine learning is.

"I said you're holding back" –Walk the Moon

To talk about machine learning, it's useful to introduce a few concepts. The first concept is the outcome that we would like to predict. I'm going to refer to this as a labeled instance. If you recall the steps of the scientific method, you may remember talking about the dependent measure (or "outcome variable"). A labeled instance is analogous to the dependent measure. Second, each labeled instance has a set of quantifiable or measurable properties. The properties are used to describe the labeled instances.

Now that we've defined our data, there are three steps in developing our model.

Step 1 - Training

Like the monek/plaple example, we need to train our algorithms on a dataset for which we have known values for the instances. When we are training our machine learning algorithms, it helps if we can provide it with unequivocal examples, which we call the ground truth. Thus, the first step in machine learning is to run the algorithms on a training dataset. The training dataset has values for both the properties and the labels. The machine-learning algorithm is attempting to learn the association between the values of the properties and their labels. Table 1 is an example of a very small training dataset, which is derived from Fig. 1.

Table 1: Training Data (with Labeled Instances)
ID Antenna Head Body Legs Tail Number of Legs Label
M-01 Fuzzy Oval Striped Short Stinger 8 Monek
M-02 Short Oval Spotted Short Stinger 8 Monek
P-01 Short Oval Striped Long Long 4 Plaple
P-02 Fuzzy Square Striped Long Long 4 Plaple

Step 2 - Validation

We withhold a subset of data so that we can start the second step, which is to evaluate our machine-learning model. We will call this the validation dataset. The goal is to measure how accurate our model is. We do this by feeding the model all of the property values, and we make it guess what the labels are. We then compare those guesses against the withheld "answers." It's common practice to keep track of the types of errors that the model makes and report them as accuracy statistics. Table 2 is an example of a validation dataset.

Table 2: Validation Data (label withheld)
ID Antenna Head Body Legs Tail Number of Legs Label
M-03 Fuzzy Oval Spotted Long Stinger 4 Monek
M-04 Fuzzy Square Spotted Short Stinger 8 Monek
P-03 Short Square Striped Short Long 8 Plaple
P-04 Short Square Spotted Long Long 4 Plaple

Step 3 - Testing

Now it's time to release our fledgling machine and start categorizing instances for which we do not have labeled instances. In other words, we feed our machine the property values, and we let the algorithms choose the labels. The dataset in this case doesn't have a ground truth. We are letting the machine do all the work now. Table 3 is an example of the input into our machine-learning algorithm that has been trained to recognize the two types of bugs.

Table 3: Test Data (label unknown)
ID Antenna Head Body Legs Tail Number of Legs Label
K-06 Fuzzy Oval Spotted Short Long 8 ???
K-07 Short Square Striped Short Stinger 4 ???

The S.T.E.M. Connection

Suppose you teach math or computer science, and your students are curious about learning to set up a machine-learning project. There are many different tutorials out there, but these two seem like particularly good starting places:
  1. Categorize Lilies using Python libraries
  2. Handwriting recognition using TensorFlow
The first is a little more basic, and it leaves out many details. However, the author does a good job of getting the user up and running quickly. You may need to install some software on your computer, but I found doing so was as simple as advertised. Personally, I'm not super-excited about categorizing lilies, but this is a good project to get your feet wet. 

The second tutorial is a little more advanced. The authors discuss matrix multiplication and vector addition. If you need a way to motivate these topics in your own class [2], then this would be a good resource. In addition, the topic is cool. Your goal is to teach a computer to recognize handwritten digits between zero and nine. Banks, for example, rely on this technology for cashing personal checks. 

Machine learning is cool for so many reasons. It is accessible to people who are interested in the topic [3], it solves many difficult problems, and it has a connection to psychology. For example, learning how to categorize objects is a fundamental skill that young brains must master to make sense of the world!

Share and Enjoy!

Dr. Bob

Going Beyond the Information Given

[1] I am indebted to Takashi Yamauchi for allowing me to recreate the stimuli he used in his study on categorization: 

Yamauchi, T., & Markman, A. B. (2000). Inference using categories. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26(3), 776.

[2] By "motivate," I am of course referring to a potential answer to the age-old student lament: When are we ever going to need to know this?!

[3] I would be remiss if I didn't mention the weka workbench that's also freely available. It's generally used for educational data-mining projects.