Khan Academy: Machine Learning → Measurable Learning
With millions of problems attempted per day, Khan Academy’s interactive math exercises are an important and popular feature of the site. (Math practice. Popular. Really!) For over a year, we’ve used a statistical model of each student’s accuracy on each exercise as the basis for awarding proficiency. When a user reaches a high enough accuracy on an exercise, we award the proficiency badge, which is both a reward for the student’s accomplishment and a signal that they can move on to learn even cooler math. So it’s important that we are intelligent about when we suggest a that a student moves on… and when we don’t.
This post details recent improvements we have made in the accuracy of the statistical model, and provides evidence that use of the improved model has led to increased learning gains.
Part I: Building a Better Model
The baseline model was a logistic regression on six data features. The six features were engineered as a compact representation of the user’s practice history on the same exercise being predicted, and the same coefficient weights were used for every exercise. It was a simple setup that made quality assurance super easy! We had only one model, and the same pattern of correct or incorrect answers would always lead to the same predicted accuracy on every exercise.
Of course, that simplicity also constrained performance in two unsurprising yet important ways. First, the lack of exercise-specific parameters made it tough to account for exercise-specific difficulty or learning dynamics. The model didn’t reflect the fact that a calculus exercise is a priori tougher than a fractions exercise. Second, cross-exercise evidence was completely ignored in each prediction! So, if you aced a differential equations exercise, we disregarded that information instead of using it to smartly infer that you probably wouldn’t sweat a basic algebra exercise.
Our attempts to improve involved two techniques. First, we trained a separate model for every exercise. Brain-dead and obvious, I know. But for you data science rookies, if you have lots and lots of data, there is no more consistently reliable low-hanging fruit than this: try taking your existing model and creating separate instances for more specific segments of your data—day of the week, user type, moon phase, anything—and then make use of the more specific models.
Second, we incorporated cross-exercise evidence by adding new features computed from the user’s history on all exercises. In order to compress a user’s activity from a large and variable number of possible activities, we used a clever trick we called “compressed features”. It’s inspired by work in compressed sensing, and could be considered a distant cousin to Bloom filters or feature hashing. If you’re not familiar with those things, no worries— you can read my companion post with example code to see exactly how compressed features work, or feel free to skip it.
In the context of our application, the information we wanted to compress was the historical performance on every exercise. As a super simple proxy of exercise performance, we created features for getting the first problem correct of each possible exercise, as well as a feature for getting the first problem incorrect. We separated the features for correct and incorrect responses so that their significance in forecasting could be scaled differently. With approximately 400 exercises, we had ~400 * 2 = 800 original features, which we compressed to 100 features.
Time to see what those new features can do! Table 1 summarizes the features included in each model variant tested; for each variant, a model was created for every exercise via logistic regression.
Below are the out-of-sample ROC Curves for each model variant. ROC curves are my favorite way to compare classifier performance, but if you’re not familiar with them, all you need to know is that more area under the curve is better! You can see by comparing the baseline and bias+custom lines that simply training per exercise—even using identical features—provided a substantial boost over the baseline. Adding in the compressed features in bias+custom+compressed provided only minor further improvement.
Interestingly, though, the marginal benefit of the compressed features is greater when predicting the first few problems done on any exercise. The overall performance improvement relative to the baseline model is also much, much greater.
Why are the new techniques more powerful for the first few problems of an exercise? Because if a user has already provided many direct observations on particular exercise, the baseline model uses that very direct evidence well. The blind spot occurs when we don’t have many attempts yet from a user on an exercise, or maybe no attempts at all. That’s when we’d really like to know whether the exercise is a priori tough or easy. That’s when we’d really like to leverage the user’s history on other exercises as indirect evidence. And that’s when the new techniques serve well.
Part II: Measuring the Learning Benefit
So it appeared that we could more rapidly predict a learner’s exercise accuracy, perhaps with just a problem attempt or two. Now how the heck could we use that to help an actual human being? Well, if our new model could tell us right away that a user had an exercise down pat, perhaps we could usefully nudge them along to new material by granting proficiency faster than the status quo. We set up a randomized experiment (A/B test) to investigate just that. Half of the users were randomly assigned to a control group operating under the status quo, while an experimental group was eligible to receive an “accelerated proficiency” if the new model predicted within the first five problems done on an exercise that the user met the accuracy requirement (94%). After an experimental group learner did five problems on an exercise, their proficiency on that exercise was determined identically to a control group learner.
We fired up the experiment up using our A/B testing framework, and watched the numbers roll in. Within hours, we knew the number of proficiencies earned was up by almost a third. That wasn’t too surprising— the logic of the experiment meant it could only be the same or higher. More encouragingly, we also tracked the the number and accuracy of subsequent review problems attempted on the proficient exercises, and found the accuracy to almost identical. This was important evidence that the accelerated proficiencies were still valid, since learners were retaining the information at comparable accuracy. Finally, problem attempts were down nominally, which caused a slight concern over possibly lower engagement. Recall, however, that experimental group users were being accelerated to more difficult problems, which take more time per problem to complete. Table 2 has the results.
So our users were no doubt happier with all their shiny new proficiencies. But one question still nagged: Did the increase in proficiencies indicate that users were actually learning more on Khan Academy, or only that our system was now better at labeling their pre-existing abilities? Had we actually created better learning, or just better assessment?
To find out, we created “domain learning curves”. Let’s break that phrase down a bit. A learning curve simply represents ability level achieved versus effort invested, with effort usually expressed in units of time or number or exposures to a skill. While the curve for a single person can be very erratic, averaging over larger numbers of subjects can yield a smoother curve, yielding insight into the dynamics and efficiency of learning gains.
Learning curves can be used to evaluate tutoring systems, and here we adapted them to meet our needs. Because the models and algorithms under investigation could alter the distribution of the learner’s effort across individual skills, it was important that our measurement focus be at a higher level— the domain level, where a domain is the whole collection of skills the user is trying to learn. So we created a straightforward sample of domain performance with the following algorithm: while a user worked on a topic, every so often (with low frequency) we’d give them a random problem sampled uniformly from the topic domain. Those “analytics problems”, as we called them, gave us a simple, low-footprint way to monitor the user’s progress in a topic.
Now, the naive way to construct domain learning curves is to simply plot the percentage accuracy versus the attempt number of the analytics problem within the domain:
But there’s a potentially huge problem with that. Namely, the density and characteristics of the user population that makes it to any given point on the experience axis can change a lot, which confounds our ability to understand changes in accuracy or compare curves. For example, we could raise the average accuracy for the more experienced end of a learning curve just by frustrating weaker learners early on and causing them to quit, but that hardly seems like the thing to do! If we really want to understand total learning gains, we need a way to combine learning dynamics with engagement dynamics.
Let’s define a set of curves (or functions) to give us what we want:
Optional details on computing D(i): We first take all pairs of consecutive analytics problems, and compute the change in the correctness— either 1-1=0, 1-0=1, 0-1=-1, or 0-0=0. We divide and distribute that gain evenly among the problem indices spanning the two analytics problems. D(i) is the total allocated gain for index i, divided by the theoretical maximum that was possible for index i.
Simple though it may be, we now have a powerful way to measure total learning gains, and in a way the balances engagement and learning efficiency. With G(i), higher engagement is rewarded if and only if the learning remains productive; higher learning efficiency is rewarded only to the extent enough learners stay engaged and benefit.
Click here for a nice, humungous graphic (or see a small sample below) showing learning curves for our A/B test on the ten most popular topics on Khan Academy. The blue lines are for the “accelerated proficiency” group; green is for the control group. Note that system learning gain curves eventually level out, asymptoting to the total learning gain for the system on each topic.
It’s lovely to see those learning curves going “up and to the right”, and the blue learning curves show more efficient learning for the experimental group. You’ll also see that despite generally faster dropout rates in the experimental group (remember the 5% drop in problems attempted?), total learning gain is indeed consistently higher for the experimental group. In fact, aggregating numerical results across topics, learning gains were up 33% in the top ten topics. We tested the effect in multiple time periods and found it to be robust. And learning efficiency, defined as the total system learning gain divided by problems done, was up a whopping (61%) because the greater learning gains came despite fewer problems done.
It should be noted that similar learning gains were not present in more difficult and less heavily trafficked topics, as the accelerated proficiencies were not granted frequently enough to have a clear impact on behavior.
Future Research and Conclusions
We’re thrilled to have shown such substantial effects with a relatively narrow intervention, and think there is exciting follow-on work to be done, such as:
Helping learners avoid problems that will be unproductively difficult.
Generally understanding and modeling the relationships between problem difficulty, engagement, and learning efficiency.
Experiments directly focusing on optimizing the engagement component of learning gains.
Finally, if you’ve read this far, you have impressive stamina and/or you really love this stuff! Feel free to follow me on Twitter or consider joining our team. We’re hiring, and there’s lots more work to be done.
Acknowledgments: Thank you to Jascha Sohl-Dickstein for leading the modeling work described here, and Jascha and Eliana Feasley for helping me debug a portion of the learning curve analysis.