Khan Academy: Machine Learning → Measurable Learning

With millions of problems attempted per day, Khan Academy’s interactive math exercises are an important and popular feature of the site.  (Math practice.  Popular.  Really!)  For over a year, we’ve used a statistical model of each student’s accuracy on each exercise as the basis for awarding proficiency. When a user reaches a high enough accuracy on an exercise, we award the proficiency badge, which is both a reward for the student’s accomplishment and a signal that they can move on to learn even cooler math.  So it’s important that we are intelligent about when we suggest a that a student moves on… and when we don’t.

This post details recent improvements we have made in the accuracy of the statistical model, and provides evidence that use of the improved model has led to increased learning gains.

Read More

Compressed Features for Machine Learning

When building supervised learning models, I often come across situations where there is a large class of features I’d like to try adding, but the corresponding increase in model parameters seems daunting. For example, in my work at Khan Academy, many user models might benefit from knowing whether each our 4,000 videos has been watched by the user or not.  I usually end up trying to engineer a heuristic summary feature—e.g., how many videos have been watched within a few broad topics?—which is time consuming, or just giving up.  However, if the features are sparse with non-zero values, there’s a really easy and efficient way to compress the information.

Let’s say we have a set of m sparse-valued feature variables, and we’d like to compress their information into an n-dimensional feature vector (n < m).  First, we permanently associate each of the m original feature variables with a shorter n-dimensional vector of length 1 that points in a completely random direction.  Then we compute the compressed feature vector by summing the random vectors scaled (multiplied) by the value of the associated original feature variable.  The following code snippet spells it out clearer than any other explaining I can do:

As written, it does require that you build and maintain an associative array mapping the m feature types to the random components. That may be a pain if m is really large, or you are parallelizing your algorithms, or spanning research and production environments. An easy improvement would be to use the hash of the original feature key to generate the random components on-the-fly, either by seeding the RNG or some other means. 

Thanks to Jascha Sohl-Dickstein for teaching me this technique.

turadg said: I really like how you and other Khan Academy data scientists post your analyses. As a graduate student in HCI and learning sciences, I've been surprised that none of the analyses (that I've seen!) have any cognitive modeling. Is that true? If so, why? Thanks.

While I certainly am nowhere close to giving up, in the limited experience I have comparing cognitively-based models versus, let’s say, more purely empirical models, the purely empirical models have performed much better.  

The Science of Learning

Online education is hot field right now.  That means a lot of programmers and data scientists are interested in learning about education and instruction; a lot of teachers are rethinking the potential uses of technology and analytics.  What may be less apparent or accessible to the newly interested, though, is the fantastic body of research available to them through the field of learning science(s).

The science of learning is concerned with questions like: How do people learn? How does learning vary across subjects, people, and environments?  What reliably improves or harms learning?  What is known about the effectiveness of online learning?  Can we use psychological interventions to boost learning?  

With answers to questions like these, online education professionals can have a huge head start on building great learning experiences.  

Over the last year, I’ve tried to acquire a solid footing in the learning sciences.  Given the surging interest in the field, I made this reading list in an attempt to streamline the process for others.

The list was made with the following goals in mind.  

  1. Emphasize non-commercial, thoroughly footnoted survey reports.  Avoid narrow studies or articles that might be incentivized to “sell” a certain result or view, but provide plenty of jumping off points for exploring studies in more detail.
  2. Use free and online resources where possible.
  3. Favor scientifically sound but mathematically non-technical writing.
  4. Keep length flexible and reasonable— An investment of a couple weekends or weeks of evening reading should yield a basic understanding.

Here’s the list..

Core Readings:

How People Learn
 - (Free after signup with NAP).  If you only have time to read one book, this is your winner.  Summarizes decades of research in the science of learning and highlights its connection to educational practices.  This is the foundation on which everything below is built.

e-Learning and the Science of Instruction - Summary of how learning science informs best practices when designing and delivering learning experiences online.

Knowing What Students Know / Executive Summary 
- (free from NAP) Learn about the science of effective assessment practices— and the shortcomings of many current-day assessments.  Presents a framework for building designed to improve learning, not just score it.

Bonus Readings:

Why Don’t Students Like School? / Sample Article -  
Written by a cognitive scientist, this is fairly overlapped with How People Learn.  But it was a fun, short read, and did as much as any reading to drive the learning principles home for me.  And at this point you deserve some fun, right?  It’s well worth a couple hours and ten bucks.

How Students Learn Math, History, Science / Executive Summary -
Follow-up report to How People Learn with chapters written by expert educators in each field.  Even greater emphasis on application to the classroom.

Yes, the list is made entirely of books.  No blogs, no zippy articles.  I’m sorry.  But the truth is, while I read a lot of articles, it was through these books that I felt like I was deeply learning and my perspective was changing.  And given what I have learned in these books, I think I might know why!  Built-in to the learning experience of this list are:

  1. Spacing - Most of the books were not read in one sitting.  Revisiting them multiple times helped my retention.
  2. Repetition and practice - While the books have lots of overlap, the redundancy provided extra reinforcement and helped me master the vocabulary so I could begin to focus on the concepts.
  3. Lots of examples - Book-length writing provided room for the authors to use ample concrete examples.  Processing of many examples is the most proven path to expert knowledge.
  4. Thinking about meaning - Something about the commitment and pace of reading a book encouraged me to take time and reflect more deeply about what I read.  And as you will learn, knowledge is memory, and memory is the residue of thought.

The list is a work in progress, and I’d love to hear your own suggestions via comment here or on Twitter.  


Bayesian networks (and probabilistic graphical models more generally) are cool. We computer geeks can love ‘em because we’re used to thinking of big problems modularly and using data structures.  But better than being cool, they’re useful. Especially if you have the kind of problem that involves hundreds or thousands of interrelated variables, any one of which you might want to predict based on some subset of the others. Did I mention that your variables can have noisy, missing, or just plain unobservable data?

At Khan Academy, we’ve got problems like that.

  •  What are the underlying concepts that relate mastery of our hundreds of exercises to each other? 
  • Can we predict the best ordering of topics for a user to minimize time spent and maximize success? 
  • What instructional interventions can an intelligent learning system make to best aid the user? 

So I sat down to start applying these tools, and there were plenty of online resources to help with the book learnin’. I’m not going to duplicate them—I recommend this or that. When I started coding, though, I had several practical questions and really wanted a simple code example. I didn’t find a great one, so I’m posting what I came up with in the hope of helping the next soul.


For our example problem, let’s say there is a hidden (unobservable) binary variable T for every Khan Academy user that represents whether they have mastered a given topic (1=mastered, 0= not mastered). You can think of a topic as a collection of any N exercises. While topic mastery is hidden, we can partially observe performance on the exercises in that topic. I say ‘partially’ because the user may not do problems on all of the exercises. So the example program needs to handle missing data for all of the exercise variables E_i, which for simplicity can also be binary variables (1=good performance, 0=bad). In the idiom of Bayes nets, then, our graph looks like this:

Of course, lots of problems could be modeled by this simple graph structure of a hidden parent variable with a collection of children with missing data. Let me know if you invent another interesting application!


Here’s the code. It contains functions to learn from a simulated example or from a data file representing the evidence from your child variables. The learning algorithm is an expectation-maximization, of which the most interesting piece is the expectation step where we must impute the hidden T-variable given whatever E-variable evidence is available for that data sample.  To see the algebra worked out using Bayes rule, check out this excellent write up courtesy of Jascha.

If for some sad reason you have an aversion to matrices (or NumPy), you might also take a look at my rough draft version written with the excellent Pandas library instead of just NumPy. (It doesn’t support handling of missing data for the E-variables and I ended up converting strictly to NumPy for speed.)

What I find fascinating (and what the Pandas version illustrates nicely) is that the most complicated math really needed here is… counting! The idea that we can learn the full joint distribution with nary a gradient or a step size parameter in sight kind of feels like magic to me.


A few lesson learned here and over the years:

  • Learning from simulated data is very useful for testing and debugging your algorithm. Two useful tests are:
    • Make sure your algorithm learns a model for which you know the true distribution.
    • If the above doesn’t work, initialize your parameters to the correct values and see if they “walk away” 
  • Graphing the log-likelihood evolution is helpful to visualize convergence behavior.
  • I initially tested an example model with just two children variables.  That system was “underdetermined” because there were more parameters than constraints.  That made things not converge as expected, and I spent many hours searching for a bug before I was mercifully rescued.  
  • When imputing values, make sure you calculate probabilities and sample from that distribution (or pass the full distribution through to future imputations).  


Despite the simplicity of this example, it is already Really Useful.  I created a data set for a subset of the Khan Academy exercises, which you can download here. I heuristically chose a definition for the E-variables of whether a user answered more or less than 85% of problems correct on an exercise. Once I had learned the full joint distribution (“theta” in the code), I could infer the probability of mastery for a given user on any exercise, including exercises for which they had not yet done any problems. When I plugged those predictions in as an additional feature to our accuracy model, it was a highly significant feature, especially on the first few problems done for an exercise.

Of course, there are many different ways we could construct new features to summarize performance across a pool of exercises, but this  a clean, robust, and transparent option.  It’s easily extended to a full hierarchical model of our knowledge map.   And the graphical modeling framework is powerful enough that we can eventually accomodate temporal effects, a decision-making agent, prior knowledge of experts (hello, teachers!), and more.


I’m tremendously excited about the potential of probabilistic graphical models to power optimized online learning at Khan Academy and elsewhere. If you’re interested in learning about these modeling techniques in general, hustle over to Stanford’s free and recently-launched online course, and follow me on Twitter  for more practical examples and updates. If you want to directly improve the future of education, there’s a place to do that, too.