Latent Skill Embedding

In this post, we explore a probabilistic model of student learning and assessment that can be used to recommend personalized lesson sequences. A student using Knewton-powered products currently receives content recommendations one step at a time. The model presented in this blog post is motivated by the idea that such a student could receive additional context and motivation by also seeing the most likely trajectory, or sequence of recommendations, leading to her goals, given her current state.


Students using Knewton-powered adaptive learning products are working in an online environment, acquiring knowledge that enables them to pass specific assignments and achieve their learning goals. These students receive recommendations from Knewton about what to work on, in a sequence that is determined by the work the student has previously done.

The goal of this blog post is to explore a new way of evaluating pathways through content based on their ability to improve student learning. We leverage existing log data (entries that look like Student A completed Lesson B and Student C passed assessment D) to compare the performance of different lesson sequences. Moreover, our method can be leveraged to generate displayable lesson sequences that maximize educational gains while meeting practical time constraints (i.e., the student has at most n days to maximize her probability of passing goal assessments). The additional context these sequences of recommendations provide could enhance students’ motivation, and consequently their overall performance.

General Approach

We use historical student-module interactions to learn a joint probabilistic embedding of the latent skills of students, skill requirements of assessments, and skill gains from lessons (modulated by prerequisite knowledge). Through offline evaluations on log data, we have shown that this model can successfully distinguish between lesson sequences that lead students to mastery or failure. In this post, we will outline how the model can be used to recommend lesson sequences in real time.


Probabilistic embeddings are used for a variety of tasks, including music playlist prediction,1 feedback propagation in MOOCs,2 visualization of high-dimensional data,3 and language modeling for natural language processing.4 We are interested in embeddings that perform nonlinear dimensionality reduction on high-dimensional data, and also can be used to learn useful relationships between objects that are not explicitly expressed in the data. We will use a probabilistic embedding to capture the dynamics of student learning and testing, and in the process we will describe a method to recommend personalized lesson sequences. Our approach resembles a proposed sparse factor analysis method.5

In our model, students, lessons, and assessments are jointly embedded in a low-dimensional semantic space that we call the “latent skill space.” Students have trajectories through the latent skill space, while assessments and lessons have fixed locations. Formally, a student is represented as a set of d latent skill levels \vec{s}_t \in \mathbb{R}_+^d that can change over time; a lesson module is represented as a vector of skill gains \vec{ell} \in \mathbb{R}_+^d and a set of prerequisite skill requirements \vec{q} \in \mathbb{R}_+^d; an assessment module is represented as a set of skill requirements \vec{a} \in \mathbb{R}_+^d.

Students interact with lessons and assessments in the following way. First, a student can be tested on an assessment module with a pass-fail result R \in {0, 1}, where the likelihood of passing is high when a student has skill levels that exceed all of the assessment requirements and vice versa. Second, a student can work on lesson modules to improve skill levels over time. To fully realize the skill gains associated with completing a lesson module, a student must satisfy prerequisites, though fulfilling some of the prerequisites to some extent will result in relatively small gains. Time is discretized such that at every timestep t \in \mathbb{N}, a student completes a lesson and may complete zero or many assessments. The evolution of student knowledge can be formalized as an input-output hidden Markov model.

There are two equations that characterize the dynamics of the latent skill space: the “assessment result likelihood,” and the “learning update.” Both are discussed in the following two sections.

Modeling Assessment Results

For student \vec{s}, assessment \vec{a}, and pass-fail result R,

    \[ R \sim \mathrm{\Bernoulli}(\phi(\Delta(\vec{s},\vec{a}))) \]

where (\phi\) is the sigmoid function, and

    \[ \Delta(\vec{s},\vec{a}) = \frac{vec{s} \cdot \vec{a}}{|\vec{a}|} - |\vec{a}| + \gamma_s + \gamma_a \]

where \gamma_s, \gamma_a \in \mathbb{R}. A pass result is indicated by R=1, and a fail by R=0. The term \frac{\vec{s} \cdot \vec{a}}{|\vec{a}|} can be rewritten as |\vec{s}|\mathrm{cos}(\theta), where theta is the angle between \vec{s} and \vec{a}; it can be interpreted as “relevant skill”. The term |\vec{a}| can be interpreted as assessment difficulty. The bias term \gamma_s is a student-specific term that captures a student’s general (assessment-invariant and time-invariant) ability to pass; it can be interpreted as a measure of how often the student guesses correct answers. The bias term \gamma_a is a module-specific term that captures an assessment’s general (student-invariant and time-invariant) difficulty. These bias terms are analogous to the bias terms used for modeling song popularity in playlists.1

Modeling Student Learning from Lessons

For student \vec{s} who worked on a lesson with skill gains \vec{ell} and prerequisites \vec{q} at time t+1, the updated student state is

    \[ \vec{s}_{t+1} \sim \mathcal{N}\left(\vec{s}_t + \vec{ell} \cdot \phi(\Delta(\vec{s}_t,\vec{q})),\mathbf{\sigma^2\mathbf{I}_d}\right) \]

where \Delta(\vec{s}_t,\vec{q}) = \frac{\vec{s}_t \cdot \vec{q}}{|\vec{q}|} - |\vec{q}| and \sigma is a constant. The intuition behind this equation is that the skill gain from a lesson should be weighted according to how well a student satisfies the lesson prerequisites. A student can compensate for lack of prerequisites in one skill through excess strength in another skill, but the extent to which this trade-off is possible depends on the lesson prerequisites. The same principle applies to satisfying assessment skill requirements. Figure 0 illustrates the vector field of skill gains possible for different students under fixed lesson prerequisites. Without prerequisites, the vector field is uniform.

Figure 0

The vector field of skill gains for a lesson with skill gains \0.5, 1 and prerequisites 0.7, 0.3. Contours are drawn for varying update magnitudes. A student can compensate for lack of prerequisites in one skill through excess strength in another skill, but the extent to which this trade-off is possible depends on the lesson prerequisites.

Parameter Estimation

How do we take a bunch of logged student-module interactions and turn them into an embedding? We compute maximum a posteriori (MAP) estimates of model parameters \Theta by maximizing the following objective function:

L(\Theta) = \sumlimits_{\mathcal{A}} \log{(Pr(R \mid \vec{s}_t, \vec{a}, \gamma_s, \gamma_a))} + \sumlimits_{\mathcal{L}} \log{(Pr(\vec{s}_{t+1} \mid \vec{s}_t, \vec{ell}, \vec{q}))} - \beta \cdot \lambda(\Theta)

where \mathcal{A} is the set of assessment interactions, \mathcal{L} is the set of lesson interactions, \lambda(\Theta) is a regularization term that penalizes the L_2 norms of embedding parameters to reduce overfitting, and \beta is a regularization parameter. Non-negativity constraints on embedding parameters are enforced. We solve the optimization problem using the L-BFGS-B algorithm, which extends the popular L-BFGS algorithm to handle box constraints. We randomly initialize parameters and run the iterative optimization until the relative difference between consecutive objective function evaluations is less than 10^{-3}. Averaging validation accuracy over multiple runs during cross-validation reduces sensitivity to the random initializations (since the objective function is non-convex).

Toy Examples

To illustrate how an embedding can capture the underlying geometry of different scenarios, we conducted a series of experiments on small, synthetically-generated interaction histories.


Figure 1

The key observation here is that the model recovered positive skill gains for L1, and “correctly” arranged Alice and A1 in the latent space. Initially, Alice fails A1, so her skill level is behind the requirements of A1. After completing L1, Alice passes A1, indicating that her skill level has probably improved past the requirements of A1. Note that this scenario could have been explained with only one latent skill.

Figure 2

A two-dimensional embedding, where an intransitivity in assessment results requires more than one latent skill to explain. The key observation here is that the assessments are embedded on two different axes, meaning they require two completely independent skills. This makes sense, since student results on A1 are uncorrelated with results on A2. Fogell fails both assessments, so his skill levels are behind the requirements for A1 and A2. McLovin passes both assessments, so his skill levels are beyond the requirements for A1 and A2. Evan and Seth are each able to pass one assessment but not the other. Since the assessments have independent requirements, this implies that Evan and Seth have independent skill sets (i.e. Evan has enough of skill 2 to pass A2 but not enough of skill 1 to pass A1, and Seth has enough of skill 1 to pass A1 but not enough of skill 2 to pass A2).

Figure 3

We replicate the setting in Figure 2, then add two new students Slater and Michaels, and two new lesson modules L1 and L2. Slater is initially identical to Evan, while Michaels is initially identical to Seth. Slater reads lesson L1, then passes assessments A1 and A2. Michaels reads lesson L2, then passes assessments A1 and A2. The key observation here is that the skill gain vectors recovered for the two lesson modules are orthogonal, meaning they help students satisfy completely independent skill requirements. This makes sense, since initially Slater was lacking in Skill 1 while Michaels was lacking in Skill 2, but after completing their lessons they passed their assessments, showing that they gained from their respective lessons what they were lacking initially.

Figure 4

We replicate the setting in Figure 2, then add a new assessment module A3 and a new lesson module L1. All students initially fail assessment A3, then read lesson L1, after which McLovin passes A3 while everyone else still fails A3. The key observation here is that McLovin is the only student who initially satisfies the prerequisites for L1, so he is the only student who realizes significant gains.



We have some evidence that an embedding can make good sequence recommendations in an offline evaluation. How can we use an embedding to make sequence recommendations in an online setting? Formally, we must address the following problem: Given a student’s current skill levels and a set of assessments the student wants to pass, what is the optimal sequence of lessons for the student? We can formulate the problem two ways: specify a minimum pass likelihood for the goal assessments and find the shortest path to mastery, or specify a time constraint and find the path that leads to the highest pass likelihood for the goal assessments while not exceeding the prescribed length.

In general, both problems can be tackled by using a trained Latent Skill Embedding model to specify a Markov Decision Process, in which the possible states are given by the student embedding, the set of possible actions corresponds to the set of lesson modules that the student can work on, the transition function is the learning update (Equation 2), and the reward function is the likelihood of passing all goal assessments (Equation 1) discounted by the expected time taken to complete the lesson sequence. Various methods for finding the optimal policy for an MDP with a continuous state space, discrete action space, and probabilistic transitions have been explored in the literature.6

For an embedding model without lesson prerequisites, the order in which lessons are completed is irrelevant since sums are commutative. Time-constrained lesson recommendation thus reduces to the 0-1 knapsack problem. In this case, each lesson contributes skill gains, but costs the student an expected amount of time,which can be predicted using matrix factorization. The 0-1 knapsack problem can be solved efficiently using dynamic programming.

In practice, content recommendations should intersperse assessments in the lesson sequence to keep students engaged and make sure they are learning. We are currently working on an algorithm that incorporates this idea into the sequence optimization problem.


A Python implementation of the Latent Skill Embedding model is available at The IPython notebooks used to perform data exploration, model evaluations, sensitivity analyses, and synthetic experiments are available at The data sets from Knewton used for these experiments are not available to the public. The project will be maintained at


Siddharth Reddy is a junior at Cornell University, studying computer science and applied mathematics. He’s interested in machine learning, adaptive education, and startups.

This work was a collaboration between Siddharth Reddy, Igor Labutov, and Thorsten Joachims at Cornell University, in conjunction with the data science team at Knewton. The project was presented at the Machine Learning for Education workshop at ICML 2015 (see for details).


  1. Shuo Chen, Joshua L Moore, Douglas Turnbull, and Thorsten Joachims.

Playlist prediction via metric embedding. In Proceedings of the 18th ACM

SIGKDD international conference on Knowledge discovery and data mining,

pages 714–722. ACM, 2012.

  1. Chris Piech, Jonathan Huang, Andy Nguyen, Mike Phulsuksombati, Mehran

Sahami, and Leonidas Guibas. Learning program embeddings to propagate

feedback on student code. arXiv preprint arXiv:1505.05969, 2015.

  1. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.

Journal of Machine Learning Research, 9(2579-2605):85, 2008.

  1. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities

in continuous space word representations. In HLT-NAACL, pages 746–751,


  1. Andrew S Lan, Christoph Studer, and Richard G Baraniuk. Time-varying

learning and content analytics via sparse factor analysis. In Proceedings of

the 20th ACM SIGKDD international conference on Knowledge discovery

and data mining, pages 452–461. ACM, 2014.

  1. Marco Wiering and Martijn van Otterlo. Reinforcement Learning: State-of-the-art,

volume 12. Springer Science & Business Media, 2012.

How Well Can You Predict Student Responses?

NIPS 2014 and Human Propelled Machine Learning

Last month Knewton gathered along with the global data science community at NIPS, which has become one of the top conferences for cutting-edge machine learning. Topics included game theory, climate informatics, energy infrastructure, neural nets and deep learning, and a wealth of different optimization algorithms.

The workshop Human Propelled Machine Learning was of particular interest, as much of the research presented is directly relevant to online learning. Jacob Whitehill (HarvardX) presented a model predicting student dropout in online courses, largely based on log-in times. Jonathan Huang (Google) discussed how to automatically grade programming homework on Coursera by clustering different solutions into equivalence classes. Richard Baraniuk (Rice) introduced the concept of Mathematical Language Processing, analogous to NLP, to interpret mathematical expressions. Nisar Ahmed (CU Boulder) swapped the roles of humans and machines by exploring systems where humans operated merely as a sensor in an automated process, rather than as a controller in operations.

On behalf of Knewton, I presented the paper On the Limits of Psychometric Testing in Online Education (co-authored by Zack Nichols), which provides an in-depth look at the accuracy of several well-known test grading approaches in online education. Traditional grading models were developed for bricks and mortar settings, in which students were tested in one sitting and required to answer all the questions on a test. Today, students may be assessed continuously over the course of a semester, and may not answer all available questions.

The Naive Way to Predict Student Responses

The most intuitive way to express how proficient a student is at a certain concept is to compute what percentage of questions she answered correctly. If a student S — let’s call her Sarah — attempted ten questions and answered eight correctly, she would get a proficiency score of 80%, which we denote as θS = 80%. The higher the proportion of questions Sarah answered correctly, the more proficient we think she is.

Given this information, how do we predict how Sarah would respond to an arbitrary question? If we don’t know anything about the question, we could argue that Sarah’s proficiency is pretty high, so she probably would give the right answer. Applying this logic to each individual question, we’d predict that Sarah would answer all questions correctly. Analogously, we would predict that her classmate Theo with θT = 45% would answer each question incorrectly.

We call this prediction method the naive classifier, as it naively assumes that a student would respond the same to each question. Obviously, we could do better if we knew a bit more about the questions being answered.

Classical Test Theory (CTT)

Classical test theory uses an additional parameter, difficulty, to predict student performance. To measure a question’s difficulty, we look at the percentage of students who answered the question incorrectly. For example, if 90% of all students got question Q wrong, then this question is probably very hard. We give Q a difficulty score 90%, or write βQ = 90%.

So, how would Sarah and Theo respond to question Q? Using classical test theory, we simply compare each student’s proficiency with the question’s difficulty. If the student’s proficiency is greater than the question’s difficulty, we predict a correct response. In this example, we predict that both Sarah and Theo will answer Q incorrectly.

One shortcoming of classical test theory is that a student’s proficiency only depends on how many questions a student answered correctly, not which questions she got right. What if Sarah only answered very easy questions? And what if Theo wanted to be challenged and only answered very hard questions? Shouldn’t we assess their proficiency differently? There is a similar problem with how question difficulty is calculated: What if only very knowledgeable students answered a given question? Enter IRT.

Item Response Theory (IRT)

Like CTT, item response theory uses a parameter for a student’s proficiency and a question’s difficulty, too, but these parameters are computed differently. In short, if a student gets a hard question right, she earns more points than when she gets an easy question right. Mathematically, for the probability P that a student with proficiency θ correctly answers a question with difficulty β, we write


Here we compute the θs and βs for all students and questions using Bayesian statistics, which means that we look for the θs and βs that would make the given student responses the most likely to occur.

The formula is illustrated in figure A. This example shows that a student with some proficiency θ (vertical dotted line) is likely to answer Q1 incorrectly (dashed line), but Q2 correctly (solid line).

AIn our paper, we tested both CTT and IRT on tens of millions of student interactions in two online educational products. We found that CTT is a better model for describing the observed student interactions (i.e., the training set), but that IRT works better for predicting unseen student responses (i.e., the test set). In other words, CTT is more prone to overfitting than IRT.

Comparing Once Again

Let’s look at the IRT formula once more. If θ < β, then we find that the power e-(θβ), and that the probability P < ½, which we could round down to zero when making a prediction. Indeed, if a student’s proficiency is less than a question’s difficulty, we’d predict her response to be incorrect. Conversely, if θβ, then the power e-(θβ), and hence P ≥ ½, which we would round up to 1. This implies a correct response.

Again, we predict a correct response if and only if a student’s proficiency is greater than a question’s difficulty. When predicting a single response, we are not interested in how much exactly a student’s proficiency or a question’s difficulty is or how far they are apart — we simply need to know which one is bigger.

This means that all we need is an ordering of students and questions, from most proficient to least proficient, and difficult to easy, respectively. A student will correctly respond to all questions below their proficiency, and incorrectly to all questions above it. Rather than learning specific values for θ and β, couldn’t we deduce such an ordering directly from the student responses?

B Bipartite Partial Tournament Ranking

In our paper, we introduce another model for describing student responses. We interpret the student responses as a tournament, in which a student engages in a battle with a question, and either wins or loses. We call this a partial tournament, since students generally do not face all of the questions, and it’s bipartite, since the students only challenge questions and not each other. This tournament is a list of pair-wise comparisons: Which is bigger, the student’s proficiency or the question’s difficulty? 

We could draw this tournament as a bipartite directed graph (see fig. C), where vertices represent students and questions, and a student’s outgoing edges represent correct answers, and vice versa. Since we predict a correct response if and only if a student’s proficiency is higher than a question’s difficulty, an edge points to a vertex with a lower parameter. Therefore, paths always go through vertices with decreasing proficiency or difficulty. If we can find a path through all vertices (called a topological sorting), then we will have a ranking of all students and questions that perfectly describes all student responses.


However, such a sorting does not exist if we have cycles (see fig. D). In that case, an edge must be broken (or flipped in the opposite direction) to unambiguously rank students and items. This means that even the most accurate ranking must contain at least one prediction error: The edges you flip are exactly the student responses you will predict inaccurately. Therefore, a model with the highest accuracy tries to find a feedback arc set (FAS): the minimum number of edges that should be removed to eliminate cycles.

It is practically impossible to find the FAS for a large graph, but good approximations exist. In our paper, we combine a well-known algorithm by Eades and a simple local optimization (which we called Eades+LO) which in simulations virtually always found the “true” simulated accuracy.

Finding the Maximum Accuracy

The above approaches model student behavior with the same total number of parameters (i.e., only one, for proficiency or difficulty), and vary only in their regularizations and fitting algorithms. In our paper, we show that these variations can still result in dramatic differences for in-sample and out-of-sample accuracy (see figures E and F below).


By construction, the tournament ranking-based model will generate a ranking of proficiencies and difficulties that will best describe the given student interactions. Therefore, its in-sample prediction accuracy will always be higher than any other one-parameter model. This is also shown in our paper: On the training set, it outperforms both CTT and IRT by a landslide. On the test set, however, it performs much worse that the other two models.

Even though the tournament ranking performs poorly on the test set, it yields a useful reference point in comparison with other algorithms. For example, if a dataset has a tournament ranking-based prediction accuracy of 85%, then an IRT model fit to this dataset cannot be expected to have a prediction accuracy higher than 85% on neither the training set nor the test set. (As a side note, it’s not difficult to show that this upper bound also holds for 2PL-IRT models.)

So what accuracies can our student predictions maximally reach? Although we ran the model on two very different educational products, the results were remarkably similar. In both products, the tournament ranking-based model reached accuracies between 82% and 100% for predicting all interactions in a concept, averaging around 93%. This suggests that, on average, no one-parameter model will be able to predict the remaining error rate of 7%. This is either noise or student behavior that needs a second parameter to be described.

Predict How You Will Predict

Surprisingly, these maximum accuracies correlated very well with simple dataset features like mean response score, number of students, and average number of questions answered per student, up to a correlation coefficient R2 of 85%. In other words, without running your tournament ranking or IRT model, you can easily predict how well your model will be able to model student responses.


This is important, since the accuracy of tournament ranking-based predictions is defined by the size of a feedback arc set in the data: Predictability of this feature hints at some conserved student behavior across the datasets. It is possible that modeling approaches which are better tuned to continuous assessment in online educational environments could explain this property.

Final Thoughts

Classical test theory, which compares the proportion of correct responses, is a very intuitive model that is easy to understand and implement, and performs well in describing student behavior. However, for predicting unseen student interactions, item response theory outperforms the other discussed models. Interpreting the student responses as a tournament of students versus questions and finding the best ranking gives us an upper bound of the prediction accuracy of these models, and all other one-parameter models as well.

The consistency and predictability of maximum accuracies suggest that these models fail to capture hidden patterns in student behavior. One obvious explanation could be that these models ignore the fact that students learn. It stands to reason that an assessment method that can take learning into account should have measurable advantages for this type of data. This work is intended as a step in that direction, by providing a necessary point of comparison.


The paper On the Limits of Psychometric Testing in Online Education was written by Ruben Naeff and Zack Nichols, with valuable input from David Kuntz, Jesse St. Charles, Illya Bomash, Kevin Wilson and Yan Karklin. References can be found in the paper.


Offline Parameter Estimation

As part of my summer internship at Knewton, I helped design and implement an offline parameter estimation batch. This blog post discusses some of the design considerations and challenges associated with the implementation of offline parameter learning. But first, I’ll provide a little context around the model and an overview of the parameter estimation process.

Vocabulary and Context

  • A student event is any interaction a student has with educational content that is recorded in Knewton’s data store (e.g., watching an instructional video or answering a quiz question).
    • When an event consists of a student responding to an assessment question, it is referred to as an item interaction. In this case, the item refers to the actual question.
    • Items are responsible for assessing one or more concepts (e.g., long division). The content for a particular book consists of a collection of concepts.
    • A graph structure identifies the relationship between concepts (e.g., a concept might be a prerequisite for another concept). The graph structure also identifies which concept(s) a particular item assesses. Therefore, a single graph identifies the relationships for all items and concepts contained in the book used for a course.
    • By understanding the characteristics of each item (e.g., its level of difficulty) as well as a student’s current level of proficiency in a particular concept, Knewton is able to provide a recommendation on what piece of educational content the student should interact with next in order to meet the learning objectives of the course. This explanation glosses over a number of other factors taken into consideration by Knewton’s recommendation engine, but suffice to say, the recommendation relies on an understanding of item characteristics and student proficiency.
    • Item characteristics and student proficiency are modeled using Item Response Theory (IRT) as discussed in a previous blog post.

Parameter Estimation Overview

Once the initial model components are identified based on experimentation and evaluation by data science teams, Knewton has to assess how the parameters for this model will be estimated over time to ensure robustness and accuracy as more data are collected. The team can choose to train parameters in an ad hoc mode, a real time online mode, a recurring offline mode, or some hybrid.

Ad hoc

In this mode, the data scientist(s) might derive parameters outside of the operational functions of the product based on their analysis. Parameters are configured into the product based on this analysis in an ad hoc manner based on additional experiments/analysis by the data scientist(s).

The key operational drawback of this approach is the lack of a systematic and real-time response to new student events. Parameter updates have to wait until the next time the data scientist(s) choose to gather and incorporate new student events into a parameter estimation run. In addition, without a standardized and repeatable process in place, different data scientists may approach the parameter updates in a slightly different way based on their familiarity with the problem. This can pose a challenge in maintaining quality consistency and reproducibility.

The benefit of this approach, however, is that it has few constraints imposed upon it, and so is incredibly flexible. Data scientists are able to tweak the training process as they see fit.


This mode refers to a recurring operational process by which parameters are regularly re-estimated as part of some (semi-)automated process. At Knewton, this is achieved through scheduled batch jobs that automate the gathering of training data and execute the parameter learning.

While this limits the flexibility afforded to the parameter estimation method (since the logic has to be implemented in an automated batch), it improves robustness by simplifying the process by which model parameters can be updated in response to changes in the underlying data. This process still does not provide real-time parameter updates, but, at the time of execution, it can incorporate all available data into its parameter estimates.


This mode refers to the integration of the parameter update process into the online behavior of the product. At Knewton, this means that each incoming student event can trigger a parameter update. This is necessary because each additional student interaction provides insight into the current proficiency level attained by the student. An application of this approach outside of Knewton involves having to cluster news articles into related topic areas. Since new articles are constantly coming in over a newswire, there is a need to associate and update the clusters to reflect the update corpus of news article [1].

The key operational challenge of online estimation is that data generally have to be processed incrementally. Multiple scans of the full data set are not always feasible. In addition, real-time response rate requirements limit the amount of computation that can take place for each incoming data point (particularly if the number of incoming events is high) [2]. As an example in the context of news articles and topic models, Banerjee et al. employ algorithms which hold the “number of clusters” constant and only update the cluster parameters for the topic area which appears to be most closely aligned with an incoming news article [1].

While these techniques are effective in providing real-time response rates, the accuracy of model predictions may suffer over time if there are underlying changes to the model parameters that were held constant. In the case of news topic clusters, the study finds that, in general, “online algorithms give worse nMI (a measure of cluster quality) results than corresponding batch algorithms, which is expected since the online algorithms can only update the cluster statistics incrementally” [1]. In Knewton’s case, new data might suggest that item parameters need to be tweaked from the default settings with which they were originally configured.


The authors of the news topic clustering study — like Knewton — propose the use of a hybrid model where offline parameter estimation is used to regularly refresh model parameters that are otherwise being updated incrementally through the online parameter update [1]. This provides the ability for models parameters to react in real time to incoming data points, while systematically refreshing parameters that might otherwise become stale as a result of only making incremental online parameter adjustments.

The hybrid approach does pose the challenge of having to identify an optimal frequency at which offline parameter estimation should be carried out to augment the performance of the online parameter updates. The frequency is likely to be dependent on how quickly errors accumulate due to online incremental updates and the tolerance range for such prediction errors.

Design Considerations for Offline Parameter Estimation

The remainder of this post discusses the offline batch process that was to be introduced in support of the hybrid parameter estimation approach. As such, the objectives of this batch job were to:

  • Extract item interactions from the student events recorded by Knewton;
  • Parse them into training data for a parameter learning algorithm for the IRT model; and
  • Estimate new item parameters grouped by book. By grouping items by the book, the estimation process can use relationships between different concepts (e.g., if mastery of concept 1 is a prerequisite for mastery of concept 2) to inform parameter selection for items assessing each of those concepts.

There were four major design considerations which went into achieving these objectives: robustness, flexibility, scale, and transparency.


One of the main considerations was to design an offline parameter estimation job which would be able to accommodate future changes in the estimation process as well as the underlying data. As a result, there was heavy use of modularized design to decouple various components of the batch process:

    • Rather than directly retrieve student events from the database, the batch relies on an intermediary batch job that retrieves student events from the database and parses them into an easily consumable input format.
    • Finally, a separate library is defined that accepts a training data set as input and uses it to perform the actual parameter estimation.

This allows the batch process to gather a training dataset and pass it to a machine learning algorithm for parameter estimation while being agnostic of the underlying database structure, graph structure, and parameter estimation algorithm details.


Though the batch job is itself written in Java, the library responsible for performing parameter estimation is written in Python in order to provide the flexibility of leveraging mathematical packages such as pandas and numpy. To facilitate this, the batch job accommodates python code execution as a post-Reduce calculation step using a custom MapReduce framework (which uses Thrift to communicate between the Java and Python components).

The library used to perform the actual parameter estimation is the same library which the data science teams utilize for their experimental analysis. This eliminates the need to generate duplicate code going from exploratory analysis to offline parameter estimation — particularly as the parameter estimation techniques are tweaked. In addition, the batch is able to support input parameters corresponding to the full range of optimization settings used by the data science teams to tweak the parameter learning process.


The biggest challenge was dealing with the amount of data involved in the parameter estimation step. Knewton records over 500 million student interactions each month. Using a MapReduce design allowed the extraction and parsing of item interactions from student events to be achieved by distributing the student event records across multiple Hadoop partitions. Using a 15 node cluster of R3 AWS instances, the batch was able parse student events for a given month within an hour. The R3 instances are optimized for in-memory analytics, providing 61 GiB of memory and 160 GB of SSD storage.

Even after the batch had extracted the item interactions from the student event records, this still left between one and five million training data instances per book for the estimation step. Such large data sets place a heavy load on memory during the iterative parameter estimation process. While this issue was addressed by using AWS instances optimized for memory, another option would be to sub-sample down to manageable levels as the amount of data increases over time. The other alternative would be to modify the learning algorithm to avoid storing all 5 million training data instances in memory during the update calculations. However, devising a special algorithm for this may limit the long-term flexibility of the parameter estimation process.


The batch job uses Counters and logging to provide transparency into the parameter estimation steps. The logging is facilitated by the parameter estimation library which enables the user to see parameter estimation statistics (like log-likelihood values) at the end of each iteration in the algorithm. This was designed so that one can look inside the parameter estimation process and evaluate discrepancies or optimization opportunities.

Parameter Validations

Finally, a systematic QA step is an important feature for any automated offline parameter estimation. Knewton exercises a number of manual validation steps ranging from quantitative analysis (e.g., running historical events through the model to evaluate predictive power) to qualitative analysis (e.g., reviewing recommendations produced with the parameter estimates). In the future, some of these QA steps could be incorporated into the batch process to provide fully automated offline parameter estimation.


[1] A. Banerjee and S. Basu, “Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning,” Society for Industrial and Applied Mathematics, pp. 431-436, 2007.
[2] S. Zhong, “Efficient streaming text clustering,” Neural Networks, vol. 18, no. 5-6, pp. 790-798, 2005.

Online Cross-Validation to Predict Future Student Behavior

At Knewton we use various mathematical models to understand how students learn. When building such models we want to make sure they generalize, or perform well for a large population of students. Cross-validation, a technique in machine learning, is a way to assess the predictive performance of a mathematical model. At Knewton, we use cross-validation extensively to test our models.

Cross-validation is based on the fact that we don’t have access to unlimited data. If we had all the possible data on student learning patterns, the solution would be straightforward. We would test all our models with the data and pick the one with the lowest error rate. In reality, we only have a finite set of student data to work with. Given a limited amount of data, how do we decide which model performs the best?

One approach is to use all of the available data to test our model. A major problem with this approach is overfitting, which is demonstrated in Figure 1.


Figure 1: Left: the model (blue) underfits the data (orange). This is an over-simplistic explanation of the data where the model would be a better fit if it had more parameters. Middle: the model fits the data just right, where the model captures the overall pattern in the data well. Right: the model overfits the data, where the model fits the noise in the dataset. (Source)

If our model overfits the data, the error rate will be low but if new data is added to the dataset, the model might perform poorly as the fit doesn’t explain the new data well. This is why models that overfit do not generalize well and should be avoided.

This is where cross-validation comes into play. In this approach, rather than fitting the model to the full dataset we split it into training and test sets. This is also referred to as holdout cross-validation, as we are leaving a portion of the data out for testing. The model is fitted using only the training portion of the dataset. Then we assess the predictive performance of the model on the left-out data, which is the test set.

As an example, one model we use to assess student learning is Item Response Theory (IRT). We want to cross-validate our IRT model for a set of student responses to test the performance of our model. To do this, we can split the student response data into training and test sets, fit the model to the training data, and validate it on the test data. If the fitted model predicts the student responses in the test set accurately we can accept this IRT model.

When measuring how students learn, we assume they learn over time. Therefore, it is useful to understand how students behave as time progresses. A shortcoming of the holdout cross-validation technique is that it makes comparisons between random bits of past student data so it can’t make predictions about how students will behave in the future. It would be very useful if we were able to make predictions about students’ future behavior given their past learning patterns.

Online cross-validation is a version of cross-validation which can validate over time series data. Going back to our student response data example, online cross-validation uses a student’s past data to predict how that student will behave in the future. The dataset for online cross-validation is a time-ordered set of responses the student gave in the past. We take the first k responses of a student and use them for the training set, then we try to predict that student’s k+1st, k+2nd, …, k+nth response. If our prediction accuracy is high, we can say that our model is a good fit for our dataset.

Let’s look at how online cross-validation works in more detail. The students answer some questions over time. Some of these responses are correct (green) and some are incorrect (red). Online cross-validation will start by training on the student’s first response only (k=1), then use this to predict whether the student is going to get the next item (k+1 = 2) correct or incorrect.


Figure 2: The first iteration of online cross-validation. The dots represent whether a student got a question correct (green) or incorrect (red). The model is fitted using the first response (k=1) and then used to predict the second, k+1st item (k+1=2). If our prediction matches the student response, our model accuracy increases. 0/1 refers to incorrect/correct.

In the next iteration of online cross-validation, we can use the first two responses (k=2) as our training set, fit the model using these two data points, and predict the third response (k+1=3).


Figure 3: The second iteration of online cross-validation. The dots represent whether a student got a question correct (green) or incorrect (red). The model is fitted using the first two responses (k=2) and then used to predict the third, k+1st item (k+1=3). 0/1 refers to incorrect/correct.

Online cross-validation continues until we run through all the iterations by increasing the training set one student response at a time. We expect to make better predictions as we add more data to our training set.

With online cross-validation, we are not limited to predicting only the next response in the future. We can predict a student’s next 2, 3, …, n responses. This makes online cross-validation a very useful technique if we want to make predictions far in the future.

Both holdout cross-validation and online cross-validation are very useful methods to assess the performance of models. Holdout cross-validation method is useful in assessing performance if we have a static dataset, whereas online cross-validation is helpful when we want to test a model on time series data.


Student Latent State Estimation with the Kalman Filter

The Kalman filter is an algorithm that estimates the values of unknown, or latent, variables from a series of noisy measurements taken over time. The Kalman filter has numerous applications in finance and different areas of engineering, and in this blog post, we will show how the Kalman filter can apply to student learning. We will use a Kalman filter to infer an approximation to a student’s ability given their response times to a set of questions.

To fully understand the concepts below, you’ll need a background in basic probability and statistics.


To begin, we are going to describe our inference problem and a model of the system – a set of assumptions of how student learning works.

Given a series of scalar measurements:


where z_i denotes the natural logarithm of the time it took our student to answer question i,

We want to infer the scalar latent variables:


where the student has ability x_i at the time question i is answered.

We model the change in student ability over time as a Gaussian random walk, meaning that the current ability value is based on the previous ability value, plus some Gaussian noise:

x_k=x_{k-1}+w_k, where w_k is drawn from N(0, \tau \sigma^2_x), where \tau is the time the student spent between questions, and \sigma^2_x is a hyperparameter that corresponds to the variance for this latent process (1).

Having the variance of the noise increase linearly with the time difference makes our equation consistent with a continuous Gaussian random walk, and is consistent with our intuition that a student’s ability needs time to change. In other words, a student is unlikely to experience significant change in ability if they’re between questions in a quiz, but if it’s been a day since they’ve answered a question, they’re more likely to have taken the time to learn more about the concept, or, conversely, they might have forgotten the material. Because the latent state variance is time-dependent, our filter is technically called a hybrid Kalman filter, since it assumes a continuous-time model for a discrete set of observations.

We don’t assume that the student ability x_k accounts for all the variability in the log of the student response time, z_k. For example, it’s possible that the student is familiar with this particular problem or is distracted by her environment. Therefore, we say that the observed times are likewise corrupted by some Gaussian noise, v_k:

z_k=x_k+v_k, where v_k is drawn from N(0, \sigma^2_z), where \sigma^2_z is a hyperparameter that corresponds to the variance of the Gaussian noise (2).

Kalman Filter blog post (v2)

The resulting model is pictured by the diagram above: the ability of the student at the previous question x_{k-1} determines the log response time z_{k-1}. The ability of the student at current question x_k is determined by the ability at the previous question x_{k-1}, and determines the current log response time, z_k.


The Kalman filter is an algorithm for estimating each x_k a posteriori — that is, it computes an estimate for each x_k given observations Z = {z_1, z_2,...,z_k}. It does this recursively, which means that it only needs an estimate of x_{k-1}, along with observation z_k, to output an estimate for x_k.

To make a new estimate, we first need to compute two intermediate pieces of information: a prior distribution and a likelihood distribution.

A note on syntax:

z_{1:k-1} denotes our observations z_1 through z_{k-1}, and x_{k-1|k-1} represents our estimate of ability at the k-1th question given observations z_{1:k-1}; likewise \sigma^2_{k-1|k-1} represents our estimate of the variance given observations z_{1:k-1}.

f denotes the Gaussian probability density function (pdf) with mean \mu  and variance \sigma^2:

f(x;\mu,\sigma^2) = \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{(x-\mu)^2}{2 \sigma^2}}

Calculating our prior distribution

Our prior term represents the knowledge we have of current latent state x_k having seen everything but our current observation z_{k}.

To calculate our prior, we start off with an estimate of x_{k-1}, represented as a Gaussian distribution with mean x_{k-1|k-1} and variance \sigma^2_{k-1|k-1}:

p(x_{k-1} |z_{1:k-1}) = f(x_{k-1};x_{k-1|k-1}, \sigma^2_{k-1|k-1}) (3)

From equation 1, we see that x_k is simply the addition of two independent random Gaussian random variables, x_{k-1} and w_k.

From probability theory, we know that the probability density of the addition of two independent random variables can be expressed as a convolution of the two composite probability densities. It happens that the convolution of two Gaussians is also a Gaussian:

p(x_k|z_{1:k-1}) = \displaystyleint p(x_{k-1}|z_{1:k-1}) p(x_k|x_{k-1}) dx_{k-1} = f(x_k; x_{k|k-1}, \sigma^2_{k|k-1}), where

x_{k|k-1}=x_{k-1|k-1} and

\sigma^2_{k|k-1}=\sigma^2_{k-1|k-1}+ \tau \sigma^2_x  (4)

We call p(x_k|z_{1:k-1}) the prior knowledge we have about x_k. The next step is to look at the current observation, z_k and see what information it adds.

Calculating our likelihood distribution

Our likelihood term represents the information our current observation z_k gives us about our latent state x_k.

From equation 2, we see that likelihood of our observation z_k given our hidden variable x_k is simply a Gaussian centered at x_k. This becomes our likelihood term:

p(z_k|x_k) = f(z_k; x_k, \sigma^2_z) (5)

Combining prior and likelihood

The Kalman filter combines the prior knowledge we have about x_k and our likelihood term in accordance with Bayes’ rule, by multiplying the prior term with the likelihood term. We call this resulting distribution the posterior, or our estimate of x_k given all the information we have.

Luckily, the multiplication of two Gaussians is still a Gaussian, although unnormalized:

p(x_k|z_{1:k}) = \frac{1}{Z} p(x_k|z_{1:k-1})*p(z_k|x_k) = f(x_k;x_{k|k}, \sigma^2_{k|k}), where Z is a normalizing constant, and where:


C_k=\frac{\sigma^2_{k|k-1}}{\sigma^2_{k|k-1} + \sigma^2_z}

\sigma^2_{k|k} = (\frac{1}{\sigma^2_{k|k-1}} + \frac{1}{\sigma^2_z})^{-1} (6)

To summarize, given a Gaussian posterior distribution for x_{k-1} (Equation 3) and a new observation z_k, the Kalman filter estimates a new Gaussian posterior for x_k (Equation 6). By updating the Kalman filter as we receive new observations, we can obtain fast, real-time estimates of our latent state.

Open Source Code

An open source implementation of this hybrid Kalman filter algorithm is on Knewton’s GitHub:


Sophie Chou is a senior at Columbia University majoring in Computer Science. She’s interested in Machine Learning, Natural Language Processing, and becoming a better teacher. You can follow her on twitter @mpetitchou.

Andersen Chen is a senior at Brown University, majoring in Mathematics and Computer Science. He’s interested in data science, startups, and machine learning.

Deciding Who is Right: Using Item Response Theory Instead of Majority Rule to Crowdsource Answers

As the saying goes, it takes a village to build an adaptive learning technology provider. Even when you have an entire village at your disposal, however, the added help may contribute more headache than help as even the simplest question can generate differing opinions.

To place the situation in more concrete terms, imagine you have a stack of 10,000 photos of either kangaroos and kittens, but you do not know which photo depicts what. Because object recognition remains a difficult problem in artificial intelligence, even the most powerful computers will have a difficult time determining if a photo is a kangaroo or a kitten.

Classifying them all by yourself would be time consuming and potentially inaccurate if you start to lose focus. Luckily, 100 of your closest friends have offered to help; unluckily, some informal surveying already reveals that sometimes they disagree with each other. After all, kangaroos and kittens do sometimes look similar. What to do?

How can we decide the correct answer amidst a sea of potentially contradictory information? One straightforward approach would be to gather two or three (or ten) labels for each photo and take the majority vote. While the majority method would give us a rough idea of the correct label for each photo, it fails to incorporate the fact that some of your friends may be more gifted kangaroo/kitten labelers than others. Additionally, some of the photos might be harder to label, which would skew calculations about your friends’ abilities.

All of a sudden, our kangaroo-kitten problem is starting to sound like a problem we have already tackled: Item Response Theory (IRT)!

IRT Model

In order to solve our kangaroo-kitten problem, we can use a slightly modified form of IRT. Here at Knewton, we use IRT to determine student proficiency. Unlike standard tests where all questions are weighted equally, IRT assigns each student an ability value and each question a difficulty value, which provides a more nuanced picture of how a classroom of students is doing.

In the one-parameter logistic (1PL) IRT model, for question \beta j with difficulty j and student i with ability \theta i, the probability that question j is answered correctly, xj = 1, is


For more information about IRT, check out Alejandro Companioni’s previous N Choose K post.

The question then becomes, how can we use our knowledge about abilities and difficulties to determine who is better at kangaroo/kitten labeling?

Beyond IRT

If we stretch our minds a little bit, we can imagine our kangaroo/kitten problem as a test where the questions are the photos and the students are our friends. We want to determine which students are most proficient at the very strenuous task of animal classification. In addition to the parameters included in the 1PL IRT model, however, we also want to compute a probability to capture how likely each picture is to be either a kangaroo or a kitten.

Similar to the 1PL IRT model, the parameters in our model now include labels L, vector of abilities theta, and vector of difficulties beta. To make sure we’re all on the same page, the labels L represents all of the given labels by our labelers. Not all labelers need label every photo. Each of our abilities \theta can range from negative infinity to positive infinity. The greater the ability, the more skilled the labeler. Our difficulties range from zero to infinity where the higher the difficulty, the harder the image is to label correctly.

Consider how the observed labels, true labels, abilities, and difficulties all relate to each other. Would the difficulty of the question affect the accuracy of the observed label? Potentially. Would the true label of the image affect the ability of the labeler? Unlikely. Below we have drawn the general graphical model describing the relationships between these parameters where the shaded variables are observed.


Remember that in our case, we have 10,000 images and 100 labelers. Unsurprisingly, the difficulties, abilities, and the true labels are all independent of each other, meaning the accuracy of a labeler has no effect on the likelihood that a photo depicts a kangaroo or a kitten!

How does this all have anything to do with if the photo is a kangaroo or a kitten? For specific photo j, we can derive how likely the photo depicts either adorable animal. That is, the posterior probability of the correct label zj for photo j denotes the probability the photo depicts an animal.

Because we know that the photo contains either of the two animals, we can designate kangaroo as 0 and kitten as 1. Our posterior probability then designates from 0 to 1 how likely the photo is to contain either animal. If we assume that the correct label zj is independent of both abilities theta and difficulties beta, the probability simplifies dramatically.



The posterior probability now consists of two components: a prior belief and an IRT-based probability. Our first term p(zj) captures our prior knowledge about how many of the photos contain each. For example, if we suspected that the majority of the photos were kittens rather than kangaroos, we could use that parameter to include our prior belief in the model. The second probability uses our 1PL IRT probability to denote the probability the labeler gave a label (aka answered a test question) conditioned on the correct answer, the labeler ability, and the difficulty of the question.


Now that we have established our graphical model including relevant dependencies, we can use an Expectation-Maximization (EM) approach to obtain maximum likelihood estimates of the parameters of interest. Essentially we can now alternate between the expectation and maximization steps to find the most likely probabilities and parameters, given all of the other probabilities and parameters.

By a lucky coincidence, we have actually already determined our expectation step above when we computed the posterior probability of each label! A simpler way to think about the expectation step is to imagine that our abilities and difficulties are all fixed, and we calculate the animal image probabilities accordingly. If we only calculated the probabilities once, however, the probabilities would only depend on whatever values of abilities and difficulties we initialized in our model! How can we keep adjusting the model?

This revelation brings us to the second half of EM: the maximization step. For this step, we want to find a way to make our posterior probabilities as large as possible—denoting how certain we are overall of our guesses of the correct label—by adjusting our parameters and . More formally, we are trying to maximize the expectation of the joint log-likelihood of the observed and hidden variables (L, Z) given the parameters (\theta, \beta) with respect to the posterior probabilities that we calculated in the expectation step.

Our joint log-likelihood function is the expected value of the logarithm of our joint probabilities. That is, how certain are we of everything so far? Using our conditional independence assumptions outlined earlier, we can find our joint log-likelihood function:


Using gradient ascent, we can then find values of \theta and \beta that locally maximize Q.

From here, we simply alternate between expectation and maximization steps until convergence. To recap, expectation holds ability and difficulty constant, while calculating posterior probabilities. Maximization then calculates the ability and difficulty parameters to maximize joint log-likelihood, given constant posterior probabilities.

Depending on the gradient ascent implementation, the algorithm should converge quickly, revealing our best guesses for which animal is featured in which photo. As we can see below from our implementation based on simulated data, the EM approach outscores the majority approach at nearly 5% initially before converging later. Additionally, as we increase the number of voters, the accuracy increases. Success!


While our improvement over the majority method may be impressive, our E-M IRT model still has plenty of room to expand. What if we also had pictures of koalas and killer whales, increasing the number of options? What if we had reason to believe that the abilities of our friends fall in a Gaussian distribution, creating a prior distribution on our parameters? What if we assumed that our friends might become better labelers as they continued to label, making our model intertemporal?

Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., and J. Movellan. (2009). Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise. In Advances in Neural Information Processing Systems 22, pages 2035–2043.

de Ayala, R.J. (2008). The Theory and Practice of Item Response Theory, New York, NY: The Guilford Press.

Algorithmically Generating Questions

Developing assessment materials is a lot of work. Teachers spend countless hours scavenging textbooks and developing original exercises for practice worksheets, homework problems, remedial materials, and exams. To prevent cheating, many teachers write several versions of each test, multiplying the work required for creating problems (not to mention grading!). The issue is exacerbated by the rising popularity of MOOCs, in which tens of thousands of students may be enrolled in the same course.

Furthermore, access to a large, varied pool of assessment items is a key to the success of many adaptive courses of study. Without a large pool of items, proficiency estimates can be compromised and personalized courses of study can become less effective than they would be otherwise.

Machine Generated Questions

Machine generated questions have been studied for decades as a component of intelligent tutoring systems. Most research falls into two categories: solution-oriented approaches, and template-based approaches.

Solution-Oriented Approach*

In this approach, questions are generated based on the set of skills and concepts required to solve them. For example, skills related to addition includes adding single digit numbers, adding multi-digit numbers, adding three or more numbers, and carrying digits.

A recent paper, entitled “A Trace-Based Framework for Analyzing and Synthesizing Educational Progressions,” describes an interesting implementation of solution-oriented question generation. On page three, the authors write out pseudocode for the standard, classroom addition procedure. They then annotate the code with symbols representing skills (for example, C for “carry digit”). Thus, by running the pseudocode and keeping track of the symbols, one can obtain a “trace” that categorizes each possible addition problem.

Because solution-oriented approaches group problems based on skills, they lend themselves well to adaptivity. As a student answers questions, one can identify skills he or she is struggling with, and then recommend material reinforcing those skills. However, a major drawback of solution-oriented approaches is that developing questions even for a topic as simple as addition requires a fair amount of labor and domain expertise.

Template-Based Approach*

In this approach, a question template is used to represent a potentially large class of problems. For example, consider a familiar question:

Find all roots of _x+ _x + _.

The underlines are “holes” that must be filled in by the question generator. A template might also specify valid ways to fill in the holes. For example, maybe each hole can only be filled in by the integers one through 10, leading to 10^3= 1000 possible questions. The instructor may wish to further restrict the template to only permit quadratics with real, distinct roots.

The biggest advantage of this approach is that it is accessible to a majority of instructors, provided there is an intuitive and robust templating language. In addition, template-based approaches are easily generalizable, capable of representing entire domains. A disadvantage of templates is that they tend to group problems based on appearance, not skills.

This summer, I set out to create an assessments generator engine that would be both accessible and expressive enough to generate a wide variety of problems. For these reasons, a templating approach made the most sense. Furthermore, Knewton already has infrastructure in place that will enable us to achieve adaptivity even with a template-oriented approach.

The Art of Templating

My first task was to devise a templating language. I decided that it would be a good exercise to define a domain specific language (DSL) that formalizes the space of possible templates. This DSL must let instructors specify the following:

  • Which variables in the question can be adjusted?

  • What values are these variables allowed to take on?

  • How is the correct answer computed?

  • How are the incorrect answers computed? (for multiple choice questions)

After several iterations, I came up with a format general enough to cover many (if not most) of the questions used by Knewton Math Readiness. I’ll go through some examples, beginning with simple addition, that illustrate the main features of the templating language.

Adding Two Numbers

The below template is used to generate questions of the form _ +_ = ?

template "add 2 numbers" {
  question is "<x>+<y>=?"
  answer is "<sum>"
  variables {
    x, y are integers between 1 and 9

The question and answer are simply strings with variable names denoting the “holes.” Variables come in two flavors: generated (x and y) and derived (sum). Generated variables are bound to a sample set, which could be a range of integers, numbers with 2 decimal places, or even the set of fibonacci numbers. Derived variables are defined by mathematical expressions.

Finding Quadratic Roots

Let’s return to the earlier question of finding the roots of a quadratic. Following the addition example, we might try:

template "quadratic roots" {
  question is "<a>x^2 + <b>x + <c> = 0. Solve for x."
  answer is "x = <root1>, <root2>"
  variables {
    a, b, c are integers between -10 and 10
    discriminant = b^2 - 4*a*c
    root1 = -b + sqrt(discriminant)/(2*a)
    root2 = -b - sqrt(discriminant)/(2*a)

Here, we are generating each coefficient (a, b, c) from the range [-10, 10]. However, the below table illustrates an issue with this template. For certain coefficients, the solutions can be integral, fractions, irrational numbers, or even imaginary numbers.



(1, -10, 16)

2, 8

(-4, 7, -3)

0.75, 1

(3, 6, 2)

-1.577…, -0.422…

(1, 2, 1)


(1, 4, 8)

-2 + 2i, -2 -2i

Because of this, the template can represent questions requiring different skill sets and mastery levels. It is important to give instructors the ability to maintain a consistent difficulty level, and to control the concepts covered by a single template. This is achieved using constraints.

template "quadratic roots" {
  question is "<a>x^2 + <b>x + <c> = 0. Solve for x."
  answer is "x = <root1>, <root2>"
  variables {
    a, b, c are integers between -10 and 10
    discriminant = b^2 - 4*a*c
    root1 = -b + sqrt(discriminant)/(2*a)
    root2 = -b - sqrt(discriminant)/(2*a)
  constraints {
    root1, root2 must be integers
    root1 <> root2
    gcd(a, b, c) = 1

In general, constraints are useful for narrowing the skill set covered by the template, and to ensure that instantiations of the template are sufficiently varied. In this example, requiring unique, integer roots is used to control difficulty level, while requiring gcd(a, b, c) = 1 ensures that no two questions will be multiples of one another.

Car Distance

Another important feature of the templating language is the ability to specify wrong answers.

template "car distance" {
  question is "How far does a car travel in <m> minutes traveling <r> miles/hour?"
  answer is "<dist>miles"
  variables {
    m is an integer between 30 and 90 divisible by 10
    r is an integer between 40 and 65 divisible by 5
    dist = rate*time/60
    wrong1 = rate*time
    wrong2 = rate/time
    wrong3 = rate/time/60
  wrong answers {

Wrong answers can be either static or variable. What’s powerful about this feature is that each wrong answer might be associated with a particular deficiency or misconception. In the example, a student might choose “rate/time/60” because she doesn’t know the correct distance formula, while another student might choose “rate*time” because she has trouble converting units. This is another source of information that Knewton can use to provide more targeted recommendations.

The Problem Generation Algorithm

Great, so we have a template. Now how do we actually generate questions? My first inclination was to start with the simplest possible algorithm:

  1. Go down the list of variables, selecting values for the generated variables uniformly at random from the sample sets and using the formulas to compute the derived variables

  2. If the variables satisfy all of the constraints, add it to the list of questions.

  3. Repeat.

This naive algorithm performs nicely given one key assumption: a large enough fraction of the sample space (the set of all possible questions, i.e. the cartesian product of the sample sets) must meet the constraints specified in the template. For instance, if 100 questions are desired and the algorithm can handle 100,000 iterations, roughly 1/1000 questions need to be valid. This isn’t too daunting: as long as we offer an expressive library of sample sets and constraints, instructors can be expected to provide templates meeting this requirement.

It is a very difficult to come up with a more efficient approach. For some problems, algorithms do exist to generate solutions (see Euclid’s method for pythagorean triples), but for others it is mathematically impossible (see Hilbert’s tenth problem). In many cases, introducing heuristics may improve the random algorithm. For instance, it may be possible to identify a large chunk of the sample space that leads to solutions that are too large, non integral, negative, etc.

The Prototype

I chose to implement the assessment generator in Scala for several reasons:

  • Scala’s interoperability with Java made integrating with the rest of the Knewton code base an easy task.

  • Scala’s powerful Parser Combinators library made implementing the template DSL straightforward. Because of their low overhead, I also used parser combinators for converting math expressions like “sqrt(x^3 + 5)” into my internal data model. While there are many existing Java/Scala libraries that accomplish this, I was unable to find any capable of manipulating general mathematical objects, such as fractions, matrices, polynomials, polygons, and groups.

  • Scala’s parallel collections allowed running iterations of the problem generator routine in parallel. Doing so only required swapping out a Map with a ParMap, and appending “.par” to the main program loop.

*For examples of solution-oriented approaches in the literature, see (survey) and (grade school math).

*For examples of template-based approaches in the literature, see (embedded systems).

Ku Pima: To Measure

Trivia: Ku Pima means To Measure in the Xitsonga Language

An Analytics (Data Science) team is made up of engineers/scientists with a wide array of skills. This results from the nature of the goals the team has to meet. As an Electrical Engineering major at Wits University, I’ve spent two summers as an instrumentation engineering intern. Instrumentation deals with the task of engineering instruments that can measure certain quantities for industrial processes to be controlled. Examples of environments include manufacturing and chemical plants, houses, or even the International Space Station. I find analytics to be a similar process to instrumentation engineering in that useful measurements are sought and then the instruments to calculate those measures are engineered.

Building the Analytics Pipeline

On the Analytics team at Knewton, the data scientists develop measures that are useful to track, whether directly for a business case or for building blocks for future analytics. Within the Analytics team there is a Data Analysis component that develops analytics (measures). Another component, Data Infrastructure, engineers the pipeline (instruments) to actually calculate these analytics on a large/production scale. Initially an analytic is developed by exploring some interesting idea of a measure, using available organization data, and then refining it to arrive at the final analytic.

My internship was concerned with creating Data Infrastructure (the instrumentation) to compute some analytics at Knewton. My initial major task was to take a newly developed analytic, in this case Engagement, data from different sources within our products, and  engineer tools to calculate this analytic. This task itself encompases not only the implementation of an algorithm but also the engineering processes necessary to construct the components needed for the measure to be calculated. Further there is a need to analyze and validate the measure on a larger scale than the initial one used to develop it. This necessitates the need for a feedback loop between the data analysts and data infrastructure components.

Engagement is a many-faceted construct.  One facet is an analytic that serves to indicate how much time a student spends “actively engaged” on the Knewton platform. There are a number of ways it can be defined. Here is the basic premise: Let’s say a student is busy with a homework assignment. After they have submitted their task, the Knewton system sends recommendations on which material the student should tackle next. From these interactions one might want to know how engaged a student is with the learning platform. There can be many ways of quantifying this: time spent on the system; number of logins per week; time spent on recommended material, etc. The analytics team is tasked with investigating and developing the analytic into a form that will be useful internally or to a partner. After this initial synthesis, we need to engineer a pipeline that will take student interactions and the recommendations into account and calculate the engagement analytic. Further, this analytic is an example of analytic that needs to be inferred. By “infer” we mean that we cannot directly observe the data we want and thus have to deduce it from other data.

There Can Be Multiple Reasons to Infer an Analytic

The raw student interaction data needs to be cleaned and clustered: The raw data captures a lot of information, some of which may not be useful. Thus, there is a need for cleaning/filtering. Some student interactions can be clustered and thought of as a single event instead of multiple events. (Think of a student doing multiple tasks consecutively on a single homework assignment.)

You have to categorize the users’ intentions: The user’s intention is important as it can make an analytic useful or less so. For example: there is a situation where the student did not intend to do the next recommended action, not because they thought it was irrelevant, but because they had to move to another task (say, another homework assignment with an upcoming deadline). In this situation we would have to treat this as a data point that would not be useful in calculating engagement.

Resources: Available resources are always a factor for any organization. It might be faster to calculate the analytic in one way as opposed to another. It might be more efficient to infer an analytic from a readily available dataset than use a much richer, hard to obtain dataset that is less efficient and only provides a small boost in accuracy with an accompanying large use of resources.

Computing Engagement

The Engagement pipeline takes in data generated by Knewton and its partners that contains student interactions as well as recommendations sent to students. From this data the student learning/task sessions are inferred. From this data we then calculate the Engagement analytic. The computed analytic is reported and the data used for its calculation stored for future use and analysis.

 Building the Analytics Pipeline

After the initial pipeline is engineered there are still a number of tasks to complete. Validation is very important in determining if the analytic can be interpreted as expected. We need to know whether with sample data it produces similar results to the analytic development stage. This part of the process involves some modeling and statistics and needs analysis tools in order to detect errors and may lead to refining the analytic itself.

Through this validation process we are able to refine the measure further. Validation gives us a better understanding of the shortcomings of the data we collect and what other data might be useful.

If you’re interested in this subject, I’d also recommend checking out these tech blog posts: Netflix’s Recommendation Engine, Finding similar users on Twitter, How Facebook Enabled Graph Search On Such a Scale.

Generative Models for Description and Prediction

This summer I worked on the Analytics team at Knewton. One of the goals of analytics is to condense the huge amount of information generated by students interacting with educational products and provide teachers with data-driven insights into how their students are doing.

In this post, I will discuss an approach to conveying information about student behaviors using generative models.

A generative model is a description of how to generate data that is similar to the observed data.

Imagine we have a stream of binary observations for each student. The observations represent whether the student did any work on the system today, whether the student answered questions correctly as she progressed through a course, or anything else about the student that can be represented in binary terms. The observations are viewed as a stream; over time more observations are added.

Here are some examples of observations of correctness on practice questions from when each student started a course until today.

StudentA: “1111111110”

StudentB: “100011110111100111111011100110011”

StudentC: “111011110111111111111111111111111”

We could specify different generative models for each student. One possible generative model for student A could be:

Model #1 (Student A)

“print nine 1’s and then one 0”.

If we wanted to describe student B similarly the description would be:

Model #1 (Student B)

“print one 1 then three 0’s then four 1s then one 0 …”.

Models like Model #1 exactly preserve all of the information that was present in the student data but in doing so don’t reduce the complexity of the observations at all. Often, in communicating information about student behaviours, we seek to preserve the important information while reducing the complexity to communicate concisely.

A generative model does not need to reproduce the data exactly. One way of reducing complexity while preserving important information is to attribute parts of the observation that we believe to be unimportant as being generated from a process with some randomness.

Throughout this post, we’ll use a simple generative model for demonstration. The generative model belongs to a family of models, parametrized by a value w.

Model #2:

“Flip a weighted coin that lands on heads with probability w. If it’s heads, print 1, otherwise print 0”.

In reporting to a teacher, I can report that the model that best fits with the student history is w=0.9.

That can be pretty informative. It means that in the family of Model #2, w=0.9 gives the model that is closest to the full description of StudentA (using KL divergence as a measure of closeness). Often, a student’s full data description is inconvenient to communicate, but now I can summarize it concisely with a description of the model family (Model #2) and a single parameter (w=0.9).

While the model parameters may be meaningful themselves (in this case they are the student’s overall score on practice questions), they also define a space in which we can compare students to each other (or the same student at different points in time). For example, I can note that StudentA is best described by w=0.9 while StudentB is best described by w0.65. If I wanted to find the student most similar to StudentC (w0.94), I would find the student with the closest value for w and choose StudentA. Unlike standard string comparisons (eg. Levenshtein distance, which would yield studentB as the closest student to studentC), a distance measure based on a generative model is not sensitive to differences in the number of observations for each student.

In this case, because the generative model is quite simple, I’m just reporting a summary statistic of the data (in this case the sample mean), but that is not always the case.*

In choosing Model #2 to convey my information, I’ve consciously made some decisions about what qualities of the data are important or not. For example this model is insensitive to the order of the data. If I were to imagine a new student (StudentD) whose data was the same as StudentA’s but had its order reversed, the two students would be reported identically. If I believed that the order of the 1’s and 0’s was important, I could choose a different generative model. While many generative models will be valid for describing the same set of observations, some of these models will describe the data better, so it is also possible to empirically evaluate the relative descriptiveness of models.

Generative models can also be used for prediction. For concreteness, let’s imagine that the data we’re interested in is related to student work habits and the 0’s and 1’s represent whether the student interacted with the system in each hour. If we want to predict how many hours a student will spend on the system next week, we could fit a generative model to the student’s past habits and then run that model forward from where it left off to estimate a distribution for hours spent on the system next week.

In one sense this is still a description of the past work habits of a student, just phrased as a prediction of the future: “If these habits continue for another week, the distribution over time spent in the system is X.”

In some cases, this future prediction may be the most useful type of summary. For example, if a teacher wants to intervene before a struggling student fails an exam, an extremely useful description of the student’s improvements of proficiency is phrased this way: “If this rate of improvement continues until the exam date, the student is projected to get a 20% on the exam!”

*For example, a common generative model used for summarizing text documents is the Latent Dirichlet Allocation (LDA) model is parameterized by a set of distributions. These distributions can be viewed as a summary of the documents.

Another example is the IRT model of student proficiency described by Alex in his N choose K post. Reporting parameters from this model forms the basis of how we at Knewton communicate about what students know.

Parameter Recovery

1. Introduction

Providing students with good study recommendations begins with understanding what they already know. At Knewton, we’re combining existing work on psychometrics and large scale machine learning to perform proficiency estimation more accurately and at a larger scale than ever before.

As Alejandro Companioni [discussed in a previous post], Item Response Theory (IRT) is the core model around which our proficiency estimation is built. IRT proposes that there are latent features of students (proficiency) and questions (difficult, discrimination, and guessability) which explain the accuracy of responses we see on assessments. However, most IRT analyses assume that students bring the same level of proficiency to every question asked of them. This might be fine for a 2-hour standardized test, but we are interested in helping students over an entire course. If we’re doing our job, students should be improving their proficiency as time goes on!

This summer, I made our IRT models more sensitive to gains (and losses) in student proficiencies over time. I’ll leave the details of that model to another post, but as a teaser, I’ll show you our first figure. In this figure, the x-axis represents time as measured by the total number of questions answered. Dot represent the question which was answered at that time. Red dots represent incorrectly answered questions while green dots represent correctly answered questions. The {y}-value of the dot is its difficulty. The green line tracks our simulated student proficiency, and the blue line tracks our recovered student proficiency paramter given the questions that they have answered.

In this post, I will discuss three methods which we used to evalute the performance of our algorithms and discuss their relative strengths and weaknesses. We’ll focus on the mean-squared error, the log-likelihood, and the Kullback-Liebler divergence.


A visual representation of our temporal IRT model

2. Evaluating results 

To tackle the problems I faced, I explored statistical inference algorithms on probabilistic graphical models. To judge my results, I simulated classes of students answering sets of questions, and saw how accurately my algorithms recovered the parameters of the students and questions.

One method of quantifying the accuracy of estimates is the mean-squared error (MSE). It takes the mean of the square of the differences between the estimated parameters and their actual values. In symbols, if {\theta^{\text{act}}} is the vector of model parameters used to generate our data, and {\theta^{\text{rec}}} are the parameters our method recovered, then

\displaystyle \text{MSE} = \frac{1}{N} sum_{i=1}^N \left(\theta^{\text{act}}_i - \theta_i^{\text{rec}}\right)^2.

While the MSE is a good indicator for accuracy in many places, it has problems when models have multiple solutions at different scales. Let’s see why this is through an example.

Suppose we have a class of students answering a set of questions. Suppose the questions are actually very hard and that the students happen to be very proficient at the material. Just looking at this set of data, however, we have no way of knowing that these students are actually very proficient. We can only assume the most likely scenario–the students have average proficiences and are answering the questions competently. From the data alone, we can only hope to discern the values of item parameters relative to the proficiency of students and vice-versa. We cannot hope to know their values absolutely. So there are many equally valid interpretations at different scales of the same data. Because the MSE looks at the difference between the two parameters, it will penalize parameters that are scaled differently, even though those parameters might be an equally valid solution given the data!

Let’s look at Figure 2. We see that the ordering is basically preserved, which we could measure with Pearson’s {r}. but the recovered parameters are not on the same scale. If that were the case, then the line of best fit would have slope one and {y}-intercept {0}. However, in this case, the slope is closer to {2/3}. In statistical terms, the variance of the recovered parameters is much less than the variance of the actual parameters.


Recovered student proficiencies plotted against actual student proficiencies from a simulation.

The log-likelihood gives us a more meaningful method of measuring accuracy. The log-likelihood tells us the log-probability of our recovered parameters given our data. At instantiation, our algorithms should have low log-likelihoods–the parameters that we guess first are random and don’t fit our data well. Our algorithms should iterate toward higher log-likelihoods, hopefully converging at the set of parameters with the highest log-likelihood. This is the philosophy behind the Expectation-Maximization algorithm. But the log-likelihood is susceptible to tail events. For instance, if a question is in reality extremely hard but, through random chance, a few students with average proficiency answer the question correctly, then maximizing the log-likelihood will lead to marking these very hard questions as easier than they actually are. This, of course, could be solved with more data from large numbers of extremely proficient students, but this data is often hard to come by. Instead, we introduce another way of measuring model fit: the Kullback-Leibler (KL) divergence. Suppose we have probability distributions {P} and {P'}, generated by our original and recovered parameters, respectively. That is,

\displaystyle P(\text{datum}) = \frac{1}{# \text{ of data}} Pr(\text{datum} | \theta^{\text{act}} )


\displaystyle P'(\text{datum}) = \frac{1}{# \text{ of data}} Pr(\text{datum} | \theta^{\text{rec}} )

In our case, a datum is one student’s response to one question and the {theta} paramters are the student proficiency and question discrimination, difficult, and guessability discussed in Alex’s post. Then we can define

\displaystyle D_{KL} (P||P') = \displaystyle sum_i \lnleft({\frac{P(i)}{P'(i)}}\right) P(i).

If {P = P'}, then {D_{KL}(P||P') = 0}. Moreover, Gibbs’ inequality implies that {D_{KL}(P||P') geq 0}. If the original and recovered parameters generate similar probabilities, then the KL Divergence should be small. Note that since the KL divergence involves the ratio of the likelihoods, it is less susceptible to the influence of tail events than the log-likelihood alone.

The log-likelihood and KL divergence both use likelihoods to measure fit, which means that they only care about fit of the parameters to the data, and not the exact convergence to the original. So they often prove to be reliable measures of fit to judge our algorithms on. For instance, even though the MSE of our recovered parameters is large, Figure 2 shows us that our algorithm has likely converged (since the log-likelihood and KL divergences are not changing much) and give us a reliable measure of model fit.


The log-likelihood and KL divergence of our recovered parameters through a run of our algorithm.