Latent Skill Embedding

In this post, we explore a probabilistic model of student learning and assessment that can be used to recommend personalized lesson sequences. A student using Knewton-powered products currently receives content recommendations one step at a time. The model presented in this blog post is motivated by the idea that such a student could receive additional context and motivation by also seeing the most likely trajectory, or sequence of recommendations, leading to her goals, given her current state.


Students using Knewton-powered adaptive learning products are working in an online environment, acquiring knowledge that enables them to pass specific assignments and achieve their learning goals. These students receive recommendations from Knewton about what to work on, in a sequence that is determined by the work the student has previously done.

The goal of this blog post is to explore a new way of evaluating pathways through content based on their ability to improve student learning. We leverage existing log data (entries that look like Student A completed Lesson B and Student C passed assessment D) to compare the performance of different lesson sequences. Moreover, our method can be leveraged to generate displayable lesson sequences that maximize educational gains while meeting practical time constraints (i.e., the student has at most n days to maximize her probability of passing goal assessments). The additional context these sequences of recommendations provide could enhance students’ motivation, and consequently their overall performance.

General Approach

We use historical student-module interactions to learn a joint probabilistic embedding of the latent skills of students, skill requirements of assessments, and skill gains from lessons (modulated by prerequisite knowledge). Through offline evaluations on log data, we have shown that this model can successfully distinguish between lesson sequences that lead students to mastery or failure. In this post, we will outline how the model can be used to recommend lesson sequences in real time.


Probabilistic embeddings are used for a variety of tasks, including music playlist prediction,1 feedback propagation in MOOCs,2 visualization of high-dimensional data,3 and language modeling for natural language processing.4 We are interested in embeddings that perform nonlinear dimensionality reduction on high-dimensional data, and also can be used to learn useful relationships between objects that are not explicitly expressed in the data. We will use a probabilistic embedding to capture the dynamics of student learning and testing, and in the process we will describe a method to recommend personalized lesson sequences. Our approach resembles a proposed sparse factor analysis method.5

In our model, students, lessons, and assessments are jointly embedded in a low-dimensional semantic space that we call the “latent skill space.” Students have trajectories through the latent skill space, while assessments and lessons have fixed locations. Formally, a student is represented as a set of d latent skill levels \vec{s}_t \in \mathbb{R}_+^d that can change over time; a lesson module is represented as a vector of skill gains \vec{ell} \in \mathbb{R}_+^d and a set of prerequisite skill requirements \vec{q} \in \mathbb{R}_+^d; an assessment module is represented as a set of skill requirements \vec{a} \in \mathbb{R}_+^d.

Students interact with lessons and assessments in the following way. First, a student can be tested on an assessment module with a pass-fail result R \in {0, 1}, where the likelihood of passing is high when a student has skill levels that exceed all of the assessment requirements and vice versa. Second, a student can work on lesson modules to improve skill levels over time. To fully realize the skill gains associated with completing a lesson module, a student must satisfy prerequisites, though fulfilling some of the prerequisites to some extent will result in relatively small gains. Time is discretized such that at every timestep t \in \mathbb{N}, a student completes a lesson and may complete zero or many assessments. The evolution of student knowledge can be formalized as an input-output hidden Markov model.

There are two equations that characterize the dynamics of the latent skill space: the “assessment result likelihood,” and the “learning update.” Both are discussed in the following two sections.

Modeling Assessment Results

For student \vec{s}, assessment \vec{a}, and pass-fail result R,

    \[ R \sim \mathrm{\Bernoulli}(\phi(\Delta(\vec{s},\vec{a}))) \]

where (\phi\) is the sigmoid function, and

    \[ \Delta(\vec{s},\vec{a}) = \frac{vec{s} \cdot \vec{a}}{|\vec{a}|} - |\vec{a}| + \gamma_s + \gamma_a \]

where \gamma_s, \gamma_a \in \mathbb{R}. A pass result is indicated by R=1, and a fail by R=0. The term \frac{\vec{s} \cdot \vec{a}}{|\vec{a}|} can be rewritten as |\vec{s}|\mathrm{cos}(\theta), where theta is the angle between \vec{s} and \vec{a}; it can be interpreted as “relevant skill”. The term |\vec{a}| can be interpreted as assessment difficulty. The bias term \gamma_s is a student-specific term that captures a student’s general (assessment-invariant and time-invariant) ability to pass; it can be interpreted as a measure of how often the student guesses correct answers. The bias term \gamma_a is a module-specific term that captures an assessment’s general (student-invariant and time-invariant) difficulty. These bias terms are analogous to the bias terms used for modeling song popularity in playlists.1

Modeling Student Learning from Lessons

For student \vec{s} who worked on a lesson with skill gains \vec{ell} and prerequisites \vec{q} at time t+1, the updated student state is

    \[ \vec{s}_{t+1} \sim \mathcal{N}\left(\vec{s}_t + \vec{ell} \cdot \phi(\Delta(\vec{s}_t,\vec{q})),\mathbf{\sigma^2\mathbf{I}_d}\right) \]

where \Delta(\vec{s}_t,\vec{q}) = \frac{\vec{s}_t \cdot \vec{q}}{|\vec{q}|} - |\vec{q}| and \sigma is a constant. The intuition behind this equation is that the skill gain from a lesson should be weighted according to how well a student satisfies the lesson prerequisites. A student can compensate for lack of prerequisites in one skill through excess strength in another skill, but the extent to which this trade-off is possible depends on the lesson prerequisites. The same principle applies to satisfying assessment skill requirements. Figure 0 illustrates the vector field of skill gains possible for different students under fixed lesson prerequisites. Without prerequisites, the vector field is uniform.

Figure 0

The vector field of skill gains for a lesson with skill gains \0.5, 1 and prerequisites 0.7, 0.3. Contours are drawn for varying update magnitudes. A student can compensate for lack of prerequisites in one skill through excess strength in another skill, but the extent to which this trade-off is possible depends on the lesson prerequisites.

Parameter Estimation

How do we take a bunch of logged student-module interactions and turn them into an embedding? We compute maximum a posteriori (MAP) estimates of model parameters \Theta by maximizing the following objective function:

L(\Theta) = \sumlimits_{\mathcal{A}} \log{(Pr(R \mid \vec{s}_t, \vec{a}, \gamma_s, \gamma_a))} + \sumlimits_{\mathcal{L}} \log{(Pr(\vec{s}_{t+1} \mid \vec{s}_t, \vec{ell}, \vec{q}))} - \beta \cdot \lambda(\Theta)

where \mathcal{A} is the set of assessment interactions, \mathcal{L} is the set of lesson interactions, \lambda(\Theta) is a regularization term that penalizes the L_2 norms of embedding parameters to reduce overfitting, and \beta is a regularization parameter. Non-negativity constraints on embedding parameters are enforced. We solve the optimization problem using the L-BFGS-B algorithm, which extends the popular L-BFGS algorithm to handle box constraints. We randomly initialize parameters and run the iterative optimization until the relative difference between consecutive objective function evaluations is less than 10^{-3}. Averaging validation accuracy over multiple runs during cross-validation reduces sensitivity to the random initializations (since the objective function is non-convex).

Toy Examples

To illustrate how an embedding can capture the underlying geometry of different scenarios, we conducted a series of experiments on small, synthetically-generated interaction histories.


Figure 1

The key observation here is that the model recovered positive skill gains for L1, and “correctly” arranged Alice and A1 in the latent space. Initially, Alice fails A1, so her skill level is behind the requirements of A1. After completing L1, Alice passes A1, indicating that her skill level has probably improved past the requirements of A1. Note that this scenario could have been explained with only one latent skill.

Figure 2

A two-dimensional embedding, where an intransitivity in assessment results requires more than one latent skill to explain. The key observation here is that the assessments are embedded on two different axes, meaning they require two completely independent skills. This makes sense, since student results on A1 are uncorrelated with results on A2. Fogell fails both assessments, so his skill levels are behind the requirements for A1 and A2. McLovin passes both assessments, so his skill levels are beyond the requirements for A1 and A2. Evan and Seth are each able to pass one assessment but not the other. Since the assessments have independent requirements, this implies that Evan and Seth have independent skill sets (i.e. Evan has enough of skill 2 to pass A2 but not enough of skill 1 to pass A1, and Seth has enough of skill 1 to pass A1 but not enough of skill 2 to pass A2).

Figure 3

We replicate the setting in Figure 2, then add two new students Slater and Michaels, and two new lesson modules L1 and L2. Slater is initially identical to Evan, while Michaels is initially identical to Seth. Slater reads lesson L1, then passes assessments A1 and A2. Michaels reads lesson L2, then passes assessments A1 and A2. The key observation here is that the skill gain vectors recovered for the two lesson modules are orthogonal, meaning they help students satisfy completely independent skill requirements. This makes sense, since initially Slater was lacking in Skill 1 while Michaels was lacking in Skill 2, but after completing their lessons they passed their assessments, showing that they gained from their respective lessons what they were lacking initially.

Figure 4

We replicate the setting in Figure 2, then add a new assessment module A3 and a new lesson module L1. All students initially fail assessment A3, then read lesson L1, after which McLovin passes A3 while everyone else still fails A3. The key observation here is that McLovin is the only student who initially satisfies the prerequisites for L1, so he is the only student who realizes significant gains.



We have some evidence that an embedding can make good sequence recommendations in an offline evaluation. How can we use an embedding to make sequence recommendations in an online setting? Formally, we must address the following problem: Given a student’s current skill levels and a set of assessments the student wants to pass, what is the optimal sequence of lessons for the student? We can formulate the problem two ways: specify a minimum pass likelihood for the goal assessments and find the shortest path to mastery, or specify a time constraint and find the path that leads to the highest pass likelihood for the goal assessments while not exceeding the prescribed length.

In general, both problems can be tackled by using a trained Latent Skill Embedding model to specify a Markov Decision Process, in which the possible states are given by the student embedding, the set of possible actions corresponds to the set of lesson modules that the student can work on, the transition function is the learning update (Equation 2), and the reward function is the likelihood of passing all goal assessments (Equation 1) discounted by the expected time taken to complete the lesson sequence. Various methods for finding the optimal policy for an MDP with a continuous state space, discrete action space, and probabilistic transitions have been explored in the literature.6

For an embedding model without lesson prerequisites, the order in which lessons are completed is irrelevant since sums are commutative. Time-constrained lesson recommendation thus reduces to the 0-1 knapsack problem. In this case, each lesson contributes skill gains, but costs the student an expected amount of time,which can be predicted using matrix factorization. The 0-1 knapsack problem can be solved efficiently using dynamic programming.

In practice, content recommendations should intersperse assessments in the lesson sequence to keep students engaged and make sure they are learning. We are currently working on an algorithm that incorporates this idea into the sequence optimization problem.


A Python implementation of the Latent Skill Embedding model is available at The IPython notebooks used to perform data exploration, model evaluations, sensitivity analyses, and synthetic experiments are available at The data sets from Knewton used for these experiments are not available to the public. The project will be maintained at


Siddharth Reddy is a junior at Cornell University, studying computer science and applied mathematics. He’s interested in machine learning, adaptive education, and startups.

This work was a collaboration between Siddharth Reddy, Igor Labutov, and Thorsten Joachims at Cornell University, in conjunction with the data science team at Knewton. The project was presented at the Machine Learning for Education workshop at ICML 2015 (see for details).


  1. Shuo Chen, Joshua L Moore, Douglas Turnbull, and Thorsten Joachims.

Playlist prediction via metric embedding. In Proceedings of the 18th ACM

SIGKDD international conference on Knowledge discovery and data mining,

pages 714–722. ACM, 2012.

  1. Chris Piech, Jonathan Huang, Andy Nguyen, Mike Phulsuksombati, Mehran

Sahami, and Leonidas Guibas. Learning program embeddings to propagate

feedback on student code. arXiv preprint arXiv:1505.05969, 2015.

  1. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.

Journal of Machine Learning Research, 9(2579-2605):85, 2008.

  1. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities

in continuous space word representations. In HLT-NAACL, pages 746–751,


  1. Andrew S Lan, Christoph Studer, and Richard G Baraniuk. Time-varying

learning and content analytics via sparse factor analysis. In Proceedings of

the 20th ACM SIGKDD international conference on Knowledge discovery

and data mining, pages 452–461. ACM, 2014.

  1. Marco Wiering and Martijn van Otterlo. Reinforcement Learning: State-of-the-art,

volume 12. Springer Science & Business Media, 2012.

Data Visualizations for Model Validation

In order to provide students with sensible, targeted recommendations, the Knewton platform uses a variety of statistical models to track and interpret student progress. These models regularly undergo improvements designed to progressively increase our understanding of student behavior. One tool for assessing the accuracy of these improvements is a model dashboard a visual display of model fitness metrics that allows a reviewer to interpret a model’s ability to capture and explain diverse student behaviors within the Knewton platform.

Model dashboards are primarily useful for analyzing performance on student behavior observed after the initial model release. Models are rigorously validated during the development phase, wherein data scientists clearly state assumptions made by the model and test how well it explains held-out data collected from student interactions. However, as Knewton launches new adaptive integrations, the population of students, and therefore student behaviors, diversifies and drives the need for further monitoring of model performance. A dashboard that displays quantitative metrics on the model’s function can help pinpoint when and how the model deviates from expectations, providing empirical motivation and direction for adjustments. By monitoring the models during the cycle of modification and validation, data scientists discover ways to make improvements responsibly and scientifically.

Since proficiency estimation is a fundamental pillar of Knewton recommendations, it was the first model that I instrumented with a dashboard. Via the proficiency model, Knewton computes a quantitative measure of a student’s proficiencies on concepts for every recommendation, which can then be used to predict the probability of a student getting an item (e.g., a question) correct. The system then fuses this proficiency data with several other types of inferences to develop a comprehensive understanding of a student’s academic abilities and thereby make personalized recommendations.

Choosing Metrics

One of the most fundamental requirements for the proficiency model is the ability to successfully predict student responses. There are several ways to measure accuracy. Two examples of relevant questions are, “How far off were the predictions from the observations?” and “How well do our predictions discriminate correct from incorrect responses?” The metrics I chose to represent for the dashboard’s initial presentation attempt to address a few of these different nuances.

For each item that Knewton recommends, we compute a response probability (RP) representing our prediction of whether the student will respond correctly to the item given her estimated proficiency in each of its related concepts. For example, a response probability of 0.7 would predict a 70% chance that a student would answer the question correctly. To explore prediction accuracy from a variety of perspectives, I implemented three separate accuracy metrics for the first iteration of the proficiency dashboard.

For the first metric, I used a very rough measure of how close the predictions were to the observed responses by computing a mean squared error (MSE) between the correctness values and the prediction probabilities. While a low MSE is a strong indicator of accurate predictions, a large MSE is not as easily interpreted. Another metric is needed to characterize this case.

It may be that although the RPs are not near 0 or 1, by defining a threshold of, say, .5, the RPs do a reasonable job at separating correct responses from incorrect responses in that most RPs greater than 0.5 pair with a correct response, and vice versa. It would not be very informative, though, to choose a single threshold, since for different choices of thresholds there will often be a tradeoff between correct and incorrect categorizations. In order to easily visualize this tradeoff, I computed a receiver operator characteristic (ROC) for the second metric. A ROC plots the true positive rate (TPR) against the false positive rate (FPR) for a sliding threshold from 0 to 1. More specifically and using the following data, known as a confusion matrix:

Incorrect response Correct response
RP > threshold FP TP
RP < threshold TN FN


the TPR (TP/(TP+FN)) is plotted against the FPR (FP/(FP+TN)). As the probability threshold decreases from 1, a highly predictive model will increase the TPR more rapidly than the FPR, while a completely random prediction would increase them both roughly equally. So, the ROC curve provides an easy interpretation at a glance a strong deviation from a 45-degree line shows that the RPs are good at separating correct from incorrect responses.

For the last metric, I computed the empirical distribution of probabilities for the correct and incorrect responses, displayed as a histogram of RPs. Data scientists examine the asymmetry of RPs over correct versus incorrect responses to validate prior assumptions made about student proficiencies in the model. In addition, outliers in this distribution (such as a peak around high RPs for incorrect responses) may not be apparent in the previous two metrics, yet may reflect subsets of students whose behaviors do not adhere to model assumptions and ought to be investigated.

Technical Implementation

These metrics provide a graded set of interpretations of the model’s accuracy. In order to implement the dashboard, data required by each metric must be transformed over a series of phases from the live service to their final presentable form. The proficiency model runs as a component of the Recommendation Service, which runs on an AWS EC2 instance. The model running within the service takes in student responses and updates student proficiencies online. For each batch of responses, metadata are logged in a format containing RPs and item correctness. These logs are swept on a periodic basis to an S3 bucket, providing a persistent, canonical record of model state at the time that we actually served a student.

A workflow executes periodically to refresh the metrics on display. Steps include ETL scripts for loading the logs into a Redshift RDS cluster, parsing the logs, transforming them into the metrics, and publishing the results to the dashboard. For displaying these metrics, Knewton currently uses Looker, which provides a sophisticated graphical visualization for the metrics stored in the database.

Looking Forward

The dashboard currently implements the three metrics outlined above, yet there are many other important validation criteria for the proficiency model that will benefit from similar automation and reporting. Iterating on and augmenting the metrics will provide a fresher and more holistic snapshot of model behavior that will continue to expedite how Knewton data scientists track model performance day-to-day. The dashboard that I delivered provides our data scientists with actionable feedback to ensure that our models serve diverse student needs with a powerful adaptive education experience.

Student Latent State Estimation with the Kalman Filter

The Kalman filter is an algorithm that estimates the values of unknown, or latent, variables from a series of noisy measurements taken over time. The Kalman filter has numerous applications in finance and different areas of engineering, and in this blog post, we will show how the Kalman filter can apply to student learning. We will use a Kalman filter to infer an approximation to a student’s ability given their response times to a set of questions.

To fully understand the concepts below, you’ll need a background in basic probability and statistics.


To begin, we are going to describe our inference problem and a model of the system – a set of assumptions of how student learning works.

Given a series of scalar measurements:


where z_i denotes the natural logarithm of the time it took our student to answer question i,

We want to infer the scalar latent variables:


where the student has ability x_i at the time question i is answered.

We model the change in student ability over time as a Gaussian random walk, meaning that the current ability value is based on the previous ability value, plus some Gaussian noise:

x_k=x_{k-1}+w_k, where w_k is drawn from N(0, \tau \sigma^2_x), where \tau is the time the student spent between questions, and \sigma^2_x is a hyperparameter that corresponds to the variance for this latent process (1).

Having the variance of the noise increase linearly with the time difference makes our equation consistent with a continuous Gaussian random walk, and is consistent with our intuition that a student’s ability needs time to change. In other words, a student is unlikely to experience significant change in ability if they’re between questions in a quiz, but if it’s been a day since they’ve answered a question, they’re more likely to have taken the time to learn more about the concept, or, conversely, they might have forgotten the material. Because the latent state variance is time-dependent, our filter is technically called a hybrid Kalman filter, since it assumes a continuous-time model for a discrete set of observations.

We don’t assume that the student ability x_k accounts for all the variability in the log of the student response time, z_k. For example, it’s possible that the student is familiar with this particular problem or is distracted by her environment. Therefore, we say that the observed times are likewise corrupted by some Gaussian noise, v_k:

z_k=x_k+v_k, where v_k is drawn from N(0, \sigma^2_z), where \sigma^2_z is a hyperparameter that corresponds to the variance of the Gaussian noise (2).

Kalman Filter blog post (v2)

The resulting model is pictured by the diagram above: the ability of the student at the previous question x_{k-1} determines the log response time z_{k-1}. The ability of the student at current question x_k is determined by the ability at the previous question x_{k-1}, and determines the current log response time, z_k.


The Kalman filter is an algorithm for estimating each x_k a posteriori — that is, it computes an estimate for each x_k given observations Z = {z_1, z_2,...,z_k}. It does this recursively, which means that it only needs an estimate of x_{k-1}, along with observation z_k, to output an estimate for x_k.

To make a new estimate, we first need to compute two intermediate pieces of information: a prior distribution and a likelihood distribution.

A note on syntax:

z_{1:k-1} denotes our observations z_1 through z_{k-1}, and x_{k-1|k-1} represents our estimate of ability at the k-1th question given observations z_{1:k-1}; likewise \sigma^2_{k-1|k-1} represents our estimate of the variance given observations z_{1:k-1}.

f denotes the Gaussian probability density function (pdf) with mean \mu  and variance \sigma^2:

f(x;\mu,\sigma^2) = \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{(x-\mu)^2}{2 \sigma^2}}

Calculating our prior distribution

Our prior term represents the knowledge we have of current latent state x_k having seen everything but our current observation z_{k}.

To calculate our prior, we start off with an estimate of x_{k-1}, represented as a Gaussian distribution with mean x_{k-1|k-1} and variance \sigma^2_{k-1|k-1}:

p(x_{k-1} |z_{1:k-1}) = f(x_{k-1};x_{k-1|k-1}, \sigma^2_{k-1|k-1}) (3)

From equation 1, we see that x_k is simply the addition of two independent random Gaussian random variables, x_{k-1} and w_k.

From probability theory, we know that the probability density of the addition of two independent random variables can be expressed as a convolution of the two composite probability densities. It happens that the convolution of two Gaussians is also a Gaussian:

p(x_k|z_{1:k-1}) = \displaystyleint p(x_{k-1}|z_{1:k-1}) p(x_k|x_{k-1}) dx_{k-1} = f(x_k; x_{k|k-1}, \sigma^2_{k|k-1}), where

x_{k|k-1}=x_{k-1|k-1} and

\sigma^2_{k|k-1}=\sigma^2_{k-1|k-1}+ \tau \sigma^2_x  (4)

We call p(x_k|z_{1:k-1}) the prior knowledge we have about x_k. The next step is to look at the current observation, z_k and see what information it adds.

Calculating our likelihood distribution

Our likelihood term represents the information our current observation z_k gives us about our latent state x_k.

From equation 2, we see that likelihood of our observation z_k given our hidden variable x_k is simply a Gaussian centered at x_k. This becomes our likelihood term:

p(z_k|x_k) = f(z_k; x_k, \sigma^2_z) (5)

Combining prior and likelihood

The Kalman filter combines the prior knowledge we have about x_k and our likelihood term in accordance with Bayes’ rule, by multiplying the prior term with the likelihood term. We call this resulting distribution the posterior, or our estimate of x_k given all the information we have.

Luckily, the multiplication of two Gaussians is still a Gaussian, although unnormalized:

p(x_k|z_{1:k}) = \frac{1}{Z} p(x_k|z_{1:k-1})*p(z_k|x_k) = f(x_k;x_{k|k}, \sigma^2_{k|k}), where Z is a normalizing constant, and where:


C_k=\frac{\sigma^2_{k|k-1}}{\sigma^2_{k|k-1} + \sigma^2_z}

\sigma^2_{k|k} = (\frac{1}{\sigma^2_{k|k-1}} + \frac{1}{\sigma^2_z})^{-1} (6)

To summarize, given a Gaussian posterior distribution for x_{k-1} (Equation 3) and a new observation z_k, the Kalman filter estimates a new Gaussian posterior for x_k (Equation 6). By updating the Kalman filter as we receive new observations, we can obtain fast, real-time estimates of our latent state.

Open Source Code

An open source implementation of this hybrid Kalman filter algorithm is on Knewton’s GitHub:


Sophie Chou is a senior at Columbia University majoring in Computer Science. She’s interested in Machine Learning, Natural Language Processing, and becoming a better teacher. You can follow her on twitter @mpetitchou.

Andersen Chen is a senior at Brown University, majoring in Mathematics and Computer Science. He’s interested in data science, startups, and machine learning.

Ku Pima: To Measure

Trivia: Ku Pima means To Measure in the Xitsonga Language

An Analytics (Data Science) team is made up of engineers/scientists with a wide array of skills. This results from the nature of the goals the team has to meet. As an Electrical Engineering major at Wits University, I’ve spent two summers as an instrumentation engineering intern. Instrumentation deals with the task of engineering instruments that can measure certain quantities for industrial processes to be controlled. Examples of environments include manufacturing and chemical plants, houses, or even the International Space Station. I find analytics to be a similar process to instrumentation engineering in that useful measurements are sought and then the instruments to calculate those measures are engineered.

Building the Analytics Pipeline

On the Analytics team at Knewton, the data scientists develop measures that are useful to track, whether directly for a business case or for building blocks for future analytics. Within the Analytics team there is a Data Analysis component that develops analytics (measures). Another component, Data Infrastructure, engineers the pipeline (instruments) to actually calculate these analytics on a large/production scale. Initially an analytic is developed by exploring some interesting idea of a measure, using available organization data, and then refining it to arrive at the final analytic.

My internship was concerned with creating Data Infrastructure (the instrumentation) to compute some analytics at Knewton. My initial major task was to take a newly developed analytic, in this case Engagement, data from different sources within our products, and  engineer tools to calculate this analytic. This task itself encompases not only the implementation of an algorithm but also the engineering processes necessary to construct the components needed for the measure to be calculated. Further there is a need to analyze and validate the measure on a larger scale than the initial one used to develop it. This necessitates the need for a feedback loop between the data analysts and data infrastructure components.

Engagement is a many-faceted construct.  One facet is an analytic that serves to indicate how much time a student spends “actively engaged” on the Knewton platform. There are a number of ways it can be defined. Here is the basic premise: Let’s say a student is busy with a homework assignment. After they have submitted their task, the Knewton system sends recommendations on which material the student should tackle next. From these interactions one might want to know how engaged a student is with the learning platform. There can be many ways of quantifying this: time spent on the system; number of logins per week; time spent on recommended material, etc. The analytics team is tasked with investigating and developing the analytic into a form that will be useful internally or to a partner. After this initial synthesis, we need to engineer a pipeline that will take student interactions and the recommendations into account and calculate the engagement analytic. Further, this analytic is an example of analytic that needs to be inferred. By “infer” we mean that we cannot directly observe the data we want and thus have to deduce it from other data.

There Can Be Multiple Reasons to Infer an Analytic

The raw student interaction data needs to be cleaned and clustered: The raw data captures a lot of information, some of which may not be useful. Thus, there is a need for cleaning/filtering. Some student interactions can be clustered and thought of as a single event instead of multiple events. (Think of a student doing multiple tasks consecutively on a single homework assignment.)

You have to categorize the users’ intentions: The user’s intention is important as it can make an analytic useful or less so. For example: there is a situation where the student did not intend to do the next recommended action, not because they thought it was irrelevant, but because they had to move to another task (say, another homework assignment with an upcoming deadline). In this situation we would have to treat this as a data point that would not be useful in calculating engagement.

Resources: Available resources are always a factor for any organization. It might be faster to calculate the analytic in one way as opposed to another. It might be more efficient to infer an analytic from a readily available dataset than use a much richer, hard to obtain dataset that is less efficient and only provides a small boost in accuracy with an accompanying large use of resources.

Computing Engagement

The Engagement pipeline takes in data generated by Knewton and its partners that contains student interactions as well as recommendations sent to students. From this data the student learning/task sessions are inferred. From this data we then calculate the Engagement analytic. The computed analytic is reported and the data used for its calculation stored for future use and analysis.

 Building the Analytics Pipeline

After the initial pipeline is engineered there are still a number of tasks to complete. Validation is very important in determining if the analytic can be interpreted as expected. We need to know whether with sample data it produces similar results to the analytic development stage. This part of the process involves some modeling and statistics and needs analysis tools in order to detect errors and may lead to refining the analytic itself.

Through this validation process we are able to refine the measure further. Validation gives us a better understanding of the shortcomings of the data we collect and what other data might be useful.

If you’re interested in this subject, I’d also recommend checking out these tech blog posts: Netflix’s Recommendation Engine, Finding similar users on Twitter, How Facebook Enabled Graph Search On Such a Scale.

Parameter Recovery

1. Introduction

Providing students with good study recommendations begins with understanding what they already know. At Knewton, we’re combining existing work on psychometrics and large scale machine learning to perform proficiency estimation more accurately and at a larger scale than ever before.

As Alejandro Companioni [discussed in a previous post], Item Response Theory (IRT) is the core model around which our proficiency estimation is built. IRT proposes that there are latent features of students (proficiency) and questions (difficult, discrimination, and guessability) which explain the accuracy of responses we see on assessments. However, most IRT analyses assume that students bring the same level of proficiency to every question asked of them. This might be fine for a 2-hour standardized test, but we are interested in helping students over an entire course. If we’re doing our job, students should be improving their proficiency as time goes on!

This summer, I made our IRT models more sensitive to gains (and losses) in student proficiencies over time. I’ll leave the details of that model to another post, but as a teaser, I’ll show you our first figure. In this figure, the x-axis represents time as measured by the total number of questions answered. Dot represent the question which was answered at that time. Red dots represent incorrectly answered questions while green dots represent correctly answered questions. The {y}-value of the dot is its difficulty. The green line tracks our simulated student proficiency, and the blue line tracks our recovered student proficiency paramter given the questions that they have answered.

In this post, I will discuss three methods which we used to evalute the performance of our algorithms and discuss their relative strengths and weaknesses. We’ll focus on the mean-squared error, the log-likelihood, and the Kullback-Liebler divergence.


A visual representation of our temporal IRT model

2. Evaluating results 

To tackle the problems I faced, I explored statistical inference algorithms on probabilistic graphical models. To judge my results, I simulated classes of students answering sets of questions, and saw how accurately my algorithms recovered the parameters of the students and questions.

One method of quantifying the accuracy of estimates is the mean-squared error (MSE). It takes the mean of the square of the differences between the estimated parameters and their actual values. In symbols, if {\theta^{\text{act}}} is the vector of model parameters used to generate our data, and {\theta^{\text{rec}}} are the parameters our method recovered, then

\displaystyle \text{MSE} = \frac{1}{N} sum_{i=1}^N \left(\theta^{\text{act}}_i - \theta_i^{\text{rec}}\right)^2.

While the MSE is a good indicator for accuracy in many places, it has problems when models have multiple solutions at different scales. Let’s see why this is through an example.

Suppose we have a class of students answering a set of questions. Suppose the questions are actually very hard and that the students happen to be very proficient at the material. Just looking at this set of data, however, we have no way of knowing that these students are actually very proficient. We can only assume the most likely scenario–the students have average proficiences and are answering the questions competently. From the data alone, we can only hope to discern the values of item parameters relative to the proficiency of students and vice-versa. We cannot hope to know their values absolutely. So there are many equally valid interpretations at different scales of the same data. Because the MSE looks at the difference between the two parameters, it will penalize parameters that are scaled differently, even though those parameters might be an equally valid solution given the data!

Let’s look at Figure 2. We see that the ordering is basically preserved, which we could measure with Pearson’s {r}. but the recovered parameters are not on the same scale. If that were the case, then the line of best fit would have slope one and {y}-intercept {0}. However, in this case, the slope is closer to {2/3}. In statistical terms, the variance of the recovered parameters is much less than the variance of the actual parameters.


Recovered student proficiencies plotted against actual student proficiencies from a simulation.

The log-likelihood gives us a more meaningful method of measuring accuracy. The log-likelihood tells us the log-probability of our recovered parameters given our data. At instantiation, our algorithms should have low log-likelihoods–the parameters that we guess first are random and don’t fit our data well. Our algorithms should iterate toward higher log-likelihoods, hopefully converging at the set of parameters with the highest log-likelihood. This is the philosophy behind the Expectation-Maximization algorithm. But the log-likelihood is susceptible to tail events. For instance, if a question is in reality extremely hard but, through random chance, a few students with average proficiency answer the question correctly, then maximizing the log-likelihood will lead to marking these very hard questions as easier than they actually are. This, of course, could be solved with more data from large numbers of extremely proficient students, but this data is often hard to come by. Instead, we introduce another way of measuring model fit: the Kullback-Leibler (KL) divergence. Suppose we have probability distributions {P} and {P'}, generated by our original and recovered parameters, respectively. That is,

\displaystyle P(\text{datum}) = \frac{1}{# \text{ of data}} Pr(\text{datum} | \theta^{\text{act}} )


\displaystyle P'(\text{datum}) = \frac{1}{# \text{ of data}} Pr(\text{datum} | \theta^{\text{rec}} )

In our case, a datum is one student’s response to one question and the {theta} paramters are the student proficiency and question discrimination, difficult, and guessability discussed in Alex’s post. Then we can define

\displaystyle D_{KL} (P||P') = \displaystyle sum_i \lnleft({\frac{P(i)}{P'(i)}}\right) P(i).

If {P = P'}, then {D_{KL}(P||P') = 0}. Moreover, Gibbs’ inequality implies that {D_{KL}(P||P') geq 0}. If the original and recovered parameters generate similar probabilities, then the KL Divergence should be small. Note that since the KL divergence involves the ratio of the likelihoods, it is less susceptible to the influence of tail events than the log-likelihood alone.

The log-likelihood and KL divergence both use likelihoods to measure fit, which means that they only care about fit of the parameters to the data, and not the exact convergence to the original. So they often prove to be reliable measures of fit to judge our algorithms on. For instance, even though the MSE of our recovered parameters is large, Figure 2 shows us that our algorithm has likely converged (since the log-likelihood and KL divergences are not changing much) and give us a reliable measure of model fit.


The log-likelihood and KL divergence of our recovered parameters through a run of our algorithm.

The Mismeasure of Students: Using Item Response Theory Instead of Traditional Grading to Assess Student Proficiency

Imagine for a second that you’re teaching a math remediation course full of fourth graders. You’ve just administered a test with 10 questions. Of those 10 questions, two questions are trivial, two are incredibly hard, and the rest are equally difficult. Now imagine that two of your students take this test and answer nine of the 10 questions correctly. The first student answers an easy question incorrectly, while the second answers a hard question incorrectly. How would you try to identify the student with higher ability?

Under a traditional grading approach, you would assign both students a score of 90 out of 100, grant both of them an A, and move on to the next test. This approach illustrates a key problem with measuring student ability via testing instruments: test questions do not have uniform characteristics. So how can we measure student ability while accounting for differences in questions?

Item response theory (IRT) attempts to model student ability using question level performance instead of aggregate test level performance. Instead of assuming all questions contribute equally to our understanding of a student’s abilities, IRT provides a more nuanced view on the information each question provides about a student. What kind of features can a question have? Let’s consider some examples.

First, think back to an exam you have previously taken. Sometimes you breeze through the first section, work through a second section of questions, then battle with a final section until the exam ends. In the traditional grading paradigm described earlier, a correct answer on the first section would count just as much as a correct answer on the final section, despite the fact that the first section is easier than the last! Similarly, a student demonstrates greater ability as she answers harder questions correctly; the traditional grading scheme, however, completely ignores each question’s difficulty when grading students!

The one-parameter logistic (1PL) IRT model attempts to address this by allowing each question to have an independent difficulty variable. It models the probability of a correct answer using the following logistic function:

where j represents the question of interest, theta is the current student’s ability, and beta is item j’s difficulty. This function is also known as the item response function. We can examine its plot (with different values of beta) below
to confirm a couple of things:

  1. For a given ability level, the probability of a correct answer increases as item difficulty decreases. It follows that, between two questions, the question with a lower beta value is easier.
  2. Similarly, for a given question difficulty level, the probability of a correct answer increases as student ability increases. In fact, the curves displayed above take a sigmoidal form, thus implying that the probability of a correct answer increases monotonically as student ability increases.

Now consider using the 1PL model to analyze test responses provided by a group of students. If one student answers one question, we can only draw information about that student’s ability from the first question. Now imagine a second student answers the same question as well as a second question, as illustrated below.

We immediately have the following additional information about both students and both test questions:

  1. We now know more about student 2’s ability relative to student 1 based on student 2’s answer to the first question. For example, if student 1 answered correctly and student 2 answered incorrectly we know that student 1’s ability is greater than student 2’s ability.
  1. We also know more about the first question’s difficulty after student 2 answered the second question. Continuing the example from above, if student 2 answers the second question correctly, we know that Q1 likely has a higher difficulty than Q2 does.
  1. Most importantly, however, we now know more about the first student! Continuing the example even further, we now know that Q1 is more difficult than initially expected. Student 1 answered the first question correctly, suggesting that student 1 has greater ability than we initially estimated!

This form of message passing via item parameters is the key distinction between IRT’s estimates of student ability and other naive approaches (like the grading scheme described earlier). Interestingly, it also suggests that one could develop an online version of IRT that updates ability estimates as more questions and answers arrive!

But let’s not get ahead of ourselves. Instead, let’s continue to develop item response theory by considering the fact that students of all ability levels might have the same probability of correctly answering a poorly-written question. When discussing IRT models, we say that these questions have a low discrimination value, since they do not discriminate between students of high- or low-ability. Ideally, a good question (i.e. one with a high discrimination) will maximally separate students into two groups: those with the ability to answer correctly, and those without.

This gets at an important point about test questions: some questions do a better job than others of distinguishing between students of similar abilities. The two-parameter logistic (2PL) IRT model incorporates this idea by attempting to model each item’s level of discrimination between high- and low-ability students. This can be expressed as a simple tweak to the 1PL:


How does the addition of alpha, the item discrimination parameter, affect our model? As above, we can take a look at the item response function while changing alpha a bit:


As previously stated, items with high discrimination values can distinguish between students of similar ability. If we’re attempting to compare students with abilities near zero, a higher discrimination sharply decreases the probability that a student with ability < 0 will answer correctly, and increases the probability that a student with ability > 0 will answer correctly.

We can even go a step further here, and state that an adaptive test could use a bank of high-discrimination questions of varying difficulty to optimally identify a student’s abilities. As a student answers each of these high-discrimination questions, we could choose a harder question if the student answers correctly (and vice versa). In fact, one could even identify the student’s exact ability level via binary search, if the student is willing to work through a test bank with an infinite number of high-discrimination questions with varying difficulty!


Of course, the above scenario is not completely true to reality. Sometimes students will identify the correct answer by simply guessing! We know that answers can result from concept mastery or filling in your Scantron like a Christmas tree. Additionally, students can increase their odds of guessing a question correctly by ignoring answers that are obviously wrong. We can thus model each question’s “guess-ability” with the three-parameter logistic (3PL) IRT model. The 3PL’s item response function looks like this:


where chi represents the item’s “pseudoguess” value. Chi is not considered a pure guessing value, since students can use some strategy or knowledge to eliminate bad guesses. Thus, while a “pure guess” would be the reciprocal of the number of options (i.e. a student has a one-in-four chance of guessing the answer to a multiple-choice question with four options), those odds may increase if the student manages to eliminate an answer (i.e. that same student increases her guessing odds to one-in-three if she knows one option isn’t correct).

As before, let’s take a look at how the pseudoguess parameter affects the item response function curve:


Note that students of low ability now have a higher probability of guessing the question’s answer. This is also clear from the 3PL’s item response function (chi is an additive term and the second term is non-negative, so the probability of answering correctly is at least as high as chi). Note that there are a few general concerns in the IRT literature regarding the 3PL, especially regarding whether an item’s “guessability” is instead a part of a student’s “testing wisdom,” which arguably represents some kind of student ability.

Regardless, at Knewton we’ve found IRT models to be extremely helpful when trying to understand our students’ abilities by examining their test performance.


de Ayala, R.J. (2008). The Theory and Practice of Item Response Theory, New York, NY: The Guilford Press.
Kim, J.S., Bolt, D (2007). “Estimating Item Response Theory Models Using Markov Chain Monte Carlo Methods.” Educational Measurement: Issues and Practices 38 (51).
Sheng, Y (2008). “Markov Chain Monte Carlo Estimation of Normal Ogive IRT Models in MATLAB.” Journal of Statistical Software 25 (8).

Also, thanks to Jesse St. Charles, George Davis, and Christina Yu for their helpful feedback on this post!