Latent Skill Embedding

In this post, we explore a probabilistic model of student learning and assessment that can be used to recommend personalized lesson sequences. A student using Knewton-powered products currently receives content recommendations one step at a time. The model presented in this blog post is motivated by the idea that such a student could receive additional context and motivation by also seeing the most likely trajectory, or sequence of recommendations, leading to her goals, given her current state.


Students using Knewton-powered adaptive learning products are working in an online environment, acquiring knowledge that enables them to pass specific assignments and achieve their learning goals. These students receive recommendations from Knewton about what to work on, in a sequence that is determined by the work the student has previously done.

The goal of this blog post is to explore a new way of evaluating pathways through content based on their ability to improve student learning. We leverage existing log data (entries that look like Student A completed Lesson B and Student C passed assessment D) to compare the performance of different lesson sequences. Moreover, our method can be leveraged to generate displayable lesson sequences that maximize educational gains while meeting practical time constraints (i.e., the student has at most n days to maximize her probability of passing goal assessments). The additional context these sequences of recommendations provide could enhance students’ motivation, and consequently their overall performance.

General Approach

We use historical student-module interactions to learn a joint probabilistic embedding of the latent skills of students, skill requirements of assessments, and skill gains from lessons (modulated by prerequisite knowledge). Through offline evaluations on log data, we have shown that this model can successfully distinguish between lesson sequences that lead students to mastery or failure. In this post, we will outline how the model can be used to recommend lesson sequences in real time.


Probabilistic embeddings are used for a variety of tasks, including music playlist prediction,1 feedback propagation in MOOCs,2 visualization of high-dimensional data,3 and language modeling for natural language processing.4 We are interested in embeddings that perform nonlinear dimensionality reduction on high-dimensional data, and also can be used to learn useful relationships between objects that are not explicitly expressed in the data. We will use a probabilistic embedding to capture the dynamics of student learning and testing, and in the process we will describe a method to recommend personalized lesson sequences. Our approach resembles a proposed sparse factor analysis method.5

In our model, students, lessons, and assessments are jointly embedded in a low-dimensional semantic space that we call the “latent skill space.” Students have trajectories through the latent skill space, while assessments and lessons have fixed locations. Formally, a student is represented as a set of d latent skill levels \vec{s}_t \in \mathbb{R}_+^d that can change over time; a lesson module is represented as a vector of skill gains \vec{ell} \in \mathbb{R}_+^d and a set of prerequisite skill requirements \vec{q} \in \mathbb{R}_+^d; an assessment module is represented as a set of skill requirements \vec{a} \in \mathbb{R}_+^d.

Students interact with lessons and assessments in the following way. First, a student can be tested on an assessment module with a pass-fail result R \in {0, 1}, where the likelihood of passing is high when a student has skill levels that exceed all of the assessment requirements and vice versa. Second, a student can work on lesson modules to improve skill levels over time. To fully realize the skill gains associated with completing a lesson module, a student must satisfy prerequisites, though fulfilling some of the prerequisites to some extent will result in relatively small gains. Time is discretized such that at every timestep t \in \mathbb{N}, a student completes a lesson and may complete zero or many assessments. The evolution of student knowledge can be formalized as an input-output hidden Markov model.

There are two equations that characterize the dynamics of the latent skill space: the “assessment result likelihood,” and the “learning update.” Both are discussed in the following two sections.

Modeling Assessment Results

For student \vec{s}, assessment \vec{a}, and pass-fail result R,

    \[ R \sim \mathrm{\Bernoulli}(\phi(\Delta(\vec{s},\vec{a}))) \]

where (\phi\) is the sigmoid function, and

    \[ \Delta(\vec{s},\vec{a}) = \frac{vec{s} \cdot \vec{a}}{|\vec{a}|} - |\vec{a}| + \gamma_s + \gamma_a \]

where \gamma_s, \gamma_a \in \mathbb{R}. A pass result is indicated by R=1, and a fail by R=0. The term \frac{\vec{s} \cdot \vec{a}}{|\vec{a}|} can be rewritten as |\vec{s}|\mathrm{cos}(\theta), where theta is the angle between \vec{s} and \vec{a}; it can be interpreted as “relevant skill”. The term |\vec{a}| can be interpreted as assessment difficulty. The bias term \gamma_s is a student-specific term that captures a student’s general (assessment-invariant and time-invariant) ability to pass; it can be interpreted as a measure of how often the student guesses correct answers. The bias term \gamma_a is a module-specific term that captures an assessment’s general (student-invariant and time-invariant) difficulty. These bias terms are analogous to the bias terms used for modeling song popularity in playlists.1

Modeling Student Learning from Lessons

For student \vec{s} who worked on a lesson with skill gains \vec{ell} and prerequisites \vec{q} at time t+1, the updated student state is

    \[ \vec{s}_{t+1} \sim \mathcal{N}\left(\vec{s}_t + \vec{ell} \cdot \phi(\Delta(\vec{s}_t,\vec{q})),\mathbf{\sigma^2\mathbf{I}_d}\right) \]

where \Delta(\vec{s}_t,\vec{q}) = \frac{\vec{s}_t \cdot \vec{q}}{|\vec{q}|} - |\vec{q}| and \sigma is a constant. The intuition behind this equation is that the skill gain from a lesson should be weighted according to how well a student satisfies the lesson prerequisites. A student can compensate for lack of prerequisites in one skill through excess strength in another skill, but the extent to which this trade-off is possible depends on the lesson prerequisites. The same principle applies to satisfying assessment skill requirements. Figure 0 illustrates the vector field of skill gains possible for different students under fixed lesson prerequisites. Without prerequisites, the vector field is uniform.

Figure 0

The vector field of skill gains for a lesson with skill gains \0.5, 1 and prerequisites 0.7, 0.3. Contours are drawn for varying update magnitudes. A student can compensate for lack of prerequisites in one skill through excess strength in another skill, but the extent to which this trade-off is possible depends on the lesson prerequisites.

Parameter Estimation

How do we take a bunch of logged student-module interactions and turn them into an embedding? We compute maximum a posteriori (MAP) estimates of model parameters \Theta by maximizing the following objective function:

L(\Theta) = \sumlimits_{\mathcal{A}} \log{(Pr(R \mid \vec{s}_t, \vec{a}, \gamma_s, \gamma_a))} + \sumlimits_{\mathcal{L}} \log{(Pr(\vec{s}_{t+1} \mid \vec{s}_t, \vec{ell}, \vec{q}))} - \beta \cdot \lambda(\Theta)

where \mathcal{A} is the set of assessment interactions, \mathcal{L} is the set of lesson interactions, \lambda(\Theta) is a regularization term that penalizes the L_2 norms of embedding parameters to reduce overfitting, and \beta is a regularization parameter. Non-negativity constraints on embedding parameters are enforced. We solve the optimization problem using the L-BFGS-B algorithm, which extends the popular L-BFGS algorithm to handle box constraints. We randomly initialize parameters and run the iterative optimization until the relative difference between consecutive objective function evaluations is less than 10^{-3}. Averaging validation accuracy over multiple runs during cross-validation reduces sensitivity to the random initializations (since the objective function is non-convex).

Toy Examples

To illustrate how an embedding can capture the underlying geometry of different scenarios, we conducted a series of experiments on small, synthetically-generated interaction histories.


Figure 1

The key observation here is that the model recovered positive skill gains for L1, and “correctly” arranged Alice and A1 in the latent space. Initially, Alice fails A1, so her skill level is behind the requirements of A1. After completing L1, Alice passes A1, indicating that her skill level has probably improved past the requirements of A1. Note that this scenario could have been explained with only one latent skill.

Figure 2

A two-dimensional embedding, where an intransitivity in assessment results requires more than one latent skill to explain. The key observation here is that the assessments are embedded on two different axes, meaning they require two completely independent skills. This makes sense, since student results on A1 are uncorrelated with results on A2. Fogell fails both assessments, so his skill levels are behind the requirements for A1 and A2. McLovin passes both assessments, so his skill levels are beyond the requirements for A1 and A2. Evan and Seth are each able to pass one assessment but not the other. Since the assessments have independent requirements, this implies that Evan and Seth have independent skill sets (i.e. Evan has enough of skill 2 to pass A2 but not enough of skill 1 to pass A1, and Seth has enough of skill 1 to pass A1 but not enough of skill 2 to pass A2).

Figure 3

We replicate the setting in Figure 2, then add two new students Slater and Michaels, and two new lesson modules L1 and L2. Slater is initially identical to Evan, while Michaels is initially identical to Seth. Slater reads lesson L1, then passes assessments A1 and A2. Michaels reads lesson L2, then passes assessments A1 and A2. The key observation here is that the skill gain vectors recovered for the two lesson modules are orthogonal, meaning they help students satisfy completely independent skill requirements. This makes sense, since initially Slater was lacking in Skill 1 while Michaels was lacking in Skill 2, but after completing their lessons they passed their assessments, showing that they gained from their respective lessons what they were lacking initially.

Figure 4

We replicate the setting in Figure 2, then add a new assessment module A3 and a new lesson module L1. All students initially fail assessment A3, then read lesson L1, after which McLovin passes A3 while everyone else still fails A3. The key observation here is that McLovin is the only student who initially satisfies the prerequisites for L1, so he is the only student who realizes significant gains.



We have some evidence that an embedding can make good sequence recommendations in an offline evaluation. How can we use an embedding to make sequence recommendations in an online setting? Formally, we must address the following problem: Given a student’s current skill levels and a set of assessments the student wants to pass, what is the optimal sequence of lessons for the student? We can formulate the problem two ways: specify a minimum pass likelihood for the goal assessments and find the shortest path to mastery, or specify a time constraint and find the path that leads to the highest pass likelihood for the goal assessments while not exceeding the prescribed length.

In general, both problems can be tackled by using a trained Latent Skill Embedding model to specify a Markov Decision Process, in which the possible states are given by the student embedding, the set of possible actions corresponds to the set of lesson modules that the student can work on, the transition function is the learning update (Equation 2), and the reward function is the likelihood of passing all goal assessments (Equation 1) discounted by the expected time taken to complete the lesson sequence. Various methods for finding the optimal policy for an MDP with a continuous state space, discrete action space, and probabilistic transitions have been explored in the literature.6

For an embedding model without lesson prerequisites, the order in which lessons are completed is irrelevant since sums are commutative. Time-constrained lesson recommendation thus reduces to the 0-1 knapsack problem. In this case, each lesson contributes skill gains, but costs the student an expected amount of time,which can be predicted using matrix factorization. The 0-1 knapsack problem can be solved efficiently using dynamic programming.

In practice, content recommendations should intersperse assessments in the lesson sequence to keep students engaged and make sure they are learning. We are currently working on an algorithm that incorporates this idea into the sequence optimization problem.


A Python implementation of the Latent Skill Embedding model is available at The IPython notebooks used to perform data exploration, model evaluations, sensitivity analyses, and synthetic experiments are available at The data sets from Knewton used for these experiments are not available to the public. The project will be maintained at


Siddharth Reddy is a junior at Cornell University, studying computer science and applied mathematics. He’s interested in machine learning, adaptive education, and startups.

This work was a collaboration between Siddharth Reddy, Igor Labutov, and Thorsten Joachims at Cornell University, in conjunction with the data science team at Knewton. The project was presented at the Machine Learning for Education workshop at ICML 2015 (see for details).


  1. Shuo Chen, Joshua L Moore, Douglas Turnbull, and Thorsten Joachims.

Playlist prediction via metric embedding. In Proceedings of the 18th ACM

SIGKDD international conference on Knowledge discovery and data mining,

pages 714–722. ACM, 2012.

  1. Chris Piech, Jonathan Huang, Andy Nguyen, Mike Phulsuksombati, Mehran

Sahami, and Leonidas Guibas. Learning program embeddings to propagate

feedback on student code. arXiv preprint arXiv:1505.05969, 2015.

  1. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.

Journal of Machine Learning Research, 9(2579-2605):85, 2008.

  1. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities

in continuous space word representations. In HLT-NAACL, pages 746–751,


  1. Andrew S Lan, Christoph Studer, and Richard G Baraniuk. Time-varying

learning and content analytics via sparse factor analysis. In Proceedings of

the 20th ACM SIGKDD international conference on Knowledge discovery

and data mining, pages 452–461. ACM, 2014.

  1. Marco Wiering and Martijn van Otterlo. Reinforcement Learning: State-of-the-art,

volume 12. Springer Science & Business Media, 2012.

What's this? You're reading N choose K, the Knewton tech blog. We're crafting the Knewton Adaptive Learning Platform that uses data from millions of students to continuously personalize the presentation of educational content according to learners' needs. Sound interesting? We're hiring.