This post was written by Joy Zheng, a student at Harvard University who interned at Knewton this summer.
How do you quickly visualize and validate data in a large-scale, distributed, and real-time system? How do you make it deployable in multiple cloud environments? And how do you track changes to this data over time?
These are questions that I set out to answer in my internship this summer, as I worked to develop a web-based interface that displays how much data the Knewton system is receiving from hundreds of thousands of Pearson students across the country.
At a very high level, this is how things work: Hundreds of thousands of college kids across the country use a recommended textbook published by Pearson. These books have an online component with which the students interact, learn, take tests and quizzes, etc. Information regarding all this activity flows through to us at Knewton. We process it to make recommendations to each student about how to progress through the course, so that the student can achieve the milestones set by the professor.
For any given student, we need to know something about the student and his or her history of previous answers, as well as something about the course and the textbook that the course uses. We also need to track any custom content that instructors have created. In order to serve up recommendations, the work gets divided up among different services in the system — along with the data.
With this problem set in mind, I came up with a list of overarching goals for a web-based dashboard that would provide this information. Before building the system, I had to keep in mind the following:
Recentness of data: While it isn’t vital for the dashboard to have up-to-the-second data, someone has to be able to login at any time and see fairly recent data — say, from the last hour. We also want to make sure that, when needed, users can get even more recent data on demand.
Responsiveness: How fast can we refresh the data?
Redundancy: When two users want the same piece of data from the platform, the dashboard should not have to fetch it twice.
Ability to get details on the fly: Given thousands of courses, how do you get more descriptive information on the fly?
Ease of use: Using the dashboard should not require deep knowledge of how data is being handled in the platform itself.
Configuration: In line with the last point, the dashboard should be configurable from the website; one should be able to change things like “refresh time” on the fly, for example.
Ultimately, we settled on an implementation where data would be polled at regular intervals from various pieces of the Knewton platform and then cached in memory.
This resulted in three strategies for obtaining the data:
Auto-refresh: We query for updates every hour.
On-demand refreshes: If a user wants the most recent state for some piece of information, they can force a refresh.
Stats over time: Permanent persistence of summary statistics — because we want to be able to see some numbers over time, we save relevant data once every 24 hours.
As a result, data presented to the users is almost always drawn from the cache; we preferred this to live querying for two primary reasons.
First, as mentioned previously, we didn’t want to have to send off queries to the platform for data multiple times if we would be getting the same data back; we only wanted to have to pipe data when it had been updated.
Secondly, speed. After all, drawing information from local cache is faster than having to gather it, and we were even able to cache significant portions of the web pages themselves.
Additionally, as part of collating data, one of the roles of the dashboard was to cross-reference it — to line up the data about a textbook with the data for a course and the data for a student. Thus, the nature of the internal back-end references meant that querying on-demand for say, information about one course or user, would require querying in sequence, rather than querying in parallel.
Given that most of the Knewton platform is written in Java, and non-Java pieces had Java APIs in place, the language choice for the dashboard was very straightforward; admittedly, though, there were points where I missed functional programming, especially when trying to make sure that the dashboard could handle multiple simultaneous users during parallelizing refreshes.
The webpage itself was run using an embedded Jetty servlet. In retrospect, now that I am more familiar with Jetty and having run through a cycle of configurations, I likely would have given this choice more consideration — either by looking at different server technologies or different configuration methods. Especially compared with Ruby on Rails, which defined my previous experience with servers, resolving relative URLS both for webpage and resource files ended up taking notably more effort than expected with this configuration style. This challenge was compounded by the use of JSP (Java Server Page = Java-embedded HTML) files which facilitated UI and CSS better than the Java-generated HTML that I had in the first running dashboard version.
The interaction within the system can be represented in the diagram below:
While testing functionality, I ran the code on a local machine against a very small set of data (really small — like counting actual numbers rather than trailing zeros on your fingers) just to make sure that everything was running and that we were gathering the expected data. Consequently, in our first run against a bigger sized data set (approaching 100,000), some areas of data took a few hours to refresh. We were also piping enough data around to have a tangible effect on the performance of the platform as a whole (ouch!).
To improve performance, we took a few steps:
Reduced field set: First, we took another look at all the data we gathered to make sure that we weren’t collecting anything we didn’t need. While our trimming efforts here probably didn’t have as much of an effect on performance, they definitely reduced the memory footprint.
Delta Refreshes: Next, we made the dashboard smarter by only receiving data that had been updated since the last refresh. Given API restrictions, this wasn’t possible for every service that we queried, but it still gave us a big speed boost.
Lazy loading: Data that we might reasonably want (but could live without 90% of the time) was taken off the automatic update list, and instead only queried and cached on demand. This sometimes meant a wait of 15-30 seconds by users if they really wanted to dig into the details, but it reduced the amount of data we had to cycle through every hour by an order of magnitude.
These tweaks improved performance by orders of magnitude, too, so that refreshes could be measured in minutes or seconds rather than in hours.
All in all, I was able to get the dashboard up and running into production just a few minutes before my final presentation of my internship. The dashboard is now being used officially at Knewton to track data across multiple environments and for several hundred thousand students across the country!