Software Engineering Internships at Knewton

I love when intern candidates ask how we run internships. And if they don’t ask, I tell them anyway and proceed to explain why they should ask. What does the answer to “how does a company run internships” tell you? It’s simple: it tells you the amount of time you will spend working closely with experienced members of the organization from engineers to product managers to UX designers, etc. And, how you spend that time working with great engineers is the fastest way to become a great engineer.

How a company runs internships is one of the two fundamental answers you need to be convinced of before accepting an internship. The other fundamental question is, “What is the company culture like?” Of course company brand and money play a role in the decision, but I would not trade my summer for a notch on my resume if it does not translate into a great summer becoming the best engineer. However this is a topic for another post.

Our approach to internships is simple: every engineering intern becomes a full-fledged member of one of our Product teams.

Making sure that engineers in the company have the time and incentive to teach and spend time with interns is crucial to a great learning experience. Building the company’s lunch app with a team of other interns is a glorified homework assignment that can just as well be done with your college roommate.

By helping every intern become a full-fledged member of one of our Product teams, it puts his/her work on par with everyone else on the teams. Meaning, his/her problem becomes the team’s problem, collaboration with other members of the team (including engineers, PMs, data scientists, etc.) happens naturally, and the work is meaningful and important to the company as well as the team. No one will be too busy to spend time on a problem that they know needs to be solved. Working in a fast-moving Product team is a challenge for someone who lacks experience, so every intern has a dedicated mentor. “Learn and Teach” is one of Knewton’s core values, and everyone goes out of their way to share and help, but a mentor has the explicit responsibility for it and the mentor’s career growth within the company depends on it.

The final piece of the puzzle is figuring out what to work on. This is tricky and can make or break an internship, so we spend time getting it right. The project has to be in the team’s semi-critical path and have a meaningful outcome. Since “Ship and Learn” is another Knewton core value, an intern will spend a couple of weeks just working with the team—learning, shipping, and seeing what sparks her interest. At the end of that period, with help from her mentor and manager, she will have outlined a project that is meaningful and something she can feel passionate about.

This is a simple approach geared toward cross-functional collaboration and executing a project that matters to the team and the intern. It is about learning how to build web-scale software with an experienced team.

We want interns to have great experiences: learn from our talented engineers, data scientists, and PMs, on top of having an exciting summer in the heart of NYC.

Remember, before committing to an internship, ask how it is run. You will definitely make your interviewer happy and you’ll end up more fulfilled, gaining valuable skills for your future.

Analyzing Java “Garbage First Garbage Collection” (G1GC) Logs

Garbage Collection can take a big toll on any Java application, so it’s important to understand its behavior and impact. After a JVM upgrade of Knewton’s Cassandra database, we needed a tool to compare the performance and impact of different garbage collection strategies, but  we couldn’t find an existing tool that would parse gc logs and analyze them.

This post explains the process followed and discusses some results that we believe may be useful while evaluating Java garbage collection strategies.

Evolution of Garbage Collection on Java

One of the biggest challenges for any programming language is the way that it will allocate memory while executing a task. Garbage collection is the action to regain sectors of memory that are no longer in use. These mechanisms are usually evaluated in three dimensions that are known as the three legs of the performance stool: latency, throughput, and footprint. In real life, there is always a tradeoff, and if all the stars align, you can optimize for two of the three dimensions.

carlos-blog-1

Source: Explorations of the three legged performance stool

Over the course of its history, Java has handled garbage collection in different ways. The early JDK (before 1.4) had a single thread to sweep memory to find out which areas were in use and which ones could be reclaimed.

Java 4 introduced more complex algorithms based on tracing collectors The idea was to find root objects and identify all instances that were still in use. If an object didn’t have a trace, it was considered not in use, and that area of memory would be reclaimed. The trouble with this approach is that applications had cease all activities to allow the analysis. Examples of these “stop the world” algorithms include Mark and Sweep, Heap Compaction, and Copying Collector.

Garbage collection strategies before Java 7 relied on the definition of fixed regions of memory named generations:

  • Objects created recently were gathered in the Eden Generation,
  • When the garbage collector acted, objects that had traces to be in use were “promoted” to the Survivor Generation when an object were not referenced by other objects, they are removed.
  • Objects that survived multiple passes eventually reach the Permanent Generation.

One disadvantage of these garbage collection algorithms is that they require copying and deleting objects during promotions, temporarily using more memory and in some cases causing an overflow that could crash an application. As the size of those memory buckets is inflexible, this approach could limit applications when they needed to scale. Examples of these strategies are Concurrent Mark-Sweep (CMS) and Parallel Collector.

Java 7 update 4 came with a new algorithm: Garbage First (G1), which divides memory into heap regions of the same size. Unlike CMS, while sweeping memory, G1 uses a pause predictive model to identify areas likely to contain objects not in use and targets them as first to be evacuated, hence the name. Only the evacuation process, when objects from the heap that are still in use are compacted and copied to a different region to avoid fragmentation has a stop-the-world behavior. In our tests, the stop-the-world phase has been imperceptible, however.

G1 doesn’t sort objects by generation, thus there are no limits on generation size. The result is a strategy that frees up large portions of memory in less time and that keeps adjusting based on of previous garbage collections.

Java 8 supports G1, CMS, and Parallel Collector, and G1 has been proposed as the default strategy for Java 9.

Why Knewton Created a Parser/Analyzer

While upgrading the Cassandra database, the database  team searched for a tool that could evaluate objectively the performance of G1 vs. CMS; Although G1 was released in April 2012, utilities available for analyzing the garbage collection logs, like Netflix’s gcviz, didn’t support G1 logs; We also tried an approach proposed by Jeff Taylor of Oracle, but their code was not able to interpret all the information from our servers. The scripts are an update of Taylor’s approach: With AWK they collect the logs and format them as a table so they can be interpreted and analyzed with R.

The JVM options that we have been using are:

-XX:+UseThreadPriorities
-XX:ThreadPriorityPolicy=42
-Xms1995M -Xmx1995M
-Xss256k -XX:StringTableSize=1000003
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:+UseTLAB
-XX:+ResizeTLAB
-XX:CompileCommandFile=/hotspot_compiler
-XX:MaxGCPauseMillis=500
-XX:InitiatingHeapOccupancyPercent=25
-XX:+UseG1GC
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:+ParallelRefProcEnabled
-XX:ParallelGCThreads=2
-XX:ConcGCThreads=2
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
-XX:PrintFLSStatistics=1
-Xloggc:/<valid path>/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=10M

The resulting logs contain entries formatted as JSON:

2016-05-12T20:42:14.917+0000: 2.438: Total time for which application threads were stopped: 0.0204068 seconds, Stopping threads took: 0.0000649 seconds
{Heap before GC invocations=2 (full 0):
garbage-first heap  total 2043904K, used 120064K [0x0000000083400000, 0x0000000083503e60, 0x0000000100000000) region size 1024K, 111 young (113664K), 7 survivors (7168K)
Metaspace     used 18551K, capacity 18694K, committed 18944K, reserved 1067008K 
class space    used 2265K, capacity 2299K, committed 2304K, reserved 1048576K
2016-05-12T20:42:15.303+0000: 2.824: [GC pause (G1 Evacuation Pause) (young) 
Desired survivor size 7340032 bytes, new threshold 1 (max 1) - age   1:    6402368 bytes,    6402368 total, 0.0183564 secs]
  [Parallel Time: 17.2 ms, GC Workers: 2]
     [GC Worker Start (ms): Min: 2824.3, Avg: 2824.3, Max: 2824.3, Diff: 0.0]
     [Ext Root Scanning (ms): Min: 2.2, Avg: 2.2, Max: 2.2, Diff: 0.0, Sum: 4.5]
     [Update RS (ms): Min: 0.4, Avg: 0.4, Max: 0.4, Diff: 0.1, Sum: 0.8]
        [Processed Buffers: Min: 2, Avg: 3.5, Max: 5, Diff: 3, Sum: 7]
     [Scan RS (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.2]
     [Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.2]
     [Object Copy (ms): Min: 9.2, Avg: 11.7, Max: 14.2, Diff: 5.0, Sum: 23.4]
     [Termination (ms): Min: 0.0, Avg: 2.6, Max: 5.2, Diff: 5.2, Sum: 5.2]
        [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 2]
     [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
     [GC Worker Total (ms): Min: 17.1, Avg: 17.1, Max: 17.1, Diff: 0.0, Sum: 34.3]
     [GC Worker End (ms): Min: 2841.5, Avg: 2841.5, Max: 2841.5, Diff: 0.0]
  [Code Root Fixup: 0.0 ms]
  [Code Root Purge: 0.0 ms]
  [Clear CT: 0.1 ms]
  [Other: 1.1 ms]
     [Choose CSet: 0.0 ms]
     [Ref Proc: 0.6 ms]
     [Ref Enq: 0.0 ms]
     [Redirty Cards: 0.0 ms]
     [Humongous Register: 0.1 ms]
     [Humongous Reclaim: 0.0 ms]
     [Free CSet: 0.1 ms]
  [Eden: 104.0M(104.0M)->0.0B(1189.0M) Survivors: 7168.0K->8192.0K Heap: 117.3M(1996.0M)->15.0M(1996.0M)]
Heap after GC invocations=3 (full 0):
garbage-first heap   total 2043904K, used 15360K [0x0000000083400000, 0x0000000083503e60, 0x0000000100000000) region size 1024K, 8 young (8192K), 8 survivors (8192K)
Metaspace       used 18551K, capacity 18694K, committed 18944K, reserved 1067008K
class space     used 2265K, capacity 2299K, committed 2304K, reserved 1048576K

The Cassandra upgrade was a consequence of a fix for a Thrift Buffer anomaly; this is another example of the continuous efforts of Knewton’s database team to tune and optimize our database servers.

Results

Knewton has conducted extensive analysis of the performance and tuning of our Cassandra nodes, the following are some of the information collected from a cluster with 24 nodes worth of information that are presented as the hypothetical Knerd Cassandra cluster, also described in Dr. Josh Wickman’s post. Each node has 2 CPU’s and 8 GB of RAM.

The following examples show ways to analyze the the gc.logs. They consider only the logs obtained from Production Cassandra nodes during a 24 hours window; during that period it was also done a full repair on the database, and an automated backup. Note that the results shown were from the test once after the change was done; for exploratory phases we used cassandra-stress tool for functional testing, and an end-to-end load test using Locust.

The following histogram compares the duration of garbage collection execution and groups them in three categories:

  1. Real Time: the total time from start to finish for the garbage collection
  2. User Time: the time used for processes that were not related to the kernel (operating system) and pauses.
  3. Sys Time: The time spent by kernel operations,  most of it corresponding to time that the CPU is locked.lso known as stop-the-world scenario.

carlos-blog-2

Time spent on garbage collection in a “stop-the-world” phase (Sys Time) is a fraction of the time when user processes are blocking operations. The garbage collection adapts to minimize impact on the the application.

We can also compare the sizes for each generation before and after the garbage collection and confirm the impact that a Cassandra nodetool repair action can have on the garbage collection.

First, a brief explanation of how repairs work in Cassandra: Cassandra is a masterless database, which means that each server that is part of the same cluster will share data (sometimes several copies of the same record). Each node is responsible for knowing where to find a record for a fraction of the data; each record will have a unique token identification; and each node will be responsible for keeping track of all the copies of specific tokens.

When a repair is requested, each node will be responsible for validating the consistency of all copies of the data. (If there are differences the latest version will prevail.) Depending on the amount of data and the replication factor, this can be a very expensive operation on CPU and memory.

The graph below shows the total memory used before and after a garbage collection

carlos-blog-3

At (1), the command nodetool repair executed a sequential repair in the entire cluster, which used more memory after the garbage collection since the anti-entropy mechanism of the repair put more objects in use.

carlos-blog-4

Garbage collection has a non deterministic behaviour: it does not start at the exact time that the memory threshold is surpassed (2), as sometimes it can take fractions of the second afterward, also, there can be a delay while G1 determines the areas that will release the most memory; that delay will depend on the heap size allocated and how much is in use, this can be an area of concern for environments with limited resources (i.e.: development virtual environments, docker containers).

carlos-blog-5

After the repair, the pattern changed, with garbage collections that effectively decreased the memory used below 0.25 GB (3); one explanation is that the garbage collector learns from its previous runs when the repair was being executed.

carlos-blog-6

Finally, there is an additional spike (4) caused by our automated backup mechanism.

carlos-blog-7

From the total heap size, we can also analyze and compare components; in the example we’ll show only the parts of the New Generation: Eden and Survivor spaces, as they usually are the earliest indicators of issues with memory consumption.

carlos-blog-8carlos-blog-9
Garbage collection is triggered by the total heap size heap, so garbage collection will happen even when the Eden and Survivors Generations are relatively small.

After garbage collection, the Eden space should be at or close to zero, while the Survivor space gets bigger.. The Eden space is also larger than the Survivor space; in this example, by 20:1. This ratio will be different depending on the application.

If needed, we can get more granular information about the different operations (like scanning memory sections or copying objects) that take place during a garbage collection; in the same way as before, we can also identify the change of the behavior that occurs during a repair.
carlos-blog-10carlos-blog-11

As explained before, a Cassandra repair can consume a lot of resources; but this is necessary to ensure that the data is consistent (especially when there are constant updates). One finding of the composition is that the change on the behavior pattern: the Objects Copy operation (in green) stayed longer hours after the repair was completed. which could be also an effect of the adaptive nature of the G1 algorithm.

Conclusions

The algorithms and strategies for garbage collection will keep evolving and changing, as we have not yet found one that maximizes throughput with minimum latency and footprint. In the meantime, Java applications will continue to rely upon G1GC for versions 7, 8, and, most likely, 9. By sharing some experiences with logs from Generation First Garbage Collection (G1GC) and open source data modeling tools, we hope to help other database teams understand their systems better.

References

For Garbage Collection

IBM – Java theory and practice: A brief history of garbage collection

infoq – G1: One Garbage Collector To Rule Them All

Oracle – Getting Started with the G1 Garbage Collector

Oracle – Java SE 6 Garbage Collection Tuning

jaxenter – Java 9’s new garbage collector: What’s changing? What’s staying?

Oracle – Understanding G1 GC Logs

For visualization tools of Garbage Collection logs

GCViewer

Netflix’s gcviz

For data modeling with R

r-project.org

Code

Code is available at this gist

How to Replace a Microservice: Swapping Out the Engine in Mid-Flight

Photo by Michael VH via Flickr (CC BY 2.0)

Photo by Michael VH via Flickr (CC BY 2.0)

Introduction

So you’ve built an important new service. It satisfies current use cases and is also forward looking. Pretty soon it’s mission-critical, so you optimize where you can and keep it humming along.

Then the product changes, and your service needs to do something different. You don’t have time to rewrite from scratch (and you’re never supposed to do that anyway), so you refactor what you can, then start bolting stuff on.

A couple of years later, the product changes again. This time, refactoring isn’t going to cut it: the entire design needs to change. So you take a deep breath and make your case for the big rewrite. The team is justifiably skeptical, but you’re persuasive, and everyone signs off.

Great! The service is going to be so much better this time. With two years of domain knowledge, you know where all the landmines are, you know the stack and the usage patterns like the back of your hand.

Still, it’s slow-going. A bunch of other services now depend on the existing interface in production. Downtime is unacceptable, and a flip-the-switch cutover is just asking for trouble.

Now what?

The Knewton Graph Store

Knewton’s mission is to personalize education for everyone. To do that, Knewton organizes educational content in graph form that shows how knowledge builds on itself. Graphing is a cornerstone of what gives our platform its power.

Knewton’s graph store initially used an event-based model. In order to modify a graph, internal client code would do the following operations:

  • Read graph data using a library built on Titan (a property graph library with transactions and pluggable datastores) and backed by a Cassandra cluster.
  • Make the desired modifications and commit the Titan transaction.
  • Intercept the Titan transaction, package the modifications up into a delta format and write the delta to a queue.
  • Consumers of the queue would pick up the delta messages and write them back into another Cassandra database in different forms.

By the middle of 2015, Knewton was using graphs in ways that required much lower latencies, which meant that frequent round trips to a database wouldn’t work any more. Graphs had grown significantly bigger, which had caused problems with storage and database schemas. Beyond that, many smaller problems also made the service difficult to work with. So we undertook a project to rewrite the graph store.

Design and Development

The design and development of the new service were relatively painless. It’s tempting to claim that this was purely due to how fantastic Knewton’s engineers are (although we are and you should come work with us), but it’s worth noting how important hindsight was to the process. Having seen how the graph store was used in the real world helped us know what to build, what to fix, and what to avoid. One of Knewton’s core values is to “ship and learn”, and after several years of seeing the graph store evolve, we knew what we needed and built the new service accordingly.

Migration

Migration was by far the most complicated phase of the project. It was composed of two parts: migrating the existing data to the new service, then migrating clients to the new service.

Migrating the data

Best practices for doing data migration usually specify a five-step process:

  1. Set up the new datastore (or new table or schema).
  2. Modify existing business logic to write to both the old and new datastores while still reading from the old one.
  3. Backfill historical data into the new datastore.
  4. Start reading from the new datastore.
  5. Stop writing to the old datastore.

Due to the event-sourced nature of writes, we were able to accomplish Step 2 by adding a new consumer to the events topic. Knewton uses Kafka as our main messaging platform, and one of its most useful features is the ability to have multiple consumers subscribe to the same message stream at the same time. This allowed us to mirror incoming data to both the new and old datastores with a minimum of trouble.

Difficulties arose in Step 3 due to issues with old data. Midway through the migration, we discovered that data stored several years earlier didn’t match current schemas. The validation logic had evolved quite a bit, but the historical data had never been cleansed.

This absence of strict validation mechanisms is a downside of many NoSQL solutions: since the database isn’t enforcing any schemas or constraints, the burden falls entirely on application logic, which typically changes much more quickly than schemas do.

Migrating the data ended up requiring several weeks of time-consuming manual investigation and data introspection.

Migrating the clients

Though ultimately worth it, migrating the clients off of the old service took a lot of time and effort because of all the necessary cross team coordination. When trying to switch clients onto a new interface for your services, it’s customary to leave the old interfaces available for a period of time to allow other teams some breathing room to do their migration. While we were able to accommodate our clients using this strategy, unforeseen interdependencies between clients caused us to have to spend several more weeks than planned coordinating the final rollout.

Results

In the end, we were able to serve all of our clients’ needs while simplifying the system and cutting hosting expenses by more than 70%. Knewton completed the migration with no data loss, no downtime, and no outages. Knewton is now able to handle significantly larger graphs and build new features that enable our partners to do things that weren’t possible before.

Lessons Learned

Migrating takes longer than you think

Migration is often the phase of the project with most uncertainty, so it’s difficult to forecast accurately. It’s easy to assume that the difficulty of migrating to a new service will purely proportional to the amount of data being moved, but there are several other factors to keep in mind:

Datastore differences

Moving between different datastores makes the migration much more complicated. You have to account for the particular mechanisms, guarantees, and performance characteristics of your new datastore. Changing datastores also typically means that you have to rewrite tooling and documentation as well.

Data hygiene

Data will get dirty. There are always going to be a few objects or rows that don’t exactly conform to the rules you expect them to, especially in NoSQL databases where hygiene mechanisms such as constraints and foreign keys often don’t exist. Unless you’ve taken great pains in the past to make sure your data is exactly what you think it is, migration time is when you’ll find out which of your assumptions aren’t true.

Clients

You have much less control over the clients of your service than you do over your own code. Clients have their own services and their own schedules, and you’re the one throwing in a monkey wrench by changing one of their dependencies. There are several ways to accommodate clients, but each approach has costs:

  • Be flexible about cutover deadlines
    • Clients have their own schedules to manage,so it’s a good idea to give them some breathing room to do the migration. A hard cutover date is a recipe for problems.
    • Tradeoff: total migration time (and thus the total time you have to maintain the old service for backwards compatibility) can stretch out longer.
  • Get feedback early and continuously
    • A good way to avoid any unforeseen issues in client migration is to continually solicit feedback on both technical considerations (like interface design and documentation) and project management considerations (like timing and scheduling).
    • Tradeoff: communication requires a lot of energy and effort from all involved parties, with a payoff that is not immediately visible.
  • Minimize the need for coordination
    • Coordination across multiple teams is difficult, and even more so with scale. In general, it’s a good practice to avoid it because it reduces the amount of time any one team spends waiting for other teams to complete precursor tasks.
    • Trade-off: Enabling all clients to complete the migration independently requires more effort and planning from the service owner, especially in response to unforeseen events.

Rewrite only when you need to rearchitect

It’s sometimes difficult to judge when to rewrite a service from the ground up, but a good rule of thumb is to rewrite only if the system’s architecture itself is blocking your progress. It can be easy to convince yourself that the code may be confusing, or that the data model doesn’t make sense, or that there’s too much technical debt, but all those issues are solvable “in place.”

If you can refactor the code, refactor the code. If you can change your database schema incrementally, change your schema. When you can’t move forward any more without changing the architecture, it’s time to think about a rewrite.

Don’t look too far down the road

One of the easiest traps to fall into in software engineering is to build a service (or feature or library or tool) because “it might be needed down the road.” After all, software engineers are supposed to be forward looking and anticipate issues before they occur. But the further you look, the less clearly you can see. It’s important to resist temptation and build only those things that you know you need now (or, at most, in the immediate future).

System Design Documents at Knewton: RFCs

The team is in agreement: the Flimflamulator service is a mess of tech debt and something needs to be done. Other software teams at the company write code that depends on this service, however. They’ve built workarounds, and any changes will propagate into their systems. Flimflamulator provides some customer-facing functionality; the Product team will want to weigh in too.

How do you make sure make sure you’re not creating new problems? What if someone has already thought through solutions to this? There are so many stakeholders that a design review meeting would be chaos. How do you collect feedback from everyone?

At Knewton we use the RFC process. You might call it design documents or architecture review, but the goal is the same.

What is an RFC?

RFC stands for Request for Comment; at its most basic level an RFC is a way to collect feedback about an idea. A document explaining the idea is shared with peers and stakeholders who provide feedback and ultimately sign off (or don’t) on the idea. RFCs can be product proposals, design documents, process changes, or any other thing you’d like someone’s opinion on.

RFC does not stand for Request for Consensus. People are rarely equally informed or equally affected by issues, and some team members naturally hold more sway than others. RFCs are not decided by majority vote, and that’s OK.

Knewton’s process for this is loosely based on the RFC process used by the Internet Engineering Task Force to discuss and set standards for internet protocols.

When to Write an RFC?

Write an RFC when you’re considering something that will be hard to change later: typically public interfaces, data schemas, frameworks, platforms, or human processes.

How to Write an RFC?

Knewton’s RFC template explains what an RFC should cover. We use Google Docs to share and discuss documents because it has great commenting tools.

RFC Ownership

Each RFC must have at least one owner, who should be listed at the top of the RFC along with the RFC’s status and any relevant dates (refer to the template for examples). The owner is responsible for:

  • doing research on prior art
  • identifying the right audience of stakeholders and peers
  • publishing the RFC to that audience
  • addressing all questions, concerns, and suggestions transparently and expediently
  • documenting any dissenting opinions

The “Right” RFC Audience

“An RFC is like a wedding proposal; if you’re not sure your stakeholders are going to say yes, it’s too early to write it,” as my colleague Rémi likes to say.

Not all RFCs require involvement of the entire tech team. In fact, most don’t.

The RFC owner should identify key stakeholders and solicit their opinions first. These are the people affected by what you’re proposing, subject matter experts, or people who can make the proposal happen/not happen. An informal chat before you start writing can save lots of time.

Once you have a doc, share it with a small audience who can give you rapid feedback. Your manager is often a good place to start. Focus on quick iteration and tell people when you expect to receive their feedback.  Early in the process, 24-hour turnaround is a reasonable request. Be proactive in soliciting feedback. You’ll probably get a lot of comments on your first draft, and an in-person review can be useful to speed things along.

As major issues get worked out and details solidified, expand the audience in a few rounds: more stakeholders, your whole team, tech-wide. It should be like rolling a snowball down a hill. Allow up to 5 business days for the final audience to sign off. This will be a judgment call based on the size of the audience and urgency of the proposal. At Knewton it’s customary to share your RFC with the entire tech team as an FYI even if it isn’t relevant to everyone.

How to Address Comments

It’s unlikely everyone will be in perfect agreement about your proposal, and you’ll probably receive some feedback that you disagree with. Disagreement is OK.

A few things to keep in mind when addressing dissenting comments:

  • What would it take to change your mind? If you think incorporating the dissenting feedback would cause problems, ask the commenter to provide a solution to those problems.
  • Does the commenter have skin in the game? Are they a subject matter expert? You don’t have to address every comment on your proposal if they’re not relevant.
  • Close out comments you won’t incorporate by briefly but respectfully saying you disagree, and mark the comment resolved. If the commenter feels strongly, they’ll let you know.
  • Long comment threads can be a time sink. Resolve them with a real-time conversation.

RFC Statuses

RFC documents should list their current status in the document header (see RFC template).

Proposed means open for comment, actively reviewed and publicly debated.

Accepted means closed for comment, all comments “resolved,” dissenting opinions documented, and work to implement the proposal has been scheduled. Doc status should be updated to “Accepted on [date].”

Cancelled means the RFC owner decided to not proceed and cancel the RFC. Doc status should be updated to “Cancelled on [date].”

Summing It All Up

  1. Decide on problem to solve
  2. Identify and talk to key stakeholders
  3. Write RFC
  4. Solicit feedback
    1. Primary stakeholder or mentor (no more than 1 day per iteration)
    2. Team or wider group of stakeholders (2-3 days)
    3. Full audience (5 business days)
  5. Resolve comments
  6. Close RFC
    1. Update doc to indicate it has been closed