WiP Limits and Velocity

I recently presented at a conference on the topic of Work in Progress (WiP) limits, and in particular Little’s law.  As part of I gave a demonstration and from it drew the conclusion that WiP limits have no direct bearing on velocity.

This caused some controversy so I’ll try to explain again here.

little

Applying Little’s Law

Mathematically speaking we can increase the speed with which we get served in two ways:

  1. We reduce the number of people in the queue (We limit WiP)  or
  2. We increase the speed each person is served (We increase Throughput – in this case by adding more people, or installing faster equipment)

Note:  When mathematically applying Little’s law the pre-condition is that the variables are independent, the assumption is that number of people in the queue does not impact on how quickly the server works.  This works in mathematics, but maybe not so much with people in reality.

The principle is that WiP and Throughput are two independent variables and changing one does not change the other, only cycle time is impacted.

The wider impact of WiP limits

Of course the goal of my talk was to illustrate that whilst you may feel you are getting more done by not limiting your WiP, and having a whole bunch of work in progress this has no bearing on your actual throughput which is governed by your bottle neck.

When using Little’s Law in a controlled environment we can show that limiting WiP doesn’t impact throughput (unless we limit it to less then our bottleneck and starve it – say two servers and limiting the queue to 1 person). What limiting WiP does do is reduce cycle-time enabling you to be more flexible and more responsive and to have more reliable forecasts.

Work in Progress Limits will directly impact Cycle-time

Contradictory Claims

Unfortunately this contradicted one of my other claims in the presentation where I said that Limiting WiP helps increase throughput of the team.

The confusion of course is understandable, when using a mathematical example there is no cost for context switching and the example was to show the impact of WiP on a stable system isolating variables to demonstrate what happens.  The real world is far more complex and there are many more variables.

In the real world

Generally speaking the ability to focus on less things means less multi-tasking, reduced cycle time means you have less things on the go at once, and for less time. These help you be more productive and thus indirectly the throughput will go up, but this is a result of you being able to focus not as a direct impact of the WiP limits.

There are many indirect benefits of using WiP limits,  but Cycle-time is the prime beneficiary, the rest is a bonus or an indirect consequence.

 

 

 

Understanding cycle-time in software

Cycle-time and Lead-time and even Little’s Law are terms that have migrated over into the software development context and are becoming more widely used, but I am not sure they are fully understood so I’ll attempt to clarify my understanding of the terms.

How does cycle time differ from lead time?

In manufacturing:

There are two different ways I have seen to express Cycle-time:

“Cycle Time” is the overall amount of elapsed time taken to create an item. Measured from the point when work starts until item is delivered.

or

“Cycle Time” is the average amount of time between each delivered item.  e.g. 7 items delivered in a week is a cycle time of 1 day

Lead-time seems to be consistent

“Lead Time” is the amount of elapsed time from which the order was first requested by the customer until the customer has received it.

Note: The different uses of cycle time cause confusion and I prefer the first description in a software context, as the second one ignores the impact of WiP. 

Clearly both of these are more complicated in software terms as the boundaries are not as clear.

cycletime

Cycle-time

Cycle-time is generally considered to be the point when a backlog item is committed and the item is “in progress” through our development cycle, in Kanban terms this is likely time counted from when it moves out of the ‘backlog’ column. We stop counting when it is moved to the ‘done’ column.   There can be a whole debate about what ‘done’ means and that is for another topic. But for simplicity the ‘cycle’ ends when your team will stop work on it (ideally it will be in the hands of the customer).

In Scrum cycle-time is typically fixed at a sprint length, we commit at the start of the sprint and deliver at the end.  But that is not universally true, some teams do not always deliver and there will be effort later to deploy as part of a scheduled release, and some teams deliver as soon as a story is complete.

Note: Cycle-time is sometimes called delivery time. But again this get confused with lead-time because of the non-linear manufacturing to software conversion.

Lead-time

Lead-time is a bit more complicated, it may be counted from the point a need is first identified by a customer to when it is resolved but this is rarely a one-to-one mapping, or it may be considered from the point when the need is identified and refined into a backlog item that will be worked on and delivered.

But really for agile backlogs we generally don’t work in a linear fashion, just because a request has been on the backlog longer doesn’t mean it will be delivered sooner. We prioritize and so a request identified last week may be delivered sooner than one from last year.

As a result Lead-time is generally not used as frequently as Cycle-time, not because we can’t measure it but because it is not as meaningful or useful.

Throughput

Another often used term is “Throughput” this is simply the number of units that have passed though the system (units completed) in a given period.  e.g.  Our throughput is: 10 stories per week, or 5 customers per hour.

Unless stated otherwise you can usually assume that it is an average (mean) or the most recent time period.

Little’s Law

Cycle time is a trailing measurement and we are able to come up with metrics and averages from historic results.  But a mathematician from MIT – John Little  devised a mathematical formula for predicatively calculating Cycle-time from the number of units in the system.

Average Cycle Time =  Average Number of items in progress / Average Throughput

Projected Cycle Time   =  Number of items currently in progress  / Average Throughput 

picture1

e.g. So let’s say you have 50 items in progress and on average you complete 10 per week.

picture2

Number of items currently in progress (50)  / Average Throughput (10 per week)

Projected Cycle Time:   50/10  =  5 weeks

What this tells us is that for the new units entering the system, on average it will be 5 weeks before each of them is completed.

If we want to reduce Cycle-time we have two options we can either reduce the amount of work in progress, or we can increase throughput.  Often our throughput is limited by other factors and so may be harder to change, such as team size or equipment availability. But work in progress (WiP) is typically more within our discretion.

Lets say we reduce the work in progress to just 20.

picture3

Number of items currently in progress (20)  / Average Throughput (10 per week)

Projected Cycle Time:   20/10  =  2 weeks

What this tells us is that for the new units entering the system, on average it will be 2 weeks before each of them is completed.

What you see here is that our throughput is the same, we are still delivering the same amount but by reducing work in progress we are able to deliver the same amount in a shorter elapsed time, and in a desire to have more agility then a shorter cycle time enables us to change direction sooner.  Less work in progress does not increase our throughput, however, it makes us more dynamic and more adaptable to change.

The less work you have in progress the sooner it will get done.

Yorke’s motto

In essence – the less work you have in progress the sooner it will get done, that is the lesson that should be taken from this.

But there does become a point where you can reduce the work in progress to a point where throughput is impacted (the system can become starved) and thus there is a trade-off.
In some situations cycle-time is a critical factor so an element of slack time is desirable, say urgent need to see a doctor, compared to a regular appointment.
But in most other cases we would aim to limit our WiP as much as we can to the point where throughput is not impacted.

Other less used terms

Then we come to the fun ones, these terms come from manufacturing but are used much less frequently in a software context but may be useful to be aware of as it can help understand the composition of cycle-time and where you can take action to tighten up your process.

  • Process Time
    • This is the time when someone is working on creating something: writing code, documents, design, tests, etc.
  • Move time
    • In software terms this would the time to move from one user to another or from one action to another and so would include, compilation, building, deploying and hand over/knowledge sharing
  • Inspection time
    • This might be QA time or code reviews, demonstration to Product Owners or stakeholders, but this may overlap with process time as this is still value added work.
  • Queue time
    • This is time when a unit(story/task) is in progress but no worked on, say blocked, or waiting to be code reviewed waiting for QA waiting for Demo, any time work is paused for any reason before it is done.
  • Wait time
    • This is the time a request spends in the backlog before we commit to work on it, e.g. the time from when a customer identifies a feature/function or bug until we start work on it.
  • Value added time
    • Any time in the system where activity that adds value is happening e..g not queuing, waiting.

Summary

Cycle-time and Little’s Law are becoming much more frequently used and whilst they are not complicated to understand the basics of they do bring a new set of terms and associated confusion.

For example I have seen a number of times recently people advising that reducing cycle-time increases throughput.  Mathematically speaking this is incorrect, (although the reverse is true increased through put should reduce cycle-time). Anecdotally less context switching does increase throughput.
What they are advising is good advice – reduction of Work In Progress. However, their expectation and  reasoning is wrong and adds to yet more confusion.

Cycle-time is not a measure of quantity of output, but a measure of the duration it is in the system. It is important to remember that throughput and cycle-time are different measures.

Cycle-time is not a measure of quantity of output, but the duration it is in the system.

I hope this helps and I’d be happy to add any further clarification if there are parts that are unclear.

 

Why do we use WiP limits?

I participated in a great discussion this week on the use of WiP limits.  For those that don’t know, WiP stands for Work in Progress, and is a measure of how many activities you have started but not completed.

It is important to note that this is not a measure of how many tasks you are currently actively working on.  If I started a task but have become blocked, so I start another then my Work in Progress is 2 – even though I am only actively working on one. WiP is a measure of how many started but incomplete tasks I have.

For example: call-waiting or being put on hold during a phone call:  You are talking away and get another call so you switch to the second call without hanging up, and when you get another call you take that too, how many people can you have on hold at once?  You are only working on one call at a time, although you must remember the content and context of all those other calls. For the people still on hold while you have all these conversations I expect they are wishing you had a WiP limit.  Your WiP-Work In Progress is the sum of all active(incomplete) phone calls.

Limiting WiP

There are many reasons for limiting Work in Progress not least because I hate being put on hold.  I will go in to some later and it is likely that if you follow any type of Agile framework you already limit WiP although you may not be consciously aware of it.

If you have a backlog, then you are limiting WiP. You are consciously choosing not to start new work immediately but are prioritising it and organising it to work on later. It is likely that you are informally limiting your WiP according to what you feel you or your team can manage at a given time. It is important to realise that this is a WiP limit even though it may not be conscious or scientific.

If you follow Scrum: WiP is consciously and explicitly limited at Sprint Planning, often by estimating effort, or counting stories or calculating story points. The limit is generally set based on Velocity – our average achieved over previous sprints. If on average we finish 10 stories a Sprint, or we average 35 points, then that average is generally taken as the starting point for our WiP limit, we may choose to push for a few more, or if we are expecting to be slower due to vacation time, or known issues we may choose to commit to less. But in Scrum we don’t usually refer to it as a WiP limit. But that is exactly what it is.

KanBan is actually very similar, if we on average are completing 10 stories a week, then it is likely that we will consciously or subconsciously limit ourselves to only take on 10 new stories within a week, although in the case of KanBan this is not done in a Big Bang explicit decision but gradually over the measured period and is more a consequence than a plan (A Pull rather than a Push). We pull stories when we are ready and we pull at the pace we are working at, that just happens to be 10.

Slightly off-topic, but one of the crucial differences between Scrum and KanBan is that Scrum is a ‘Push’ model and KanBan is a ‘Pull’ model.  In Scrum we Push an amount of work to the board and spend the Sprint working to Remove(complete) it.  With KanBan we are aiming for a more steady amount of Work in Progress and will pull new work as current work is completed, which creates a smoother flow, but conversely lacks a unified time based goal or target.  But it is important to understand that both frameworks limit WiP and both focus on ‘completed’ work being the primary objective.

Why do we limit WiP?

I’d like to think it was obvious by now and I think the basic principle is, but there are nuances that may complicate things which may be where the cause for debate comes from.

At it’s simplistic level if I can only complete an average of 10 stories a week/sprint then starting an average of 20 a week without changing anything else is nuts, all that will happen is that on average 10 or more of those stories each week will remain incomplete and stay ‘in progress’ by the start of week 4 we will still have completed 30, but will have 50 now in progress.  Chances are by now we will actually go slower because we will be distracted and falling over ourselves trying to juggle more than we can possibly achieve.

Think of all those people on hold, growing frustrated at being ignored, and you trying to remember all those conversations, the phone rings again, can you cope with another caller on hold?

We limit WiP because we understand that by working only on what we can achieve, we can focus and be more effective.

Limiting WiP and WiP limits

“Ah but you have talked about Limiting WiP but you haven’t mentioned WiP limits” I hear you say…

And this is where I think the conversation really stems from and where we get into nuances.  KanBan began in manufacturing where work was passed from one workstation to another and there was a measurable flow through the system.   Limiting the production on one workstation so that it maintained pace with the entire system limits waste. Having one hyper-performing efficient widget maker that can produce widgets 4 times faster than they can possibly be used is wasted effort. And the same applies to a software process, we use WiP limits to regulate one part of the flow to maintain pace with the flow as a whole, the goal is to visualise bottlenecks and by visualising bottlenecks we can take steps to increase the flow of the whole system.

Other forms of WiP limit

But WiP limits can take many forms and are applicable to almost all aspects of life. The most common example used is a highway, when a road gets busy it slows down, when it gets very busy it grinds to a halt.  Limiting access actually speeds things up for everyone on the road, and ultimately more cars can get through far quicker.

But there are other less obvious WiP limits. A long-distance runner wants to increase her times, she starts fast but she runs out of energy and is slowing down near the end of the distance. By setting herself a slower time per mile early on (Pace is a form of WiP limit) she has a consistent speed and reserves energy so is able to run the full distance faster by slowing down during the early part of the race.

Are you dedicated to a single product/project?  That is a very low WiP limit of 1 project.

Do you use a calendar and try not to book two meetings at the same time, that is a very tight 1 item at a time WiP limit.

What about a budget?  Do you manage your finances as an individual or company. A budget is a WiP limit for your money, by limiting money spent on beer you have more available for clothes. By regulating your spending you can save for a holiday.  By remaining in credit at the bank you do not have to pay interest and so on.

Example of how to choose a WiP limit in a workflow

In the case of the widget maker, if the system can only use 10 widgets a day then we may set his WiP limit to 10, when he gets to 10 he is free to go and help someone else.  Suddenly rather than unused widgets we have an extra pair of hands to do other work.   But hold on! these widgets are easy to make and only take a short while to do, we could probably limit WiP to a lower limit and jump on the workstation as an when needed to produce more.  So how do we set the WiP limit?  My advice is not very scientific at this point.  Start by continue doing what you currently do and just watch, if your workstation/activity columns are sub divided in to doing and done, take a look on a regular basis and see where the biggest stacks are. If one column regularly has more cards on the done column than doing, it ‘may’ be a sign that you should reduce your WiP limit for that activity. Try it and see if it improves your flow.

The largest queues of done work are usually the area that needs limiting, but be aware of ‘bursts’ of activity, for example there may be a workstation that is only available for one day a week, in which case the queue for that workstation may need to be longer – or maybe you should be asking for more regular access to the workstation. But when imposing limits do it gradually and monitor to see if it improves the flow.

Anything in a done column is normally waste, if it is done and not in production that effort cost you money and is just sat there.  Sometimes that makes sense but the car industry learned that complete cars sat in storage amounted to $millions of wasted investment, that components stockpiled in advance was essentially big piles of cash that could be used more effectively, we can learn the same.

What I am saying is that a WiP limit being reached is a conversation trigger, a long queue is a conversation trigger, and so on. With so many things in Agile we create early warning systems to create conversations and make decisions when it is necessary.  We make small changes to the process and watch, we see what changes and if it is better we carry on, but look for another new small improvement.

Summary

WiP limits are not unique to KanBan, and like it or not you are already limiting your work and activities in all aspects of your daily life, you just perhaps didn’t realize it.  In the context of a KanBan board we do not arbitrarily impose a WiP limit because we can or because it is fun. We impose a limit when we feel it adds value, normally where we see a queue of completed work growing faster than it is being consumed. We limit various stages of the work because our goal is the output of the system (the whole team) and not the perceived utilization of individuals.  One team member being 100% occupied all day producing something that will sit untouched for a week is not efficient.

We are not trying to manage a system where our goal is to keep busy a group of individuals, our goal is to create a cooperative and coordinated team working towards a common purpose in the most effective way possible. A WiP limit is just one of many tools that can be used to enhance the prioritisation of work and focus the team on the true goal.