Scaling teams like parallel computing systemsWhy output is not linear with size

Kiril VidelovNovember 29, 2021

Reading time: 11 minutes

The dynamics of scaling teams and organizations fascinate me. It is well known that the productivity of a team does not scale linearly with team size. Going from a team of 5 to a team of 10 does not double the output, despite doubling of the team size. The goal of this post is to answer why that is, using math.

There are of course entire management books dedicated to the subject as well as a lot of anecdotal evidence to this claim.

A well known rule of thumb is sticking to Two-Pizza teams. In a nutshell, this means a team that can be fed with two pizzas. Personally, I would say this really depends on how large the pizzas are but nevertheless, the spirit of the message is - there is a productivity sweet spot when it comes to team size. But why is it this way?

In this article I make the following conjecture:

A team is no different from a parallel computing system. As such, it is subject to Amdahl's law, which governs it's scalability.

I am motivated by creating models that approximate different aspects of life. Sometimes such models allow us to develop a deeper understanding of the world.

Let's start by dissecting the claim:

A team is no different from a parallel computing system.

Teams are Parallel Computing Systems

Teams are systems

Per definition, a system is a collection of entities that interact or collaborate in order to achieve a common goal. The parts of a system are united in that they have a common purpose.

turing machine illustration

Similarly, teams are formed in order to achieve objectives that are too big for an individual to accomplish - in other words, a shared goal.

A common characteristic between systems and teams is the need for interaction and collaboration. Without such interaction you have just a collection of workers or entities that produce value individually.

Teams compute

Here is an interesting fact:

The term "Computer" used to refer to a profession!

Fundamentally, computation / calculation simply means transforming one or more inputs into one or more outputs. As a team, you outputs may be products or services while the inputs may include business requests or your own inspiration.

Again, it is key to highlight the need of a common objective in the process of computing - in the absence of an aligned common objective, team members are better considered discrete systems (teams).

Teams are (usually) parallel

Parallel computation is one where multiple calculations are carried out simultaneously. This typically applies when problems can be broken down into smaller ones. A classical example of a problem well suited for parallelization is matrix multiplication, where Divide-and-Conquer techniques are appropriate.

When it comes to teamwork, the system can choose to work on the same task in order to improve the quality of the output. For example, this is the case with brainstorming or mob/pair programming.

More typically however, teams break tasks into smaller ones (Divide-and-Conquer) in order to be able to execute on them simultaneously. The sub-tasks are of course related and sometimes interdependent because they contribute to the common objective of the system / team.

Next, let's discuss parallel systems in general and the Amdahl's law. Later in the article, I discuss how we can use this theory in the context of teams.

Amdahl's scalability law

In computer systems with more than one processing units, there exists an equation that governs the theoretical speedup in execution of a fixed workload as we increase the number of processors - known as Amdahl's law.

At it's core, this rule stipulates that because workloads always include sections which cannot be parallelised (eg. requiring synchronisation between processors), the speedup is constrained by the proportion of work that must be performed sequentially.

For example if half of the workload cannot be parallelised, then at most we could reduce the execution time in half - a speedup of 2x, no matter how much we increase the processor count.

Introducing Amdahl's law

Amdahl's law is an equation that gives us the theoretical speedup of a workload and is formalised as follows:
Speedup(n)=1(1p)+pnSpeedup(n) = \frac{1}{(1-p) + \frac{p}{n}}
    where:
  • p is the proportion of the workload that benefits from parallelization (eg. does not require synchronising)
  • n is the parallelism factor (or number of processors)

Let's take for example a workload where 25% of it can not parallelized - this means p = 0.75. We can calculate the speedup with parallelism factor (number of processors) n = 16. The speedup is just 3.367x, despite 16 processors!

1(10.75)+0.7516=3.367\frac{1}{(1-0.75) + \frac{0.75}{16}} = 3.367

Even more interesting is plotting the results for p = 0.75. Below, on the x axis we have the number of processors and on the y axis we have the resulting speedup. You can adjust the p value interactively.

This visualises the way as parallelism increases, the marginal gain in speedup decreases. With a p value of 0.75 (meaning 25% of the workload can not be parallelized), as n approaches infinity, the speedup approaches 4.

More formally, we can say that as n grows towards infinity, the Speedup tends to 1/(1-p).

A key observation to be made here is that the slope of the curve platoes. In other words, the marginal effect of increasing n diminishes. Let's look further into that.

Rate of speedup change

The first order derivative (rate of change) is the curve slope of our function. This means it gives us an indication of the sensitivity with respect to the parallelism factor n. A value of 1 would indicate linear relationship, and we can see that as n increases, the rate of change tends towards 0.

f(n)=p((p1)np)2f'(n) = \frac{p}{((p-1)n-p)^2}

This is the formal definition of the first order derivative of the Amdahl's law. Let's plot it!

It's clear that even with high values of p, there is a quick dropoff in sensitivity towards n. This insight is useful because n is a finite resource - processor units. In this context f'(n) represents a measure of efficiency. A greater rate of change for increasing n means higher efficiency.

Returning to the example where three quarters of the workload are parallelizable (p = 0.75), an argument could be made that it is wasteful to allocate more than 16 processor units (n = 16) and depending on exactly how scarce the resource is, perhaps n = 10 is a more appropriate allocation.

However, if the workload could be modified to have a greater parallelizable proportion p, parallelization could be efficient at higher n values!

This brings up an interesting question:

If we can establish a threshold of a minimum acceptable efficiency f'(n), how much of the workload needs to be parallelizable to efficiently utilize a given number of processors?

We can answer this by rearranging the equation and solve for p.

Solving for p

So far we had the parallel proportion p set and we plotted the Speedup and the rate of change with respect of the parallelism factor n.

p=(4dn24dn+12dn2+2dn1)2dn24dn+2dp = \frac{-(\sqrt{4dn^2 - 4dn +1} -2dn^2 + 2dn -1)}{2dn^2 - 4dn +2d}

Since we use the rate of change as our efficiency measure, let's solve for p instead. Let's refer to the rate of change f'(n) as d for brevity. The equation is starting to look bulky but we can plot it and reason around it visually.

If we can establish a reasonable minimum efficiency d for a given workload, we can see what proportion p of that workload must be parallelizable for different n values.

The beautiful part is that this equation gives us an indication as to how the workload must be adjusted in order to be able to continue scaling! In other words:

This allows us to reason around the inherent bottleneck caused by the need of task synchronization.

If we choose a minimum efficiency d (or worst rate of improvement per processor added) of 0.2, and if we want to utilize 5 processors, then 76% of the workload must be parallelizable. On the other hand, if we want to utilize 20 processors and maintain the same level of efficiency, then 94% of the workload needs to be parallelizable.

Let's get back to discussing teams.

Scaling teams

So far I have made the case that teams are parallel computing systems. We also discussed the Amdahl's law, which governs the speedup in completing a workload (objective) gained from increasing the number of processing units.

When we made the case that teams are parallel computing systems, we put an emphasis of a common goal / objective of the team / system. There is an inherent need for interaction and collaboration between the parts of the system. Without it you have just a collection of workers or entities that produce value individually. Building on this:

The common team goal or objective is its workload. Parts of the workload (subtasks) may be dependent on each other, and some subtasks may require synchronization between team members.

Let's consider the points of collaboration within a team as serial (non-parallelizable) portions of their workload. For example it is necessary for a healthy team to:

  • coordinate work (planning)
  • share updates (standups)
  • re-evaluate & optimize their process (retrospectives)

And now the really cool part - If we take a look at how a team spends their week, we can approximate the proportion of the workload that is non-parallelizable. Why is this useful? We can use the Amdahl's law and it's first order derivative to reason around team dynamics. For instance:

We can answer the question "What would be the maximum team size where team members can feel reasonably productive?" from first principles.

Let's do it! First, we assume a 40 hour working week. How many of those hours are spent for the healthy operation of a team collaborating towards a common objective? This is a portion of the work that can not be parallelized and for a good reason. This number can vary, but 5 hours would be reasonable, which represents 12.5% of the total time. In other words, the parallelizable proportion p is 0.875.

Our second question is - what is a reasonable minimum efficiency in a team setting? Earlier, when discussing the rate of change with respect to the number of processors n, a value d of 0.1 may have been reasonable - CPUs cores are cheap. For a team a much higher value is desirable, I would say 0.3 or higher.

A p = 0.875 and d = 0.3 gives us a team size n of 6.68.

Solving for p

As earlier, let's plot this, solving for p. We can pick a minimum efficiency value d and solve for what ratio of the workload must be parallelizable. For convenience, below I am also converting p to the number of hours per week that can be spent in synchronization tasks.

If you have in mind the number of hours your team needs to spend synchronizing every week, this plot visualizes how large the given team can be!

Keep in mind the meaning of the parameter d here - it is a factor that determines the marginal increase in speedup/output when adding an additional processing unit n.

Conclusions

Teams are parallel computer systems who eat pizzas.

Because by definition a team works towards a common goal, there is an inherent need for synchronization between the members of the team. 12% is a reasonable estimate of the proportion of time spend synchronizing.

The proportion of time that cannot be parallelized places an upper bound of how much of the workload can be sped up. Moreover, the rate of change in output decreases as the number of team member increases. In other words:

After a certain threshold, increasing a team size is a very inefficient way of reducing the time to achieve the team's goal (boosting output).

So, what is the right team size? Assuming 12% of the workload needs to be synchronized to facilitate healthy collaboration within the team, and if we want a reasonable minimum efficiency of 0.3, then the answer is a size of 6-7. This happens to be align really well with the "Two-Pizzas Team" size!

In my next post, I will control for different pizza sizes!

Thanks for reading!
- Kiril