Every project should have 2 measurements associated with it - the measure of success (does it accomplish what the business wants) and the measure of sunsetting (when is should this project be retired).

For example, let’s say your company has decided to create a knowledge base for the customers to use. Measures of success would be the time it takes for customers to become proficient at using the product (time should go down) and number of requests customer success gets for help around normal tasks (number should go down). A measure for sunsetting would be if the number of customers accessing the knowledge base goes to a trivial number (for example if improvements to the user flow led to customers not needing to consult the knowledge base).

Why would we want these?

The first measure will tell us if we built the right thing with the right knowledge. If the project did not have the intended effect, we can look into what happened - was the product right, but the customer path to the product wrong? Was the product itself wrong and we need to look into better customer analysis?

A lot of things can go into the first measure and sunsetting projects which did not achieve success can go a long way towards keeping an engineering group focused.

Is the job of a dev to write code?


Writing code is part of the job, but it is not the whole job.

Is it the job of an author to write words? Or is their job to create a book/article/screenplay/etc?

Most developers are employed to create or maintain a software product. This can take many activities to accomplish, of which writing code represents a significant portion, but it is not the only thing. While other roles will engage with several parts of creating a product, lowering the dev’s responsibilities in those areas, there are still many activities within creating a product.

Treating devs as just a matter of writing code pushes us towards ideas like trying to ‘protect’ devs from useful meetings, instead of demanding that all activities, including coding, be looked at to make sure they are efficient and useful.

You have probably heard this adage (or one of its variants):
“If you have five minutes to chop down a tree, spend the first two and a half sharpening your axe”

Chopping down a tree is not about swinging an axe - it is about getting the tree down. As a result, you want each swing to be as precise and useful as possible. This idea can be applied to many things, including the act of delivering software.

A developer is not done just because they wrote code. They are done when the product is delivered (…and then maintenance work begins).

When looking at how developers (and the entire development team), the question is not “are they spending too much time not coding?” but “are they spending their time on activities which enable delivery?”

Not all code is written equally (looking at you code pushed at 2am), and not all meetings are useful. That said, many meetings are useful, just as much code is. However, just as code can be refactored to perform better, so can meetings and other non-code activity.

Everyone should be able to ask of any activity “is this actually helping us deliver or is this wasting our time or is there a better way to do this?”

PRs are a great example of an activity which is not writing code, but most people would agree is part of the job. If you work at a place which uses a branch-PR strategy, there is a good chance several PRs feel like a waste of time, or people feel like they waste time trying to get them reviewed.

There are several ways to accomplish the goals of “code in the product is reviewed for errors” and “knowledge sharing of code in product”, which tend to be the goals of PRs. For instance, pair programming is a useful way to accomplish both without the need for PRs. Ship/Show/Ask + strong CI/CD is another way of changing the activity from one generally dreaded and bottlenecked to a smoother experience.

Note: Reminder that merging code to master is not the same as being done. If your job is to build/deliver a product, you are not done until it is deliverable to the consumer.

This means quality is part of the job. Security is part of the job. Reliance, stability, etc, are all parts of the job.

While all the above have specialists, this doesn’t abdicate responsibility from the developers. These specialists guide and enable developers.

A dev is not there to write code. They are there to deliver software.

Imagine this: you work in an app supported by a large network of microservices. You are happily going along trying random things when you suddenly find an error message which makes no sense. You use the information in the message to trace its route through the system. With a bit more testing around the services highlighted by the trace, you provide developers with a detailed report on the conditions which can cause this error.

Now imagine a similar, but slightly different scenario: you work in an app supported by a large network of microservices. You are happily going along trying random things when you suddenly find an error message which makes no sense. The information in the error message is only enough for you to report the last service used in the response, so all you have to provide developers is what you did to cause it and the response received.

Next think about those scenarios, but if a customer was the receiver of the weird message. In the first scenario you would be able to perform a detailed analysis of the problematic system. In the second…you would have to hope the issue is easily replicable.

The second scenario is all too common in debugging, whether it be for customer purposes or for QA. It results in more time spent just trying to figure out what happened and less time addressing the problem.

Traceability is the idea where so long as you have the response, you can trace the path the response took through the system. Frequently this is accomplished via a tracer ID. This tracer ID enables you to just search the logs for it to get the path through.

Whenever dealing with any form of layered system, traceability is one of the most important features your team can put into place to make their own lives easier and it absolutely matters for quality. With traceability, you can see how a request moves through the system and determine if the path it follows is the expected one.

If I can take a trace ID and pull every log for every service my request interacted with using it, we have good traceability.

Short post warning.

Many problems have been solved before. Most problems in the IT industry have, in fact, been solved before. It is highly unlikely you are dealing with a completely unique problem.

Even if prior solutions do not solve your problem the way you need them to, they might give ideas or something to build on to make a more appropriate solution. Look to what has been done, what is standard, and ask why you cannot use it.

Assuming uniqueness is a good way to saddle yourself with something terrible to maintain in the future.

Introducing a new retro format called “Improvement Themes”!

In order to have a more valuable retro with small, actionable items as an outcome, this format takes inspiration from Toyota’s Improvement Kata and consists of small, focused discussions with the goal of finding 3 small action items per topic.

The retro format is broken into 5 timeboxed sections:
Topics to Discuss
(For Each Topic Chosen)
Current State
Ideal State
Next Steps
Assign Action Items

In Topics to Discuss, everyone spend 5 minutes writing topics to discuss. Ideally, these topics are ones which affected the previous iteration and can be improved. Once everyone is done writing, the group then votes on what to talk about.

Current State focuses on the current state of the project - what the team does, what effects the team is feeling, etc. Many people want to say the lack of something (eg, code reviews) is current state, so we ask what the effect of missing items is. This is probably the hardest discussion for people to get the hang of since we are asking people to separate out the effect from causes and potential solutions and just focus on what has been going on. The reason you want to do this is to give everyone a chance to understand on what the problem is before people start solutioning.

Ideal State is when we discuss where we would like to be. Similar to Current State, there is a tendency to skip right to solutions and a desire to state doing a thing is ideal when we want to focus on what the effect of doing the thing is. An example ideal state for Delivery would be ‘All stories are small enough to be delivered within a single iteration’. Similar to above, while the solutions proposed might be good, we wanted the entire team to have a common understanding of what ideal state is before proposing solutions.

Next Steps looks for 3 small action items completable in an iteration. Since the team should have a common understanding of current state and ideal state, everyone can work together to come up with solutions. Smaller action items are ones which can be placed above story work, completed quickly, and should have a positive impact on the team’s working ability. Bigger action items tend to linger for a long enough time people forget the context and rarely complete them quickly.

Assign Action Items goes the same as usual for other retro formats with an effort to make sure the same person does not have a lot of action items assigned to them.

An example chart for a topic after these discussions:

Every part of the retro is limited to 5 minutes and produces good discussions and action items for at least 2 topics in an hour long block. After the retro, you can look into what sort of metrics can track the effects of the action items towards Ideal State. Topics can be reviewed in future retros or in another meeting when there is a large need (eg, major issue happened and we want to do a special retro on it)

We review our action items regularly, usually every Monday, as a reminder to the whole team and to give a chance to let owners give updates. Likewise, we created a powerpoint deck we keep in drive where every topic has 2 slides - the first slide contains the descriptions of the current and ideal states, along with the next steps which came out of the last retro. The second slide contains metrics for our progress towards ideal state.

I have found these changes have resulted in action items which happen and more focused retros.

You are starting a new project with a new team and a new website. Everyone agrees on practices (such as agile, unit testing, test driven development, everyone should be in the same room for better communication, etc). The team is happy and working together and things are great.

Then the big Questions come up.

“Who should own the automated regression suite? Who should write it?”

Enter chaos.

This tricky question around developing and maintaining automated regression tests can split a team who is otherwise on the same page. Given the tests are code, the obvious option looks like the devs! Yet, the tests are there to make QAs’ job easier by removing the need to do tedious manual regression! Before you know it, you have a lively debate on your hands around how to handle these tests.

There is a tendency in the industry to espouse the idea that automated regression tests are the Holy Grail of QA and will fix all the problems introduced by manual testing. After all, no one wants to sit there and hit the same buttons in the same sequence over and over again.

Yet, while automated regression tests are useful, there is a large caveat we rarely talk about.

They are a tool, just like any other tool we use. There are good times and bad times times to use them. When the tool fits the job, it works well and most people will agree it is useful. When things are not as clear cut, the tool can cause dissonance in a team. Splits form between those who like it and those who do not. When the tool does not fit, teams tend to dislike the tool itself for being a complete pain to work with a vow to never use such a thing again. So, how do we avoid falling into the trap and use the tool appropriately?

First, we need to define what automated regression tests are. A lot of teams will call these tests functional, ui, etc, but these names are also used for different sort of tests and can lead to much confusion (see What is in a Name. For this article, we are talking about UI-driven full stack tests which aim to replicate a user going through the app and ensure various scenarios can are possible.

The second thing which makes using these tests difficult to maintain, is the question of who are the tests for, and thus who should have ultimate responsibility over them. Are they for devs to act as higher level tests? Are they for QAs to avoid doing repetitive, manual testing? I know a lot of people would argue they are for both, or that the intended person is the development team as a whole, but it is my opinion that serving two masters/purposes weakens how much of an effect something can have.

Perhaps this is actually the wrong question. Instead of asking who they are for, let us focus on what the intention is. So, what is the intention of an automated regression suite? In the typical testing pyramid, these tests occupy the top of the pyramid, indicating there should be few of them. There should be few of them because they are usually slow and the technologies powering them tend to be flaky, whereas unit tests tend to be quick and stable.

This gives us a guideline on how many, but not what they are or what their purpose is. Still following the testing pyramid, you usually try to push a test down the pyramid; this indicates the tests on the pyramid are for the same purpose (making sure code changes do not break prior functionality).

Despite this, we also tote these tests as a replacement for rote verification testing. The thing is, the bottom two layers of the testing pyramid are all about whether some piece of logic works - does the code work. QAs test the user interactions - does the code work from the user’s point of view. These are two different sets of tests, also known as black box and white box testing. Automated regression tests are black box tests from UI with the only internal interaction happening at the beginning to set up data.

I argue the fact these tests are black box tests indicates they should be kept separate from your lower level tests. Instead of focusing on code coverage and changes, these tests should focus on User Flows.
Ideally, a QA and BA will work together to identify critical business user flows to automate. The primary reason to limit this is because automated regression tests tend to be slow and flaky, so the more of them which exist, the less useful they are due to maintenance cost. If you have stable and quick tests, there is nothing wrong with adding more scenarios (going down the priority ladder).

As a result, I believe these are the ownership of the team as a whole, but QAs should have full ownership of what the suite tests. It is a tool for them to aid in making sure their regression can be as automated as possible.

We are always looking for easy ways to measure ourselves and what we produce. The dream is to have some metric which is automatically calculated, point at it, and say “this proves we (do not) have quality!”

Enter code coverage. Many people promote this metric to satisfy this need, and it is frequently one of the first things implemented when a company decides to care about quality, so let’s examine this claim.

Unit test code coverage can be good. It encourages people to write test and exercise every line of code in a test suite. Branch coverage can be better by also looking at the logical options. However, just because tests are being written does not mean they are worthwhile tests. Using code coverage to encourage test writing needs to go hand in hand with items such as reviewing or pairing on tests to share good knowledge and a culture striving for useful tests.

Additionally, code coverage is a weird metric. 100% is not a meaningful number - it ironically does not tell you something useful. On the other hand, 50% is a very useful number in identifying areas which have high needs for tests.

Code coverage is at its strongest when it is used as a tool to target areas which need tests. It is a measure of absence of quality, but not of presence.

In conclusion, code coverage is not a metric which says you have quality. It can be useful and has a place within a quality strategy, but take care to treat it as a proof of absence, not a proof of presence.

Note: UAT, Regression, and acceptance tests should not be used - think of them as QA tools (see other blog post) to aid in the reduction of the manual effort required (or freeing up resources away from regression and the like enabling them to be put towards things like exploratory testing). They have different focuses where it is not about what code is there, but how the code enables users.

It is All The Same

The first step to understanding testing APIs is to understand it’s still the same as testing any other thing. I have noted the primary differences to keep in mind below, but beyond these you will want to apply the testing principles you would for any other application (eg, performance testing, integration testing)

Identify Users

Users of APIs consist of two parts - the actual consumer (typically another app, such as a mobile app, or a website), and the developers writing the consumers.

Know who your targeted consumers are, the same way you would know your target users.

Acknowledge the User Interface

The user interface for an API consists of 2 primary things - the documentation and the response messages (both good and error).

A developer writing a consuming app needs to be able to use the documentation and response messages to fix any mistakes they have made and know when it is successful - the same way a web site would guide a user through a task.

User Flows

APIs are always written in service of something. There’s tasks they are enabling the users of the consuming products to do. While API calls can be made independent of each other, there still exists primary user flows they enable. Identify these flows and understand how they fit together.

Automated Testing Basics

Automate the above section. Put yourself in your users’ shoes and use the documentation and responses to guide your path. If you have to ask the developers who worked on the part you are automating against, that’s a UI bug which should be addressed.

As the users of the API are primarily developers and their apps, automating tasks and interacting with the API the same way they do is one of the most important things you can do.

Further Resources

API Tester is my attempt to automate myself out of ever doing boilerplate API testing ever again.

It checks many common issues with APIs such as consistent error messaging via doing a variety of calls based on how you tell it the API works (ideally based on the API’s documentation)

While as of writing this API Tester’s DSL is rough to use, the concepts it represents are valuable ones any API Team should consider.

I subscribe to this idea: “Every test has a purpose and every concern has a layer” (if this phrase sounds nonsensical to you, please review the Swiss Cheese Model which sounds nice and lovely.
As a result, I hereby propose we purge the phrase of “Functional Tests” from our collective vocabulary. It is a phrase which tells you nothing other than that there are tests and they tests functionality. Likewise, I propose we stop considering “Integration Tests” as its own test type and instead consider it as an umbrella term for a test category.
What I am getting it is those two phrases fail to convey the meaning. What is the purpose of a test who belongs in the “Functional” suite? How do you know a test belongs there? When implementing functionality, how do you know when you need to write a functional test?

As a result, I think that asking unit vs integration vs functional is asking the wrong question.

Name test suites after their purpose. What is the testing protecting? What concern is the test addressing?

Note, I do usually keep unit as it is a term the industry recognizes and has some level of agreement, but I define unit as a test whose purpose is to protect and document the functionality of code, specifically in a line or method. To me the focus of these tests is the ease of writing, the speed of running, and the stability/reliability of the test. If it needs mocks to achieve that, then use mocks. If it needs to be a “social” unit test to achieve that, have it be social. The only exception is I do not allow internet connectivity in my unit tests as internet connections always slow them down. I ultimately believe that unit tests are written for the express purpose of letting developers know when they have made unintended changes to their code and whatever method allows for the quickest and most reliable tests of this is that project’s version of a unit test.

“Functional Tests” is the worst term as most tests are technically functional tests. Instead ask yourself - is this a user flow test? Am I protecting the my own API contract? Is this a usability test?

For all tests, instead of “functional”, name them after their purpose. For example, frequently the tests named “Functional” are actually a variant of “User Flow” tests. Their purpose is to test the ability of the system to handle expected user journeys and interactions. By changing the name to “User Flow” thinking about these tests becomes easier. When implementing functionality, I can ask “what user flows have changed, been introduced, or removed as a result of this?” With that question, I am well armed to make appropriate changes to my User Flow suite in response to a card.

As another example, integration is another umbrella term. Many of these tests are concerned with contract, usability, or might not even need to exist. If you check the Swiss Cheese Model, you will see that the architecture section highlights that one should identify and characterize integrations. Those characteristics then point to what integration tests you would want. For example, if you have a dependency which is under development and volatile, you might consider a contract test to ensure that development does not break your project. Another example is if your dependency has frequent downtime, you might want to consider some form of monitoring or heartbeat. Finally many integration tests are written without a purpose beyond “we integrate, therefor integration test”. These tests take time away from what could be useful tests. Identify your concerns and test those. Testing for the sake of testing helps no one.

It is much easier to enforce test boundary via thinking of test purpose first.

Every test has a purpose and every concern has a layer.

Each test suite represents a layer (and things other than test suites can by layers), so make your layers based on what sort of concerns/problems you expect that layer to be able to catch.

*I recognize the irony of this post given this website has a (not yet fleshed out) Test Types section. That section is intended to demonstrate various test types from a purpose view point to give inspiration and examples for use in developing a test plan.