Our Large Scale DevOps Transformation

Our Large Scale DevOps Transformation

Over the past three and a half years, Visma has worked in a focused and structured way to modernize how we design, develop, deliver and operate our cloud services.

Where we previously released a new version of our cloud services every six weeks on average, we now have many teams who continuously deliver new functionality to our customers, several times a week, and even several times a day.

We have about 40 cross-functional autonomous development teams (we call them Service Delivery Teams) that have full responsibility for both development and operations. With an average of about 9 people in each team, the Service Delivery Team is fully responsible for design, development (including testing), delivery (deployment), monitoring, infrastructure (this term includes both IaaS, PaaS and other resources the service depends on), handling incidents, and everything in between.

In our experience, having all the relevant experts working together in one team which is responsible for “everything”, is not just more efficient (because there are fewer dependencies on other parts of the organization), but it also results in higher quality cloud services (primarily because multiple types of competence can be utilized throughout the development process).

This transformation did not come easily or quickly. It required us to organize ourselves differently and radically change how we think and how we work. New technology provided by public cloud platforms like Amazon Web Services and Microsoft Azure, has been essential in enabling us to work how we want to work. So how did this get started and how did we get here?

Origins

In the spring of 2015, a small group of people from different parts of Visma gathered in a workshop in Oslo, Norway. Consciously disregarding the current state of affairs (meaning how we were organized and used to work at the time), we tried to describe what we thought would be the ideal way of building and delivering successful cloud services. Starting from a clean slate, how do we wish we could be organized, and how do we wish we could work?

We did not phrase it like this at the time, but at the end of the initial workshops, we had described a way of working that continue to live on today as our core principles: DevOps, Continuous Delivery and Public Cloud.

DevOps

We wanted to break down any resemblance of silos between development and operations, so we developed the concept of a “Service Delivery Team”, which is similar to a development team, but includes the operational competence necessary for the team to take full responsibility, for both development and operations. This way, development can take advantage of operational competence, and operations can take advantage of development competence, on a daily basis, early in the process – not as an afterthought.

We gave this type of team a new name to underline that they were different from traditional development teams. The responsibility of a Service Delivery Team is not just to “develop a product”, but to continuously deliver a good service to the customers. A great product is not very valuable if it’s unavailable, doesn’t perform well or delivers a bad user experience.

Moving from a siloed way of working to cross-functional teams with aligned incentives

Monitoring

In order to be more agile, we focused heavily on monitoring so that we could better understand how our cloud services behave and how they are used. Using monitoring tools to gain insights that steer our development process is critical to making sure we spend time on features that are valuable to customers.

Note that a Service Delivery Team is responsible for designing and implementing monitoring on all different levels (e.g. infrastructure, application, user experience) as well as reacting to alerts and extracting valuable insights.

Automation

Being capable of making small changes to our cloud services, learning from the results of those changes and repeating this cycle frequently became a high priority – but this had many other implications. It meant that we needed to focus a lot more on automated testing, as there would be a lot less time to do manual testing if we were going to release a new version on a daily basis. It meant that the release process itself had to be fully automated. Even automated infrastructure (infrastructure as code) became a requirement for us.

This meant that infrastructure changes could be automatically applied and verified (similar to application changes), they could be code reviewed (to catch “bugs” and share knowledge), but also that our disaster recovery procedures became more efficient and less error-prone.

Zero downtime

Instead of planning downtime in a scheduled maintenance window for releasing a new version of a cloud service, we needed to be capable of releasing new versions without impacting the end-users. We refer to this as zero-downtime deployment, and nowadays we support this as a matter of course, and regularly release new versions of our cloud services inside of office hours. If something should go wrong once in a blue moon, all the relevant experts who can help, are at work.

Public Cloud

Many of the nonfunctional requirements we think are important, are easier to implement on one of the large public cloud platforms, than in less mature cloud platforms. The total self-service nature of public cloud means that a team can more effectively take full responsibility for everything related to infrastructure and operations. The consumption-based pricing and detailed cost insights makes it easier than ever for teams to experiment and make cost effective decisions when it comes to architecture and operational issues. Mature APIs and other automation functionality makes it easier to automate in many different areas, be it infrastructure provisioning and configuration (infrastructure as code), deployment, self-healing architecture, disaster recovery, or elastic scaling (meaning continuously and automatically matching provisioned resources with required resources). This translates into more efficient teams and more cost efficient and reliable cloud services.

Onboarding

After a few initial workshops in the spring of 2015 we had an initial draft of what we called the Visma Cloud Delivery Model (VCDM) which described some core principles, how we should be organized, how we should work and some technical requirements we thought were important for successful cloud service delivery. We also developed an onboarding process, the primary tool for actually implementing VCDM in our organization. The purpose of the onboarding process is to start with any team, introduce them to the VCDM principles, processes and requirements, and by the end of it, have them work in full alignment with the model.

One of the first activities for an onboarding team is to ensure they are able to fill the different roles we require in all VCDM Service Delivery Teams. This includes primary (~full time) roles like VCDM Service Owner (ultimately accountable for “everything” regarding their service), VCDM Infrastructure Engineer (primary operations specialist), VCDM Developers and VCDM Business Analysts as well as part-time roles that typically are combined with a primary role, like VCDM Agile Lead, VCDM Security Engineer and VCDM Incident Coordinator to name a few of them.

Then, the teams are introduced to different processes by dedicated VCDM Process Guides. As an example, a Service Delivery Team that is onboarding to VCDM, will spend one hour with the VCDM Guide for Testing to ensure a common understanding in terms of how to think about and work with automated testing, exploratory testing, performance testing, etc. There are similar sessions for development, release, monitoring, security and incident handling. VCDM Guides may assist teams beyond these introductory sessions, if needed.

Gap analysis

Another important part of the onboarding process, are the self-assessments. These are continuously updated documents that function as gap analyses for the teams. Teams that onboard to VCDM will do four self-evaluations of their own maturity. One self-assessment is focused on Continuous Delivery practices, one is focused on service reliability and performance, one is focused on service security and one is about technical debt management. Why technical debt management? If teams take on technical debt, it should be a conscious decision. All technical debt should be identified, documented and managed according to the associated risks.

Some requirements in these self-assessments are about the team and how they work, while others are about the cloud service and how it works. The teams analyze where they are today compared to where they should be according to VCDM requirements. Significant deviations have to be resolved by the teams as part of the onboarding process, and this is typically what takes the most time. In our experience, the onboarding process can take anywhere from one month to a year, depending on the maturity of the team and their service.

The onboarding process ensures a minimum maturity level for all teams and services that are aligned with the model. This is how we can be confident that a VCDM onboarded service is secure, reliable, and performs well. It is how we can be confident that a VCDM Service Delivery Team is capable of efficiently developing and delivering a high quality cloud service, and that they are capable of handling any unforeseen incidents that might happen.

Alignment

Visma has a tendency to acquire other companies, and VCDM is an excellent alignment tool for us in terms of processes, requirements, mindset and vocabulary. As an example, all VCDM teams handle incidents in the same way. They use the same template for incident reports, and they can all be found in the same place. Similarly, all VCDM services use the same framework for disaster recovery planning and testing. Because of this alignment, we are able to gather the same type of data across all VCDM teams and cloud services.

Based on this data we have developed different maturity indexes, that measure the maturity of VCDM teams and cloud services within different areas. Currently we have one index for security, one for Continuous Delivery and DevOps, and one for architecture and technology. Everyone in Visma has access to these indexes which are calculated on a daily basis, and both the teams themselves as well as their stakeholders pay attention to them which encourages continuous improvement.

Piloting

A couple of months after the first workshop, we had a model and an onboarding process that was ready to be piloted. Out of three initial pilot teams, two of them completed onboarding in four months, meaning that they were now working in alignment with the model. One team had to back out from the pilot phase, as they were not mature enough to complete onboarding within a reasonable time frame.

We saw that it was very difficult to fulfill the VCDM requirements in a traditional private cloud, with less self-service and automation possibilities. In the four month pilot period we continued to make changes to the model and the onboarding process, and we have continued to do so ever since. It’s important for us to continuously improve and keep VCDM up to date.

Implementing and extracting value from Continuous Delivery, DevOps and Cloud Technology

Today

We are now around 12 people who work on continuously updating and improving the Visma Cloud Delivery Model as well as guiding teams through the onboarding process. We are continuously gathering feedback from the onboarded teams on how we can improve, striving to minimize any overhead and maximize the value of the model for the teams and for Visma. To that end, we have a structured retrospective process, and role-based competence foras that ensure experience sharing and organizational learning. We also use these foras to inform about and ensure adoption of changes to the model.

It has been quite the journey since the spring of 2015. About 40 teams have onboarded to the model so far, and there are at least 40 more in the pipeline. I’m excited to see where we will be in another three and a half years.

Alexander spoke about the Visma Cloud Delivery Model at DevOpsDays Oslo 2018, October 29-30. You can see his presentation below:

Alexander Lystad is the Chief Cloud Architect in Visma Enterprise. He has been with Visma for 6 years, implementing and extracting value from Continuous Delivery, DevOps and Cloud Technology.
Connect with Alexander: