DevOps nowadays is such a trendy topic – you’ve probably noticed so many references to it on articles, tech conferences talks, videocasts, training courses etc. General awareness in this area is on the rise and this is not surprising at all since modern systems are becoming more and more complex with vast amounts of data to be processed and with simultaneous user access. But what’s most important is that modern systems have to be open and flexible to introducing new changes and meet the demands of an always on, 24/7 working environment. Without this commitment, there are losses – of revenue, reputation…or in some cases lives!!
So, what is DevOps and how does it help? Well there is no concrete definition for DevOps – in fact in you search Wikipedia you’ll find;
“DevOps is a set of software development practices that combines software development (Dev) and information technology operations (Ops) to shorten the systems development life cycle while delivering features, fixes, and updates frequently in close alignment with business objectives.” – https://en.wikipedia.org/wiki/DevOps
And while this definition offers a very broad overview to what DevOps is, there’s so much more it and so many more questions than the basic ones like:
Qs: Are you doing DevOps? A: Oh, yes we are! We have Jenkins in place for our Dev/QA environments. All we have to do is to hit a button and watch the magic happen!
Qs: Does it also transfer to Production Environment? A: No, our customer security policy does not allow us to do so.
Having a continuous Integration tool, a new shiny system version shipped to a production environment is only the start of what’s involved. DevOps is more of an organisational change, the company being the engine that needs to have all the parts and components running correctly – which needs to be maintained and inspected regularly to future proof for potential issues.
So, how can DevOps can help with this? How can we move towards something that will fit our needs and tackle problems in a better manner? I’ve provided a compact guide below:
* Define ops and development standards in your Organisation (if you haven’t already!) and follow them – apply KISS (Keep It Simple Stupid) principal and try not to re-invent the wheel. Be aware that complex solutions have longer learning curves and are harder to maintain.
* Follow best practices – if you have a problem you need to tackle, there is a high chance that someone has been faced with the problem already. Don’t be afraid to ask questions and gain from another people’s experience – whether in or outside of your Organisation.
* Don’t lock yourself to a specific stack/technology. There are no perfect solutions, but some are better in dealing with specific problems than others. Always try to seek alternatives.
* Create a Technology radar for your Organisation that will help you track new technologies – in IT everything is changing fast so you need to plan ahead and be prepared for introducing changes.
* Try to automate any repeatable manual work – deployments, testing, daily checks – you name it. This will not only save a lot of time but also prevent any human error caused mistakes.
* In order for automation to work correctly, enforce standardised environments – all fixes and patches should be applied across DEV/QA/UAT/PROD.
* Is there an issue each time you request new environment? Adopt and implement Infrastructure as a Code using dedicated tools/platforms.
* Once you have automation in place – focus on reducing its error and failure rate – this will get you up to speed in no time.
* People are the true value of every IT Organisation. Be aware of that & trust them. Don’t micromanage your team as this leads to problems. Remember that good teams can be fully autonomous and can manage on their own.
* Try to balance team workloads and be aware that multitasking can do more harm than good. Apply WIP (Work in Progress) limits if necessary.
* Support people and teams during failures – everyone makes mistakes. The key is to learn from them and provide a means for these to not happen again.
* Create an on-boarding plan for new Employees – not everyone has the same knowledge level, but actions should be made to mitigate this.
“Having a continuous Integration tool, a new shiny system version shipped to a production environment is only the start of what’s involved. DevOps is more of an organisational change…“
* Move towards Agile methodologies – Waterfall is an enemy of DevOps and causes issues when applying some of its best practices.
* Increase Deployment frequency – if you only have Continuous Integration in place, try expanding it into Continuous Deployment and ultimately into Continuous Delivery
* Follow some deployment patterns that can be adopted to your Product:
* Define a standardised and transparent process to be followed by your Organisation – by default it should have as little overhead as possible and a quick turnaround.
* Define and follow a Release management process for provisioning your systems.
Monitoring and Alerts:
* Watch your system as they work in order to gain insights and prevent or predict possible failures/issues on either system or infrastructure level.
* Implement alert mechanism in order to mitigate issues before they occur.
What to do if your Production system is down or facing a serious issue?
* The key rule to most production issues is to mitigate first in order to provide continuous operations. ‘Proper’ fix can be applied a little bit later.
* All services need to be recovered based on impact. Look into what’s most critical first to get you up and running.
* It’s very important to record all evidence related to the issue (times, logs etc.) as this will be used for Root Caused Analysis later on.
* When facing issues, communication is key and needs to be timely sent and always transparent – do not try to hide anything.
* Don’t cause panic – communicate the issue only to the affected parties.
* Separate external and internal communication – customers usually won’t understand tech talk so the information provided to them needs to be clear and understandable. And consider ChatOps as a good solution to have one centralised place for internal communication.
* Have an ‘Incident Manager’ in place who will coordinate everything in a correct manner.
* It’s good to have some insight of your system in real time otherwise your support might get overwhelmed with questions like ‘When will it start working?’ so a dedicated health check/incident page would be useful here.
Now that we’ve identified the cause and restored initial functionality – what’s next?
Root Cause Analysis:
* Perform accurate Root Cause Analysis (RCA) – go through recorded timelines, think of lessons learned and a possible future improvement plan to ensure the issue does not happen again.
* Create a Post Mortem document that is clear and accessible to everyone.
* Review your Post Mortems to see if you are not falling into the same rabbit holes.
* It is very important that RCA and a Postmortem be done with a blameless approach – everyone can make mistakes and if you approach analysis with a witch hunt style, it’s likely to result with your team not sharing the important details they know which are crucial to the analysis.
* Create an improvement plan – this should be initiated from the Root Cause Analysis document.
* Track your improvement proposals so these get implemented properly – regardless if this is a process modification, monitoring improvement, code fix or Architectural change.
Great, what’s next?
So much to cover right? Do you have to do all of this from the start? No! Improve on what you have. Start small, fail often and learn from it, adopt, adjust, improve! Improve! Improve!