Day 2 after going live. Monitoring? Backups? Maintainability? Crazy talk!?!
Only thinking of these things after you have gone live? You've probably already shot yourself in the proverbial foot. Thinking about how your Next Amazing Thing™ is going to fare in the real world of inherently unreliable infrastructure, flawed implementation and unpredictable external factors (yes that includes users 👹) should be from the moment of product inception.
I'm going to sound like a fun-sponge here but I think this stuff is important. Unfortunately I don't know whether the emergence of new methodologies / ideologies / tools have helped the situation either. Sometimes I think these new tools at our disposal, make things a bit too easy, making us complacent about what we are nurturing. Just because the Next Amazing Thing™ has a pretty smile and can be deployed auto-magically within minutes, doesn't mean it doesn't have a mean bite when things go wrong.
I suppose what I'm getting at here is, it may sound boring but don't forget the basics.
I am lucky to work for a large IT consultancy. I am surrounded by experienced IT veterans who really have seen it all before. The vast majority of what we do hasn't changed in the last 3-4 decades, it is just badged/framed/interpreted differently. Not everyone is this lucky though.
If only someone had collated thorough, up-to-date information on how to run a product in the modern IT age. Oh yes, they have. A small little startup you may of heard of - AWS. The AWS Well-Architected Framework is your one stop shop for what you should be doing (or have a good plan for) by Day 2.
I don't run my services on AWS you say, that's fine - a lot of this framework applies to the outside of AWS too. This article will cover some of the high level concepts that I find really noteworthy - I'm not going to attempt to replicate all the content of the framework. Even if you don't use AWS I highly recommend setting up an account and using their free assessment tool to audit your solution.
Operating a solution in the real world
Do I really know what's going on with my solution?
- Do you capture and keep logs? (Archive to S3, backup elsewhere)
- Are your application errors handled appropriately, with details being retained, alerted and analysed (Sentry, CloudWatch Custom Metrics)
- At a crunch point, do you have a reliable way to login and view logs?
Do I have a way to fix an issue quickly and reliably?
- Do you manage your source control system in a way that allows you to issue patches rapidly for code issues? (Short lived branches, clear branching strategy, good test coverage, automated deployment capabilities)
How do I know my solution is performing efficiently?
- This is a bit tougher, but do you have an appreciation of the underlying platform (i.e AWS IaaS, Heroku) to know where you might have performance bottlenecks?
- Do you log timestamps for time sensitive operations?
Living in a hostile world
Can I control easily and see clearly who has access to my system?
- If using a platform like AWS, best practice should be followed - Root Account locked down, multi-factor authentication on IAM users, use IAM policies attached to EC2 instances over embedding keys in code.
- Do you use a role-based model for allocating permissions to users? (this applies to any system, not just AWS or similar)
- Do you have 'a single source of truth' where possible? (as opposed to synchronising users between systems)
- If you manage your own network equipment, is that tied into your single source of truth? (RADIUS or TACACS)
How do I manage security events?
- Do you get alerts for suspected security events?
- Do you have sensible thresholds before locking out accounts?
- Security events are just the typical 'hacker incursion / breach' - taken holistically security is a triangle of Availability (are you up when you users need it?), Confidentiality (does your system leak data?) and Integrity (is your system resistant to accidental and malicious corruption?)
How do I protect my solution?
- Do you have a clear way on managing cryptographic material (personal SSH keys, TLS private keys, GPG signing keys)
Expect the unexpected
Do I know the impact of the data I process?
- Living in the EU? GDPR will dictate that you should know this inside out. Email, telephone, real world address is all personal information that needs to be handled appropriately.
- What would happen if your data was breached? Would it cause harm?
How would my solution tolerate environment or component failure?
- All systems will fail. Operating your system reliably will depend on your architecture, but consider; redundancy (active/active or active/passive), autoscaling (cloud), clustering and automated recovery.
- Do you know when a component has failed? What fault domains comprise your solution?
- Do you depend on something or someone outside of your control to facilitate disaster recovery or failover? (Hint: You shouldn't).
How do I backup information?
- Do you backup your information regularly?
- When did you last test your restore capability?
- How do you protect your backups from loss?
Around every corner is a beancounter
Is what I have designed cost effective to procure and operate?
- Have you considered and ranked options for how your solution will be hosted and the solutions you will use?
- Is your solution overkill for what you actually need?
- Do you know what under-utilisation looks like and what your costs are going towards?
- If using a cloud provider, is your solution elastic, to avoid wastage?
Some of this might look overkill for some. It may well be, but if I really boil it down;
Ensure that you keep your logs, implement security recommendations for your platform, backup your data, and understand how your application could fail (and mitigate where possible).
What do you think? Comments as ever welcome below. That's all, thanks for reading!
Fault Domain Analysis - https://lethain.com//fault-domains/
An example Git branching model - https://docs.gitlab.com/ee/workflow/gitlab_flow.html