Architecting in the Cloud: Taking on the GCP Dress4Win Case Study - Part 1
One thing that I wasn't so keen on the AWS Certification exam was the case studies that were used in questions seemed quite shallow (mainly because they don't cross across questions). Google has taken a different approach, providing three case studies that feature in the exam. Taking this approach allows for a deeper dive and really show those breadth of skills required of an architect.
This longer than usual post is me trying to work through the case study, detailing how I would approach the situation using AWS (I wouldn't like to claim I know GCP enough, yet!). Play along and think about how you would
The full case study can be read at https://cloud.google.com/certification/guides/cloud-architect/casestudy-dress4win-rev2. To summarise, the company provides a web application and mobile app to "allow users to manage their wardrobe" and "connect them to relevant designers and retailers". They have a multi-tiered environment with Database, Compute, Messaging and Big Data Processing.
From my consultancy experience, technology is not the only (and regularly the least troublesome) barrier or driver to change. So in true rebellious thinking, let's pick apart the case study from beginning to end and analyze the business drivers. In their executive statement;
"concerned about our ability to scale and contain costs with our current infrastructure" - concerns around keeping up with business growth while maintaining cost control.
"traffic patterns are highest in the mornings and weekend evenings; during other times, 80% of our capacity is sitting idle." - the business has identified that they have significant waste in their current infrastructure.
"Our capital expenditure is now exceeding our quarterly projections. Migrating to the cloud will likely cause an initial increase in spending, but we expect to fully transition before our next hardware refresh cycle." - capital expenditure is financially tough on a business. It is tough to see large amounts of cash go towards something in one go, that it is depreciating in value before it has even been turned on.
"For the first phase of their migration to the cloud, Dress4Win is moving their development and test environments." - a safe but understandable way to move towards cloud. By learning the pitfalls of cloud they will be better equipped to move their production environment - the heart of their business.
"They are also building a disaster recovery site, because their current infrastructure is at a single location." - building DR using public cloud means that elasticity can be well and truly exploited here. Depending on their architecture a 'pilot-light' DR solution would allow a very cost effective deployment to be running, waiting for when it will be needed (DR is not a case if, it is a case of when!).
The main business driver for this migration to the cloud appears to be primarily cost efficiency, followed by the wish to increase agility, to keep up with their competitors. Is cloud the right choice at this point? In balance, yes. Their desire to innovate in how they create and use their non-production environments will be enabled by cloud, along with building a cost-effective DR solution for their production system.
Platform & Tooling
Dress4Win wishes to move to the cloud, AWS seems like a good choice as any. It has dominant market share, meaning that expertise is easier to obtain compared to other platforms.
There is a strong desire to improve agility within the development cycle for Dress4Win. The most effective drivers here would be the use of IaC tooling and ensuring that flow is maximised by applying DevOps principles and tooling.
IaC is a rapidly growing space. I am a fan of Terraform over vendor-specific solutions, for example Cloudformation. Being multi-vendor there is a growing skill base and a large amount of organisations supporting this toolset.
Dress4Win explicitly stated they wish to introduce a CD pipeline. Although their current Java based solution could be introduced into a CD flow using a tool like Ansible, containerising their solution may be a better approach.
AWS Solution Approach
At a high level, AWS shall be used to host functionally representative development environments and the DR environment.
To separate the differing environments, a multi account setup shall be employed;
- Master / Billing & Security Account.
- Pre-Production / NLE,
- Production / DR.
Each account shall have one or more VPCs to separate each environment.
A conservative approach will be taken, starting with a replatforming exercise, and recommendations made for transformation to use cloud native products where appropriate.
- All logs will be collated into Cloudwatch (including Docker as part of transformation),
- X-Ray integrated for application level monitoring.
- Databases shall be backed up on daily schedule using features built into RDS,
- Big Data cluster data is persisted on S3, no further backup beyond this shall be provided. Sane permissions management will be utilised to reduce risk of accidental deletion.
VPC / Network Architecture
Each environment shall have 1 VPC. Each VPC shall have subnets deployed into 2 availability zones (3 for DR). Subnets shall exist for;
Connectivity into the development environments shall be via EC2 Instance Connect instead of via Bastion.
DR shall be managed via EC2 Instance Connect aswell. Connectivity to on-premise for data replication shall be via VPN Gateway, using IPSEC connecting to Dress4Win's on premise deployment.
MySQL provides the core RDBMS backend for the Dress4Win solution. The decision here is to whether to deploy MySQL on an EC2 instance and deploy via script, or use RDS.
RDS certainly has operational advantages, however, it does come at a cost. Comparing RDS to an equivalent EC2
t3.micro instance is nearly twice as expensive. Much of RDS' advantage is about streamlining the ongoing operation of an RDBMS, and there is an implication that development environments may be short lived. Assuming that a consistent MySQL configuration profile can be established, I propose the usage of EC2 with scripted MySQL install.
Web Application Servers
In the brief they detailed their ambition to implement a CD pipeline and improve agility of development. Containerisation is a good fit here, however, without an explicit brief to transform their environments from development through to production, this would be definitely in the 'evolve' category of their cloud migration, once they have replatformed.
Stage 1 - Improve development flow by implementing CD. Control costs using Autoscaling of development environments. This meshes with their existing deployment model. Deploy via;
- AMI with dependencies installed,
- Deployment of builds via Ansible or similar upon successful run of testing in CI/CD pipeline,
- Drain and replace instances in an Autoscaling Group via CD AWS CLI trigger / script.
Stage 2 - Containerisation of their solution from development through to prod. Offload of static content to S3 + CloudFront. ALB to perform SSL offload. This will improve flow, increase cost-efficiency and scale well.
Dress4Win has Hadoop / Spark clusters for real time analytics and trending. The decision here is to replatform onto EC2 the existing solution, or use EMR to manage the creation and operation of clusters.
The cost overhead of EMR is relatively minimal, and the ability to scale easily in Production makes EMR a good choice over self-hosting.
Messaging is core to an application like described in Dress4Win, therefore it should be considered how this can be scaled and run optimally. RabbitMQ uses AMQP 0.9 which is not compatible with Amazon MQ (which is also relatively expensive compared to self-hosting on EC2).
Therefore, the suggestion here would be to replatform to RabbitMQ installed on EC2, with a roadmap decision to be made whether to re-architect to use SNS.
There is likely minimal benefit for trading CI, Scanning Tools for cloud native equivalents if significant effort has gone into creating effective developer tooling already, they should be replatformed as-is.
Bastion servers can either be replatformed, or adoption of EC2 Instance Connect considered, depending on security appetite.
The DR architecture should functionally replicate their on-premise deployment. No statement is made around RPO/RTO and the style of DR deployment Dress4Win wants (i.e Pilot Light, Multi-site). Running with the theme of cost control, I propose a pilot light DR strategy;
- A VPC deployed in the Production AWS account,
- VPN Gateway to on premise deployment,
- RDS MySQL Instance, replicating from the on-premise MySQL host,
- Autoscaling groups configured with Web Application and Messaging Server launch configurations,
- Push production Hadoop / Spark data to S3 (seeded by Snowball),
- Have small EMR cluster configured to use S3 for backend, ready to scale if needed,
- Route53 redirecting all live traffic to the on-premise deployment (public IP, not via their VPN),
Build and Migration Approach
Dress4Win want to innovate in the way they manage their development environments. To allow rapid creation (and reproducible) environments Terraform shall be used to create all non-production environments.
DR shall comprise of a Terraform scripted solution to create a pilot-light DR for Dress4Win's production infrastructure. To achieve this;
- Standby replication shall be running to an RDS instance,
- Application Server image AMI built. Autoscaling group setup with minimum of 1 instance configured. Setup to scale with demand,
- EC2 ALB configured to route traffic to ASG,
- EMR cluster setup, with 1 node in place and scaling enabled. Replication to S3 configured from production Hadoop cluster,
- Route53 performing DNS resolution with a Failover Policy, primary being on-premise, failover to EC2 ALB.
That's the theory, Part 2 coming soon - let the terraforming begin!