Case Study

Data Migration

Migrating Monolith Application to Cloud Native Stack.

Case Study

Data Migration

Migrating Monolith Application to Cloud Native Stack.

Case Study

Data Migration

Migrating Monolith Application to Cloud Native Stack.

Services

Resources

Company

Context.

The client provides cloud and on-premise based Hospital Management Services (HMS) SaaS platform. This HMS serves 800+ B2B customers (tenants), including large hospital chains across the Middle East, Africa, and APAC region. The tech stack for the application is a Java-based backend, JSP-based server rendered UI, Redis, and MySQL as datastores. The HMS platform is deployed as a SaaS and in customers' cloud environments and on-premise data centers.

Problem Statement.

Improving Observability: Setup centralized cloud-agnostic observability stack to enable platform-wide monitoring with the ability to enable per-tenant monitoring

Improving Observability: Setup centralized cloud-agnostic observability stack to enable platform-wide monitoring with the ability to enable per-tenant monitoring

Improving Observability: Setup centralized cloud-agnostic observability stack to enable platform-wide monitoring with the ability to enable per-tenant monitoring

Deployment automation: Improve the application release process to reduce the deployment time across SaaS and customer environments.

Deployment automation: Improve the application release process to reduce the deployment time across SaaS and customer environments.

Deployment automation: Improve the application release process to reduce the deployment time across SaaS and customer environments.

Outcome/Impact

Setup a single pane of observability for the client team to monitor all their customer environments.

Setup a single pane of observability for the client team to monitor all their customer environments.

Setup a single pane of observability for the client team to monitor all their customer environments.

Set up per customer observability so that support team can debug application and infrastructure problems.

Set up per customer observability so that support team can debug application and infrastructure problems.

Set up per customer observability so that support team can debug application and infrastructure problems.

Contanerise the application, enable Continuous Integration workflow, and implement GitOps-based deployments.

Contanerise the application, enable Continuous Integration workflow, and implement GitOps-based deployments.

Contanerise the application, enable Continuous Integration workflow, and implement GitOps-based deployments.

Bring down deployment time from days to a few minutes.

Bring down deployment time from days to a few minutes.

Bring down deployment time from days to a few minutes.

Deployed this solution for all tenants across cloud and on-premise.

Deployed this solution for all tenants across cloud and on-premise.

Deployed this solution for all tenants across cloud and on-premise.

Solution.

The HMS is deployed in one of two ways:

  1. A SaaS solution hosted and managed by the client team

  2. An on-premise solution hosted at the customer site (cloud and data centers) and managed by the client team

When One2N team started the work, there were two major challenges we set out to solve:

  1. Set up the observability stack for all environments

  2. Improve the application release processLet’s look at these in detail.

Set up Observability

Let’s answer some foundational questions like:

  • Why centralized observability was needed?

  • What did we do?

  • How did we set it up?

Client's Observability Setup

Why: The client team wanted a single pane to observe all the environments, including SaaS and customer environments. Without this, the client team would always be reactive and know about the problems only when customers reported them. The goal of a centralized observability system is to help the client team be more proactive in understanding system failures.

What: We evaluated the build-vs-buy option by performing a proof of concept for a centralized observability SaaS solution (e.g., Newrelic). However, we decided not to proceed with this due to the outgoing data costs it would incur (to send the logs and metrics to the centralized monitoring stack outside of the cloud environment). We couldn’t use a cloud-specific solution (e.g., AWS Cloudwatch) since some deployments were also on other clouds (e.g., Azure). Hence, we needed a self-hosted monitoring solution that’s based on open-source tooling.

How: For a homogenous, self-hosted observability solution across all environments, we finally settled on Prometheus, Fluentd, Loki, Grafana, and Thanos stack. Each client would have its own Prometheus, Loki, and Grafana stack, where client teams would look at dashboards and alerts. Logs and metrics were also pushed to the central Loki and Thanos setup via remote write. This way, the client team had access to monitoring data for all clients.

Based on each of the customer’s deployment setup and SLO needs, we rolled out this solution via Helm chart deployments or systemd services. We ensured the monitoring infra setup is separate from the application setup so that when/if production goes down, the monitoring infra doesn’t go down with it. On the client side, we deployed the monitoring stack via Helm charts. We also automated the creation of various infrastructure and application dashboards so that each client doesn’t have to create these dashboards themselves.

This centralized observability solution has been deployed on all environments and has been actively used by developers and system administrators.

Let’s now look at how we solved some challenges related to the application release process.

Improve the application release process

For improving the application release process, we started with improving the local dev setup and application packaging. Here are some challenges we solved in the process.

Challenge 1: The application was stateful and had many heterogeneous environments

The applications weren’t designed to be run in a stateless (and distributed) manner. The app used stateful HTTP sessions tied to a single server IP. We worked with the dev team to find and fix these issues. They updated the app to run in a stateless manner so that we can horizontally scale the API layer by running the same app on multiple instances. The stateful sessions had to be moved to cache (Redis).

Almost all the production environments were unique in their own way. These environments were treated as pets instead of cattle. To solve this problem, we containerized the application to run anywhere without custom scripting. For this, we worked with the dev team to apply the 12-factor app principles of separating config from code. We removed the hardcoded config from the code and moved it as part of the environment that can be passed during the application run. We created multiple dockerfiles that compiled the backend and frontend code using multi-stage docker builds. In the end, we optimized the container image size to be about 600Mb from an earlier size of about 2GB.

Challenge 2: Ad-hoc deployment automation

The deployment scripts used to deploy the app were tied to Ubuntu OS. These ad-hoc scripts grew in lines and complexity with time. The existing scripts had a lot of dead code with conditional logic (e.g., to support older application releases). Imagine 1000+ lines of bash scripting logic to build and deploy the app.

We standardized the deployment automation across all environments. For production deployments, we used Kubernetes, Helm, and ArgoCD. For non-prod environments, we created an Ansible playbook that deployed the latest docker image from the dev branch onto the test server. This helped the QA team quickly deploy and test new application changes without getting overwhelmed with Kubernetes tooling. For running DB schema migrations, we used init containers in Kubernetes.


In summary, we were able to roll out this whole solution (Observability and Application Release Process improvements) across all of the client's B2B customers over a period of one year.

Here are some details about the workflow.

  1. The developer creates a Pull Request(PR) and attaches a "preview" label to it. This label is customizable.

  2. PR creation triggers the Continuous Integration build process using GitHub Actions. A Docker image artifact is created and pushed to the AWS Elastic Container Registry.

  3. The ArgoCD ApplicationSet controller detects the PR and triggers the application deployment in a separate namespace.

  4. In this namespace, ArgoCD provisions the service-specific Kubernetes resources such as - Ingress, Service, and Deployment.

  5. The developers and QA team can access the Preview environment via the Ingress URL.

  6. Once the PR is closed/merged, or the "preview" label is removed, ArgoCD automatically removes all the resources of the Preview environment.

ArgoCD provides a general framework for creating Preview environments. However, here are some challenges we had to solve when building Preview environments on top of ArgoCD.

  • Dependency Management for Services

  • Seed Data Management

  • Keeping costs low for Preview environments

Dependency Management for Services

For frontend services, setting up Preview environments is easy. For backend services, we have to also set up its dependent components such as database, queue, cache, and other backend services.

For this, we chose a hybrid approach where the database, queue, and cache were provisioned specifically for the Preview environment in the same namespace. The dependent backend services were used from the shared Staging environment. This allowed us to save the resources cost on shared backend services yet not share the database, queue, and cache.

We introduced configuration overrides to solve an interesting scenario. Consider two services - Service A and Service B, where Service A depends on Service B. A developer working on a Pull Request of Service A, can use Service B either from the Staging environment or from any Preview environment of Service B. We enabled developers to point to any version of Service B (either Staging or another Preview environment) by changing the configuration in ArgoCD.

Seed Data Management

Each backend service in the Preview environment has its own newly created database. To test the service in Preview environment, the team had to create and update database records. This was time-consuming. We solved this by creating customized Docker images with pre-seeded data. We also made it easy for developers to update the seed data by replacing a dump file in a Git repository. This allowed us to version control the seed data as well. With this, the team would not have to start from scratch, effectively reducing the time to test the Preview environment.

Keeping costs low for Preview environments

As Preview environments are created for every Pull Request with a specific label, it can quickly grow into many environments being provisioned. To ensure this doesn’t add up to too much infrastructure cost, we suggested running the Preview environments using spot instances in Kubernetes. Spot instances are perfect for these non-critical workloads, like Preview environments.

If a Pull Request doesn’t have any new activity (commits, comments, etc.) for a certain period of time, the preview label will automatically be removed. This, in turn, would delete the Preview environment to save costs.

In summary, we were able to roll out Preview environments for all of Manatal’s services (both frontend and backend) on their existing tech stack. All this work was carried out by just one senior engineer in less than 2 months.

See our other works

See our other works

See our other works

Blogs

Dive into our collection of engaging blog posts

Blogs

Dive into our collection of engaging blog posts

Blogs

Dive into our collection of engaging blog posts

Blogs

Dive into our collection of engaging blog posts

Blogs

Dive into our collection of engaging blog posts