Context.
The client provides Credit Insights to Financial Institutions and NBFCs across South East Asia. They have 15 deployments across AWS and GCP and six on-premise data centers in different geographies. Every on-premise has 40+ bare-metal machines, and every cloud deployment has 10+ VMs with 30+ microservices.
Problem Statement.
Outcome/Impact.
Solution.
Every deployment (Cloud or On-Premise) had its own Nomad cluster for scheduling workloads. Prometheus and StatsD exporter were used for metrics collection.
Configured nomad agent and servers to emit metrics to StatsD exporter over UDP. Configure Prometheus to scrape from StatsD exporter HTTP endpoint. This allowed us to set up monitoring for new VMs without having to restart Prometheus.
Configure Prometheus to write metrics to a custom golang HTTP remote service. The service then forwards these metrics to Cloudwatch or Stackdriver. This allowed us to scale Prometheus by using the underlying Cloudwatch or Stackdriver as a remote backend.
Host and service monitoring alert created using terraform scripts integrated with Pagerduty.
Use Grafana for visualizations with Cloudwatch and Stackdriver as backends.
Configure custom alerts via Grafana with Pagerduty Integration.