Services

Resources

Company

Our Work

Blog

Book a Call

Back to Blog

#How-To

#Performance-Optimization

#Kubernetes

Oct 14, 2022 | 6 min read

Horizontal Pod Autoscaling in Kubernetes based on external metrics, using Prometheus adapter

Akshay Srivastava

SRE @One2N

Back to Blog

#How-To

#Performance-Optimization

#Kubernetes

Oct 14, 2022 | 6 min read

Horizontal Pod Autoscaling in Kubernetes based on external metrics, using Prometheus adapter

Akshay Srivastava

SRE @One2N

Back to Blog

#How-To

#Performance-Optimization

#Kubernetes

Oct 14, 2022 | 6 min read

Horizontal Pod Autoscaling in Kubernetes based on external metrics, using Prometheus adapter

Akshay Srivastava

SRE @One2N

Back to Blog

#How-To

#Performance-Optimization

#Kubernetes

Oct 14, 2022 | 6 min read

Horizontal Pod Autoscaling in Kubernetes based on external metrics, using Prometheus adapter

Akshay Srivastava

SRE @One2N

Orchestrators manage modern-day micro-service architectures. Kubernetes is one of them, which provides benefits of resource optimization, minimal or zero downtime deployments, reliability, auto-scaling, to name a few. Auto-scaling solutions are feedback loop based on specific metrics like network throughput, resource utilization of the services. Generally, metrics can be traffic throughput, resource utilization like CPU/Memory of the services. These metrics are part of the cluster and monitored to take auto-scaling decisions, but what about the external metrics? This blog covers both kinds of metrics for deploying the auto-scaling solution and used in production for a client.

One of our clients was using a Redis server which was outside of the Kubernetes cluster. We had to collect the metrics of the Redis queues and based on threshold auto-scale the pods.

What is Horizontal Pod Autoscaling [HPA]?

Kubernetes is inherently scalable, providing a number of tools that allow the applications as well as the infrastructure to scale up and down depending on the demand, efficiency and a number of other metrics. What I’m going to discuss in this article, is one such feature that allows the user to horizontally scale the Pods based on certain metrics, which can either be provided by Kubernetes itself, or custom metrics which have been generated by the user.

The Horizontal Pod Autoscaler automatically scales the number of Pods in a replication controller, deployment, replica set or stateful set based on some metrics. It is implemented as a Kubernetes API resource and a controller.

The HPA controller retrieves metrics from a series of APIs, which include:

metrics.k8s.io API for resource metrics
- These include metrics like cpu/memory usage of a Pod.
custom.metrics.k8s.io API for custom metrics
- These can be defined by using operators and are generated from within the cluster, for example Prometheus Operator.
external.metrics.k8s.io API for external metrics
- These metrics originate from outside the Kubernetes cluster, for example number of pending jobs present in a external Redis queue, and have to made available to the cluster so that the HPA controller can monitor it.

How are we going to implement HPA?

For this article, we will be using the Prometheus Adapter in order to have the Prometheus metric available to the Kubernetes cluster as an external metric.

The following steps outline how HPA can be implemented in the cluster:

There will be an application running in the cluster, which connects to the external Redis service to pick up the next job from the queue.
After picking up the job from the queue, the application will send to StatsD the number of pending jobs remaining in the queue as a gauge metric.
The external Prometheus will scrape StatsD and now has the metric.
- For example, below is the metric that will be used to trigger the autoscaling event:

Now, you can deploy the Prometheus Adapter in the cluster to query the external Prometheus and expose the metric to the cluster via the external metric API.
The manifests for deploying the Prometheus Adapter can be found here.
The following changes are to be made to the Prometheus Adapter’s manifests:
Update the URL for the Prometheus service in the deploy/manifests/custom-metrics-apiserver-deployment.yaml
Update the deploy/manifests/custom-metrics-config-map.yaml with the correct rule for querying Prometheus.
- For example, our metric is named trigger_prod_hpa, which has the labels {instance="sh119.global.temp.domains/~onetwoni",job="trigger_prod_hpa"}.
- The corresponding Prometheus Adapter rule for the above metric would be:

externalRules:
  - seriesQuery: 'trigger_prod_hpa{instance="sh119.global.temp.domains/~onetwoni",job="trigger_prod_hpa"}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        service: {resource: "service"}
    name:
      matches: ^trigger_prod_(.*)$
      as: "trigger_prod_$1"
    metricsQuery: 'trigger_prod_hpa{instance="sh119.global.temp.domains/~onetwoni",job="trigger_prod_hpa"}'

The git repository for the Prometheus Adapter contains a very detailed explanation on how the adapter rules are written.

TLS certificates will have to be generated for the Prometheus Adapter.
- This Makefile can be used to generate the certs.
- Run the following commands to generate the manifest which will contain the certs as a k8s secret:

mkdir custom-metrics-api
touch metrics-ca.key metrics-ca.crt metrics-ca-config.json custom-metrics-api/cm-adapter-serving-certs.yaml
make certs

The above commands will generate a yaml file which has the secret configured.

Copy the generated cm-adapter-serving-certs.yaml to Prometheus Adapter’s deploy/manifests directory.
- Note: Make sure that the namespace of the generated secret is the same as the namespace for the manifests in Prometheus Adapter’s deploy/manifests.
After the adapter has been successfully deployed, we now have to confirm that the adapter configuration is applied correctly:
- Confirm that the external.metrics.k8s.io API is active and aware of the metric:
  - kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .    

{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "trigger_prod_hpa",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

Next, confirm that the metric’s value from Prometheus is correctly available to the cluster:

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/*/trigger_prod_hpa" | jq .

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/*/trigger_prod_hpa" | jq .    

{
  "kind": "ExternalMetricValueList",
  "apiVersion": "external.metrics.k8s.io/v1beta1",
  "metadata": {
    "selfLink": "/apis/external.metrics.k8s.io/v1beta1/namespaces/%2A/trigger_prod_hpa"
  },
  "items": [
    {
      "metricName": "trigger_prod_hpa",
      "metricLabels": {
        "__name__": "trigger_prod_hpa",
        "instance": "localhost",
        "job": "trigger_prod_hpa"
      },
      "timestamp": "2020-11-01T04:17:01Z",
      "value": "10"
    }
  ]
}

Now that we have confirmed that our external metric is available to the cluster, we are ready to define the HPA configuration which will have:

threshold value to trigger the autoscaling event, which is compared against the external metric
minimum number of Pods that must be running when the value is below the threshold
maximum number of Pods that can be scaled up to when the value crosses the threshold

Below is the HPA configuration yaml:
- hpa.yaml

---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache-hpa  # Name of the HPA config
  namespace: hpademo    # Namespace of the deployment on which HPA is to be applied
spec:
  scaleTargetRef:
    apiVersion: apps/v1 
    kind: Deployment
    name: php-apache    # Name of the deployment
  minReplicas: 1        # Minimum number of running Pods
  maxReplicas: 5        # Maximum number of Pods that can be scaled 
  metrics:
    - type: External    
      external:
        metricName: trigger_prod_hpa      # Name of the external metric as it is available to the cluster   
        targetValue: "40"                 # Threshold value for the autoscaling to trigger

The above configuration can be applied by the following command: kubectl apply -f hpa.yaml

We can check the applied HPA configuration by running the following command: kubectl describe hpa -n <namespace>

kubectl describe hpa -n hpademo

Name:                                 php-apache-hpa
Namespace:                            hpademo
Labels:                               <none>
Annotations:                          <none>
CreationTimestamp:                    Sun, 01 Nov 2020 10:06:20 +0530
Reference:                            Deployment/php-apache
Metrics:                              ( current / target )
  "trigger_prod_hpa" (target value):  10 / 40
Min replicas:                         1
Max replicas:                         5
Deployment pods:                      1 current / 1 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from external metric trigger_prod_hpa(nil)
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:           <none>

As can be seen above, the HPA configuration has been applied to the cluster and the HPA controller is able to access the external metric correctly. It will monitor the value of the external metric to the threshold’s value, and when it crosses the threshold, will trigger a scale up action. Similarly, when the external metric’s value goes below the threshold, the HPA controller will trigger a scale down action.

The HPA controller keeps a track of desired number of Pods based on the following formula:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

Test the HPA

In order to test our HPA configuration and make sure that the scaling up/down occurs correctly, we will update the value of the trigger_prod_hpa metric to a value above the threshold.

Update the value of the trigger_prod_hpa metric at Prometheus to a value above “40” (threshold we have set). Let’s set the value at “50”, and after the scale up event has happened, let’s update it to “30”
As we can see below, the metric value has been updated and the HPA controller has already started scaling up our Pods till the maximum allowed number:

kubectl describe hpa -n hpademo

Name:                                 php-apache-hpa
Namespace:                            hpademo
Labels:                               <none>
Annotations:                          <none>

When the value of the trigger_prod_hpa metric eventually falls below the threshold, the HPA controller will start scaling down the Pods based on the formula mentioned above:

It is possible to track the number of replicas you will end up with after scaling down:

desiredReplicas = ceil[5*(30/40)] = ceil(3.75) = 4 replicas
---
desiredReplicas = ceil[4*(30/40)] = ceil(3)    = 3 replicas
---
desiredReplicas = ceil[3*(30/40)] = ceil(2.25) = 3 replicas

As can be seen, HPA is now maintaining 3 replicas:

kubectl describe hpa -n hpademo

Name:                                 php-apache-hpa
Namespace:                            hpademo
Labels:                               <none>
Annotations:                          <none>

Conclusion

We were able to achieve Horizontal Pod Autoscaling feature of Kubernetes, by ingesting the external metrics to the cluster, define HPA configuration and let Kubernetes handle auto-scale the pods.
Always set the maximum and a minimum number of replicas. There can be situations where the maximum replicas are not enough or minimum replicas are more than desired.

One of our clients was using a Redis server which was outside of the Kubernetes cluster. We had to collect the metrics of the Redis queues and based on threshold auto-scale the pods.

What is Horizontal Pod Autoscaling [HPA]?

The HPA controller retrieves metrics from a series of APIs, which include:

metrics.k8s.io API for resource metrics
- These include metrics like cpu/memory usage of a Pod.
custom.metrics.k8s.io API for custom metrics
- These can be defined by using operators and are generated from within the cluster, for example Prometheus Operator.
external.metrics.k8s.io API for external metrics
- These metrics originate from outside the Kubernetes cluster, for example number of pending jobs present in a external Redis queue, and have to made available to the cluster so that the HPA controller can monitor it.

How are we going to implement HPA?

For this article, we will be using the Prometheus Adapter in order to have the Prometheus metric available to the Kubernetes cluster as an external metric.

The following steps outline how HPA can be implemented in the cluster:

There will be an application running in the cluster, which connects to the external Redis service to pick up the next job from the queue.
After picking up the job from the queue, the application will send to StatsD the number of pending jobs remaining in the queue as a gauge metric.
The external Prometheus will scrape StatsD and now has the metric.
- For example, below is the metric that will be used to trigger the autoscaling event:

Now, you can deploy the Prometheus Adapter in the cluster to query the external Prometheus and expose the metric to the cluster via the external metric API.
The manifests for deploying the Prometheus Adapter can be found here.
The following changes are to be made to the Prometheus Adapter’s manifests:
Update the URL for the Prometheus service in the deploy/manifests/custom-metrics-apiserver-deployment.yaml
Update the deploy/manifests/custom-metrics-config-map.yaml with the correct rule for querying Prometheus.
- For example, our metric is named trigger_prod_hpa, which has the labels {instance="sh119.global.temp.domains/~onetwoni",job="trigger_prod_hpa"}.
- The corresponding Prometheus Adapter rule for the above metric would be:

externalRules:
  - seriesQuery: 'trigger_prod_hpa{instance="sh119.global.temp.domains/~onetwoni",job="trigger_prod_hpa"}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        service: {resource: "service"}
    name:
      matches: ^trigger_prod_(.*)$
      as: "trigger_prod_$1"
    metricsQuery: 'trigger_prod_hpa{instance="sh119.global.temp.domains/~onetwoni",job="trigger_prod_hpa"}'

The git repository for the Prometheus Adapter contains a very detailed explanation on how the adapter rules are written.

TLS certificates will have to be generated for the Prometheus Adapter.
- This Makefile can be used to generate the certs.
- Run the following commands to generate the manifest which will contain the certs as a k8s secret:

mkdir custom-metrics-api
touch metrics-ca.key metrics-ca.crt metrics-ca-config.json custom-metrics-api/cm-adapter-serving-certs.yaml
make certs

The above commands will generate a yaml file which has the secret configured.

Copy the generated cm-adapter-serving-certs.yaml to Prometheus Adapter’s deploy/manifests directory.
- Note: Make sure that the namespace of the generated secret is the same as the namespace for the manifests in Prometheus Adapter’s deploy/manifests.
After the adapter has been successfully deployed, we now have to confirm that the adapter configuration is applied correctly:
- Confirm that the external.metrics.k8s.io API is active and aware of the metric:
  - kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .    

{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "trigger_prod_hpa",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

Next, confirm that the metric’s value from Prometheus is correctly available to the cluster:

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/*/trigger_prod_hpa" | jq .

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/*/trigger_prod_hpa" | jq .    

{
  "kind": "ExternalMetricValueList",
  "apiVersion": "external.metrics.k8s.io/v1beta1",
  "metadata": {
    "selfLink": "/apis/external.metrics.k8s.io/v1beta1/namespaces/%2A/trigger_prod_hpa"
  },
  "items": [
    {
      "metricName": "trigger_prod_hpa",
      "metricLabels": {
        "__name__": "trigger_prod_hpa",
        "instance": "localhost",
        "job": "trigger_prod_hpa"
      },
      "timestamp": "2020-11-01T04:17:01Z",
      "value": "10"
    }
  ]
}

Now that we have confirmed that our external metric is available to the cluster, we are ready to define the HPA configuration which will have:

threshold value to trigger the autoscaling event, which is compared against the external metric
minimum number of Pods that must be running when the value is below the threshold
maximum number of Pods that can be scaled up to when the value crosses the threshold

Below is the HPA configuration yaml:
- hpa.yaml

---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache-hpa  # Name of the HPA config
  namespace: hpademo    # Namespace of the deployment on which HPA is to be applied
spec:
  scaleTargetRef:
    apiVersion: apps/v1 
    kind: Deployment
    name: php-apache    # Name of the deployment
  minReplicas: 1        # Minimum number of running Pods
  maxReplicas: 5        # Maximum number of Pods that can be scaled 
  metrics:
    - type: External    
      external:
        metricName: trigger_prod_hpa      # Name of the external metric as it is available to the cluster   
        targetValue: "40"                 # Threshold value for the autoscaling to trigger

The above configuration can be applied by the following command: kubectl apply -f hpa.yaml

We can check the applied HPA configuration by running the following command: kubectl describe hpa -n <namespace>

kubectl describe hpa -n hpademo

Name:                                 php-apache-hpa
Namespace:                            hpademo
Labels:                               <none>
Annotations:                          <none>
CreationTimestamp:                    Sun, 01 Nov 2020 10:06:20 +0530
Reference:                            Deployment/php-apache
Metrics:                              ( current / target )
  "trigger_prod_hpa" (target value):  10 / 40
Min replicas:                         1
Max replicas:                         5
Deployment pods:                      1 current / 1 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from external metric trigger_prod_hpa(nil)
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:           <none>

The HPA controller keeps a track of desired number of Pods based on the following formula:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

Test the HPA

In order to test our HPA configuration and make sure that the scaling up/down occurs correctly, we will update the value of the trigger_prod_hpa metric to a value above the threshold.

Update the value of the trigger_prod_hpa metric at Prometheus to a value above “40” (threshold we have set). Let’s set the value at “50”, and after the scale up event has happened, let’s update it to “30”
As we can see below, the metric value has been updated and the HPA controller has already started scaling up our Pods till the maximum allowed number:

kubectl describe hpa -n hpademo

Name:                                 php-apache-hpa
Namespace:                            hpademo
Labels:                               <none>
Annotations:                          <none>

When the value of the trigger_prod_hpa metric eventually falls below the threshold, the HPA controller will start scaling down the Pods based on the formula mentioned above:

It is possible to track the number of replicas you will end up with after scaling down:

desiredReplicas = ceil[5*(30/40)] = ceil(3.75) = 4 replicas
---
desiredReplicas = ceil[4*(30/40)] = ceil(3)    = 3 replicas
---
desiredReplicas = ceil[3*(30/40)] = ceil(2.25) = 3 replicas

As can be seen, HPA is now maintaining 3 replicas:

kubectl describe hpa -n hpademo

Name:                                 php-apache-hpa
Namespace:                            hpademo
Labels:                               <none>
Annotations:                          <none>

Conclusion

We were able to achieve Horizontal Pod Autoscaling feature of Kubernetes, by ingesting the external metrics to the cluster, define HPA configuration and let Kubernetes handle auto-scale the pods.
Always set the maximum and a minimum number of replicas. There can be situations where the maximum replicas are not enough or minimum replicas are more than desired.

One of our clients was using a Redis server which was outside of the Kubernetes cluster. We had to collect the metrics of the Redis queues and based on threshold auto-scale the pods.

What is Horizontal Pod Autoscaling [HPA]?

The HPA controller retrieves metrics from a series of APIs, which include:

metrics.k8s.io API for resource metrics
- These include metrics like cpu/memory usage of a Pod.
custom.metrics.k8s.io API for custom metrics
- These can be defined by using operators and are generated from within the cluster, for example Prometheus Operator.
external.metrics.k8s.io API for external metrics
- These metrics originate from outside the Kubernetes cluster, for example number of pending jobs present in a external Redis queue, and have to made available to the cluster so that the HPA controller can monitor it.

How are we going to implement HPA?

For this article, we will be using the Prometheus Adapter in order to have the Prometheus metric available to the Kubernetes cluster as an external metric.

The following steps outline how HPA can be implemented in the cluster:

There will be an application running in the cluster, which connects to the external Redis service to pick up the next job from the queue.
After picking up the job from the queue, the application will send to StatsD the number of pending jobs remaining in the queue as a gauge metric.
The external Prometheus will scrape StatsD and now has the metric.
- For example, below is the metric that will be used to trigger the autoscaling event:

Now, you can deploy the Prometheus Adapter in the cluster to query the external Prometheus and expose the metric to the cluster via the external metric API.
The manifests for deploying the Prometheus Adapter can be found here.
The following changes are to be made to the Prometheus Adapter’s manifests:
Update the URL for the Prometheus service in the deploy/manifests/custom-metrics-apiserver-deployment.yaml
Update the deploy/manifests/custom-metrics-config-map.yaml with the correct rule for querying Prometheus.
- For example, our metric is named trigger_prod_hpa, which has the labels {instance="sh119.global.temp.domains/~onetwoni",job="trigger_prod_hpa"}.
- The corresponding Prometheus Adapter rule for the above metric would be:

externalRules:
  - seriesQuery: 'trigger_prod_hpa{instance="sh119.global.temp.domains/~onetwoni",job="trigger_prod_hpa"}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        service: {resource: "service"}
    name:
      matches: ^trigger_prod_(.*)$
      as: "trigger_prod_$1"
    metricsQuery: 'trigger_prod_hpa{instance="sh119.global.temp.domains/~onetwoni",job="trigger_prod_hpa"}'

The git repository for the Prometheus Adapter contains a very detailed explanation on how the adapter rules are written.

TLS certificates will have to be generated for the Prometheus Adapter.
- This Makefile can be used to generate the certs.
- Run the following commands to generate the manifest which will contain the certs as a k8s secret:

mkdir custom-metrics-api
touch metrics-ca.key metrics-ca.crt metrics-ca-config.json custom-metrics-api/cm-adapter-serving-certs.yaml
make certs

The above commands will generate a yaml file which has the secret configured.

Copy the generated cm-adapter-serving-certs.yaml to Prometheus Adapter’s deploy/manifests directory.
- Note: Make sure that the namespace of the generated secret is the same as the namespace for the manifests in Prometheus Adapter’s deploy/manifests.
After the adapter has been successfully deployed, we now have to confirm that the adapter configuration is applied correctly:
- Confirm that the external.metrics.k8s.io API is active and aware of the metric:
  - kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .    

{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "trigger_prod_hpa",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

Next, confirm that the metric’s value from Prometheus is correctly available to the cluster:

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/*/trigger_prod_hpa" | jq .

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/*/trigger_prod_hpa" | jq .    

{
  "kind": "ExternalMetricValueList",
  "apiVersion": "external.metrics.k8s.io/v1beta1",
  "metadata": {
    "selfLink": "/apis/external.metrics.k8s.io/v1beta1/namespaces/%2A/trigger_prod_hpa"
  },
  "items": [
    {
      "metricName": "trigger_prod_hpa",
      "metricLabels": {
        "__name__": "trigger_prod_hpa",
        "instance": "localhost",
        "job": "trigger_prod_hpa"
      },
      "timestamp": "2020-11-01T04:17:01Z",
      "value": "10"
    }
  ]
}

Now that we have confirmed that our external metric is available to the cluster, we are ready to define the HPA configuration which will have:

threshold value to trigger the autoscaling event, which is compared against the external metric
minimum number of Pods that must be running when the value is below the threshold
maximum number of Pods that can be scaled up to when the value crosses the threshold

Below is the HPA configuration yaml:
- hpa.yaml

---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache-hpa  # Name of the HPA config
  namespace: hpademo    # Namespace of the deployment on which HPA is to be applied
spec:
  scaleTargetRef:
    apiVersion: apps/v1 
    kind: Deployment
    name: php-apache    # Name of the deployment
  minReplicas: 1        # Minimum number of running Pods
  maxReplicas: 5        # Maximum number of Pods that can be scaled 
  metrics:
    - type: External    
      external:
        metricName: trigger_prod_hpa      # Name of the external metric as it is available to the cluster   
        targetValue: "40"                 # Threshold value for the autoscaling to trigger

The above configuration can be applied by the following command: kubectl apply -f hpa.yaml

We can check the applied HPA configuration by running the following command: kubectl describe hpa -n <namespace>

kubectl describe hpa -n hpademo

Name:                                 php-apache-hpa
Namespace:                            hpademo
Labels:                               <none>
Annotations:                          <none>
CreationTimestamp:                    Sun, 01 Nov 2020 10:06:20 +0530
Reference:                            Deployment/php-apache
Metrics:                              ( current / target )
  "trigger_prod_hpa" (target value):  10 / 40
Min replicas:                         1
Max replicas:                         5
Deployment pods:                      1 current / 1 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from external metric trigger_prod_hpa(nil)
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:           <none>

The HPA controller keeps a track of desired number of Pods based on the following formula:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

Test the HPA

In order to test our HPA configuration and make sure that the scaling up/down occurs correctly, we will update the value of the trigger_prod_hpa metric to a value above the threshold.

Update the value of the trigger_prod_hpa metric at Prometheus to a value above “40” (threshold we have set). Let’s set the value at “50”, and after the scale up event has happened, let’s update it to “30”
As we can see below, the metric value has been updated and the HPA controller has already started scaling up our Pods till the maximum allowed number:

kubectl describe hpa -n hpademo

Name:                                 php-apache-hpa
Namespace:                            hpademo
Labels:                               <none>
Annotations:                          <none>

When the value of the trigger_prod_hpa metric eventually falls below the threshold, the HPA controller will start scaling down the Pods based on the formula mentioned above:

It is possible to track the number of replicas you will end up with after scaling down:

desiredReplicas = ceil[5*(30/40)] = ceil(3.75) = 4 replicas
---
desiredReplicas = ceil[4*(30/40)] = ceil(3)    = 3 replicas
---
desiredReplicas = ceil[3*(30/40)] = ceil(2.25) = 3 replicas

As can be seen, HPA is now maintaining 3 replicas:

kubectl describe hpa -n hpademo

Name:                                 php-apache-hpa
Namespace:                            hpademo
Labels:                               <none>
Annotations:                          <none>

Conclusion

We were able to achieve Horizontal Pod Autoscaling feature of Kubernetes, by ingesting the external metrics to the cluster, define HPA configuration and let Kubernetes handle auto-scale the pods.
Always set the maximum and a minimum number of replicas. There can be situations where the maximum replicas are not enough or minimum replicas are more than desired.

One of our clients was using a Redis server which was outside of the Kubernetes cluster. We had to collect the metrics of the Redis queues and based on threshold auto-scale the pods.

What is Horizontal Pod Autoscaling [HPA]?

The HPA controller retrieves metrics from a series of APIs, which include:

metrics.k8s.io API for resource metrics
- These include metrics like cpu/memory usage of a Pod.
custom.metrics.k8s.io API for custom metrics
- These can be defined by using operators and are generated from within the cluster, for example Prometheus Operator.
external.metrics.k8s.io API for external metrics
- These metrics originate from outside the Kubernetes cluster, for example number of pending jobs present in a external Redis queue, and have to made available to the cluster so that the HPA controller can monitor it.

How are we going to implement HPA?

For this article, we will be using the Prometheus Adapter in order to have the Prometheus metric available to the Kubernetes cluster as an external metric.

The following steps outline how HPA can be implemented in the cluster:

There will be an application running in the cluster, which connects to the external Redis service to pick up the next job from the queue.
After picking up the job from the queue, the application will send to StatsD the number of pending jobs remaining in the queue as a gauge metric.
The external Prometheus will scrape StatsD and now has the metric.
- For example, below is the metric that will be used to trigger the autoscaling event:

Now, you can deploy the Prometheus Adapter in the cluster to query the external Prometheus and expose the metric to the cluster via the external metric API.
The manifests for deploying the Prometheus Adapter can be found here.
The following changes are to be made to the Prometheus Adapter’s manifests:
Update the URL for the Prometheus service in the deploy/manifests/custom-metrics-apiserver-deployment.yaml
Update the deploy/manifests/custom-metrics-config-map.yaml with the correct rule for querying Prometheus.
- For example, our metric is named trigger_prod_hpa, which has the labels {instance="sh119.global.temp.domains/~onetwoni",job="trigger_prod_hpa"}.
- The corresponding Prometheus Adapter rule for the above metric would be:

externalRules:
  - seriesQuery: 'trigger_prod_hpa{instance="sh119.global.temp.domains/~onetwoni",job="trigger_prod_hpa"}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        service: {resource: "service"}
    name:
      matches: ^trigger_prod_(.*)$
      as: "trigger_prod_$1"
    metricsQuery: 'trigger_prod_hpa{instance="sh119.global.temp.domains/~onetwoni",job="trigger_prod_hpa"}'

The git repository for the Prometheus Adapter contains a very detailed explanation on how the adapter rules are written.

TLS certificates will have to be generated for the Prometheus Adapter.
- This Makefile can be used to generate the certs.
- Run the following commands to generate the manifest which will contain the certs as a k8s secret:

mkdir custom-metrics-api
touch metrics-ca.key metrics-ca.crt metrics-ca-config.json custom-metrics-api/cm-adapter-serving-certs.yaml
make certs

The above commands will generate a yaml file which has the secret configured.

Copy the generated cm-adapter-serving-certs.yaml to Prometheus Adapter’s deploy/manifests directory.
- Note: Make sure that the namespace of the generated secret is the same as the namespace for the manifests in Prometheus Adapter’s deploy/manifests.
After the adapter has been successfully deployed, we now have to confirm that the adapter configuration is applied correctly:
- Confirm that the external.metrics.k8s.io API is active and aware of the metric:
  - kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .    

{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "trigger_prod_hpa",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

Next, confirm that the metric’s value from Prometheus is correctly available to the cluster:

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/*/trigger_prod_hpa" | jq .

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/*/trigger_prod_hpa" | jq .    

{
  "kind": "ExternalMetricValueList",
  "apiVersion": "external.metrics.k8s.io/v1beta1",
  "metadata": {
    "selfLink": "/apis/external.metrics.k8s.io/v1beta1/namespaces/%2A/trigger_prod_hpa"
  },
  "items": [
    {
      "metricName": "trigger_prod_hpa",
      "metricLabels": {
        "__name__": "trigger_prod_hpa",
        "instance": "localhost",
        "job": "trigger_prod_hpa"
      },
      "timestamp": "2020-11-01T04:17:01Z",
      "value": "10"
    }
  ]
}

Now that we have confirmed that our external metric is available to the cluster, we are ready to define the HPA configuration which will have:

threshold value to trigger the autoscaling event, which is compared against the external metric
minimum number of Pods that must be running when the value is below the threshold
maximum number of Pods that can be scaled up to when the value crosses the threshold

Below is the HPA configuration yaml:
- hpa.yaml

---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache-hpa  # Name of the HPA config
  namespace: hpademo    # Namespace of the deployment on which HPA is to be applied
spec:
  scaleTargetRef:
    apiVersion: apps/v1 
    kind: Deployment
    name: php-apache    # Name of the deployment
  minReplicas: 1        # Minimum number of running Pods
  maxReplicas: 5        # Maximum number of Pods that can be scaled 
  metrics:
    - type: External    
      external:
        metricName: trigger_prod_hpa      # Name of the external metric as it is available to the cluster   
        targetValue: "40"                 # Threshold value for the autoscaling to trigger

The above configuration can be applied by the following command: kubectl apply -f hpa.yaml

We can check the applied HPA configuration by running the following command: kubectl describe hpa -n <namespace>

kubectl describe hpa -n hpademo

Name:                                 php-apache-hpa
Namespace:                            hpademo
Labels:                               <none>
Annotations:                          <none>
CreationTimestamp:                    Sun, 01 Nov 2020 10:06:20 +0530
Reference:                            Deployment/php-apache
Metrics:                              ( current / target )
  "trigger_prod_hpa" (target value):  10 / 40
Min replicas:                         1
Max replicas:                         5
Deployment pods:                      1 current / 1 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from external metric trigger_prod_hpa(nil)
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:           <none>

The HPA controller keeps a track of desired number of Pods based on the following formula:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

Test the HPA

In order to test our HPA configuration and make sure that the scaling up/down occurs correctly, we will update the value of the trigger_prod_hpa metric to a value above the threshold.

Update the value of the trigger_prod_hpa metric at Prometheus to a value above “40” (threshold we have set). Let’s set the value at “50”, and after the scale up event has happened, let’s update it to “30”
As we can see below, the metric value has been updated and the HPA controller has already started scaling up our Pods till the maximum allowed number:

kubectl describe hpa -n hpademo

Name:                                 php-apache-hpa
Namespace:                            hpademo
Labels:                               <none>
Annotations:                          <none>

When the value of the trigger_prod_hpa metric eventually falls below the threshold, the HPA controller will start scaling down the Pods based on the formula mentioned above:

It is possible to track the number of replicas you will end up with after scaling down:

desiredReplicas = ceil[5*(30/40)] = ceil(3.75) = 4 replicas
---
desiredReplicas = ceil[4*(30/40)] = ceil(3)    = 3 replicas
---
desiredReplicas = ceil[3*(30/40)] = ceil(2.25) = 3 replicas

As can be seen, HPA is now maintaining 3 replicas:

kubectl describe hpa -n hpademo

Name:                                 php-apache-hpa
Namespace:                            hpademo
Labels:                               <none>
Annotations:                          <none>

Conclusion

We were able to achieve Horizontal Pod Autoscaling feature of Kubernetes, by ingesting the external metrics to the cluster, define HPA configuration and let Kubernetes handle auto-scale the pods.
Always set the maximum and a minimum number of replicas. There can be situations where the maximum replicas are not enough or minimum replicas are more than desired.

One of our clients was using a Redis server which was outside of the Kubernetes cluster. We had to collect the metrics of the Redis queues and based on threshold auto-scale the pods.

What is Horizontal Pod Autoscaling [HPA]?

The HPA controller retrieves metrics from a series of APIs, which include:

metrics.k8s.io API for resource metrics
- These include metrics like cpu/memory usage of a Pod.
custom.metrics.k8s.io API for custom metrics
- These can be defined by using operators and are generated from within the cluster, for example Prometheus Operator.
external.metrics.k8s.io API for external metrics
- These metrics originate from outside the Kubernetes cluster, for example number of pending jobs present in a external Redis queue, and have to made available to the cluster so that the HPA controller can monitor it.

How are we going to implement HPA?

For this article, we will be using the Prometheus Adapter in order to have the Prometheus metric available to the Kubernetes cluster as an external metric.

The following steps outline how HPA can be implemented in the cluster:

There will be an application running in the cluster, which connects to the external Redis service to pick up the next job from the queue.
After picking up the job from the queue, the application will send to StatsD the number of pending jobs remaining in the queue as a gauge metric.
The external Prometheus will scrape StatsD and now has the metric.
- For example, below is the metric that will be used to trigger the autoscaling event:

Now, you can deploy the Prometheus Adapter in the cluster to query the external Prometheus and expose the metric to the cluster via the external metric API.
The manifests for deploying the Prometheus Adapter can be found here.
The following changes are to be made to the Prometheus Adapter’s manifests:
Update the URL for the Prometheus service in the deploy/manifests/custom-metrics-apiserver-deployment.yaml
Update the deploy/manifests/custom-metrics-config-map.yaml with the correct rule for querying Prometheus.
- For example, our metric is named trigger_prod_hpa, which has the labels {instance="sh119.global.temp.domains/~onetwoni",job="trigger_prod_hpa"}.
- The corresponding Prometheus Adapter rule for the above metric would be:

externalRules:
  - seriesQuery: 'trigger_prod_hpa{instance="sh119.global.temp.domains/~onetwoni",job="trigger_prod_hpa"}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        service: {resource: "service"}
    name:
      matches: ^trigger_prod_(.*)$
      as: "trigger_prod_$1"
    metricsQuery: 'trigger_prod_hpa{instance="sh119.global.temp.domains/~onetwoni",job="trigger_prod_hpa"}'

The git repository for the Prometheus Adapter contains a very detailed explanation on how the adapter rules are written.

TLS certificates will have to be generated for the Prometheus Adapter.
- This Makefile can be used to generate the certs.
- Run the following commands to generate the manifest which will contain the certs as a k8s secret:

mkdir custom-metrics-api
touch metrics-ca.key metrics-ca.crt metrics-ca-config.json custom-metrics-api/cm-adapter-serving-certs.yaml
make certs

The above commands will generate a yaml file which has the secret configured.

Copy the generated cm-adapter-serving-certs.yaml to Prometheus Adapter’s deploy/manifests directory.
- Note: Make sure that the namespace of the generated secret is the same as the namespace for the manifests in Prometheus Adapter’s deploy/manifests.
After the adapter has been successfully deployed, we now have to confirm that the adapter configuration is applied correctly:
- Confirm that the external.metrics.k8s.io API is active and aware of the metric:
  - kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .    

{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "trigger_prod_hpa",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

Next, confirm that the metric’s value from Prometheus is correctly available to the cluster:

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/*/trigger_prod_hpa" | jq .

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/*/trigger_prod_hpa" | jq .    

{
  "kind": "ExternalMetricValueList",
  "apiVersion": "external.metrics.k8s.io/v1beta1",
  "metadata": {
    "selfLink": "/apis/external.metrics.k8s.io/v1beta1/namespaces/%2A/trigger_prod_hpa"
  },
  "items": [
    {
      "metricName": "trigger_prod_hpa",
      "metricLabels": {
        "__name__": "trigger_prod_hpa",
        "instance": "localhost",
        "job": "trigger_prod_hpa"
      },
      "timestamp": "2020-11-01T04:17:01Z",
      "value": "10"
    }
  ]
}

Now that we have confirmed that our external metric is available to the cluster, we are ready to define the HPA configuration which will have:

threshold value to trigger the autoscaling event, which is compared against the external metric
minimum number of Pods that must be running when the value is below the threshold
maximum number of Pods that can be scaled up to when the value crosses the threshold

Below is the HPA configuration yaml:
- hpa.yaml

---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache-hpa  # Name of the HPA config
  namespace: hpademo    # Namespace of the deployment on which HPA is to be applied
spec:
  scaleTargetRef:
    apiVersion: apps/v1 
    kind: Deployment
    name: php-apache    # Name of the deployment
  minReplicas: 1        # Minimum number of running Pods
  maxReplicas: 5        # Maximum number of Pods that can be scaled 
  metrics:
    - type: External    
      external:
        metricName: trigger_prod_hpa      # Name of the external metric as it is available to the cluster   
        targetValue: "40"                 # Threshold value for the autoscaling to trigger

The above configuration can be applied by the following command: kubectl apply -f hpa.yaml

We can check the applied HPA configuration by running the following command: kubectl describe hpa -n <namespace>

kubectl describe hpa -n hpademo

Name:                                 php-apache-hpa
Namespace:                            hpademo
Labels:                               <none>
Annotations:                          <none>
CreationTimestamp:                    Sun, 01 Nov 2020 10:06:20 +0530
Reference:                            Deployment/php-apache
Metrics:                              ( current / target )
  "trigger_prod_hpa" (target value):  10 / 40
Min replicas:                         1
Max replicas:                         5
Deployment pods:                      1 current / 1 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from external metric trigger_prod_hpa(nil)
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:           <none>

The HPA controller keeps a track of desired number of Pods based on the following formula:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

Test the HPA

In order to test our HPA configuration and make sure that the scaling up/down occurs correctly, we will update the value of the trigger_prod_hpa metric to a value above the threshold.

Update the value of the trigger_prod_hpa metric at Prometheus to a value above “40” (threshold we have set). Let’s set the value at “50”, and after the scale up event has happened, let’s update it to “30”
As we can see below, the metric value has been updated and the HPA controller has already started scaling up our Pods till the maximum allowed number:

kubectl describe hpa -n hpademo

Name:                                 php-apache-hpa
Namespace:                            hpademo
Labels:                               <none>
Annotations:                          <none>

When the value of the trigger_prod_hpa metric eventually falls below the threshold, the HPA controller will start scaling down the Pods based on the formula mentioned above:

It is possible to track the number of replicas you will end up with after scaling down:

desiredReplicas = ceil[5*(30/40)] = ceil(3.75) = 4 replicas
---
desiredReplicas = ceil[4*(30/40)] = ceil(3)    = 3 replicas
---
desiredReplicas = ceil[3*(30/40)] = ceil(2.25) = 3 replicas

As can be seen, HPA is now maintaining 3 replicas:

kubectl describe hpa -n hpademo

Name:                                 php-apache-hpa
Namespace:                            hpademo
Labels:                               <none>
Annotations:                          <none>

Conclusion

We were able to achieve Horizontal Pod Autoscaling feature of Kubernetes, by ingesting the external metrics to the cluster, define HPA configuration and let Kubernetes handle auto-scale the pods.
Always set the maximum and a minimum number of replicas. There can be situations where the maximum replicas are not enough or minimum replicas are more than desired.

Jump to section

July 15, 2025 | 7 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

July 15, 2025 | 7 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

April 29, 2025 | 9 min read

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

April 29, 2025 | 9 min read

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

July 15, 2025 | 7 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

July 15, 2025 | 7 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Blog

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Services

Resources

Company

Horizontal Pod Autoscaling in Kubernetes based on external metrics, using Prometheus adapter

Horizontal Pod Autoscaling in Kubernetes based on external metrics, using Prometheus adapter

Horizontal Pod Autoscaling in Kubernetes based on external metrics, using Prometheus adapter

Horizontal Pod Autoscaling in Kubernetes based on external metrics, using Prometheus adapter

Horizontal Pod Autoscaling in Kubernetes based on external metrics, using Prometheus adapter

Share

Jump to section

Related posts

How we solved a critical site-to-site VPN IP address conflict in AWS

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

How we solved a critical site-to-site VPN IP address conflict in AWS

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

How we solved a critical site-to-site VPN IP address conflict in AWS

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

How we solved a critical site-to-site VPN IP address conflict in AWS

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content