All you need to know before deploying a managed monitoring solution

Written by Barbora Kalousková | October 31, 2019

We've recently got a chance to try out a SaaS monitoring solution called SignalFx. We really enjoyed working with it and thought it is worth sharing our experience. Let's have a look at what are some of its main benefits and everything else you need to know before you deploy any managed monitoring solution:

SignalFX is a tool that will allow you to offload all the complicated monitoring infrastructure to a reliable partner who really understands how to deal with such important components properly. Do you sometimes ask yourself any of the following questions?

What happens when my monitoring infrastructure dies?
How can I monitor my monitoring infrastructure?
What happens when storage with historical data got corrupted?
How am I supposed to provision cost-effective infrastructure while the monitoring components are so huuuuuge?

With a managed solution, you don't have to ask such questions. Instead, you can fully focus on your business and reduce operational expenses.

Of course, there are some drawbacks as every technology requires a certain level of “special treatment”. Today I'm going to show you some of the best practices for using SignalFx deployed in the AWS EKS.

Step 1: Install SignalFX smart agent

Simply follow the instructions in the official documentation. I'd just like to add, that it is very important that you always use advanced installation method (you will be using good old kubectl) since it allows you to customise the entire Smart Agent configuration. I'll explain the reason for this recommendation later on. ;)

Step 2: Customise Kubelet Metrics

Even after 5 years of its existence, the world of Kubernetes is still pretty wild. I mean, each of the managed services might behave a bit differently. EKS is a good example as it has a slightly different kubelet endpoints from other managed services. Hence, you might not see all the metrics in the SignalFx interface.

Thankfully, this requires just a minor adjustment of the Smart Agent config map. Open the ConfigMap yaml file and extend kubelet section to something like this:

     - type: kubelet-stats
             kubeletAPI:
         url: "https://${MY_NODE_NAME}:10250"
          skipVerify: true
          authType: serviceAccount

Step 3: Even better Kubelet scraping

Have you ever read the horror stories about CPU throttling? If you did, then you know that we should be carefully observing these stats. If you haven't … well, follow my lead anyway. When you are in the kubelet section, extend the configuration a bit and add the following additional metrics: container_cpu_cfs_periods and container_cpu_cfs_throttled_periods.

Then, the whole kubelet section will look like this:

     - type: kubelet-stats
       extraMetrics:
         - container_cpu_cfs_periods
         - container_cpu_cfs_throttled_periods
         - container_fs_limit_bytes
         - container_fs_usage_bytes
       kubeletAPI:
         url: "https://${MY_NODE_NAME}:10250"
         skipVerify: true
         authType: serviceAccount

Now you can be watching CPU throttling in real time. No more oppressions!

Step 4: Monitor Kubernetes volumes

The best practice to fully leverage scaling capabilities, shorten recovery times etc is to run stateless workloads on Kubernetes. However, sometimes it's just not possible to stick to this recommendation (for instance, how can you cope with a Zookeeper cluster) and we must work with stateful workload. In such cases, it's probably a good idea to monitor the free space to prevent data loss or corruption caused by a full file system.

In SignalFx you have to opt-in this feature by enabling kubernetes-volumes exporter.

     - type: kubernetes-volumes
       kubeletAPI:
         url: "https://${MY_NODE_NAME}:10250"
         skipVerify: true
         authType: serviceAccount

But this isn't all! You also need to modify the ClusterRole to alow SignalFx to be able to read information for Physical Volumes or Persistent Volume Claims. This adjustment is actually pretty easy. Just open the manifest for the Cluster Role and add the following entries to the list of resources:

  - persistentvolumeclaims
  - persistentvolumes

Now you can get all the metrics you need to observe the available space in persistent volumes.

Step 5: Add metrics for stateful tests

By default, SignalFx is collecting metrics only for Kubernetes Deployments. It's useful when you want to identify any failing workload, deployments which are not fully ready etc. This step is actually connected with the previous one. Sometimes you really need to run some stateful workload. In such cases, it comes handy if you can receive the same metrics even for StatefulSets.

To achieve this, you just need to add a few extra metrics to the kubernetes-cluster section:

     - type: kubernetes-cluster
       useNodeName: true
       extraMetrics:
         - "kubernetes.stateful_set.current"
         - "kubernetes.stateful_set.desired"
         - "kubernetes.stateful_set.ready"
         - "kubernetes.stateful_set.updated"

From now on, you should be alerted when something happens to StatefulSets as well!

Always keep an eye on the usage meter

Now let's talk about the most important part: almost everything mentioned this article is considered as the custom metrics. These are, of course, limited depending on your subscription level. Standard license gets 50 custom metrics per host. In the enterprise license you can have 200 of them. Please note that it's extremely easy to exceed the limit and it can bring you an unpleasant surprise when the invoice arrives.

I strongly recommend always reading the documentation of the monitored components carefully and trying to filter out metrics and dimensions you don't need. For instance, I really don't need Docker metrics as I'm able to retrieve similar metrics from other components. So there's nothing easier than just deleting this section from the ConfigMap.

     - type: docker-container-stats
       dockerURL: unix:///var/run/docker.sock
       excludedImages:
         - '*pause-amd64*'
         - 'k8s.gcr.io/pause*'
       labelsToDimensions:
         io.kubernetes.container.name: container_spec_name
         io.kubernetes.pod.name: kubernetes_pod_name
         io.kubernetes.pod.uid: kubernetes_pod_uid
         io.kubernetes.pod.namespace: kubernetes_namespace

And you know what? We've just saved tons metrics so you can reuse this capacity for some other important business metrics.

Key takeaways

If you'd like my advice, I'd say take some time and go through all these recommendations one by one. Try to fetch the metrics endpoints with curl or wget and try to understand what actually happens in each of the infrastructure components of Kubernetes. Playing with monitoring is actually a great way to get closer to Kubernetes and mentally connect all these things together.

The second advice would be: always think twice when you are implementing new metrics endpoints to the SignalFx configurations. Ask yourself whether you really need these metrics, because custom metrics are limited based on your subscription and exceeding the limit can be a bit pricey.

And last but not least, have fun! Everything is just about a small portion of math and experimenting with the actual metrics. There's no need to be afraid of observability.

And what if something bad happens and you can't see it in the monitoring? It'll get better in the next iteration, continuous improvement is an integral part of monitoring so always bear this in mind and don't panic.

View full post