I created a simple application that runs on a Docker container to test and generate metrics on a S3 service - in this case, Ceph
I needed to perform external monitoring on our S3 endpoints to better observe how the service was performing, thus getting a better representation of user experience. I use basically Python and Docker to accomplish the task.
We have two relatively large Ceph clusters (2.5PB, 600 OSDs) at the company that provides object storage using implementing the S3 API, so we can leverage the AWS SDK and use it to perform operations like creating buckets, putting objects and so on, and measuring the success and the time consumed to perform each operation. With those metrics we can then create graphics, panels and alerts using Grafana for instance.
Using some kind of KISS methodology (keep it simple, stupid), running everything inside a docker container would give the flexibility to throw it anywhere and get the same results. The initial goal was to run two containers on AWS pointing to both sites (where the two main Ceph Clusters live) and two others on each site, one pointing to the other.
For the task I used Python as the programming language with the Prometheus Client for Python, found at https://github.com/prometheus/client_python. I divided the code in sections to talk about each part.
Libraries
prometheus_client for metrics, boto3 for s3 connection, random
1
2
3
4
5
6
7
from prometheus_client import Histogram, start_http_server, Summary, Gauge, Counter
import boto3
import random
from io import BytesIO
import os
import time
import logging
Variables
Here we define our endpoint address, credentials, name of the bucket and objects, the interval between tests, the histogram buckets and how many objects we want to put simultaneously (for histogram metrics):
Here we define what metrics we want to generate, or collect: ALL_TESTSm S3_FAIL, CREATE, PUT, DELETE, GET_OBJECT, LIST_OBJECTS, REMOVE_BUCKET, DELETE_OBJECTS. For each operation there are three metrics defined: Summary, Gauge and Histogram. Labels were used for each operation.
This is where stuff gets done! There is an init function to set the parameters from the variables, start logging and create an a s3 resource from the boto3 library. Decorators were used to measure the time spend in each function and the run_all_tests funcion calls all the tests, which is the one we use as we’ll see next.
Here we start the prometheus exporter and call the method run_all_tests() from class S3Metrics inside an infinite while loop, so the script will run continuously until manually stopped.
1
2
3
4
5
6
7
8
9
def main():
start_http_server(PORT)
s3metrics = S3Metrics(BUCKET_NAME, OBJECT_NAME)
while True:
s3metrics.run_all_tests()
time.sleep(TIME_TO_SLEEP)
if __name__ == '__main__':
main()
So now that we have our s3metrics.py, we have to build a container to run it. I created a requirements.txt file to make it easier to adjust the package versions in the future, and used in the Dockerfile the python:3 base image since it already have most of what I need.
Now we can run the container inside an EC2 instance or anywhere capable of running a docker container (you don’t say, lol). I’ll later post my terraform script used to deploy it on AWS with more details. Now for the Grafana part:
Grafana
Here’s how I configured a Bar gauge graph to see the PUT Histogram:
grafana_histogram
We can see that the majority of the operations take less than 0.5 seconds to finish, so if we start seeing an increase in the other histogram buckets, it means that we probably have a problem.
I also created a panel to watch for errors and one to see all the operations combined:
grafana_histogram
Conclusion
That kind of monitoring can give a good observability to a service like Ceph S3 Gateway, but this could easily be applied to any other service with some changes. Before that, we used UpTimeRobot to achieve a similar goal, but from my perspective you can have much more control and flexibility with a container like that performing connectivity tests, and in this case we go much deeper by mimicking user operations on our tests.