Blog

Deploying microservices at CCData

This blog provides insights into the functioning and rationale behind our systems. It explains the deployment and monitoring of our extensive network of over 800+ microservices, which span across our Go and NodeJs codebases. While this blog focuses on NodeJs, the process for Go is quite similar.

  • February 15, 2021
  • Vlad Cealicu

A short history of our deployment system 2014 -2021

Deployment v1 (2014): when we started the company, we only had one developer, so deployment was very manual: ssh into the VM, do a git pull, restart the service. To check logs we would have to ssh into the VM and tail a log file. We did not have any monitoring and alerts. This was 7 years ago and a lot of things have changed since.

Deployment v2 (2017): ansible-playbook that does the same process as the manual v1 + with monitoring of VM stats in Icinga and Grafana.

Deployment v3 (2019): added endpoint for monitoring each microservice with general stats for each service. Deployment is done through an interface: choose service, choose VMs where it needs to be deployed, run playbook.


Deployment v4 (2021) — Setup, monitoring, logging, and alerting

3rd party services used: Github, Jenkins, Aptly, Ansible, Icinga, Grafana, Graylog, Slack

Setting up a new microservice

It all starts from our microservice template.

Fig. 1 Simple service template example, systemd definition

Each service is self-contained in its own folder with a package.json file and a deploy folder with .env variables and systemd definition. Each microservice runs on its own VM using aptly and systemd. When setting up a service our devs have to answer a few basic questions: what is the service name, what does the service need access to, how many resources do you expect the service to use, does it need local Redis, Nginx, etc.

Fig. 2 Empty service example with statusCode, monitoring, and hooks for code

After running the create service scaffolding script and answering all the questions, a pull request with an empty service gets created and devops is informed. Devops then run playbooks to set up the VMs needed by the service and the empty service gets merged and deployed. We then manually add all the Network Access Filter rules. We decided to do the network part manually to avoid an automated network rule outage, imagine someone would accidentally make our Redis Cluster unavailable by changing the Network Access Filter rules on the Redis Cluster group.

Deploying a microservice

To deploy a microservice we built an internal dashboard that runs Ansible playbooks in the background, think Ansible Tower but as an internal tool developed by CCData with CCData specific needs.

Fig.3 Choose the server group, the branch, the server Id, and the service name, press deploy.

After choosing the Server Group Id (a group of multiple VMs used for a specific microservice / API), the developer chooses the branch they want to deploy, the server they want to deploy to (options are based on the server group), and then the service that they want to deploy.

Fig. 4 Show progress on deployment and save deployment history.

When they press Deploy the following chain of events happens: Jenkins gets the service name that needs to be deployed, runs the unit tests, and builds the Aptly package for the service. The package is then published on our internal Aptly repository and a playbook installs it (apt-get install {service-name}) on the selected VMs.

Monitoring a microservice

Fig. 5 Monitoring example: common data across all microservices.

All microservices and APIs expose an internal status endpoint that gets queried by Icinga. We run our own package to convert the response to perf data in Icinga. The data is then used to build a Grafana dashboard with the following common fields.

  • Last Deploy/Restart — the last time the service was restarted
  • Last Run Start — the last time the service ran
  • Run Number — the number of times this service ran since the last deploy/restart
  • CPU, Load, Bandwidth general VM stats
  • Next Iteration — how long until the service runs again
  • Last Run Time — how long did it take to run through all the data
  • Service Uptime — just a human-readable time, similar to Last Depoy/Restart
  • Status Code — this is used for alerting and general system health. Each service implements a status code function, like in the example in Fig. 2 above.

Logging for a microservice

Fig. 6 Logging example: general common stats and individual run

Each microservice has a log level set up as an environment variable and a series of log entries. Since all our services run as systemd processes, the logs go into journald. We run a journald to GrayLog data converter and all the log entries then make it to our centralized GrayLog server tagged with the service name and other metadata. We then build dashboards for each service in GrayLog as needed.

Alerts for a microservice

Fig. 7 Default service status code behavior

Alerts go into Slack, email and we have a pager duty setup as well with direct messages/calls to team leads.

Our alerts are built based on the statusCode returned by each service. If the status code is 0, the service is healthy. A status code of 1 means that the service is in WARN mode so either one or multiple of the external data sources are having issues or other parts of our infrastructure that the service relies on are unavailable. A status code of 2 means the service is in CRITICAL mode and not working as intended. If a service is in CRITICAL mode somebody needs to have a look at the logs and figure out what is happening.

We tried to automate the alert system as much as possible but in the end, it all falls down to the team who wrote the microservice. Having general alert rules in Grafana ended up requiring a lot of work in setting them up so we went with the 3 status codes and bespoke implementations on each microservice, at the microservice level.

The best source of truth for the state of a service is the service itself.

Our default alert behavior is to set the service in CRITICAL state if it takes too long to run or if there have been CRITICAL errors since the last Icinga Agent call. We set the service in WARN mode if there have been any WARN errors since the last Icinga Agent call. If none of the above happens, it means the service is in an OK state.

Summary

After a lot of trial and error, we ended up with a system that caters to our needs and is behaving in a similar way to a Kubernetes cluster. We use VMs instead of Pods and we set up our network rules at the VM level.

We think this approach will allow us, long term to easily migrate to Kubernetes once we can see any clear advantages for doing so.

Disclaimer: Please note that the content of this blog post was created prior to our company's rebranding from CryptoCompare to CCData.

Deploying microservices at CCData

A short history of our deployment system 2014 -2021

Deployment v1 (2014): when we started the company, we only had one developer, so deployment was very manual: ssh into the VM, do a git pull, restart the service. To check logs we would have to ssh into the VM and tail a log file. We did not have any monitoring and alerts. This was 7 years ago and a lot of things have changed since.

Deployment v2 (2017): ansible-playbook that does the same process as the manual v1 + with monitoring of VM stats in Icinga and Grafana.

Deployment v3 (2019): added endpoint for monitoring each microservice with general stats for each service. Deployment is done through an interface: choose service, choose VMs where it needs to be deployed, run playbook.


Deployment v4 (2021) — Setup, monitoring, logging, and alerting

3rd party services used: Github, Jenkins, Aptly, Ansible, Icinga, Grafana, Graylog, Slack

Setting up a new microservice

It all starts from our microservice template.

Fig. 1 Simple service template example, systemd definition

Each service is self-contained in its own folder with a package.json file and a deploy folder with .env variables and systemd definition. Each microservice runs on its own VM using aptly and systemd. When setting up a service our devs have to answer a few basic questions: what is the service name, what does the service need access to, how many resources do you expect the service to use, does it need local Redis, Nginx, etc.

Fig. 2 Empty service example with statusCode, monitoring, and hooks for code

After running the create service scaffolding script and answering all the questions, a pull request with an empty service gets created and devops is informed. Devops then run playbooks to set up the VMs needed by the service and the empty service gets merged and deployed. We then manually add all the Network Access Filter rules. We decided to do the network part manually to avoid an automated network rule outage, imagine someone would accidentally make our Redis Cluster unavailable by changing the Network Access Filter rules on the Redis Cluster group.

Deploying a microservice

To deploy a microservice we built an internal dashboard that runs Ansible playbooks in the background, think Ansible Tower but as an internal tool developed by CCData with CCData specific needs.

Fig.3 Choose the server group, the branch, the server Id, and the service name, press deploy.

After choosing the Server Group Id (a group of multiple VMs used for a specific microservice / API), the developer chooses the branch they want to deploy, the server they want to deploy to (options are based on the server group), and then the service that they want to deploy.

Fig. 4 Show progress on deployment and save deployment history.

When they press Deploy the following chain of events happens: Jenkins gets the service name that needs to be deployed, runs the unit tests, and builds the Aptly package for the service. The package is then published on our internal Aptly repository and a playbook installs it (apt-get install {service-name}) on the selected VMs.

Monitoring a microservice

Fig. 5 Monitoring example: common data across all microservices.

All microservices and APIs expose an internal status endpoint that gets queried by Icinga. We run our own package to convert the response to perf data in Icinga. The data is then used to build a Grafana dashboard with the following common fields.

  • Last Deploy/Restart — the last time the service was restarted
  • Last Run Start — the last time the service ran
  • Run Number — the number of times this service ran since the last deploy/restart
  • CPU, Load, Bandwidth general VM stats
  • Next Iteration — how long until the service runs again
  • Last Run Time — how long did it take to run through all the data
  • Service Uptime — just a human-readable time, similar to Last Depoy/Restart
  • Status Code — this is used for alerting and general system health. Each service implements a status code function, like in the example in Fig. 2 above.

Logging for a microservice

Fig. 6 Logging example: general common stats and individual run

Each microservice has a log level set up as an environment variable and a series of log entries. Since all our services run as systemd processes, the logs go into journald. We run a journald to GrayLog data converter and all the log entries then make it to our centralized GrayLog server tagged with the service name and other metadata. We then build dashboards for each service in GrayLog as needed.

Alerts for a microservice

Fig. 7 Default service status code behavior

Alerts go into Slack, email and we have a pager duty setup as well with direct messages/calls to team leads.

Our alerts are built based on the statusCode returned by each service. If the status code is 0, the service is healthy. A status code of 1 means that the service is in WARN mode so either one or multiple of the external data sources are having issues or other parts of our infrastructure that the service relies on are unavailable. A status code of 2 means the service is in CRITICAL mode and not working as intended. If a service is in CRITICAL mode somebody needs to have a look at the logs and figure out what is happening.

We tried to automate the alert system as much as possible but in the end, it all falls down to the team who wrote the microservice. Having general alert rules in Grafana ended up requiring a lot of work in setting them up so we went with the 3 status codes and bespoke implementations on each microservice, at the microservice level.

The best source of truth for the state of a service is the service itself.

Our default alert behavior is to set the service in CRITICAL state if it takes too long to run or if there have been CRITICAL errors since the last Icinga Agent call. We set the service in WARN mode if there have been any WARN errors since the last Icinga Agent call. If none of the above happens, it means the service is in an OK state.

Summary

After a lot of trial and error, we ended up with a system that caters to our needs and is behaving in a similar way to a Kubernetes cluster. We use VMs instead of Pods and we set up our network rules at the VM level.

We think this approach will allow us, long term to easily migrate to Kubernetes once we can see any clear advantages for doing so.

Disclaimer: Please note that the content of this blog post was created prior to our company's rebranding from CryptoCompare to CCData.

Stay Up To Date

Get our latest research, reports and event news delivered straight to your inbox.