Sunday, March 19, 2017

Monitoring & Alerting for Micro Services


I got opportunity to design and develop Monitoring and Alerting framework for all the micro services deployed in the organisations. Monitoring majorly classified into 3
  1.   System Monitoring
  2.   Application Monitoring
  3.   Server Monitoring 
Any abnormolity on any of the above 3 will raise an alert. Alert can be   "Slack Notification",  "Email",  "Pager" and "Call".

NOTE: This is not for micro services tracing. For micro services tracing sophisticated open source tools available like Jaeger and  zipkin

 Monitoring & Alerting Tools used:
   Sensu    
  Uchiwa  
  Pagerduty 

Technologies:
 Fluentd           -  Used for data/log forwarding
OpenTSDB      - Used for Time Series Data. Replaces old RRD tools
                          ( competitors would be InfluxDB, Druid, Cassandra)
ElasticSearch   -  Used for log search (Ex:  "ERROR"  count > 1  on app1 raise an alert)


Monitoring Types:

System Monitoring:  
        includes CPU, Disk, I/O, Processes,Virtual Memory, DHCP, Network etc.
Application Monitoring : 
        includes Failed Services, Batch/Cron jobs, Cache monitoring,
        DB monitoring, transaction, 3rd party interactions etc.
Server Monitoring:  
       Apache Tomcat, Ngnix, HAProxy , Request/Response latency, Server health etc


Scripts has to write for all of the above monitoring modules. Scripts can be written in Ruby where Sensu has many sensu plugins which will help to have less number of lines in scripts.

Deployment Topology:


All scripts has to upload to chef server with version. Project configuration and other artefacts will be uploaded to chef-server.

Chef-client can be run from development machine/laptop.

Micro service will be up and running with all monitoring scripts, configuration files, application jar,fluentd, opentsdb agents.
Below shows final Java micro services which will be up and running.






Detail explanation:

Flow - (a)
Sensu agent runs on Micro service. Agent periodically executes scripts(& server monitoring) and sends output to sensu server.
Sensu Server  aggregates data from the micro services.  Sensu forwards alerts to PagerDuty if any threshold breach by a sensu metric.
Uchiwa is the dashboard for Sensu. It gives nice alerts view in order Data centre, VMs, Metrics.

Use Cases:  CPU, Disk, File I/O, Server health etc.

Flow - (b)
Fluentd agent runs on micro service is a data forwarder. Fluentd listens on a application log file path and forwards data to Elastic Search.
ElasticSearch does indexing for the log data. Error count cron job runs on elastic search which does search on "error" count on logs group by application. If there is any error on the log script forwards to sensu server which in turn converts to Pager Duty alert.

Use Cases:  Server access logs( tells how many 401 Rest codes group by Region and application,   Application logs error count group by Region and application)

Flow - (c)
This will be very interesting use case. I have used lot of time series metrics forwarded to OpenTSDB. But I would like to mention metric which helped a lot. REST call requests are recorded as time series metrics.  Example:   (Rest EndPoint, timestamp,  hitCount) .

Use Cases:
For every rest call OpenTSDB client on micro-service sends data to OpenTSDB server.   OpenTSDB graph shows traffic group by Region and Date. Which helps to understand how HTTP traffic on each data centre.


Example:
 How to use sensu checks.

Below few checks on RAM & Disk

sensu_check 'ram' do
  command 'check-ram.rb -w 20 -c 10'
  interval node['monitor']['check_interval']
  subscribers %w[all]
end

sensu_check 'disk' do
  command 'check-disk.rb'
  interval node['monitor']['check_interval']
  subscribers %w[all]
end

Above ruby script raises alert if Ram breaches threshold.  File  "check-ram.rb" will be available at Checf cookbooks as default file
https://github.com/sensu-plugins/sensu-plugins-memory-checks/blob/master/bin/check-ram.rb

Subscribe: is for notification.