I got opportunity to design and develop Monitoring and Alerting framework for all the micro services deployed in the organisations. Monitoring majorly classified into 3
- System Monitoring
- Application Monitoring
- Server Monitoring
Any abnormolity on any of the above 3 will raise an alert. Alert can be "Slack Notification", "Email", "Pager" and "Call".
NOTE: This is not for micro services tracing. For micro services tracing sophisticated open source tools available like Jaeger and zipkin
NOTE: This is not for micro services tracing. For micro services tracing sophisticated open source tools available like Jaeger and zipkin
Monitoring & Alerting Tools used:
Technologies:
Fluentd - Used for data/log forwarding
OpenTSDB - Used for Time Series Data. Replaces old RRD tools
( competitors would be InfluxDB, Druid, Cassandra)
( competitors would be InfluxDB, Druid, Cassandra)
ElasticSearch - Used for log search (Ex: "ERROR" count > 1 on app1 raise an alert)
Monitoring Types:
System Monitoring:
includes CPU, Disk, I/O, Processes,Virtual Memory, DHCP, Network etc.
Application Monitoring :
includes Failed Services, Batch/Cron jobs, Cache monitoring,
DB monitoring, transaction, 3rd party interactions etc.
Server Monitoring:
Apache Tomcat, Ngnix, HAProxy , Request/Response latency, Server health etc
Scripts has to write for all of the above monitoring modules. Scripts can be written in Ruby where Sensu has many sensu plugins which will help to have less number of lines in scripts.
Deployment Topology:
All scripts has to upload to chef server with version. Project configuration and other artefacts will be uploaded to chef-server.
Chef-client can be run from development machine/laptop.
Micro service will be up and running with all monitoring scripts, configuration files, application jar,fluentd, opentsdb agents.
Below shows final Java micro services which will be up and running.
Detail explanation:
Flow - (a)
Sensu agent runs on Micro service. Agent periodically executes scripts(& server monitoring) and sends output to sensu server.
Sensu Server aggregates data from the micro services. Sensu forwards alerts to PagerDuty if any threshold breach by a sensu metric.
Uchiwa is the dashboard for Sensu. It gives nice alerts view in order Data centre, VMs, Metrics.
Use Cases: CPU, Disk, File I/O, Server health etc.
Flow - (b)
Fluentd agent runs on micro service is a data forwarder. Fluentd listens on a application log file path and forwards data to Elastic Search.
ElasticSearch does indexing for the log data. Error count cron job runs on elastic search which does search on "error" count on logs group by application. If there is any error on the log script forwards to sensu server which in turn converts to Pager Duty alert.
Use Cases: Server access logs( tells how many 401 Rest codes group by Region and application, Application logs error count group by Region and application)
Flow - (c)
This will be very interesting use case. I have used lot of time series metrics forwarded to OpenTSDB. But I would like to mention metric which helped a lot. REST call requests are recorded as time series metrics. Example: (Rest EndPoint, timestamp, hitCount) .
Use Cases:
For every rest call OpenTSDB client on micro-service sends data to OpenTSDB server. OpenTSDB graph shows traffic group by Region and Date. Which helps to understand how HTTP traffic on each data centre.
Example:
How to use sensu checks.
Below few checks on RAM & Disk
sensu_check 'ram' do
command 'check-ram.rb -w 20 -c 10'
interval node['monitor']['check_interval']
subscribers %w[all]
end
sensu_check 'disk' do
command 'check-disk.rb'
interval node['monitor']['check_interval']
subscribers %w[all]
end
Above ruby script raises alert if Ram breaches threshold. File "check-ram.rb" will be available at Checf cookbooks as default file
https://github.com/sensu-plugins/sensu-plugins-memory-checks/blob/master/bin/check-ram.rb
Subscribe: is for notification.