Thursday, December 21, 2017

Kafka Performance Tuning

We all know the power and advantages of KAFKA. Apache Kafka is publish-subscribe messaging system which basically has three major components

KAFKA CONSUMER
KAFKA PRODUCER and
KAFKA BROKER



                                                                                                                                                      

Broker Side Configuration

Producer side 
Compression
Batch size
Sync or Async

Consumer Side configuration
Fetch Size
Applicable for all consumer instance in a consumer group.


Broker Side Configuration

num.replica.fetchers
This configuration parameter defines the number of threads which will be replicating data from leader to the follower. Value of this parameter can be modified as per availability of thread. If we have threads available we should have more number of replica fetchers to complete replication in parallel.

replica.fetch.max.bytes
This parameter is all about how much data you want to fetch from any partition in each fetch request. It’s good to increase value for this parameter so that it helps to create replica fast in the followers

replica.socket.receive.buffer.bytes
In case of less thread available for creating replica, we can increase the size of buffer. It will help to hold more data if replication thread is slow as compared to the incoming message rate.

num.partitions
This is the very important configuration which we should be taken care while having Kafka in live. As many partitions are there, we can have that level of parallelism and write data in parallel which will automatically increase the throughput

Having more partitions slows down performance and throughput if the system OS configuration can’t capable of handle it.
Creating more partitions for a topic depends on available threads and disk.

num.io.threads
Setting value for I/O threads directly depends on how much disk you have in your cluster. These threads are used by server for executing request. We should have at least as many threads as we have disks.

Producer:

Compression
compression.codec
Compression reduces  disk footprint leading to faster reads and writes.Currently Kafka supports - Values are 'none', 'gzip' and 'snappy'
Property values are none, gzip and snappy.

Batch Size
Batch.size measures batch size in total bytes instead of the number of messages. It controls how many bytes of data to collect before sending messages to the Kafka broker. Set this as high as possible, without exceeding available memory. The default value is 16384.

If you increase the size of your buffer, it might never get full. The Producer sends the information eventually, based on other triggers, such as linger time in milliseconds.

Batch size always confusing what batch size will be optimal. Large batch size may be great to have high throughput but you might feel latency issue in that. So, we can conclude that latency and throughput is inversely proportional to each other.

Async Producers
Publish the message and get the callback to get the acknowledgement of send data status.
'producer.type=1' to make producer async
'queue.buffer.max.ms =duration of match window
'batch.num.messages" = number of messages to be sent in batch.

Large Messages  
Consider placing large files on the standard storage and using Kafka to send a message with the file location. In many cases this can be much faster than using Kafka 
to send the large file itself.

Sunday, March 19, 2017

Monitoring & Alerting for Micro Services


I got opportunity to design and develop Monitoring and Alerting framework for all the micro services deployed in the organisations. Monitoring majorly classified into 3
  1.   System Monitoring
  2.   Application Monitoring
  3.   Server Monitoring 
Any abnormolity on any of the above 3 will raise an alert. Alert can be   "Slack Notification",  "Email",  "Pager" and "Call".

NOTE: This is not for micro services tracing. For micro services tracing sophisticated open source tools available like Jaeger and  zipkin

 Monitoring & Alerting Tools used:
   Sensu    
  Uchiwa  
  Pagerduty 

Technologies:
 Fluentd           -  Used for data/log forwarding
OpenTSDB      - Used for Time Series Data. Replaces old RRD tools
                          ( competitors would be InfluxDB, Druid, Cassandra)
ElasticSearch   -  Used for log search (Ex:  "ERROR"  count > 1  on app1 raise an alert)


Monitoring Types:

System Monitoring:  
        includes CPU, Disk, I/O, Processes,Virtual Memory, DHCP, Network etc.
Application Monitoring : 
        includes Failed Services, Batch/Cron jobs, Cache monitoring,
        DB monitoring, transaction, 3rd party interactions etc.
Server Monitoring:  
       Apache Tomcat, Ngnix, HAProxy , Request/Response latency, Server health etc


Scripts has to write for all of the above monitoring modules. Scripts can be written in Ruby where Sensu has many sensu plugins which will help to have less number of lines in scripts.

Deployment Topology:


All scripts has to upload to chef server with version. Project configuration and other artefacts will be uploaded to chef-server.

Chef-client can be run from development machine/laptop.

Micro service will be up and running with all monitoring scripts, configuration files, application jar,fluentd, opentsdb agents.
Below shows final Java micro services which will be up and running.






Detail explanation:

Flow - (a)
Sensu agent runs on Micro service. Agent periodically executes scripts(& server monitoring) and sends output to sensu server.
Sensu Server  aggregates data from the micro services.  Sensu forwards alerts to PagerDuty if any threshold breach by a sensu metric.
Uchiwa is the dashboard for Sensu. It gives nice alerts view in order Data centre, VMs, Metrics.

Use Cases:  CPU, Disk, File I/O, Server health etc.

Flow - (b)
Fluentd agent runs on micro service is a data forwarder. Fluentd listens on a application log file path and forwards data to Elastic Search.
ElasticSearch does indexing for the log data. Error count cron job runs on elastic search which does search on "error" count on logs group by application. If there is any error on the log script forwards to sensu server which in turn converts to Pager Duty alert.

Use Cases:  Server access logs( tells how many 401 Rest codes group by Region and application,   Application logs error count group by Region and application)

Flow - (c)
This will be very interesting use case. I have used lot of time series metrics forwarded to OpenTSDB. But I would like to mention metric which helped a lot. REST call requests are recorded as time series metrics.  Example:   (Rest EndPoint, timestamp,  hitCount) .

Use Cases:
For every rest call OpenTSDB client on micro-service sends data to OpenTSDB server.   OpenTSDB graph shows traffic group by Region and Date. Which helps to understand how HTTP traffic on each data centre.


Example:
 How to use sensu checks.

Below few checks on RAM & Disk

sensu_check 'ram' do
  command 'check-ram.rb -w 20 -c 10'
  interval node['monitor']['check_interval']
  subscribers %w[all]
end

sensu_check 'disk' do
  command 'check-disk.rb'
  interval node['monitor']['check_interval']
  subscribers %w[all]
end

Above ruby script raises alert if Ram breaches threshold.  File  "check-ram.rb" will be available at Checf cookbooks as default file
https://github.com/sensu-plugins/sensu-plugins-memory-checks/blob/master/bin/check-ram.rb

Subscribe: is for notification.