Naresh Blog: 2017

Thursday, December 21, 2017

Kafka Performance Tuning

We all know the power and advantages of KAFKA. Apache Kafka is publish-subscribe messaging system which basically has three major components

KAFKA CONSUMER

KAFKA PRODUCER and

KAFKA BROKER

Broker Side Configuration

Producer side

Compression

Batch size

Sync or Async

Consumer Side configuration

Fetch Size

Applicable for all consumer instance in a consumer group.

Broker Side Configuration

num.replica.fetchers

This configuration parameter defines the number of threads which will be replicating data from leader to the follower. Value of this parameter can be modified as per availability of thread. If we have threads available we should have more number of replica fetchers to complete replication in parallel.

replica.fetch.max.bytes

This parameter is all about how much data you want to fetch from any partition in each fetch request. It’s good to increase value for this parameter so that it helps to create replica fast in the followers

replica.socket.receive.buffer.bytes

In case of less thread available for creating replica, we can increase the size of buffer. It will help to hold more data if replication thread is slow as compared to the incoming message rate.

num.partitions

This is the very important configuration which we should be taken care while having Kafka in live. As many partitions are there, we can have that level of parallelism and write data in parallel which will automatically increase the throughput

Having more partitions slows down performance and throughput if the system OS configuration can’t capable of handle it.

Creating more partitions for a topic depends on available threads and disk.

num.io.threads

Setting value for I/O threads directly depends on how much disk you have in your cluster. These threads are used by server for executing request. We should have at least as many threads as we have disks.

Producer:

Compression

compression.codec

Compression reduces disk footprint leading to faster reads and writes.Currently Kafka supports - Values are 'none', 'gzip' and 'snappy'

Property values are none, gzip and snappy.

Batch Size

Batch.size measures batch size in total bytes instead of the number of messages. It controls how many bytes of data to collect before sending messages to the Kafka broker. Set this as high as possible, without exceeding available memory. The default value is 16384.

If you increase the size of your buffer, it might never get full. The Producer sends the information eventually, based on other triggers, such as linger time in milliseconds.

Batch size always confusing what batch size will be optimal. Large batch size may be great to have high throughput but you might feel latency issue in that. So, we can conclude that latency and throughput is inversely proportional to each other.

Async Producers

Publish the message and get the callback to get the acknowledgement of send data status.

'producer.type=1' to make producer async

'queue.buffer.max.ms =duration of match window

'batch.num.messages" = number of messages to be sent in batch.

Large Messages

Consider placing large files on the standard storage and using Kafka to send a message with the file location. In many cases this can be much faster than using Kafka

to send the large file itself.

Sunday, March 19, 2017

Monitoring & Alerting for Micro Services

I got opportunity to design and develop Monitoring and Alerting framework for all the micro services deployed in the organisations. Monitoring majorly classified into 3

System Monitoring
Application Monitoring
Server Monitoring

Any abnormolity on any of the above 3 will raise an alert. Alert can be "Slack Notification", "Email", "Pager" and "Call".

NOTE: This is not for micro services tracing. For micro services tracing sophisticated open source tools available like Jaeger and zipkin

Monitoring & Alerting Tools used:

Sensu

Uchiwa

Pagerduty

Technologies:

Fluentd - Used for data/log forwarding

OpenTSDB - Used for Time Series Data. Replaces old RRD tools
( competitors would be InfluxDB, Druid, Cassandra)

ElasticSearch - Used for log search (Ex: "ERROR" count > 1 on app1 raise an alert)

Monitoring Types:

System Monitoring:
includes CPU, Disk, I/O, Processes,Virtual Memory, DHCP, Network etc.
Application Monitoring :
includes Failed Services, Batch/Cron jobs, Cache monitoring,
DB monitoring, transaction, 3rd party interactions etc.
Server Monitoring:
Apache Tomcat, Ngnix, HAProxy , Request/Response latency, Server health etc

Scripts has to write for all of the above monitoring modules. Scripts can be written in Ruby where Sensu has many sensu plugins which will help to have less number of lines in scripts.

Deployment Topology:

All scripts has to upload to chef server with version. Project configuration and other artefacts will be uploaded to chef-server.

Chef-client can be run from development machine/laptop.

Micro service will be up and running with all monitoring scripts, configuration files, application jar,fluentd, opentsdb agents.
Below shows final Java micro services which will be up and running.

Detail explanation:

Flow - (a)
Sensu agent runs on Micro service. Agent periodically executes scripts(& server monitoring) and sends output to sensu server.
Sensu Server aggregates data from the micro services. Sensu forwards alerts to PagerDuty if any threshold breach by a sensu metric.
Uchiwa is the dashboard for Sensu. It gives nice alerts view in order Data centre, VMs, Metrics.

Use Cases: CPU, Disk, File I/O, Server health etc.

Flow - (b)
Fluentd agent runs on micro service is a data forwarder. Fluentd listens on a application log file path and forwards data to Elastic Search.
ElasticSearch does indexing for the log data. Error count cron job runs on elastic search which does search on "error" count on logs group by application. If there is any error on the log script forwards to sensu server which in turn converts to Pager Duty alert.

Use Cases: Server access logs( tells how many 401 Rest codes group by Region and application, Application logs error count group by Region and application)

Flow - (c)
This will be very interesting use case. I have used lot of time series metrics forwarded to OpenTSDB. But I would like to mention metric which helped a lot. REST call requests are recorded as time series metrics. Example: (Rest EndPoint, timestamp, hitCount) .

Use Cases:
For every rest call OpenTSDB client on micro-service sends data to OpenTSDB server. OpenTSDB graph shows traffic group by Region and Date. Which helps to understand how HTTP traffic on each data centre.

Example:
How to use sensu checks.

Below few checks on RAM & Disk

sensu_check 'ram' do
command 'check-ram.rb -w 20 -c 10'
interval node['monitor']['check_interval']
subscribers %w[all]
end

sensu_check 'disk' do
command 'check-disk.rb'
interval node['monitor']['check_interval']
subscribers %w[all]
end

Above ruby script raises alert if Ram breaches threshold. File "check-ram.rb" will be available at Checf cookbooks as default file
https://github.com/sensu-plugins/sensu-plugins-memory-checks/blob/master/bin/check-ram.rb

Subscribe: is for notification.