Sunday, October 23, 2016

Data pipeline design for Mobile Data Traffic using AWS


I worked for a Mobile Operating System Company which has 50 Million + user base across the globe. All these mobile generates a billion data per day.

What kind of Mobile data ?
Every user activity on the mobile is treated as a data.  Below are few examples
a) Install/Uninstall app
b) Open an app
c) Closing an app
d) Total time spend on an app
e) Network connectivity details
f) Heart beat  (Contains  OS build no, single sim/dual sim, network operator names)


Problem:
Organisation needs data collection, data pipeline and analytics for all the mobile data traffic.


Solution:
I have leveraged AWS services(EC2, Route 53, ELB, EBS, S3, Lambda, Redshift) for implementation. Not just pipe line design and analytics also implemented robust monitoring and alerting system for the entire pipeline. Also took approaches which will minimise operational cost of AWS. Lot of factors tested(Apache Benchmark testing) and optimised while implementation.

Overall Architecture

STAGE - 1   (DATA Collection & Data Enriching)

Data Collection (Shopvac):  Code name is Shopvac
  Shopvac service should be front facing, low latency and high throughput.  Shopvac Service is hosted on a EC2. Below is the network topology diagram before i go deep about service.





Mobiles:
    All mobiles post data using  HTTP POST on a host where DNS resolves host to a AWS Route 53.

AWS Route 53:
   Route 53 does simple redirection to AWS ELB.

AWS ELB:
     ELB has list of services where java process is running for data collection. In Non Amazon world the same has to be done using  Zookeeper/Etcd for Service discovery.

Shopvac Service: (Data Collection)
    Its a light weight java process running. It has a REST END Point which will be listening for mobile POST data.  End point stores data to local file system ( /var/app/shopvac/metric/<eventname.json>) .

This file system acts as a buffer. FluentD will listen on this path and forward data to Amazon S3.
FluentD forwards when ever file size reaches 100MB size or 5 secs.

Fluentd is an Open Source log forwarder like Logstash, SysD, CollectD, RSyslog etc.
Fluentd parses data for event name and creates a file with event name if not exist otherwise append data to the file.

Below Shopvac Service insight




Every service developed in organisation has to be bootstrapped with chef with above process running.

Java Process(Vertx) : Used Vertx Async programming. Vertex has Rest End point which will get invoked on mobile POST. Data will be enriched for example stores City, State and Country information(gets from Latitude and Longitude) to the existing metrics.

Java process stores data on local file system.

FluentD:  Fluentd listens on local files system where metrics are storedFluentd forwards mobile data to AWS S3.  Also forwards logs to ElasticSearch.

OpenTSDB Client:  All time series data are written to OpenTSDB. One most important data we store in OpenTSDB is REST End point call.  When ever REST point is invoked it is stored in time series data.  This will give insight how many times Rest end pint is invoked per day, per week, per hour this will give traffic insight. What duration has peak traffic on the server. 

Sensu Client :  Sensu scripts will be deployed along with chef. Basic scripts like Disk utilisation, RAM utilisation, Server health a lot other metrics sense will report to sense server. On Abnormality or threshold breach Sense  Server will alert through Pager Duty.


STAGE - 2   (Data Processing)

AWS-S3 will be source of truth. All types of metrics will be forwarded blindly to S3.

AWS Lambda (Serverless Architecture)
will listen on S3 bucket and filters for interested metrics. Forwards interested metrics to RedShift. Instead of directly forwarding to Redshift forwards to CMET service.(developed as a proxy service internally).

Note: Lambda will be charged based on cpu cycles it spent with the code. So code on lambda should be as less as possible and also should be error free. It is difficult to debug Lambda.





Note(Non-Amazon World):
S3 has to replace with Kafka
Lambda replace with Kafka Consumers/Storm

STAGE-3  (DATA Storing)

Selected Redshift as Storage because of its advantages.

Lambda can directly forward metrics to Redshift but this will not be a right approach. As Data Warehouse will be exposed to events which are abnormal in behaviour and also every DB comes with concurrent DB connections at a time. There is a need for intermittent service which acts as a proxy to Redhisft. So CMET is a service which acts as a proxy and do connection management and dropping long holding connections.


One good use case of CMET:

  1.  Long running queries will be stopped and will be marked as failure. So that lambda will retry. This helps when Redshift can't respond nor process at that point of time. Also helps reduce load on Redshift.
  2.  Exposing read connections(GUI Visualization) for set of users and write connections for users like AWS like Lambda and Admin.