Thursday, March 8, 2018

Cassandra Indexing


How Cassandra works?
Cassandra all about indexing.  Good for write heavy.
Cassandra can be thought of as a key-value database it actually contains a lookup key for every data in the form of a primary key

Primary Key:    (mandatory for create/read)
  • Primary Key is mandatory
  • Primary key is immutable.

  PRIMARY KEY (partition_key):
  A partition key will always belong to one node in a cluster and that partition’s data will always be found on that node.


We can never access the second-level data (for instance, the email of a user) without accessing the primary username key first.
CQL (Cassandra Query Language) to get email would be
SELECT "email" FROM "user_tweets" WHERE "username" = 'jochasinga';      //Primary key mandatory to pass along with query

Different other form of Primary Key:
PRIMARY KEY ((partition_key), primary_key1, primary_key2)
Primary keys 1 & 2   are also called cluster columns
Where the partition key is important for data locality (points to a node in a cluster), the clustering column specifies the order that the data is arranged inside the partition
Association looks like this in a map    PARTITION KEY -> PRIMARY KEY -> DATA
Map [String, Map[String, Data]].   //keys of a map are unique

More information

SELECT * FROM "user_tweets" WHERE "email" = 'jo.chasinga@gmail.com';     // Throws Error. Reason no primary key passed along with query. 
Cassandra introduced secondary index in order to address above query get email without primary key.
Query using secondary index (with no primary key) will span across all nodes in a cluster.  (FanOut)
Cassandra supports secondary indexes on column families where the column name is known (Not on dynamic columns)

Important links:


Advantages with Cassandra:
  •   Supports write operations at massive scale 
  •   Highly Available
  •   Elastic scalability  (new nodes can be added to cluster seamlessly)
Key information on Cassandra
  •  Schema has to define as per Cassandra documentation
  •  Choosing right primary key and clustered keys will be a challenge for MO data set. 
  •  Values of Primary keys in cassandra has to be immutable.  ( update not allowed on primary key )
  •  Mandatory to provide partition key and cluster key for read operation
  •  Secondary indexes wont scale with Cassandra as they work with out partition key (Fan out pattern)
  •  Secondary indexes applicable only on known columns (static columns)
  •  Aggregations in Cassandra are not supported by the Cassandra nodes

5 comments: