Wednesday 20 November 2019

Interview Q and A for Cassandra DB Part - 1

1. Compare MongoDB and Cassandra

Criteria            MongoDB       Cassandra

Data Model     Document       Big Table like

Database scalability    Read    Write

Querying of data         Multi-indexed Using Key or Scan

 

2. What is Cassandra?

Cassandra is one of the most favored NoSQL distributed database management systems by Apache. With open source technology, Cassandra is efficiently designed to store and manage large volumes of data without any failure. Highly scalable for Big Data models and originally designed by Facebook, Apache Cassandra is written in Java comprising flexible schemas. Apache Cassandra has no single point of failure. There are various types of NoSQL databases and Cassandra is a hybrid of column-oriented and key-value store database. 

The keyspace is outermost container for an application and table or column family in Cassandra is keyspace entity.

 

3. List the benefits of using Cassandra.

Unlike traditional or any other database, Apache Cassandra delivers near real-time performance simplifying the work of Developers, Administrators, Data Analysts and Software Engineers.

• Instead of master-slave architecture, Cassandra is established on peer-to-peer architecture ensuring no failure.

• It also assures phenomenal flexibility as it allows insertion of multiple nodes to any Cassandra cluster in any datacenter. Further, any client can forward its request to any server.

• Cassandra facilitates extensible scalability and can be easily scaled up and scaled down as per the requirements. With a high throughput for read and write operations, this NoSQL application need not be restarted while scaling.

• Cassandra is also revered for its strong data replication on nodes capability as it allows data storage at multiple locations enabling users to retrieve data from another location if one node fails. Users have the option to set up the number of replicas they want to create.

• Shows brilliant performance when used for massive datasets and thus, the most preferable NoSQL DB by most organizations.

• Operates on column-oriented structure and thus, quickens and simplifies the process of slicing. Even data access and retrieval become more efficient with column-based data model.

• Further, Apache Cassandra supports schema-free/schema-optional data model, which un-necessitate the purpose of showing all the columns required by your application.

 

4. Explain the concept of Tunable Consistency in Cassandra.

Tunable Consistency is a phenomenal characteristic that makes Cassandra a favored database choice of Developers, Analysts and Big data Architects. Consistency refers to the up-to-date and synchronized data rows on all their replicas. Cassandra’s Tunable Consistency allows users to select the consistency level best suited for their use cases. It supports two consistencies -Eventual and Consistency and Strong Consistency.

The former guarantees consistency when no new updates are made on a given data item, all accesses return the last updated value eventually. Systems with eventual consistency are known to have achieved replica convergence.

For Strong consistency, Cassandra supports the following condition:

R + W > N, where

N – Number of replicas

W – Number of nodes that need to agree for a successful write

R – Number of nodes that need to agree for a successful read

 

5. How does Cassandra write?

Cassandra performs the write function by applying two commits-

first it writes to a commit log on disk and then commits to an in-memory structured known as memtable. Once the two commits are successful, the write is achieved. Writes are written in the table structure as SSTable (sorted string table). Cassandra offers speedier write performance.

 

6. Define the management tools in Cassandra.

DataStaxOpsCenter: internet-based management and monitoring solution for Cassandra cluster and DataStax. It is free to download and includes an additional Edition of OpsCenter

• SPM primarily administers Cassandra metrics and various OS and JVM metrics. Besides Cassandra, SPM also monitors Hadoop, Spark, Solr, Storm, zookeeper and other Big Data platforms. The main features of SPM include correlation of events and metrics, distributed transaction tracing, creating real-time graphs with zooming, anomaly detection and heartbeat alerting.

 

7. Define memtable.

 It is a memory-resident data structure. After commit log, the data will be written to the mem-table. Similar to table, memtable is in-memory/write-back cache space consisting of content in key and column format. The data in memtable is sorted by key, and each Column Family consist of a distinct memtable that retrieves column data via key. It stores the writes until it is full, and then flushed out.

Sometimes, for a single-column family, there will be multiple mem-tables.

 

8. What is SSTable? How is it different from other relational tables?

SSTable expands to ‘Sorted String Table,’ which refers to an important data file in Cassandra and accepts regular written memtables. They are stored on disk and exist for each Cassandra table. Exhibiting immutability, SStables do not allow any further addition and removal of data items once written. For each SSTable, Cassandra creates three separate files like partition index, partition summary and a bloom filter.

SSTable is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value.

 

9. Explain the concept of Bloom Filter.

Associated with SSTable, Bloom filter is an off-heap (off the Java heap to native memory) data structure to check whether there is any data available in the SSTable before performing any I/O disk operation.

A bloom filter is a tool used by Cassandra. The read path of Cassandra has to go through Memtable and the row cache. A bloom filter is a partition cache, and its role is the read path is to avoid checking every SStable to find one particular data.

 

10. Explain CAP Theorem.

With a strong requirement to scale systems when additional resources are needed, CAP Theorem plays a major role in maintaining the scaling strategy. It is an efficient way to handle scaling in distributed systems. Consistency Availability and Partition tolerance (CAP) theorem states that in distributed systems like Cassandra, users can enjoy only two out of these three characteristics.

One of them needs to be sacrificed. Consistency guarantees the return of most recent write for the client, Availability returns a rational response within minimum time and in Partition Tolerance, the system will continue its operations when network partitions occur. The two options available are AP and CP.

§  Consistency: means that data is the same across the cluster, so you can read or write to/from any node and get the same data.

§  Availability: means the ability to access the cluster even if a node in the cluster goes down.

§  Partition: Tolerance means that the cluster continues to function even if there is a "partition" (communications break) between two nodes (both nodes are up, but can't communicate).

In order to get both availability and partition tolerance, you have to give up consistency. Consider if you have two nodes, X and Y, in a master-master setup. Now, there is a break between network comms in X and Y, so they can't synch updates. At this point you can either:

A) Allow the nodes to get out of sync (giving up consistency), or

B) Consider the cluster to be "down" (giving up availability)

 

All the combinations available are:

§  CA - data is consistent between all nodes - as long as all nodes are online - and you can read/write from any node and be sure that the data is the same, but if you ever develop a partition between nodes, the data will be out of sync (and won't re-sync once the partition is resolved).

§  CP - data is consistent between all nodes, and maintains partition tolerance (preventing data desync) by becoming unavailable when a node goes down.

§  AP - nodes remain online even if they can't communicate with each other and will resync data once the partition is resolved, but you aren't guaranteed that all nodes will have the same data (either during or after the partition)

A drawing of a face

Description automatically generated

 

 

 

 

11. State the differences between a node, a cluster and datacenter in Cassandra.

There are various components of Cassandra. While a node is a single machine running Cassandra, cluster is a collection of nodes that have similar type of data grouped together. Data Centers are useful components when serving customers in different geographical areas. You can group different nodes of a cluster into different data centers.

 

12. How to write a query in Cassandra?

Using CQL (Cassandra Query Language).Cqlsh is used for interacting with database.

 

13. What OS Cassandra supports?

Windows and Linux

 

14. What is Cassandra Data Model?

Cassandra Data Model consists of four main components:

Cluster: Made up of multiple nodes and keyspaces

Keyspace: a namespace to group multiple column families, especially one per partition

Column: consists of a column name, value and timestamp

ColumnFamily: multiple columns with row key reference.

 

15. What is CQL?

CQL is Cassandra Query language to access and query the Apache distributed database. It consists of a CQL parser that incites all the implementation details to the server. The syntax of CQL is similar to SQL but it does not alter the Cassandra data model.

 

16. Explain the concept of compaction in Cassandra.

Compaction refers to a maintenance process in Cassandra , in which, the SSTables are reorganized for data optimization of data structures on the disk. The compaction process is useful during interactive with memtable. There are two types of compaction in Cassandra:

Minor compaction: started automatically when a new sstable is created. Here, Cassandra condenses all the equally sized sstables into one.

Major compaction is triggered manually using nodetool. Compacts all sstables of a Column Family into one.

 

17. Does Cassandra support ACID transactions?

Unlike relational databases, Cassandra does not support ACID transactions.

 

18. Explain Cqlsh

Cqlsh expands to Cassandra Query language Shell that configures the CQL interactive terminal. It is a Python-base command-line prompt used on Linux or Windows and execute CQL commands like ASSUME, CAPTURE, CONSITENCY, COPY, DESCRIBE and many others. With cqlsh, users can define a schema, insert data and execute a query.

 

19. What is SuperColumn in Cassandra?

Cassandra Super Column is a unique element consisting of similar collections of data. They are actually key-value pairs with values as columns. It is a sorted array of columns, and they follow a hierarchy when in action: keystore> column family> super column> column data structure in JSON.

Similar to row keys, super column data entries contain no independent values but are used to collect other columns. It is interesting to note that super column keys appearing in different rows do not necessarily match and will not ever.

These super columns are used to improve the performance of the database

 

20. Define the consistency levels for read operations in Cassandra.

• ALL: Highly consistent. A write must be written to commitlog and memtable on all replica nodes in the cluster

• EACH_QUORUM: A write must be written to commitlog and memtable on quorum of replica nodes in all data centers.

• LOCAL_QUORUM:A write must be written to commitlog and memtable on quorum of replica nodes in the same center.

• ONE: A write must be written to commitlog and memtable of at least one replica node.

• TWO, Three: Same as One but at least two and three replica nodes, respectively

• LOCAL_ONE: A write must be written for at least one replica node in the local data center

• ANY

• SERIAL: Linearizable Consistency to prevent unconditional update

• LOCAL_SERIAL: Same as Serial but restricted to local data center

 

21. What is difference between Column and Super Column?

Both elements work on the principle of tuple having name and value. However, the former‘s value is a string while the value in latter is a Map of Columns with different data types.

Unlike Columns, Super Columns do not contain the third component of timestamp.

 

22. What is Column Family?

As the name suggests, Column Family refers to a structure having infinite number of rows. That are referred by a key-value pair, where key is the name of the column and value represents the column data. It is much similar to a hashmap in java or dictionary in Python. Remember, the rows are not limited to a predefined list of Columns here. Also, the ColumnFamily is absolutely flexible with one row having 100 Columns while the other only 2 columns.

 

23. Define the use of Source Command in Cassandra.

Source command is used to execute a file consisting of CQL statements.

 

24. What is Thrift?

Thrift is the name of the RPC client used to communicate with the Cassandra server.

Thrift is a legacy RPC protocol or API unified with a code generation tool for CQL. The purpose of using Thrift in Cassandra is to facilitate access to the DB across the programming language.

 

25. Explain Tombstone in Cassandra.

Tombstone is row marker indicating a column deletion. These marked columns are deleted during compaction. Tombstones are of great significance as Cassnadra supports eventual consistency, where the data must respond before any successful operation.

 

26. What Platforms Cassandra runs on?

Since Cassandra Online Training is a Java application, it can successfully run on any Java-driven platform or Java Runtime Environment (JRE) or Java Virtual Machine (JVM). Cassandra also runs on RedHat, CentOS, Debian and Ubuntu Linux platforms.

 

27. Name the ports Cassandra uses.

The default settings state that Cassandra uses 7000 ports for Cluster Management, 9160 for Thrift Clients, 8080 for JMX. These are all TCP ports and can be edited in the configuration file: bin/Cassandra.in.sh

 

By default, Cassandra uses 7000 for cluster communication (7001 if SSL is enabled), 9042 for native protocol clients, and 7199 for JMX. The internode communication and native protocol ports are configurable in the Cassandra Configuration File. The JMX port is configurable in cassandra-env.sh (through JVM options). All ports are TCP.

 

28. Can you add or remove Column Families in a working Cluster?

Yes, but keeping in mind the following processes.

•           Do not forget to clear the commitlog with ‘nodetool drain’

•           Turn off Cassandra to check that there is no data left in commitlog

•           Delete the sstable files for the removed CFs

 

29. What is Replication Factor in Cassandra?

Replication Factor is the measure of number of data copies existing. It is important to increase the replication factor to log into the cluster.

 

30. Can we change Replication Factor on a live cluster?

Yes, but it will require running repair to alter the replica count of existing data.

 

31. How to iterate all rows in ColumnFamily?

Using get_range_slices. You can start iteration with the empty string and after each iteration, the last key read serves as the start key for next iteration.

 

32. What do you understand by Data Replication in Cassandra?

Database replication is the frequent electronic copying data from a database in one computer or server to a database in another so that all users share the same level of information.

Cassandra stores replicas on multiple nodes to ensure reliability and fault tolerance. A replication strategy determines the nodes where replicas are placed. The total number of replicas across the cluster is referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row on one node. A replication factor of 2 means two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary or master replica. As a general rule,  the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes later.

 

33. What are the three components of Cassandra write?

The three components are:

1.         Commitlog write

2.         Memtable write

3.         SStable write

Cassandra first writes data to a commit log and then to an in-memory table structure memtable and at last in SStable.

 

34. Explain zero consistency.

In zero consistency the write operations will be handled in the background, asynchronously. It is the fastest way to write data.

 

35. When do you have to avoid secondary indexes?

Try not using secondary indexes on columns containing a high count of unique values as that will produce few results.

 

37. What are secondary indexes?

Secondary indexes are indexes built over column values. In other words, let’s say you have a user table, which contains a user’s email. The primary index would be the user ID, so if you wanted to access a particular user’s email, you could look them up by their ID. However, to solve the inverse query given an email, fetch the user ID requires a secondary index.

 

38. When to use secondary indexes?

You want to query on a column that isn’t the primary key and isn’t part of a composite key. The column you want to be querying on has few unique values (what I mean by this is, say you have a column Town, that is a good choice for secondary indexing because lots of people will be form the same town, date of birth however will not be such a good choice).

 

39. I have a row or key cache hit rate of 0.XX123456789 reported by JMX. Is that XX% or 0.XX% ?

XX%

 

40. What is the use of “void close()” method?

This method is used to close the current session instance.

 

41. What are the collection data types provided by CQL?

There are three collection data types:

1.         List : A list is a collection of one or more ordered elements.

2.         Map : A map is a collection of key-value pairs.

3.         Set : A set is a collection of one or more elements.

 

42. Mention what is Cassandra- CQL collections?

Cassandra CQL collections help you to store multiple values in a single variable. In Cassandra, you can use CQL collections in following ways

•           List: It is used when the order of the data needs to be maintained, and a value is to be stored multiple times (holds the list of unique elements)

•           SET: It is used for group of elements to store and returned in sorted orders (holds repeating elements)

•           MAP: It is a data type used to store a key-value pair of elements

 

43. Which command is used to start the cqlsh prompt?

Cqlsh

 

44. What is the use of “cqlsh –version” command?

This command will provides the version of the cqlsh you are using.

 

45. List the steps in which Cassandra writes changed data into commitlog?

Cassandra concatenates changed data to commitlog. Then Commitlog acts as a crash recovery log for data. Until the changed data is concatenated to commitlog, write operation will never be considered successful.

Data will not be lost once commitlog is flushed out to file.

 

46. What is the use of “ResultSet execute(Statement statement)” method?

This method is used to execute a query. It requires a statement object.

 

47. Mention what are the values stored in the Cassandra Column?

There are three values in Cassandra Column. They are:

1.         Column Name

2.         Value

3.         Time Stamp

 

48. What do you understand by Kundera?

Kundera is an object-relational mapping (ORM) implementation for Cassandra which is written using Java annotations.

 

49. Define composite type in Cassandra?

In Cassandra, composite type allows to define a key or a column name with a concatenation of data of different type. You can use two types of Composite Types:

1.         Row Key

2.         Column Name

 

50. Explain what is a keyspace in Cassandra?

In Cassandra, a keyspace is a namespace that determines data replication on nodes. A cluster consists of one keyspace per node.

51. Partition key columns are optional if you have clustering columns.

False

 

52. What benefits do Clustering columns provide?

Reading sorted data is a matter of seeking the disk head once.

 

53. Cassandra works best on network attached storage.

False

 

54. How much data can a single Cassandra node effectively handle?

1 to 3 terabytes

 

55. What is Partitions...

group rows physically together on disk based on the partition key.

 

56. What is the role of the partitioner?

It hashes the partition key values to create a partition token.

 

57. How Node joining the cluster ?

Nodes join the cluster by communicating with any node 

Cassandra finds these seed nodes list of possible nodes in Cassandra.yaml

Seed nodes communicate cluster topology to the joining node 

Once the new node joins the cluster, all nodes are peers

 

58. What is role of Drivers in Node joining

Drivers intelligently choose which node would best coordinate a request 

Token AwarePolicy – Driver chooses node which contains the data

RoundRobinPolicy – Driver round robins the ring 

DCAwareRoundRobinPolicy – Driver round robins the target data center 

 

59. Why Cassandra 

Peers instead of master/slave

Linear scale performance 

Always on reliability 

Data can be stored geographically close to clients

 

60. What is VNodes

Each nodes has several tokens it manages 

Adding/removing nodes with vnodes should not make the cluster unbalanced 

By default, each node has 256 vnodes 

VNodes automate token range assignment

Configure vnode settings in Cassandra.yaml

Num_tokens     Value greater than one turns on vnodes

 

61. What is Gossip 

Cluster Metadata

 

62. How to Choosing a Gossip Node 

Each Node initiates a gossip round every few seconds 

Picks one to three nodes to gossip with 

Nodes can gossip with ANY other node in cluster 

Probabilistically (slight favor) seed and downed nodes 

Nodes do not track which nodes they gossiped with prior 

Reliably and efficient spread node metadata through the cluster 

Fault tolerant – continues to spread when nodes fail 

 

63. What is Snitch 

A snitch determines which data centers and racks nodes belong to. They inform Cassandra about the network topology so that requests are routed efficiently and allows Cassandra to distribute replicas by grouping machines into data centers and racks. Specifically, the replication strategy places the replicas based on the information provided by the new snitch. All nodes must return to the same rack and data center. Cassandra does its best not to have more than one replica on the same rack.

The topology of the cluster 

Determines / declares each node’s rack and data center 

Several different types of snitches 

Configured in Cassandra.yaml

Endpoint_snitch: simpleSnitch

 

64. What is simple snitch ?

Places all nodes in the same data center and rack

Default snitch

 

65. Property File snitch

Reads  datacenter and rack information for all nodes from a file

You must keep files in sync with all nodes in the cluster

Cassaandra-topology.properties file

175.56.12.105=DC1:RAC1

 

66. Gossiping property File Snitch

Relieves the pain of the property file snitch 

Declare the current node’s DC / Rack information in a file 

 You must set each individual node’s settings

But you don’t have to copy settings as with property file snitch 

Gossip spreads the setting through the cluster

Cassandra-rackdc.properties file

dc=DC1

rack=RAC1

 

67. Rack inferring Snitch 

Infers the rack and DC from the IP address 

110.100.200.105

100 – Datacenter

200- Rack 

105 - Node

 

68. Dynamic Snitch ?

Layered on top of your actual snitch 

Maintains a pulse on each node’s performance

Determines which node to query replicas from depending on node health

Turned on by default for all snitches 

 

69. How to configuring snitches 

All nodes in the cluster must use the same snitch

Changing cluster network topology requires restarting all nodes 

Run sequential repair and cleanup on each node  

 

70. Snitch is used to...

Determine/declare each node's rack and data center.

 

71. Snitch is configured in the cassandra.yaml file.

True

 

72. Which is the following is *not* a type of snitch?

            SimpleSnitch

            PropertyFileSnitch

            DynamicSnitch

Ans -   CassandraSnitch

 

73. A replication factor of three means that Cassandra will store a total of four copies: the master and three copies.

Flase

 

74. A replication factor greater than one...

            widens the range of token values a single node is responsible for.

            causes overlap in the token ranges amongst nodes.

            requires more storage in your cluster.

 

75. Where does Cassandra reside in the CAP theorem?

availability/partition tolerance

 

76. With a replication factor of two, how many nodes must respond with success using consistency level quorum to indicate a successful operation?

2

 

77. How configure hinted handoff in Cassandra ?

Cassandra.yaml

You can disable hinted handoff

Set the amount of time a node will store a hint 

Default is three hours 

Consistency level of ANY means storing a hint suffices

Consistency level of ONE or more means at least one replica must successfully write 

Hint does not suffice 

 

78. The default time for a node to store a hint is:

3 hours

 

79. Hinted handoff is disabled by default.

False

 

80. Mention when you can use Alter keyspace?

ALTER KEYSPACE can be used to change properties such as the number of replicas and the durable_write of a keyspace.