Sunday, 6 September 2020

Interview Q and A for Cassandra DB Part - 2

 81. Mention what does the shell commands “Capture” and “Consistency” determines?

There are various Cqlsh shell commands in Cassandra. Command “Capture”, captures the output of a command and adds it to a file while, command “Consistency” display the current consistency level or set a new consistency level.

 

82. What is mandatory while creating a table in Cassandra?

While creating a table primary key is mandatory, it is made up of one or more columns of a table.

 

83. Mention what needs to be taken care while adding a Column?

While adding a column you need to take care that the

·       Column name is not conflicting with the existing column names

·       Table is not defined with compact storage option

 

84. How Can We Maintain Consistency Across Multiple Data Centers?

LOCAL QUORUM: Only local replicas are considered in acknowledging the writes; data still gets written to the other data center. It provides strong consistency along with speed.

All the available consistency levels in Cassandra (weakest to strongest) are as follows:

  • ANY
  • ONE, TWO, THREE
  • QUORUM
  • LOCAL_ONE
  • LOCAL_QUORUM
  • EACH_QUORUM
  • ALL: not in for availability, all in for consistency

For multiple data-centers, the best CL to be chosen are: ONE, QUORUM, LOCAL_ONE.

 

85. How many types of NoSQL databases are there?

There are four types of NoSQL databases, namely:

  1. Document Stores (MongoDB, Couchbase)
  2. Key-Value Stores (Redis, Volgemort)
  3. Column Stores (Cassandra)
  4. Graph Stores (Neo4j, Giraph)

 

86.  What do you understand by Commit log in Cassandra?

Answer: Commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.

87. How Cassandra provide High availability feature?

Cassandra is a robust software. Nodes joining and leaving are automatically taken care of. With proper settings, Cassandra can be made failure resistant. That means that if some of the servers fail, the data loss will be zero. So, you can just deploy Cassandra over cheap commodity hardware or a cloud environment, where hardware or infrastructure failures may occur.

88. When should you not use Cassandra? OR When to use RDBMS instead of Cassandra?

Cassandra is based on NoSQL database and does not provide ACID and relational data property. If you have strong requirement of ACID property (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make work out of it, however you will end up writing lots of application code to handle ACID property and will loose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.

 

89. What do you understand by Node in Cassandra?

Node is the place where data is stored.

 

90. What do you understand by Data center in Cassandra?

Data center is a collection of related nodes.

 

91. What do you understand by Cluster in Cassandra?

Cluster is a component that contains one or more data centers.

 

92. What is the syntax to create keyspace in Cassandra?

Syntax for creating keyspace in Cassandra is

CREATE KEYSPACE <identifier> WITH <properties>

 

93. Explain what is SStable consist of?

SStable consist of mainly 2 files

·       Index file ( Bloom filter & Key offset pairs)

·       Data file (Actual column data)

 

94.  Explain what is Bloom Filter is used for in Cassandra?

A bloom filter is a space efficient data structure that is used to test whether an element is a member of a set. In other words, it is used to determine whether an SSTable has data for a particular row. In Cassandra it is used to save IO when performing a KEY LOOKUP.

Bloom filter are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.

 

95. Explain how Cassandra delete Data?

SSTables are immutable and cannot remove a row from SSTables.  When a row needs to be deleted, Cassandra assigns the column value with a special value called Tombstone. When the data is read, the Tombstone value is considered as deleted.

 

96. What does JMX stands for?

JMX stands for Java Management Extension

 

97. Cassandra is written in which language?

Java

 

98. What happens to existing data in my cluster when I add new nodes?

When a new nodes joins a cluster, it will automatically contact the other nodes in the cluster and copy the right data to itself.

 

99. What are “Seed Nodes” in Cassandra?

A seed node in Cassandra is a node that is contacted by other nodes when they first start up and join the cluster. A cluster can have multiple seed nodes. Seed node helps the process of bootstrapping for a new node joining a cluster. Its recommended to use the 2 seed node per data center.

 

100.  What is Nodetool Repair ?

Syncs all data in the cluster 

Expensive  -- Grows with amount of data in cluster 

Use with clusters servicing high writes/deletes 

Last line of defense

Run to synchronize a failed node coming back online 

Run on nodes not read from very often 

 

101. Read repair always occurs when consistency level is set to...

ALL

 

102. What does read_repair_chance do?

Sets the probability which Cassandra will perform a read repair with a consistency level less than ALL.

 

103. The purpose of the commit log is...

to replay if a crashed node restarts.

 

104. What is Read repair chance ?

Performed when read is at consistency level less than ALL 

Request reads only a subset of the replicas 

We can’t be sure replicas are in sync 

Generally you are safe, but no guarantees

Response sent immediately when consistency level is met

10 % by default 

 

105. When does a client acknowledge a write?

After the commit log and MemTable are written

 

106. Which are stored sorted by clustering columns? 

SSTable , MemTable

 

107. The partition summary...

stores byte offsets into the partition index.

 

108. The key cache...

stores the byte offset of the most recently accessed records.

 

109. Which of the structures reside on disk? 

            SSTable

            partition index

 

110. Which are benefits from compaction? 

More optimal disk usage

Faster reads

Less memory pressure

 

111. All tombstones are discarded during compaction.

False

 

112. In which scenarios would a new partition on disk be larger than either of its input partition segments after a compaction?

 The input partition segments are made up of mostly INSERT operations.

 

113. Adding Nodes 

You might want to considering adding a new node if you have 

-        Reached data capacity problem  

-- Your data has outgrown the node’s hardware capacity 

-       Reached traffic capacity 

--Your application needs more rapid response with less latency 

-       Need more operational headroom 

--Need more resources for node repair, compaction, and other resource intensive operations

 

114. Adding Nodes Best Practices 

Single-token Nodes  -- Double the size of a cluster (Single token Nodes)

Vnodes – For vnode clusters, we can increments the size of the cluster if more nodes are needed

-Wait a period a time before adding each additional node ( single-token and vnodes)

-Follow the ‘2 Minute rule’

-This ensure the range announcement is known to all nodes before the next one begins entering the cluster.

 

115. What are main parameters to Node setup 

Four main parameters of a node for bootstrapping 

These are configured in the Cassandra.yaml file 

Cluster_name , rpc_address ,listen_address , seeds

 

116. What id Bootstrapping process

Simple process but pretty critical 

Can be a long running process

Node announces itself to ring using seed node

Calculate ranges of new node, notify ring of these pending ranges 

Calculate the nodes that currently own these ranges and will no longer own them once the bootstrap completes 

Stream the data from these nodes to the bootstrapping node ( monitor with nodetool netstats)

Join the new node to the ring so it can serve traffic 

Length of time it takes to join will depend on the amount of data to be streamed 

 

117. What if bootstrap fails ?

Two scenarios 

-       Bootstrapping node could not even connect to cluster 

Fairly easy to deal with 

Something fundamental like could not  find cluster

Examine the log file to understand what’s going on firstly (What types of things, error conditions it should be flagged as soon in process, if bootstrap), change config and try again 

-       Streaming portion fails 

 Node exists in cluster in joining state 

 

Nodetool rebuild to rebootstrap data

 

118. Nodetool Cleanup 

Perform cleanup after a bootstrap on the OTHER nodes 

You don’t have to do this

Reads all SSTables to make sure there is no token out of range for that particular node 

If it’s out of range it just does a copy 

If you don’t run cleanup, will get picked up through compaction over time.

Cleanup is basically a compaction 

 

119. How do we run a cleanup operation 

The nodetool cleanup command cleans up all data in a keyspace and tables that are specified 

Bin/nodetool [options] cleanup – <keyspace> (<table>)

 

Use flags to specify 

-h host/IP address

-p port

-pw password 

-u username

Nodetool cleanup command  will clean all keyspace is  specified 

 

120. Why would I remove a node ?

Two very different scenarios : 

 

You are going to reduce capacity, need to decommission ( some sort of operational requirement) 

The node is offline and will never come back online 

 

121. Removing a live node from the cluster 

Perhaps you want to decrease the size of your cluster 

Perhaps you might want to swap out an older machine with a newer machine 

Decommissioning a node will assign the ranges of the old node to other nodes and replicates the appropriate data on the new nodes 

Decommissioned node’s data will be streamed from the decommissioning node itself

Once data has been moved to other nodes, the process for removing or replacing is similar for both 

 

122. When a node is decommissioned 

Node is marked as ‘LEAVING’ and will stream data to other live nodes.

The data directories will still exist – remove these if the node will go back into production 

The Cassandra JVM is still running – but with Gossip , Thrift and Native Transport ports all down. 

This allows admin to hook up a JMX client to analyze the metrics maintained in the JVM 

Then the JVM process can be shutdown manually 

 

123. Decommission a node using nodetool

/bin/nodetool [option] decommission

 

Removes node specified by host id 

-h host/IP address

-p port 

-pw password 

-u username 

Monitor progress with nodetool netstats

 

124. Can we remove a node ?

Before doing anything, check nodetool status to see the state of the node in question 

Nodetool status   -- status =up/down

 

If the node is down ( and not coming back online), choose the appropriate option:

-Remove the node using the nodetool removenode command 

Adjust your tokens to avoid creating a hot spot if using single-token nodes.

-If removenode fails, run nodetool assassinate

-nodetool repair should be run once the node is removed from the cluster 

 

/bin/nodetool [options] removenode [host id]

 

-h host/IP address

-p port 

-pw password 

-u username

Additional arguments – status  ,  forces 

 

125. The pros  of replacing a downed node 

You don’t have to move the data twice 

Backup for a node will work for a replaced node, because same token are used to bring replaced node into cluster 

 

126. replacing a downed node using nodetool

First find the ip address of the down node using nodetool status 

In the node, open the Cassandra-env.sh file

Swap in the IP address of dead node as the replace_address value in the JVM option. This will enable bootstrapping of the new node.

Use nodetool removenode to remove the dead node

Use the force option if necessary (nodetool assassinate)

You can monitor the process using nodetool netstats

 

127. what if the node was also a seed node ?

Consideration 

Need to add to list of seeds in Cassandra.yaml

Cassandra will not allow seed node to autobootrap

Thus will have to run repair on new seed node to do so.

Steps 

 

Add a new node making the necessary changes to the Cassandra.yaml file

Specify new seed node in Cassandra.yaml file

Start Cassandra on new seed node 

Run nodetool repair on the new seed node to manually bootstrap

Remove the old seed node using nodetool removenode with the Host  ID of the downed node 

Run nodetool cleanup on previously existing nodes 

 

128. By default, how many vnodes does each node have?

256

 

129. Which parameter in the cassandra.yaml file configures vnodes?

num_tokens

 

130. When using vnodes, Cassandra automatically assigns token ranges for you.

True

 

131. Nodes can only gossip with specific other nodes in the cluster.

False

 

132. Which of the statements are true concerning gossip? 

Constant trickle of network traffic

Does not cause network spikes

Minimal compared to data streaming

            

133. In a full network partition, that is, parts of the cluster are completely disconnected from the whole, only the largest group of nodes can still satisfy queries.

False

 

134. What are the three main layers (in order) of data modeling?

Conceptual, Logical, Physical

 

135.  Data modeling 

Analyze requirements 

Identify entities and relationships

Identify queries

Specify the schema 

Optimize 

 

Conceptual Data model / Application workflow  Mapping conceptual to logical  Logical Data Model  Physical Optimization  Physical data Model

 

Think outside of the box

Non standard solution – requires creativity 

Different data models have different costs 

 

136. Keyspaces   

Top level namespace/container

Similar to a relational database schema

Replication parameters required 

Keyspaces contain tables 

Tables contain data

Uniquely identify rows

 

137. How to switch between keyspaces

By USE command 

USE keyspacename

 

138.  What is UUID  & TIMEUUID

UUID -  Universally Unique identifier 

Generate via uuid()

 

TIMEUUID embeds a Timestamp value 

Sortable 

Generate via now()

 

 

 

139. Copy command 

Imports/ exports CSV 

Header parameter skips the first line in the file 

 

Copy table1(c1,c2,c3) from ‘t1.csv’ with Header=true ;

 

140. What command bulk-loads data files?

COPY

 

141.Why do we use UUIDs in Cassandra to uniquely identify records?

To avoid conflicts in auto generating IDs between nodes

 

142. Cassandra requires you to specify the width of texual types, for example VARCHAR(50).

False

 

143. Partition Storage 

Cassandra distributes partitions across nodes 

Where on any field other than partition key would require searching all partitions on all nodes 

Cassandra no likely 

We can WHERE on a partition key value 

Cassandra uses a hashing algorithm to quickly determine which nodes contain the desired partition 

 

144. What is the smallest atomic unit of storage in Cassandra?

paritition

 

145. What is a cell?

key-value pair

 

146. What is a partition?

group of cells

 

147. What is the significance of the partition key?

Cassandra hashes the key value to determine which node the partition resides on

 

148. Clustering columns

Come after partition key within PRIMARY KEY clause

Clustering columns divide CQL rows between partitions.

Clustering column values stored sorted 

Default is ascending 

 

149. Querying clustering columns 

You must first provide a partition key 

Clustering columns can follow thereafter 

You can perform either equality (=) or range queries (<, >) on clustering columns 

All equality comparisons must come before inequality comparisons

Since data is sorted on disk, range searches are a binary search followed by a linear read

 

150. Change default Ordering  of clustering columns

Clustering columns defaults ascending order 

Change ordering direction via WITH CLUSTERING ORDER BY 

Must include all columns including and up to columns you wish to order descending 

 

151. Allow filtering 

ALLOW FILTERING Relaxes the querying on partition key constraint 

You can then query on just clustering columns 

Causes Cassandra to scan all partitions in the table 

Don’t use it    --  Unless you really have to   -- Best on small data sets 

 

152. What is an upsert?

INSERTs may cause UPDATEs; UPDATEs may causes INSERTs

 

153. What purpose do clustering columns serve?

Provide uniqueness within the partition as well as ordering criteria

 

154. What is the relationship between a partition key and a clustering column?

Partition keys determine a grouping criteria whereas clustering columns determine ordering criteria

 

155. What is NODETOOL

Node management 

Located in the bin/ folder 

/bin/nodetool help

 

Help --  Lists all possible sub commands 

Info – Current node settings and stats 

Status – Reports basic node health information 

 

156. Alter Table statement 

Adding column 

Dropping column 

Cannot alter primary key 

 

157. Collection column 

Collection columns are multi valued columns 

Designed to store a small amount of data 

Retrieved in its entirety

Cannot nest a collection inside another collection 

 

158. What is UDTs (user defined types)

UDT group related fields of information 

Allow embedding more complex data within a single column 

Create Type  address ( street text, city text);

Using a UDT by adding frozen keyword.

 

159. What command drops all records from an existing table?

TRUNCATE

 

160. What command adds/removes columns to/from a table?

ALTER

 

161. Which is a Cassandra column type?

            LIST<>

            SET<>

            MAP<>

            

162. Cassandra counters are always 100% accurate.

FALSE

 

163. What command executes a file of CQL statements?

SOURCE

 

164. Conceptual data modeling 

Abstract view of the domain 

Technology independent 

Not specific to any database system 

 

165. Which is an advantage of conceptual data modeling?

            Collaboration between both technical and non-technical team members

            Provides abstraction from the problem details

            Better understanding of the domain

            

166. Which is a type found in a conceptual data model?

            Entity types

            Relationship types

            Attribute types

 

167. Attribute types can be...

            key

            composite

            multi-valued

 

168. How do you determine the key of a 1-1 relationship?

Key attributes of either participating entity types

 

169. How do you determine the key of a 1-n relationship?

Key attributes of entity type on the many side

 

170. How do you determine the key of a m-n relationship?

Key attributes of both participating entity types

 

171. What does disjoint mean?

            An entity can only participate in only one subtype role

 

172. What is an application workflow?

            Tasks formed by causual dependencies

 

173. How do we indicate a partition key in a Chebotko diagram?

K

 

174. How do we indicate a clustering column in a Chebotko diagram?

C with up/down arrow

 

175. What is a table's main purpose in a Cassandra database?

Serve a query

 

176. Data  Modeling Principles

1 --Know  your data

Data captured by conceptual data model 

 Define what is stored in database

Preserve properties so that data is organized correctly 

2 --- Know your queries 

Queries captured by application workflow model 

Table schema design changes if queries changes 

3 ---Nest data 

Nesting organizes multiple entities into a single partition 

Support partition per query data access

 

Three data nesting mechanisms

Clustering column – multi row partitions 

Collection columns

User defined type columns 

4 --- Duplicate data

Better to duplicate than to join data 

Partition per query and data nesting may result in data duplication 

     Query results are pre computed and materialized

     Data can be duplicated across tables, partitions,  or rows 

 

177. What are the two preferrable table query strategies?

            Partition per query and partition+ per query

 

178. Why do we nest data in Cassandra?

Support a partition per query access pattern

 

179. Mapping Rules For the query driven methodology 

Mapping rules ensure that a logical data model is correct 

Each query has a corresponding table 

Tables are designed to allow queries to execute properly 

Tables return data in the correct order

MR1 --  Entities and Relationships 

 

Entity and relationship types map to tables 

Entity and relationship map to partitions or rows 

Partition may have data about one or more entities and  relationships

Attributes are represented by columns 

 

180. Choose the option that lists the mapping rules in proper order

Entities and relationships, equality search attributes, inequiality search attributes, ordering attributes, key attributes

No comments:

Post a Comment