Manoj Simar - The Technology Expert: Interview Q and A for Cassandra DB Part

81. Mention what does the shell commands “Capture” and “Consistency” determines?

There are various Cqlsh shell commands in Cassandra. Command “Capture”, captures the output of a command and adds it to a file while, command “Consistency” display the current consistency level or set a new consistency level.

82. What is mandatory while creating a table in Cassandra?

While creating a table primary key is mandatory, it is made up of one or more columns of a table.

83. Mention what needs to be taken care while adding a Column?

While adding a column you need to take care that the

· Column name is not conflicting with the existing column names

· Table is not defined with compact storage option

84. How Can We Maintain Consistency Across Multiple Data Centers?

LOCAL QUORUM: Only local replicas are considered in acknowledging the writes; data still gets written to the other data center. It provides strong consistency along with speed.

All the available consistency levels in Cassandra (weakest to strongest) are as follows:

ANY
ONE, TWO, THREE
QUORUM
LOCAL_ONE
LOCAL_QUORUM
EACH_QUORUM
ALL: not in for availability, all in for consistency

For multiple data-centers, the best CL to be chosen are: ONE, QUORUM, LOCAL_ONE.

85. How many types of NoSQL databases are there?

There are four types of NoSQL databases, namely:

Document Stores (MongoDB, Couchbase)
Key-Value Stores (Redis, Volgemort)
Column Stores (Cassandra)
Graph Stores (Neo4j, Giraph)

86. What do you understand by Commit log in Cassandra?

Answer: Commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.

87. How Cassandra provide High availability feature?

Cassandra is a robust software. Nodes joining and leaving are automatically taken care of. With proper settings, Cassandra can be made failure resistant. That means that if some of the servers fail, the data loss will be zero. So, you can just deploy Cassandra over cheap commodity hardware or a cloud environment, where hardware or infrastructure failures may occur.

88. When should you not use Cassandra? OR When to use RDBMS instead of Cassandra?

Cassandra is based on NoSQL database and does not provide ACID and relational data property. If you have strong requirement of ACID property (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make work out of it, however you will end up writing lots of application code to handle ACID property and will loose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.

89. What do you understand by Node in Cassandra?

Node is the place where data is stored.

90. What do you understand by Data center in Cassandra?

Data center is a collection of related nodes.

91. What do you understand by Cluster in Cassandra?

Cluster is a component that contains one or more data centers.

92. What is the syntax to create keyspace in Cassandra?

Syntax for creating keyspace in Cassandra is

CREATE KEYSPACE <identifier> WITH <properties>

93. Explain what is SStable consist of?

SStable consist of mainly 2 files

· Index file ( Bloom filter & Key offset pairs)

· Data file (Actual column data)

94. Explain what is Bloom Filter is used for in Cassandra?

A bloom filter is a space efficient data structure that is used to test whether an element is a member of a set. In other words, it is used to determine whether an SSTable has data for a particular row. In Cassandra it is used to save IO when performing a KEY LOOKUP.

Bloom filter are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.

95. Explain how Cassandra delete Data?

SSTables are immutable and cannot remove a row from SSTables. When a row needs to be deleted, Cassandra assigns the column value with a special value called Tombstone. When the data is read, the Tombstone value is considered as deleted.

96. What does JMX stands for?

JMX stands for Java Management Extension

97. Cassandra is written in which language?

Java

98. What happens to existing data in my cluster when I add new nodes?

When a new nodes joins a cluster, it will automatically contact the other nodes in the cluster and copy the right data to itself.

99. What are “Seed Nodes” in Cassandra?

A seed node in Cassandra is a node that is contacted by other nodes when they first start up and join the cluster. A cluster can have multiple seed nodes. Seed node helps the process of bootstrapping for a new node joining a cluster. Its recommended to use the 2 seed node per data center.

100. What is Nodetool Repair ?

Syncs all data in the cluster

Expensive -- Grows with amount of data in cluster

Use with clusters servicing high writes/deletes

Last line of defense

Run to synchronize a failed node coming back online

Run on nodes not read from very often

101. Read repair always occurs when consistency level is set to...

ALL

102. What does read_repair_chance do?

Sets the probability which Cassandra will perform a read repair with a consistency level less than ALL.

103. The purpose of the commit log is...

to replay if a crashed node restarts.

104. What is Read repair chance ?

Performed when read is at consistency level less than ALL

Request reads only a subset of the replicas

We can’t be sure replicas are in sync

Generally you are safe, but no guarantees

Response sent immediately when consistency level is met

10 % by default

105. When does a client acknowledge a write?

After the commit log and MemTable are written

106. Which are stored sorted by clustering columns?

SSTable , MemTable

107. The partition summary...

stores byte offsets into the partition index.

108. The key cache...

stores the byte offset of the most recently accessed records.

109. Which of the structures reside on disk?

SSTable

partition index

110. Which are benefits from compaction?

More optimal disk usage

Faster reads

Less memory pressure

111. All tombstones are discarded during compaction.

False

112. In which scenarios would a new partition on disk be larger than either of its input partition segments after a compaction?

The input partition segments are made up of mostly INSERT operations.

113. Adding Nodes

You might want to considering adding a new node if you have

- Reached data capacity problem

-- Your data has outgrown the node’s hardware capacity

- Reached traffic capacity

--Your application needs more rapid response with less latency

- Need more operational headroom

--Need more resources for node repair, compaction, and other resource intensive operations

114. Adding Nodes Best Practices

Single-token Nodes -- Double the size of a cluster (Single token Nodes)

Vnodes – For vnode clusters, we can increments the size of the cluster if more nodes are needed

-Wait a period a time before adding each additional node ( single-token and vnodes)

-Follow the ‘2 Minute rule’

-This ensure the range announcement is known to all nodes before the next one begins entering the cluster.

115. What are main parameters to Node setup

Four main parameters of a node for bootstrapping

These are configured in the Cassandra.yaml file

Cluster_name , rpc_address ,listen_address , seeds

116. What id Bootstrapping process

Simple process but pretty critical

Can be a long running process

Node announces itself to ring using seed node

Calculate ranges of new node, notify ring of these pending ranges

Calculate the nodes that currently own these ranges and will no longer own them once the bootstrap completes

Stream the data from these nodes to the bootstrapping node ( monitor with nodetool netstats)

Join the new node to the ring so it can serve traffic

Length of time it takes to join will depend on the amount of data to be streamed

117. What if bootstrap fails ?

Two scenarios

- Bootstrapping node could not even connect to cluster

Fairly easy to deal with

Something fundamental like could not find cluster

Examine the log file to understand what’s going on firstly (What types of things, error conditions it should be flagged as soon in process, if bootstrap), change config and try again

- Streaming portion fails

Node exists in cluster in joining state

Nodetool rebuild to rebootstrap data

118. Nodetool Cleanup

Perform cleanup after a bootstrap on the OTHER nodes

You don’t have to do this

Reads all SSTables to make sure there is no token out of range for that particular node

If it’s out of range it just does a copy

If you don’t run cleanup, will get picked up through compaction over time.

Cleanup is basically a compaction

119. How do we run a cleanup operation

The nodetool cleanup command cleans up all data in a keyspace and tables that are specified

Bin/nodetool [options] cleanup – <keyspace> (<table>)

Use flags to specify

-h host/IP address

-p port

-pw password

-u username

Nodetool cleanup command will clean all keyspace is specified

120. Why would I remove a node ?

Two very different scenarios :

You are going to reduce capacity, need to decommission ( some sort of operational requirement)

The node is offline and will never come back online

121. Removing a live node from the cluster

Perhaps you want to decrease the size of your cluster

Perhaps you might want to swap out an older machine with a newer machine

Decommissioning a node will assign the ranges of the old node to other nodes and replicates the appropriate data on the new nodes

Decommissioned node’s data will be streamed from the decommissioning node itself

Once data has been moved to other nodes, the process for removing or replacing is similar for both

122. When a node is decommissioned

Node is marked as ‘LEAVING’ and will stream data to other live nodes.

The data directories will still exist – remove these if the node will go back into production

The Cassandra JVM is still running – but with Gossip , Thrift and Native Transport ports all down.

This allows admin to hook up a JMX client to analyze the metrics maintained in the JVM

Then the JVM process can be shutdown manually

123. Decommission a node using nodetool

/bin/nodetool [option] decommission

Removes node specified by host id

-h host/IP address

-p port

-pw password

-u username

Monitor progress with nodetool netstats

124. Can we remove a node ?

Before doing anything, check nodetool status to see the state of the node in question

Nodetool status -- status =up/down

If the node is down ( and not coming back online), choose the appropriate option:

-Remove the node using the nodetool removenode command

Adjust your tokens to avoid creating a hot spot if using single-token nodes.

-If removenode fails, run nodetool assassinate

-nodetool repair should be run once the node is removed from the cluster

/bin/nodetool [options] removenode [host id]

-h host/IP address

-p port

-pw password

-u username

Additional arguments – status , forces

125. The pros of replacing a downed node

You don’t have to move the data twice

Backup for a node will work for a replaced node, because same token are used to bring replaced node into cluster

126. replacing a downed node using nodetool

First find the ip address of the down node using nodetool status

In the node, open the Cassandra-env.sh file

Swap in the IP address of dead node as the replace_address value in the JVM option. This will enable bootstrapping of the new node.

Use nodetool removenode to remove the dead node

Use the force option if necessary (nodetool assassinate)

You can monitor the process using nodetool netstats

127. what if the node was also a seed node ?

Consideration

Need to add to list of seeds in Cassandra.yaml

Cassandra will not allow seed node to autobootrap

Thus will have to run repair on new seed node to do so.

Steps

Add a new node making the necessary changes to the Cassandra.yaml file

Specify new seed node in Cassandra.yaml file

Start Cassandra on new seed node

Run nodetool repair on the new seed node to manually bootstrap

Remove the old seed node using nodetool removenode with the Host ID of the downed node

Run nodetool cleanup on previously existing nodes

128. By default, how many vnodes does each node have?

256

129. Which parameter in the cassandra.yaml file configures vnodes?

num_tokens

130. When using vnodes, Cassandra automatically assigns token ranges for you.

True

131. Nodes can only gossip with specific other nodes in the cluster.

False

132. Which of the statements are true concerning gossip?

Constant trickle of network traffic

Does not cause network spikes

Minimal compared to data streaming

133. In a full network partition, that is, parts of the cluster are completely disconnected from the whole, only the largest group of nodes can still satisfy queries.

False

134. What are the three main layers (in order) of data modeling?

Conceptual, Logical, Physical

135. Data modeling

Analyze requirements

Identify entities and relationships

Identify queries

Specify the schema

Optimize

Conceptual Data model / Application workflow Mapping conceptual to logical  Logical Data Model  Physical Optimization  Physical data Model

Think outside of the box

Non standard solution – requires creativity

Different data models have different costs

136. Keyspaces

Top level namespace/container

Similar to a relational database schema

Replication parameters required

Keyspaces contain tables

Tables contain data

Uniquely identify rows

137. How to switch between keyspaces

By USE command

USE keyspacename

138. What is UUID & TIMEUUID

UUID - Universally Unique identifier

Generate via uuid()

TIMEUUID embeds a Timestamp value

Sortable

Generate via now()

139. Copy command

Imports/ exports CSV

Header parameter skips the first line in the file

Copy table1(c1,c2,c3) from ‘t1.csv’ with Header=true ;

140. What command bulk-loads data files?

COPY

141.Why do we use UUIDs in Cassandra to uniquely identify records?

To avoid conflicts in auto generating IDs between nodes

142. Cassandra requires you to specify the width of texual types, for example VARCHAR(50).

False

143. Partition Storage

Cassandra distributes partitions across nodes

Where on any field other than partition key would require searching all partitions on all nodes

Cassandra no likely

We can WHERE on a partition key value

Cassandra uses a hashing algorithm to quickly determine which nodes contain the desired partition

144. What is the smallest atomic unit of storage in Cassandra?

paritition

145. What is a cell?

key-value pair

146. What is a partition?

group of cells

147. What is the significance of the partition key?

Cassandra hashes the key value to determine which node the partition resides on

148. Clustering columns

Come after partition key within PRIMARY KEY clause

Clustering columns divide CQL rows between partitions.

Clustering column values stored sorted

Default is ascending

149. Querying clustering columns

You must first provide a partition key

Clustering columns can follow thereafter

You can perform either equality (=) or range queries (<, >) on clustering columns

All equality comparisons must come before inequality comparisons

Since data is sorted on disk, range searches are a binary search followed by a linear read

150. Change default Ordering of clustering columns

Clustering columns defaults ascending order

Change ordering direction via WITH CLUSTERING ORDER BY

Must include all columns including and up to columns you wish to order descending

151. Allow filtering

ALLOW FILTERING Relaxes the querying on partition key constraint

You can then query on just clustering columns

Causes Cassandra to scan all partitions in the table

Don’t use it -- Unless you really have to -- Best on small data sets

152. What is an upsert?

INSERTs may cause UPDATEs; UPDATEs may causes INSERTs

153. What purpose do clustering columns serve?

Provide uniqueness within the partition as well as ordering criteria

154. What is the relationship between a partition key and a clustering column?

Partition keys determine a grouping criteria whereas clustering columns determine ordering criteria

155. What is NODETOOL

Node management

Located in the bin/ folder

/bin/nodetool help

Help -- Lists all possible sub commands

Info – Current node settings and stats

Status – Reports basic node health information

156. Alter Table statement

Adding column

Dropping column

Cannot alter primary key

157. Collection column

Collection columns are multi valued columns

Designed to store a small amount of data

Retrieved in its entirety

Cannot nest a collection inside another collection

158. What is UDTs (user defined types)

UDT group related fields of information

Allow embedding more complex data within a single column

Create Type address ( street text, city text);

Using a UDT by adding frozen keyword.

159. What command drops all records from an existing table?

TRUNCATE

160. What command adds/removes columns to/from a table?

ALTER

161. Which is a Cassandra column type?

LIST<>

SET<>

MAP<>

162. Cassandra counters are always 100% accurate.

FALSE

163. What command executes a file of CQL statements?

SOURCE

164. Conceptual data modeling

Abstract view of the domain

Technology independent

Not specific to any database system

165. Which is an advantage of conceptual data modeling?

Collaboration between both technical and non-technical team members

Provides abstraction from the problem details

Better understanding of the domain

166. Which is a type found in a conceptual data model?

Entity types

Relationship types

Attribute types

167. Attribute types can be...

key

composite

multi-valued

168. How do you determine the key of a 1-1 relationship?

Key attributes of either participating entity types

169. How do you determine the key of a 1-n relationship?

Key attributes of entity type on the many side

170. How do you determine the key of a m-n relationship?

Key attributes of both participating entity types

171. What does disjoint mean?

An entity can only participate in only one subtype role

172. What is an application workflow?

Tasks formed by causual dependencies

173. How do we indicate a partition key in a Chebotko diagram?

174. How do we indicate a clustering column in a Chebotko diagram?

C with up/down arrow

175. What is a table's main purpose in a Cassandra database?

Serve a query

176. Data Modeling Principles

1 --Know your data

Data captured by conceptual data model

Define what is stored in database

Preserve properties so that data is organized correctly

2 --- Know your queries

Queries captured by application workflow model

Table schema design changes if queries changes

3 ---Nest data

Nesting organizes multiple entities into a single partition

Support partition per query data access

Three data nesting mechanisms

Clustering column – multi row partitions

Collection columns

User defined type columns

4 --- Duplicate data

Better to duplicate than to join data

Partition per query and data nesting may result in data duplication

Query results are pre computed and materialized

Data can be duplicated across tables, partitions, or rows

177. What are the two preferrable table query strategies?

Partition per query and partition+ per query

178. Why do we nest data in Cassandra?

Support a partition per query access pattern

179. Mapping Rules For the query driven methodology

Mapping rules ensure that a logical data model is correct

Each query has a corresponding table

Tables are designed to allow queries to execute properly

Tables return data in the correct order

MR1 -- Entities and Relationships

Entity and relationship types map to tables

Entity and relationship map to partitions or rows

Partition may have data about one or more entities and relationships

Attributes are represented by columns

180. Choose the option that lists the mapping rules in proper order

Entities and relationships, equality search attributes, inequiality search attributes, ordering attributes, key attributes

Manoj Simar - The Technology Expert

About Me

Sunday, 6 September 2020

Interview Q and A for Cassandra DB Part - 2

84. How Can We Maintain Consistency Across Multiple Data Centers?

85. How many types of NoSQL databases are there?

86. What do you understand by Commit log in Cassandra?

No comments:

Post a Comment