81. Mention what does the shell commands “Capture” and “Consistency” determines?
There are various Cqlsh shell commands in Cassandra. Command “Capture”, captures the output of a command and adds it to a file while, command “Consistency” display the current consistency level or set a new consistency level.
82. What is mandatory while creating a table in Cassandra?
While creating a table primary key is mandatory, it is made up of one or more columns of a table.
83. Mention what needs to be taken care while adding a Column?
While adding a column you need to take care that the
· Column name is not conflicting with the existing column names
· Table is not defined with compact storage option
84. How Can We Maintain Consistency Across Multiple Data Centers?
LOCAL QUORUM: Only local replicas are considered in acknowledging the writes; data still gets written to the other data center. It provides strong consistency along with speed.
All the available consistency levels in Cassandra (weakest to strongest) are as follows:
- ANY
- ONE, TWO, THREE
- QUORUM
- LOCAL_ONE
- LOCAL_QUORUM
- EACH_QUORUM
- ALL: not in for availability, all in for consistency
For multiple data-centers, the best CL to be chosen are: ONE, QUORUM, LOCAL_ONE.
85. How many types of NoSQL databases are there?
There are four types of NoSQL databases, namely:
- Document Stores (MongoDB, Couchbase)
- Key-Value Stores (Redis, Volgemort)
- Column Stores (Cassandra)
- Graph Stores (Neo4j, Giraph)
86. What do you understand by Commit log in Cassandra?
Answer: Commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.
87. How Cassandra provide High availability feature?
Cassandra is a robust software. Nodes joining and leaving are automatically taken care of. With proper settings, Cassandra can be made failure resistant. That means that if some of the servers fail, the data loss will be zero. So, you can just deploy Cassandra over cheap commodity hardware or a cloud environment, where hardware or infrastructure failures may occur.
88. When should you not use Cassandra? OR When to use RDBMS instead of Cassandra?
Cassandra is based on NoSQL database and does not provide ACID and relational data property. If you have strong requirement of ACID property (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make work out of it, however you will end up writing lots of application code to handle ACID property and will loose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.
89. What do you understand by Node in Cassandra?
Node is the place where data is stored.
90. What do you understand by Data center in Cassandra?
Data center is a collection of related nodes.
91. What do you understand by Cluster in Cassandra?
Cluster is a component that contains one or more data centers.
92. What is the syntax to create keyspace in Cassandra?
Syntax for creating keyspace in Cassandra is
CREATE KEYSPACE <identifier> WITH <properties>
93. Explain what is SStable consist of?
SStable consist of mainly 2 files
· Index file ( Bloom filter & Key offset pairs)
· Data file (Actual column data)
94. Explain what is Bloom Filter is used for in Cassandra?
A bloom filter is a space efficient data structure that is used to test whether an element is a member of a set. In other words, it is used to determine whether an SSTable has data for a particular row. In Cassandra it is used to save IO when performing a KEY LOOKUP.
Bloom filter are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.
95. Explain how Cassandra delete Data?
SSTables are immutable and cannot remove a row from SSTables. When a row needs to be deleted, Cassandra assigns the column value with a special value called Tombstone. When the data is read, the Tombstone value is considered as deleted.
96. What does JMX stands for?
JMX stands for Java Management Extension
97. Cassandra is written in which language?
Java
98. What happens to existing data in my cluster when I add new nodes?
When a new nodes joins a cluster, it will automatically contact the other nodes in the cluster and copy the right data to itself.
99. What are “Seed Nodes” in Cassandra?
A seed node in Cassandra is a node that is contacted by other nodes when they first start up and join the cluster. A cluster can have multiple seed nodes. Seed node helps the process of bootstrapping for a new node joining a cluster. Its recommended to use the 2 seed node per data center.
100. What is Nodetool Repair ?
Syncs all data in the cluster
Expensive -- Grows with amount of data in cluster
Use with clusters servicing high writes/deletes
Last line of defense
Run to synchronize a failed node coming back online
Run on nodes not read from very often
101. Read repair always occurs when consistency level is set to...
ALL
102. What does read_repair_chance do?
Sets the probability which Cassandra will perform a read repair with a consistency level less than ALL.
103. The purpose of the commit log is...
to replay if a crashed node restarts.
104. What is Read repair chance ?
Performed when read is at consistency level less than ALL
Request reads only a subset of the replicas
We can’t be sure replicas are in sync
Generally you are safe, but no guarantees
Response sent immediately when consistency level is met
10 % by default
105. When does a client acknowledge a write?
After the commit log and MemTable are written
106. Which are stored sorted by clustering columns?
SSTable , MemTable
107. The partition summary...
stores byte offsets into the partition index.
108. The key cache...
stores the byte offset of the most recently accessed records.
109. Which of the structures reside on disk?
SSTable
partition index
110. Which are benefits from compaction?
More optimal disk usage
Faster reads
Less memory pressure
111. All tombstones are discarded during compaction.
False
112. In which scenarios would a new partition on disk be larger than either of its input partition segments after a compaction?
The input partition segments are made up of mostly INSERT operations.
113. Adding Nodes
You might want to considering adding a new node if you have
- Reached data capacity problem
-- Your data has outgrown the node’s hardware capacity
- Reached traffic capacity
--Your application needs more rapid response with less latency
- Need more operational headroom
--Need more resources for node repair, compaction, and other resource intensive operations
114. Adding Nodes Best Practices
Single-token Nodes -- Double the size of a cluster (Single token Nodes)
Vnodes – For vnode clusters, we can increments the size of the cluster if more nodes are needed
-Wait a period a time before adding each additional node ( single-token and vnodes)
-Follow the ‘2 Minute rule’
-This ensure the range announcement is known to all nodes before the next one begins entering the cluster.
115. What are main parameters to Node setup
Four main parameters of a node for bootstrapping
These are configured in the Cassandra.yaml file
Cluster_name , rpc_address ,listen_address , seeds
116. What id Bootstrapping process
Simple process but pretty critical
Can be a long running process
Node announces itself to ring using seed node
Calculate ranges of new node, notify ring of these pending ranges
Calculate the nodes that currently own these ranges and will no longer own them once the bootstrap completes
Stream the data from these nodes to the bootstrapping node ( monitor with nodetool netstats)
Join the new node to the ring so it can serve traffic
Length of time it takes to join will depend on the amount of data to be streamed
117. What if bootstrap fails ?
Two scenarios
- Bootstrapping node could not even connect to cluster
Fairly easy to deal with
Something fundamental like could not find cluster
Examine the log file to understand what’s going on firstly (What types of things, error conditions it should be flagged as soon in process, if bootstrap), change config and try again
- Streaming portion fails
Node exists in cluster in joining state
Nodetool rebuild to rebootstrap data
118. Nodetool Cleanup
Perform cleanup after a bootstrap on the OTHER nodes
You don’t have to do this
Reads all SSTables to make sure there is no token out of range for that particular node
If it’s out of range it just does a copy
If you don’t run cleanup, will get picked up through compaction over time.
Cleanup is basically a compaction
119. How do we run a cleanup operation
The nodetool cleanup command cleans up all data in a keyspace and tables that are specified
Bin/nodetool [options] cleanup – <keyspace> (<table>)
Use flags to specify
-h host/IP address
-p port
-pw password
-u username
Nodetool cleanup command will clean all keyspace is specified
120. Why would I remove a node ?
Two very different scenarios :
You are going to reduce capacity, need to decommission ( some sort of operational requirement)
The node is offline and will never come back online
121. Removing a live node from the cluster
Perhaps you want to decrease the size of your cluster
Perhaps you might want to swap out an older machine with a newer machine
Decommissioning a node will assign the ranges of the old node to other nodes and replicates the appropriate data on the new nodes
Decommissioned node’s data will be streamed from the decommissioning node itself
Once data has been moved to other nodes, the process for removing or replacing is similar for both
122. When a node is decommissioned
Node is marked as ‘LEAVING’ and will stream data to other live nodes.
The data directories will still exist – remove these if the node will go back into production
The Cassandra JVM is still running – but with Gossip , Thrift and Native Transport ports all down.
This allows admin to hook up a JMX client to analyze the metrics maintained in the JVM
Then the JVM process can be shutdown manually
123. Decommission a node using nodetool
/bin/nodetool [option] decommission
Removes node specified by host id
-h host/IP address
-p port
-pw password
-u username
Monitor progress with nodetool netstats
124. Can we remove a node ?
Before doing anything, check nodetool status to see the state of the node in question
Nodetool status -- status =up/down
If the node is down ( and not coming back online), choose the appropriate option:
-Remove the node using the nodetool removenode command
Adjust your tokens to avoid creating a hot spot if using single-token nodes.
-If removenode fails, run nodetool assassinate
-nodetool repair should be run once the node is removed from the cluster
/bin/nodetool [options] removenode [host id]
-h host/IP address
-p port
-pw password
-u username
Additional arguments – status , forces
125. The pros of replacing a downed node
You don’t have to move the data twice
Backup for a node will work for a replaced node, because same token are used to bring replaced node into cluster
126. replacing a downed node using nodetool
First find the ip address of the down node using nodetool status
In the node, open the Cassandra-env.sh file
Swap in the IP address of dead node as the replace_address value in the JVM option. This will enable bootstrapping of the new node.
Use nodetool removenode to remove the dead node
Use the force option if necessary (nodetool assassinate)
You can monitor the process using nodetool netstats
127. what if the node was also a seed node ?
Consideration
Need to add to list of seeds in Cassandra.yaml
Cassandra will not allow seed node to autobootrap
Thus will have to run repair on new seed node to do so.
Steps
Add a new node making the necessary changes to the Cassandra.yaml file
Specify new seed node in Cassandra.yaml file
Start Cassandra on new seed node
Run nodetool repair on the new seed node to manually bootstrap
Remove the old seed node using nodetool removenode with the Host ID of the downed node
Run nodetool cleanup on previously existing nodes
128. By default, how many vnodes does each node have?
256
129. Which parameter in the cassandra.yaml file configures vnodes?
num_tokens
130. When using vnodes, Cassandra automatically assigns token ranges for you.
True
131. Nodes can only gossip with specific other nodes in the cluster.
False
132. Which of the statements are true concerning gossip?
Constant trickle of network traffic
Does not cause network spikes
Minimal compared to data streaming
133. In a full network partition, that is, parts of the cluster are completely disconnected from the whole, only the largest group of nodes can still satisfy queries.
False
134. What are the three main layers (in order) of data modeling?
Conceptual, Logical, Physical
135. Data modeling
Analyze requirements
Identify entities and relationships
Identify queries
Specify the schema
Optimize
Conceptual Data model / Application workflow Mapping conceptual to logical Logical Data Model Physical Optimization Physical data Model
Think outside of the box
Non standard solution – requires creativity
Different data models have different costs
136. Keyspaces
Top level namespace/container
Similar to a relational database schema
Replication parameters required
Keyspaces contain tables
Tables contain data
Uniquely identify rows
137. How to switch between keyspaces
By USE command
USE keyspacename
138. What is UUID & TIMEUUID
UUID - Universally Unique identifier
Generate via uuid()
TIMEUUID embeds a Timestamp value
Sortable
Generate via now()
139. Copy command
Imports/ exports CSV
Header parameter skips the first line in the file
Copy table1(c1,c2,c3) from ‘t1.csv’ with Header=true ;
140. What command bulk-loads data files?
COPY
141.Why do we use UUIDs in Cassandra to uniquely identify records?
To avoid conflicts in auto generating IDs between nodes
142. Cassandra requires you to specify the width of texual types, for example VARCHAR(50).
False
143. Partition Storage
Cassandra distributes partitions across nodes
Where on any field other than partition key would require searching all partitions on all nodes
Cassandra no likely
We can WHERE on a partition key value
Cassandra uses a hashing algorithm to quickly determine which nodes contain the desired partition
144. What is the smallest atomic unit of storage in Cassandra?
paritition
145. What is a cell?
key-value pair
146. What is a partition?
group of cells
147. What is the significance of the partition key?
Cassandra hashes the key value to determine which node the partition resides on
148. Clustering columns
Come after partition key within PRIMARY KEY clause
Clustering columns divide CQL rows between partitions.
Clustering column values stored sorted
Default is ascending
149. Querying clustering columns
You must first provide a partition key
Clustering columns can follow thereafter
You can perform either equality (=) or range queries (<, >) on clustering columns
All equality comparisons must come before inequality comparisons
Since data is sorted on disk, range searches are a binary search followed by a linear read
150. Change default Ordering of clustering columns
Clustering columns defaults ascending order
Change ordering direction via WITH CLUSTERING ORDER BY
Must include all columns including and up to columns you wish to order descending
151. Allow filtering
ALLOW FILTERING Relaxes the querying on partition key constraint
You can then query on just clustering columns
Causes Cassandra to scan all partitions in the table
Don’t use it -- Unless you really have to -- Best on small data sets
152. What is an upsert?
INSERTs may cause UPDATEs; UPDATEs may causes INSERTs
153. What purpose do clustering columns serve?
Provide uniqueness within the partition as well as ordering criteria
154. What is the relationship between a partition key and a clustering column?
Partition keys determine a grouping criteria whereas clustering columns determine ordering criteria
155. What is NODETOOL
Node management
Located in the bin/ folder
/bin/nodetool help
Help -- Lists all possible sub commands
Info – Current node settings and stats
Status – Reports basic node health information
156. Alter Table statement
Adding column
Dropping column
Cannot alter primary key
157. Collection column
Collection columns are multi valued columns
Designed to store a small amount of data
Retrieved in its entirety
Cannot nest a collection inside another collection
158. What is UDTs (user defined types)
UDT group related fields of information
Allow embedding more complex data within a single column
Create Type address ( street text, city text);
Using a UDT by adding frozen keyword.
159. What command drops all records from an existing table?
TRUNCATE
160. What command adds/removes columns to/from a table?
ALTER
161. Which is a Cassandra column type?
LIST<>
SET<>
MAP<>
162. Cassandra counters are always 100% accurate.
FALSE
163. What command executes a file of CQL statements?
SOURCE
164. Conceptual data modeling
Abstract view of the domain
Technology independent
Not specific to any database system
165. Which is an advantage of conceptual data modeling?
Collaboration between both technical and non-technical team members
Provides abstraction from the problem details
Better understanding of the domain
166. Which is a type found in a conceptual data model?
Entity types
Relationship types
Attribute types
167. Attribute types can be...
key
composite
multi-valued
168. How do you determine the key of a 1-1 relationship?
Key attributes of either participating entity types
169. How do you determine the key of a 1-n relationship?
Key attributes of entity type on the many side
170. How do you determine the key of a m-n relationship?
Key attributes of both participating entity types
171. What does disjoint mean?
An entity can only participate in only one subtype role
172. What is an application workflow?
Tasks formed by causual dependencies
173. How do we indicate a partition key in a Chebotko diagram?
K
174. How do we indicate a clustering column in a Chebotko diagram?
C with up/down arrow
175. What is a table's main purpose in a Cassandra database?
Serve a query
176. Data Modeling Principles
1 --Know your data
Data captured by conceptual data model
Define what is stored in database
Preserve properties so that data is organized correctly
2 --- Know your queries
Queries captured by application workflow model
Table schema design changes if queries changes
3 ---Nest data
Nesting organizes multiple entities into a single partition
Support partition per query data access
Three data nesting mechanisms
Clustering column – multi row partitions
Collection columns
User defined type columns
4 --- Duplicate data
Better to duplicate than to join data
Partition per query and data nesting may result in data duplication
Query results are pre computed and materialized
Data can be duplicated across tables, partitions, or rows
177. What are the two preferrable table query strategies?
Partition per query and partition+ per query
178. Why do we nest data in Cassandra?
Support a partition per query access pattern
179. Mapping Rules For the query driven methodology
Mapping rules ensure that a logical data model is correct
Each query has a corresponding table
Tables are designed to allow queries to execute properly
Tables return data in the correct order
MR1 -- Entities and Relationships
Entity and relationship types map to tables
Entity and relationship map to partitions or rows
Partition may have data about one or more entities and relationships
Attributes are represented by columns
180. Choose the option that lists the mapping rules in proper order
Entities and relationships, equality search attributes, inequiality search attributes, ordering attributes, key attributes
No comments:
Post a Comment