Wednesday, 20 November 2019

Interview Q and A for Cassandra DB Part - 1

1. Compare MongoDB and Cassandra

Criteria            MongoDB       Cassandra

Data Model     Document       Big Table like

Database scalability    Read    Write

Querying of data         Multi-indexed Using Key or Scan

 

2. What is Cassandra?

Cassandra is one of the most favored NoSQL distributed database management systems by Apache. With open source technology, Cassandra is efficiently designed to store and manage large volumes of data without any failure. Highly scalable for Big Data models and originally designed by Facebook, Apache Cassandra is written in Java comprising flexible schemas. Apache Cassandra has no single point of failure. There are various types of NoSQL databases and Cassandra is a hybrid of column-oriented and key-value store database. 

The keyspace is outermost container for an application and table or column family in Cassandra is keyspace entity.

 

3. List the benefits of using Cassandra.

Unlike traditional or any other database, Apache Cassandra delivers near real-time performance simplifying the work of Developers, Administrators, Data Analysts and Software Engineers.

• Instead of master-slave architecture, Cassandra is established on peer-to-peer architecture ensuring no failure.

• It also assures phenomenal flexibility as it allows insertion of multiple nodes to any Cassandra cluster in any datacenter. Further, any client can forward its request to any server.

• Cassandra facilitates extensible scalability and can be easily scaled up and scaled down as per the requirements. With a high throughput for read and write operations, this NoSQL application need not be restarted while scaling.

• Cassandra is also revered for its strong data replication on nodes capability as it allows data storage at multiple locations enabling users to retrieve data from another location if one node fails. Users have the option to set up the number of replicas they want to create.

• Shows brilliant performance when used for massive datasets and thus, the most preferable NoSQL DB by most organizations.

• Operates on column-oriented structure and thus, quickens and simplifies the process of slicing. Even data access and retrieval become more efficient with column-based data model.

• Further, Apache Cassandra supports schema-free/schema-optional data model, which un-necessitate the purpose of showing all the columns required by your application.

 

4. Explain the concept of Tunable Consistency in Cassandra.

Tunable Consistency is a phenomenal characteristic that makes Cassandra a favored database choice of Developers, Analysts and Big data Architects. Consistency refers to the up-to-date and synchronized data rows on all their replicas. Cassandra’s Tunable Consistency allows users to select the consistency level best suited for their use cases. It supports two consistencies -Eventual and Consistency and Strong Consistency.

The former guarantees consistency when no new updates are made on a given data item, all accesses return the last updated value eventually. Systems with eventual consistency are known to have achieved replica convergence.

For Strong consistency, Cassandra supports the following condition:

R + W > N, where

N – Number of replicas

W – Number of nodes that need to agree for a successful write

R – Number of nodes that need to agree for a successful read

 

5. How does Cassandra write?

Cassandra performs the write function by applying two commits-

first it writes to a commit log on disk and then commits to an in-memory structured known as memtable. Once the two commits are successful, the write is achieved. Writes are written in the table structure as SSTable (sorted string table). Cassandra offers speedier write performance.

 

6. Define the management tools in Cassandra.

DataStaxOpsCenter: internet-based management and monitoring solution for Cassandra cluster and DataStax. It is free to download and includes an additional Edition of OpsCenter

• SPM primarily administers Cassandra metrics and various OS and JVM metrics. Besides Cassandra, SPM also monitors Hadoop, Spark, Solr, Storm, zookeeper and other Big Data platforms. The main features of SPM include correlation of events and metrics, distributed transaction tracing, creating real-time graphs with zooming, anomaly detection and heartbeat alerting.

 

7. Define memtable.

 It is a memory-resident data structure. After commit log, the data will be written to the mem-table. Similar to table, memtable is in-memory/write-back cache space consisting of content in key and column format. The data in memtable is sorted by key, and each Column Family consist of a distinct memtable that retrieves column data via key. It stores the writes until it is full, and then flushed out.

Sometimes, for a single-column family, there will be multiple mem-tables.

 

8. What is SSTable? How is it different from other relational tables?

SSTable expands to ‘Sorted String Table,’ which refers to an important data file in Cassandra and accepts regular written memtables. They are stored on disk and exist for each Cassandra table. Exhibiting immutability, SStables do not allow any further addition and removal of data items once written. For each SSTable, Cassandra creates three separate files like partition index, partition summary and a bloom filter.

SSTable is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value.

 

9. Explain the concept of Bloom Filter.

Associated with SSTable, Bloom filter is an off-heap (off the Java heap to native memory) data structure to check whether there is any data available in the SSTable before performing any I/O disk operation.

A bloom filter is a tool used by Cassandra. The read path of Cassandra has to go through Memtable and the row cache. A bloom filter is a partition cache, and its role is the read path is to avoid checking every SStable to find one particular data.

 

10. Explain CAP Theorem.

With a strong requirement to scale systems when additional resources are needed, CAP Theorem plays a major role in maintaining the scaling strategy. It is an efficient way to handle scaling in distributed systems. Consistency Availability and Partition tolerance (CAP) theorem states that in distributed systems like Cassandra, users can enjoy only two out of these three characteristics.

One of them needs to be sacrificed. Consistency guarantees the return of most recent write for the client, Availability returns a rational response within minimum time and in Partition Tolerance, the system will continue its operations when network partitions occur. The two options available are AP and CP.

§  Consistency: means that data is the same across the cluster, so you can read or write to/from any node and get the same data.

§  Availability: means the ability to access the cluster even if a node in the cluster goes down.

§  Partition: Tolerance means that the cluster continues to function even if there is a "partition" (communications break) between two nodes (both nodes are up, but can't communicate).

In order to get both availability and partition tolerance, you have to give up consistency. Consider if you have two nodes, X and Y, in a master-master setup. Now, there is a break between network comms in X and Y, so they can't synch updates. At this point you can either:

A) Allow the nodes to get out of sync (giving up consistency), or

B) Consider the cluster to be "down" (giving up availability)

 

All the combinations available are:

§  CA - data is consistent between all nodes - as long as all nodes are online - and you can read/write from any node and be sure that the data is the same, but if you ever develop a partition between nodes, the data will be out of sync (and won't re-sync once the partition is resolved).

§  CP - data is consistent between all nodes, and maintains partition tolerance (preventing data desync) by becoming unavailable when a node goes down.

§  AP - nodes remain online even if they can't communicate with each other and will resync data once the partition is resolved, but you aren't guaranteed that all nodes will have the same data (either during or after the partition)

A drawing of a face

Description automatically generated

 

 

 

 

11. State the differences between a node, a cluster and datacenter in Cassandra.

There are various components of Cassandra. While a node is a single machine running Cassandra, cluster is a collection of nodes that have similar type of data grouped together. Data Centers are useful components when serving customers in different geographical areas. You can group different nodes of a cluster into different data centers.

 

12. How to write a query in Cassandra?

Using CQL (Cassandra Query Language).Cqlsh is used for interacting with database.

 

13. What OS Cassandra supports?

Windows and Linux

 

14. What is Cassandra Data Model?

Cassandra Data Model consists of four main components:

Cluster: Made up of multiple nodes and keyspaces

Keyspace: a namespace to group multiple column families, especially one per partition

Column: consists of a column name, value and timestamp

ColumnFamily: multiple columns with row key reference.

 

15. What is CQL?

CQL is Cassandra Query language to access and query the Apache distributed database. It consists of a CQL parser that incites all the implementation details to the server. The syntax of CQL is similar to SQL but it does not alter the Cassandra data model.

 

16. Explain the concept of compaction in Cassandra.

Compaction refers to a maintenance process in Cassandra , in which, the SSTables are reorganized for data optimization of data structures on the disk. The compaction process is useful during interactive with memtable. There are two types of compaction in Cassandra:

Minor compaction: started automatically when a new sstable is created. Here, Cassandra condenses all the equally sized sstables into one.

Major compaction is triggered manually using nodetool. Compacts all sstables of a Column Family into one.

 

17. Does Cassandra support ACID transactions?

Unlike relational databases, Cassandra does not support ACID transactions.

 

18. Explain Cqlsh

Cqlsh expands to Cassandra Query language Shell that configures the CQL interactive terminal. It is a Python-base command-line prompt used on Linux or Windows and execute CQL commands like ASSUME, CAPTURE, CONSITENCY, COPY, DESCRIBE and many others. With cqlsh, users can define a schema, insert data and execute a query.

 

19. What is SuperColumn in Cassandra?

Cassandra Super Column is a unique element consisting of similar collections of data. They are actually key-value pairs with values as columns. It is a sorted array of columns, and they follow a hierarchy when in action: keystore> column family> super column> column data structure in JSON.

Similar to row keys, super column data entries contain no independent values but are used to collect other columns. It is interesting to note that super column keys appearing in different rows do not necessarily match and will not ever.

These super columns are used to improve the performance of the database

 

20. Define the consistency levels for read operations in Cassandra.

• ALL: Highly consistent. A write must be written to commitlog and memtable on all replica nodes in the cluster

• EACH_QUORUM: A write must be written to commitlog and memtable on quorum of replica nodes in all data centers.

• LOCAL_QUORUM:A write must be written to commitlog and memtable on quorum of replica nodes in the same center.

• ONE: A write must be written to commitlog and memtable of at least one replica node.

• TWO, Three: Same as One but at least two and three replica nodes, respectively

• LOCAL_ONE: A write must be written for at least one replica node in the local data center

• ANY

• SERIAL: Linearizable Consistency to prevent unconditional update

• LOCAL_SERIAL: Same as Serial but restricted to local data center

 

21. What is difference between Column and Super Column?

Both elements work on the principle of tuple having name and value. However, the former‘s value is a string while the value in latter is a Map of Columns with different data types.

Unlike Columns, Super Columns do not contain the third component of timestamp.

 

22. What is Column Family?

As the name suggests, Column Family refers to a structure having infinite number of rows. That are referred by a key-value pair, where key is the name of the column and value represents the column data. It is much similar to a hashmap in java or dictionary in Python. Remember, the rows are not limited to a predefined list of Columns here. Also, the ColumnFamily is absolutely flexible with one row having 100 Columns while the other only 2 columns.

 

23. Define the use of Source Command in Cassandra.

Source command is used to execute a file consisting of CQL statements.

 

24. What is Thrift?

Thrift is the name of the RPC client used to communicate with the Cassandra server.

Thrift is a legacy RPC protocol or API unified with a code generation tool for CQL. The purpose of using Thrift in Cassandra is to facilitate access to the DB across the programming language.

 

25. Explain Tombstone in Cassandra.

Tombstone is row marker indicating a column deletion. These marked columns are deleted during compaction. Tombstones are of great significance as Cassnadra supports eventual consistency, where the data must respond before any successful operation.

 

26. What Platforms Cassandra runs on?

Since Cassandra Online Training is a Java application, it can successfully run on any Java-driven platform or Java Runtime Environment (JRE) or Java Virtual Machine (JVM). Cassandra also runs on RedHat, CentOS, Debian and Ubuntu Linux platforms.

 

27. Name the ports Cassandra uses.

The default settings state that Cassandra uses 7000 ports for Cluster Management, 9160 for Thrift Clients, 8080 for JMX. These are all TCP ports and can be edited in the configuration file: bin/Cassandra.in.sh

 

By default, Cassandra uses 7000 for cluster communication (7001 if SSL is enabled), 9042 for native protocol clients, and 7199 for JMX. The internode communication and native protocol ports are configurable in the Cassandra Configuration File. The JMX port is configurable in cassandra-env.sh (through JVM options). All ports are TCP.

 

28. Can you add or remove Column Families in a working Cluster?

Yes, but keeping in mind the following processes.

•           Do not forget to clear the commitlog with ‘nodetool drain’

•           Turn off Cassandra to check that there is no data left in commitlog

•           Delete the sstable files for the removed CFs

 

29. What is Replication Factor in Cassandra?

Replication Factor is the measure of number of data copies existing. It is important to increase the replication factor to log into the cluster.

 

30. Can we change Replication Factor on a live cluster?

Yes, but it will require running repair to alter the replica count of existing data.

 

31. How to iterate all rows in ColumnFamily?

Using get_range_slices. You can start iteration with the empty string and after each iteration, the last key read serves as the start key for next iteration.

 

32. What do you understand by Data Replication in Cassandra?

Database replication is the frequent electronic copying data from a database in one computer or server to a database in another so that all users share the same level of information.

Cassandra stores replicas on multiple nodes to ensure reliability and fault tolerance. A replication strategy determines the nodes where replicas are placed. The total number of replicas across the cluster is referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row on one node. A replication factor of 2 means two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary or master replica. As a general rule,  the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes later.

 

33. What are the three components of Cassandra write?

The three components are:

1.         Commitlog write

2.         Memtable write

3.         SStable write

Cassandra first writes data to a commit log and then to an in-memory table structure memtable and at last in SStable.

 

34. Explain zero consistency.

In zero consistency the write operations will be handled in the background, asynchronously. It is the fastest way to write data.

 

35. When do you have to avoid secondary indexes?

Try not using secondary indexes on columns containing a high count of unique values as that will produce few results.

 

37. What are secondary indexes?

Secondary indexes are indexes built over column values. In other words, let’s say you have a user table, which contains a user’s email. The primary index would be the user ID, so if you wanted to access a particular user’s email, you could look them up by their ID. However, to solve the inverse query given an email, fetch the user ID requires a secondary index.

 

38. When to use secondary indexes?

You want to query on a column that isn’t the primary key and isn’t part of a composite key. The column you want to be querying on has few unique values (what I mean by this is, say you have a column Town, that is a good choice for secondary indexing because lots of people will be form the same town, date of birth however will not be such a good choice).

 

39. I have a row or key cache hit rate of 0.XX123456789 reported by JMX. Is that XX% or 0.XX% ?

XX%

 

40. What is the use of “void close()” method?

This method is used to close the current session instance.

 

41. What are the collection data types provided by CQL?

There are three collection data types:

1.         List : A list is a collection of one or more ordered elements.

2.         Map : A map is a collection of key-value pairs.

3.         Set : A set is a collection of one or more elements.

 

42. Mention what is Cassandra- CQL collections?

Cassandra CQL collections help you to store multiple values in a single variable. In Cassandra, you can use CQL collections in following ways

•           List: It is used when the order of the data needs to be maintained, and a value is to be stored multiple times (holds the list of unique elements)

•           SET: It is used for group of elements to store and returned in sorted orders (holds repeating elements)

•           MAP: It is a data type used to store a key-value pair of elements

 

43. Which command is used to start the cqlsh prompt?

Cqlsh

 

44. What is the use of “cqlsh –version” command?

This command will provides the version of the cqlsh you are using.

 

45. List the steps in which Cassandra writes changed data into commitlog?

Cassandra concatenates changed data to commitlog. Then Commitlog acts as a crash recovery log for data. Until the changed data is concatenated to commitlog, write operation will never be considered successful.

Data will not be lost once commitlog is flushed out to file.

 

46. What is the use of “ResultSet execute(Statement statement)” method?

This method is used to execute a query. It requires a statement object.

 

47. Mention what are the values stored in the Cassandra Column?

There are three values in Cassandra Column. They are:

1.         Column Name

2.         Value

3.         Time Stamp

 

48. What do you understand by Kundera?

Kundera is an object-relational mapping (ORM) implementation for Cassandra which is written using Java annotations.

 

49. Define composite type in Cassandra?

In Cassandra, composite type allows to define a key or a column name with a concatenation of data of different type. You can use two types of Composite Types:

1.         Row Key

2.         Column Name

 

50. Explain what is a keyspace in Cassandra?

In Cassandra, a keyspace is a namespace that determines data replication on nodes. A cluster consists of one keyspace per node.

51. Partition key columns are optional if you have clustering columns.

False

 

52. What benefits do Clustering columns provide?

Reading sorted data is a matter of seeking the disk head once.

 

53. Cassandra works best on network attached storage.

False

 

54. How much data can a single Cassandra node effectively handle?

1 to 3 terabytes

 

55. What is Partitions...

group rows physically together on disk based on the partition key.

 

56. What is the role of the partitioner?

It hashes the partition key values to create a partition token.

 

57. How Node joining the cluster ?

Nodes join the cluster by communicating with any node 

Cassandra finds these seed nodes list of possible nodes in Cassandra.yaml

Seed nodes communicate cluster topology to the joining node 

Once the new node joins the cluster, all nodes are peers

 

58. What is role of Drivers in Node joining

Drivers intelligently choose which node would best coordinate a request 

Token AwarePolicy – Driver chooses node which contains the data

RoundRobinPolicy – Driver round robins the ring 

DCAwareRoundRobinPolicy – Driver round robins the target data center 

 

59. Why Cassandra 

Peers instead of master/slave

Linear scale performance 

Always on reliability 

Data can be stored geographically close to clients

 

60. What is VNodes

Each nodes has several tokens it manages 

Adding/removing nodes with vnodes should not make the cluster unbalanced 

By default, each node has 256 vnodes 

VNodes automate token range assignment

Configure vnode settings in Cassandra.yaml

Num_tokens     Value greater than one turns on vnodes

 

61. What is Gossip 

Cluster Metadata

 

62. How to Choosing a Gossip Node 

Each Node initiates a gossip round every few seconds 

Picks one to three nodes to gossip with 

Nodes can gossip with ANY other node in cluster 

Probabilistically (slight favor) seed and downed nodes 

Nodes do not track which nodes they gossiped with prior 

Reliably and efficient spread node metadata through the cluster 

Fault tolerant – continues to spread when nodes fail 

 

63. What is Snitch 

A snitch determines which data centers and racks nodes belong to. They inform Cassandra about the network topology so that requests are routed efficiently and allows Cassandra to distribute replicas by grouping machines into data centers and racks. Specifically, the replication strategy places the replicas based on the information provided by the new snitch. All nodes must return to the same rack and data center. Cassandra does its best not to have more than one replica on the same rack.

The topology of the cluster 

Determines / declares each node’s rack and data center 

Several different types of snitches 

Configured in Cassandra.yaml

Endpoint_snitch: simpleSnitch

 

64. What is simple snitch ?

Places all nodes in the same data center and rack

Default snitch

 

65. Property File snitch

Reads  datacenter and rack information for all nodes from a file

You must keep files in sync with all nodes in the cluster

Cassaandra-topology.properties file

175.56.12.105=DC1:RAC1

 

66. Gossiping property File Snitch

Relieves the pain of the property file snitch 

Declare the current node’s DC / Rack information in a file 

 You must set each individual node’s settings

But you don’t have to copy settings as with property file snitch 

Gossip spreads the setting through the cluster

Cassandra-rackdc.properties file

dc=DC1

rack=RAC1

 

67. Rack inferring Snitch 

Infers the rack and DC from the IP address 

110.100.200.105

100 – Datacenter

200- Rack 

105 - Node

 

68. Dynamic Snitch ?

Layered on top of your actual snitch 

Maintains a pulse on each node’s performance

Determines which node to query replicas from depending on node health

Turned on by default for all snitches 

 

69. How to configuring snitches 

All nodes in the cluster must use the same snitch

Changing cluster network topology requires restarting all nodes 

Run sequential repair and cleanup on each node  

 

70. Snitch is used to...

Determine/declare each node's rack and data center.

 

71. Snitch is configured in the cassandra.yaml file.

True

 

72. Which is the following is *not* a type of snitch?

            SimpleSnitch

            PropertyFileSnitch

            DynamicSnitch

Ans -   CassandraSnitch

 

73. A replication factor of three means that Cassandra will store a total of four copies: the master and three copies.

Flase

 

74. A replication factor greater than one...

            widens the range of token values a single node is responsible for.

            causes overlap in the token ranges amongst nodes.

            requires more storage in your cluster.

 

75. Where does Cassandra reside in the CAP theorem?

availability/partition tolerance

 

76. With a replication factor of two, how many nodes must respond with success using consistency level quorum to indicate a successful operation?

2

 

77. How configure hinted handoff in Cassandra ?

Cassandra.yaml

You can disable hinted handoff

Set the amount of time a node will store a hint 

Default is three hours 

Consistency level of ANY means storing a hint suffices

Consistency level of ONE or more means at least one replica must successfully write 

Hint does not suffice 

 

78. The default time for a node to store a hint is:

3 hours

 

79. Hinted handoff is disabled by default.

False

 

80. Mention when you can use Alter keyspace?

ALTER KEYSPACE can be used to change properties such as the number of replicas and the durable_write of a keyspace.

 

Friday, 16 November 2018

Interview Q and A for Greenplum DB Part - 2

111. What is gpcrondump and how to work ?
Answer:    The Greenplum Database parallel dump utility gpcrondump backs up the Greenplum master instance and each active segment instance at the same time.
By default, gpcrondump creates dump files in the gp_dump subdirectory.
Several dump files are created for the master, containing database information such as DDL statements, the Greenplum system catalog tables, and metadata files.
gpcrondump creates one dump file for each segment, which contains commands to recreate the data on that segment.
You can perform full or incremental backups. To restore a database to its state when an incremental backup was made, you will need to restore the previous full backup and all subsequent incremental backups.
Each file created for a backup begins with a 14-digit timestamp key that identifies the backup set the file belongs to.
gpcrondump can be run directly in a terminal on the master host, or you can add it to crontab on the master host to schedule regular backups.
To run a full backup, use “gpcrondump -x database -u /path/for/backup -a”. This will backup the entire database to the directory given without prompting the user.

$ gpcrondump -x tutorial -u /tmp -a -r

112. Is it Indexes useful for performance in greenplum ?

Answer:    Greenplum Database does not depend upon indexes to the same degree as conventional data warehouse systems. Because the segments execute table scans in parallel, each segment scanning a small segment of the table, the traditional performance advantage from indexes is diminished.
Indexes consume large amounts of space and require considerable CPU time to compute during data loads. There are, however, times when indexes are useful, especially for highly selective queries.
When a query looks up a single row, an index can dramatically improve performance.

113. What is Row vs. column orientation ?

Answer:    Greenplum Database offers the ability to store a table in either row or column orientation. Both storage options have advantages, depending upon data compression characteristics, the kinds of queries executed, the row length, and the complexity and number of join columns.
As a general rule, very wide tables are better stored in row orientation, especially if there are joins on many columns.
Column orientation works well to save space with compression and to reduce I/O when there is much duplicated data in columns.

CREATE TABLE FAA.OTP_C (LIKE faa.otp_r) WITH (appendonly=true, orientation=column)

114. What is partitioning  a table in greenplum?

Answer:    Partitioning a table can improve query performance and simplify data administration. The table is divided into smaller child files using a range or a list value, such as a date range or a country code.
Partitions can improve query performance dramatically. When a query predicate filters on the same criteria used to define partitions, the optimizer can avoid searching partitions that do not contain relevant data.
Using the ALTER TABLE statement, an existing partition can be dropped by removing its child file. This is much more efficient than scanning the entire table and removing rows with a DELETE statement.
Partitions may also be subpartitioned. For example, a table could be partitioned by month, and the month partitions could be subpartitioned by week. Greenplum Database creates child files for the months and weeks. The actual data, however, is stored in the child files created for the week subpartitions—only child files at the leaf level hold data.
When a new partition is added, you can run ANALYZE on just the data in that partition. ANALYZE can run on the root partition (the name of the table in the CREATE TABLE statement) or on a child file created for a leaf partition. If ANALYZE has already run on the other partitions and the data is static, it is not necessary to run it again on those partitions.
Greenplum Database supports:
  • Range partitioning: division of data based on a numerical range, such as date or price.
  • List partitioning: division of data based on a list of values, such as sales territory or product line.
  • A combination of both types.

115. Explain Query Planning and Dispatch  ?

Answer:    The master receives, parses, and optimizes the query. The resulting query plan is either parallel or targeted. The master dispatches parallel query plans to all segments. Each segment is responsible for executing local database operations on its own set of data.query plans.
Most database operations—such as table scans, joins, aggregations, and sorts—execute across all segments in parallel. Each operation is performed on a segment database independent of the data stored in the other segment databases.

Understanding Greenplum Query Plans

A query plan is the set of operations Greenplum Database will perform to produce the answer to a query. Each node or step in the plan represents a database operation such as a table scan, join, aggregation, or sort. Plans are read and executed from bottom to top.
In addition to common database operations such as tables scans, joins, and so on, Greenplum Database has an additional operation type called motion. A motion operation involves moving tuples between the segments during query processing.
To achieve maximum parallelism during query execution, Greenplum divides the work of the query plan into slices. A slice is a portion of the plan that segments can work on independently. A query plan is sliced wherever a motion operation occurs in the plan, with one slice on each side of the motion.

116. What is  Parallel Query Execution and how work ?

Answer:    Greenplum creates a number of database processes to handle the work of a query. On the master, the query worker process is called the query dispatcher (QD). The QD is responsible for creating and dispatching the query plan. It also accumulates and presents the final results. On the segments, a query worker process is called a query executor (QE). A QE is responsible for completing its portion of work and communicating its intermediate results to the other worker processes.
There is at least one worker process assigned to each slice of the query plan. A worker process works on its assigned portion of the query plan independently. During query execution, each segment will have a number of processes working on the query in parallel.
Related processes that are working on the same slice of the query plan but on different segments are called gangs. As a portion of work is completed, tuples flow up the query plan from one gang of processes to the next. This inter-process communication between the segments is referred to as the interconnect component of Greenplum Database.

117. What is MVCC ?

Answer:    Greenplum uses Multiversion Concurrency Control (MVCC) to guarantee isolation, one of the ACID properties of relational databases. MVCC enables multiple users of the database to obtain consistent results for a query, even if the data is changing as the query is being executed. There can be multiple versions of rows in the database, but a query sees a snapshot of the database at a single point in time, containing only the versions of rows that were valid at that point in time. When a row is updated or deleted and no active transactions continue to reference it, it can be removed.

118. What is VACUUM , VACUUM FULL & ANALYZE ?
Answer:    The VACUUM command removes older versions that are no longer needed, leaving free space that can be reused. In a Greenplum database, normal OLTP operations do not create the need for vacuuming out old rows, but loading data while tables are in use may. It is a best practice to VACUUM a table after a load. If the table is partitioned, and only a single partition is being altered, then a VACUUM on that partition may suffice. The VACUUM FULL command behaves much differently than VACUUM, and its use is not recommended in Greenplum databases. It can be expensive in CPU and I/O, cause bloat in indexes, and lock data for long periods of time. The ANALYZE command generates statistics about the distribution of data in a table. In particular it stores histograms about the values in each of the columns. The query optimizer depends on these statistics to select the best plan for executing a query. For example, the optimizer can use distribution data to decide on join orders. One of the optimizer’s goals in a join is to minimize the volume of data that must be analyzed and potentially moved between segments by using the statistics to choose the smallest result set to work with first.

119. What is explain plans ?

Answer:    An explain plan explains the method the optimizer has chosen to produce a result set. Depending on the query, there can be a variety of methods to produce a result set. The optimizer calculates the cost for each method and chooses the one with the lowest cost. In large queries, cost is generally measured by the amount of I/O to be performed.
An explain plan does not do any actual query processing work. Explain plans use statistics generated by the ANALYZE command, so plans generated before and after running ANALYZE can be quite different. This is especially true for queries with multiple joins, because the order of the joins can have a tremendous impact on performance.

120. How to Changing optimizers ?

Answer:    You may see “legacy query optimizer” listed in the EXPLAIN output under “Optimizer status.”
  1. Check whether the Pivotal Query Optimizer is enabled.
$ gpconfig -s optimizer
Values on all segments are consistent
GUC          : optimizer
Master  value: off
Segment value: off
  1. Enable the Pivotal Query Optimizer
$ gpconfig -c optimizer -v on --masteronly
 20151201:09:08:31:172854 gpconfig:gpdb-sandbox:gpadmin-[INFO]:-completed successfully
  1. Reload the configuration on master and segment instances.
$ gpstop -u

121. How to Load external data into Greenplum Database ?
Answer:    Loading external data into Greenplum Database tables can be accomplished in different ways. We will use three methods to load the FAA data:
  • The simplest data loading method is the SQL INSERT statement. You can execute INSERT statements directly with psql or another interactive client, run a script containing INSERT statements, or run a client application with a database connection. This is the least efficient method for loading large volumes of data and should be used only for small amounts of data.
  • You can use the COPY command to load the data into a table when the data is in external text files. The COPY command syntax allows you to define the format of the text file so the data can be parsed into rows and columns. This is faster than INSERT statements but, like INSERT statements, it is not a parallel process.
    The SQL COPY command requires that external files be accessible to the host where the master process is running. On a multi-node Greenplum Database system, the data files may be on a file system that is not accessible to the master process. In this case, you can use the psql \copy meta-command, which streams the data to the server over the psql connection.
  • You can use a pair of Greenplum utilities, gpfdist and gpload, to load external data into tables at high data transfer rates. In a large scale, multi-terabyte data warehouse, large amounts of data must be loaded within a relatively small maintenance window. Greenplum supports fast, parallel data loading with its external tables feature. Administrators can also load external tables in single row error isolation mode to filter bad rows into a separate error table while continuing to load properly formatted rows. Administrators can specify an error threshold for a load operation to control how many improperly formatted rows cause Greenplum to abort the load operation.
By using external tables in conjunction with Greenplum Database’s parallel file server (gpfdist), administrators can achieve maximum parallelism and load bandwidth from their Greenplum Database system.
Another Greenplum utility, gpload, runs a load task that you specify in a YAML-formatted control file. You describe the source data locations, format, transformations required, participating hosts, database destinations, and other particulars in the control file and gpload executes the load. This allows you to describe a complex task and execute it in a controlled, repeatable fashion.


122. What is steps to Load data with gpfdist ?
Answer:    For the FAA fact table, we will use an ETL (Extract, Transform, Load) process to load data from the source gzip files into a loading table, and then insert the data into a query and reporting table. For the best load speed, use the gpfdist Greenplum utility to distribute the rows to the segments. In a production system, gpfdist runs on the servers where the data is located. With a single-node Greenplum Database instance, there is only one host, and you run gpdist on it. Starting gpfdist is like starting a file server; there is no data movement until a request is made on the process.
Note: This exercise loads data using the Greenplum Database external table feature to move data from external data files into the database. Moving data between the database and external tables is a security consideration, so only superusers are permitted to use the feature. Therefore, you will run this exercise as the gpadmin database user.
  1. Execute gpfdist. Use the –d switch to set the “home” directory used to search for files in the faa directory. Use the –p switch to set the port and background the process.
$ gpfdist -d ~/gpdb-sandbox-tutorials/faa -p 8081 > /tmp/gpfdist.log 2>&1 &
  1. Check that gpfdist is running with the ps command:
$ ps -A | grep gpfdist
  1. View the contents of the gpfdist log.
more /tmp/gpfdist.log
  1. Start a psql session as gpadmin and execute the create_load_tables.sql script. This script creates two tables: the faa_otp_load table, into which gpdist will load the data, and the faa_load_errors table, where load errors will be logged. (The faa_load_errors table may already exist. Ignore the error message.) The faa_otp_load table is structured to match the format of the input data from the FAA Web site.

$ psql -U gpadmin tutorial
tutorial=# \i create_load_tables.sql
CREATE TABLE
CREATE TABLE
Create an external table definition with the same structure as the faa_otp_load table.
tutorial=# \i create_ext_table.sql
psql:create_ext_table.sql:5: NOTICE:  HEADER means that each one of the data files has a header row.
CREATE EXTERNAL TABLE
This is a pure metadata operation. No data has moved from the data files on the host to the database yet. The external table definition references files in the faa directory that match the pattern otp*.gz. There are two matching files, one containing data for December 2009, the other for January 2010.
Move data from the external table to the faa_otp_load table.
tutorial=#  INSERT INTO faa.faa_otp_load SELECT * FROM faa.ext_load_otp;
NOTICE:  Found 26526 data formatting errors (26526 or more input rows).
Rejected related input data.
INSERT 0 1024552
Greenplum moves data from the gzip files into the load table in the database. In a production environment, you could have many gpfdist processes running, one on each host or several on one host, each on a separate port number.
Examine the errors briefly. (The \x on psql meta-command changes the display of the results to one line per column, which is easier to read for some result sets.)
tutorial=# \x
Expanded display is on.
tutorial=# SELECT DISTINCT relname, errmsg, count(*)
FROM faa.faa_load_errors GROUP BY 1,2;
-[ RECORD 1 ]-------------------------------------------------
relname | ext_load_otp
errmsg  | invalid input syntax for integer: "", column deptime
count   | 26526
Exit the psql shell:
tutorial=# \q

123. How to Load data with gpload ?
Answer:    Greenplum provides a wrapper program for gpfdist called gpload that does much of the work of setting up the external table and the data movement. In this exercise, you reload the faa_otp_load table using the gpload utility.
1.         Since gpload executes gpfdist, you must first kill the gpfdist process you started in the previous exercise.
            [gpadmin@gpdb-sandbox faa]$ ps -A | grep gpfdist
            4035 pts/0    00:00:02 gpfdist
Your process id will not be the same, so kill the appropriate one with the kill command, or just use the simpler killall command:
  gpadmin@gpdb-sandbox faa]$ killall gpfdist
  [1]+  Exit 1    gpfdist -d $HOME/gpdb-sandbox-tutorials/faa -p   8081 > /tmp/  gpfdist.log 2>&1
2.         Edit and customize the gpload.yaml input file. Be sure to set the correct path to the faa directory. Notice the TRUNCATE: true preload instruction ensures that the data loaded in the previous exercise will be removed before the load in this exercise starts.
vi gpload.yaml
---
VERSION: 1.0.0.1
# describe the Greenplum database parameters
DATABASE: tutorial
USER: gpadmin
HOST: localhost
PORT: 5432
# describe the location of the source files
# in this example, the database master lives on the same host as the source files
GPLOAD:
   INPUT:
    - SOURCE:
         LOCAL_HOSTNAME:
           - gpdb-sandbox
         PORT: 8081
         FILE:
           - /Users/gpadmin/gpdb-sandbox-tutorials/faa/otp*.gz
    - FORMAT: csv
    - QUOTE: '"'
    - ERROR_LIMIT: 50000
    - ERROR_TABLE: faa.faa_load_errors
   OUTPUT:
    - TABLE: faa.faa_otp_load
    - MODE: INSERT
   PRELOAD:
    - TRUNCATE: true
3.         Execute gpload with the gpload.yaml input file. (Include the -v flag if you want to see details of the loading process.)
$ gpload -f gpload.yaml -l gpload.log
2015-10-21 15:05:39|INFO|gpload session started 2015-10-21 15:05:39
2015-10-21 15:05:39|INFO|started gpfdist -p 8081 -P 8082 -f "/home/gpadmin/gpdb-sandbox-tutorials/faa/otp*.gz" -t 30
2015-10-21 15:05:58|WARN|26528 bad rows
2015-10-21 15:05:58|INFO|running time: 18.64 seconds
2015-10-21 15:05:58|INFO|rows Inserted          = 1024552
2015-10-21 15:05:58|INFO|rows Updated           = 0
2015-10-21 15:05:58|INFO|data formatting errors = 0
2015-10-21 15:05:58|INFO|gpload succeeded with warnings

124. How to perform Data loading ?

Answer:    The ability to load billions of rows quickly into the Greenplum database is one of its key features. Using “Extract, Load and Transform” (ELT) allows load processes to make use of the massive parallelism of the Greenplum system by staging the data (perhaps just the use of external tables) and then applying data transformations within Greenplum Database. Set-based operations can be done in parallel, maximizing performance.
With other loading mechanisms such as COPY, data is loaded through the master in a single process. This does not take advantage of the parallel processing power of the Greenplum segments. External tables provide a means of leveraging the parallel processing power of the segments for data loading. Also, unlike other loading mechanisms, you can access multiple data sources with one SELECT of an external table.
External tables make static data available inside the database. External tables can be defined with file:// or gpfdist:// protocols. gpfdist is a file server program that loads files in parallel. Since the data is static, external tables can be rescanned during a query execution.
External Web tables allow http:// protocol or an EXECUTE clause to execute an operating system command or script. That data is assumed to be dynamic—query plans involving Web tables do not allow rescanning because the data could change during query execution. Execution plans may be slower, as data must be materialized (I/O) if it cannot fit in memory.
The script or process to populate a table with external Web tables may be executed on every segment host. It is possible, therefore, to have duplication of data. This is something to be aware of and check for when using Web tables, particularly with SQL extract calls to another database.

125. What is Greenplum Database ?
Answer:    Greenplum Database is a massively parallel processing (MPP) database server with an architectur specially designed to manage large-scale analytic data warehouses and business intelligence workloads.

MPP (also known as a shared nothing architecture) refers to systems with two or more processors that cooperate to carry out an operation, each processor with its own memory, operating system and disks. Greenplum uses this high-performance system architecture to distribute the load of multi-terabyte data warehouses and can use all of a system’s resources in parallel to process a query.
Greenplum Database is based on PostgreSQL open-source technology. It is essentially several PostgreSQL database instances acting together as one cohesive database management system (DBMS). It is based on PostgreSQL 8.2.15, and in most cases is very similar to PostgreSQL with regard to SQL support, features, configuration options, and end-user functionality. Database users interact with Greenplum Database as they would a regular PostgreSQL DBMS.
The internals of PostgreSQL have been modified or supplemented to support the parallel structure of Greenplum Database. For example, the system catalog, optimizer, query executor, and transaction manager components have been modified and enhanced to be able to execute queries simultaneously across all of the parallel PostgreSQL database instances. The Greenplum interconnect (the networking layer) enables communication between the distinct PostgreSQL instances and allows the system to behave as one logical database.
Greenplum Database also includes features designed to optimize PostgreSQL for business intelligence (BI) workloads. For example, Greenplum has added parallel data loading (external tables), resource management, query optimizations, and storage enhancements, which are not found in standard PostgreSQL.

126. Explain Greenplum Database Architecture?
Answer:    Greenplum Database stores and processes large amounts of data by distributing the data and processing workload across several servers or hosts. Greenplum Database is an array of individual databases based upon PostgreSQL 8.2 working together to present a single database image. The master is the entry point to the Greenplum Database system. It is the database instance to which clients connect and submit SQL statements. The master coordinates its work with the other database instances in the system, called segments, which store and process the data.
Figure 1. High-Level view of the Greenplum Database Architecture

Greenplum Master

The Greenplum Database master is the entry to the Greenplum Database system, accepting client connections and SQL queries, and distributing work to the segment instances.
Greenplum Database end-users interact with Greenplum Database (through the master) as they would with a typical PostgreSQL database. They connect to the database using client programs such as psql or application programming interfaces (APIs) such as JDBC or ODBC.
The master is where the global system catalog resides. The global system catalog is the set of system tables that contain metadata about the Greenplum Database system itself. The master does not contain any user data; data resides only on the segments. The master authenticates client connections, processes incoming SQL commands, distributes workloads among segments, coordinates the results returned by each segment, and presents the final results to the client program.

Greenplum Segments

Greenplum Database segment instances are independent PostgreSQL databases that each store a portion of the data and perform the majority of query processing.
When a user connects to the database via the Greenplum master and issues a query, processes are created in each segment database to handle the work of that query.
User-defined tables and their indexes are distributed across the available segments in a Greenplum Database system; each segment contains a distinct portion of data. The database server processes that serve segment data run under the corresponding segment instances. Users interact with segments in a Greenplum Database system through the master.
Segments run on a servers called segment hosts. A segment host typically executes from two to eight Greenplum segments, depending on the CPU cores, RAM, storage, network interfaces, and workloads. Segment hosts are expected to be identically configured. The key to obtaining the best performance from Greenplum Database is to distribute data and workloads evenly across a large number of equally capable segments so that all segments begin working on a task simultaneously and complete their work at the same time.

Greenplum Interconnect

The interconnect is the networking layer of the Greenplum Database architecture.
The interconnect refers to the inter-process communication between segments and the network infrastructure on which this communication relies. The Greenplum interconnect uses a standard 10-Gigabit Ethernet switching fabric.
By default, the interconnect uses User Datagram Protocol (UDP) to send messages over the network. The Greenplum software performs packet verification beyond what is provided by UDP. This means the reliability is equivalent to Transmission Control Protocol (TCP), and the performance and scalability exceeds TCP. If the interconnect used TCP, Greenplum Database would have a scalability limit of 1000 segment instances. With UDP as the current default protocol for the interconnect, this limit is not applicable.

127. What is Pivotal Query Optimizer ?
Answer:    The Pivotal Query Optimizer brings a state of the art query optimization framework to Greenplum Database that is distinguished from other optimizers in several ways:
  • Modularity. Pivotal Query Optimizer is not confined inside a single RDBMS. It is currently leveraged in both Greenplum Database and Pivotal HAWQ, but it can also be run as a standalone component to allow greater flexibility in adopting new backend systems and using the optimizer as a service. This also enables elaborate testing of the optimizer without going through the other components of the database stack.
  • Extensibility. The Pivotal Query Optimizer has been designed as a collection of independent components that can be replaced, configured, or extended separately. This significantly reduces the development costs of adding new features, and also allows rapid adoption of emerging technologies. Within the Query Optimizer, the representation of the elements of a query has been separated from how the query is optimized. This lets the optimizer treat all elements equally and avoids the issues with the imposed order of optimizations steps of multi-phase optimizers.
  • Performance. The Pivotal Query Optimizer leverages a multi-core scheduler that can distribute individual optimization tasks across multiple cores to speed up the optimization process. This allows the Query Optimizer to apply all possible optimizations as the same time, which results in many more plan alternatives and a wider range of queries that can be optimized. For instance, when the Pivotal Query Optimizer was used with TPC-H Query 21 it generated 1.2 Billion possible plans in 250 ms. This is especially important in Big Data Analytics where performance challenges are magnified by the volume of data that needs to be processed. A suboptimal optimization choice could very well lead to a query that just runs forever.

128. What is Users and Roles ?

Answer:    Greenplum Database manages database access using roles. Initially, there is one superuser role—the role associated with the OS user who initialized the database instance, usually gpadmin. This user owns all of the Greenplum Database files and OS processes, so it is important to reserve the gpadmin role for system tasks only.
A role can be a user or a group. A user role can log in to a database; that is, it has the LOGIN attribute. A user or group role can become a member of a group.
Permissions can be granted to users or groups. Initially, of course, only the gpadmin role is able to create roles. You can add roles with the createuser utility command, CREATE ROLE SQL command, or the CREATE USER SQL command. The CREATE USER command is the same as the CREATE ROLE command except that it automatically assigns the role the LOGIN attribute.

129. How to Create and Prepare Database ?

Answer:    Create a new database with the CREATE DATABASE SQL command in psql or the createdb utility command in a terminal. The new database is a copy of the template1 database, unless you specify a different template. To use the CREATE DATABASE command, you must be connected to a database. With a newly installed Greenplum Database system, you can connect to the template1 database to create your first user database. The createdb utility, entered at a shell prompt, is a wrapper around the CREATE DATABASE command.

130. How to Create a schema and set a search path ?

Answer:    A database schema is a named container for a set of database objects, including tables, data types, and functions. A database can have multiple schemas. Objects within the schema are referenced by prefixing the object name with the schema name, separated with a period. For example, the person table in the employee schema is written employee.person.
The schema provides a namespace for the objects it contains. If the database is used for multiple applications, each with its own schema, the same table name can be used in each schema employee.person is a different table than customer.person. Both tables could be accessed in the same query as long as they are qualified with the schema name.
The database contains a schema search path, which is a list of schemas to search for objects names that are not qualified with a schema name. The first schema in the search path is also the schema where new objects are created when no schema is specified. The default search path is user,public, so by default, each object you create belongs to a schema associated with your login name. In this exercise, you create an faa schema and set the search path so that it is the default schema.
1.         Change to the directory containing the FAA data and scripts:
$ cd ~/gpdb-sandbox-tutorials/faa
2.         Connect to the tutorial database with psql:
$ psql -U user1 tutorial
3.         Create the faa schema:
tutorial=# DROP SCHEMA IF EXISTS faa CASCADE;
tutorial=# CREATE SCHEMA faa;
4.         Add the faa schema to the search path:
tutorial=# SET SEARCH_PATH TO faa, public, pg_catalog, gp_toolkit;
5.         View the search path:
            tutorial=# SHOW search_path;
             search_path
            -------------------------------------
             faa, public, pg_catalog, gp_toolkit
            (1 row)
            The search path you set above is not persistent; you have to set it each time you connect to the database. You can associate a search path with the user role by using the ALTER ROLE command, so that each time you connect to the database with that role, the search path is restored:
tutorial=# ALTER ROLE user1 SET search_path TO faa, public, pg_catalog, gp_toolkit;

 

131. What is Greenplum data distribution policy

Answer:    The definition of a table includes the distribution policy for the data, which has great bearing on system performance. The goals for the distribution policy are to:
  • distribute the volume of data and query execution work evenly among the segments, and to
  • enable segments to accomplish the most expensive query processing steps locally.
The distribution policy determines how data is distributed among the segments. Defining an effective distribution policy requires an understanding of the data’s characteristics, the kinds of queries that will be run once the data is loaded into the database, and what distribution strategies best utilize the parallel execution capacity of the segments.
Use the DISTRIBUTED clause of the CREATE TABLE statement to define the distribution policy for a table. Ideally, each segment will store an equal volume of data and perform an equal share of work when processing queries. There are two kinds of distribution policies:
  • DISTRIBUTED BY (column, …) defines a distribution key from one or more columns. A hash function applied to the distribution key determines which segment stores the row. Rows that have the same distribution key are stored on the same segment. If the distribution keys are unique, the hash function ensures the data is distributed evenly. The default distribution policy is a hash on the primary key of the table, or the first column if no primary key is specified.
  • DISTRIBUTED RANDOMLY distributes rows in round-robin fashion among the segments.
When different tables are joined on the same columns that comprise the distribution key, the join can be accomplished at the segments, which is much faster than joining rows across segments. The random distribution policy makes this impossible, so it is best practice to define a distribution key that will optimize joins.

132.  Steps to migrate data from Oracle to Greenplum

Answer:   Step1 : Get the information about source schema.
Step2 : Generate DDL for Greenplum schema from Oracle schema
Step3: Generate CSV data dump for oracle tables.
Step 4: Load the database using GPFDist
step 5: Validate the data

§  Test Oracle Database Connectivity
§  Oracle Database Information Report
§  Oracle Table Rows Count Report
§  Oracle Table Checksum Report
§  Generate Greenplum Schema Table DDL corresponding to Oracle Schema
§  Generate Greenplum External Table DDL corresponding to Oracle Schema
§  Generate Load data insert table scripts to insert data into Greenplum table
§  Generate Select count DML scripts to count no of rows in greenplum internal and external tables
§  Export Oracle Table Data in CSV Format consumed by Greenplum External Table
§  **Export very large partitioned tables data in parallel and store in different location
§  **Generate External table DDL of large partitioned tables
§  Test Greenplum Database Connectivity
§  Create table in Greenplum using DDL generated from option 5
§  Create external table in Greenplum using DDL generated from option 6 or option 10
§  Load Data in Greenplum
§  Generate table counts DML script
§  Create Checksum Report of Migrated data in Greenplum
§  Compare Oracle and Greenplum Checksum Report
§  Schema Manager - Rebuild, Upgrade, Clone Schema
§  Users and Roles Management - Create/Modify/Delete users
§  Grants and Privileges - Database Access Control
§  Call a calculator
§  Display files in user directory
§  Display Disk Usage
§  Display CPU Utilization
§  Display Memory Utilization
§  Find and Replace
§  Find files for a particular date (example APR 19)

§  List files in MB size