Friday 16 November 2018

Interview Q and A for Greenplum DB Part - 2

111. What is gpcrondump and how to work ?
Answer:    The Greenplum Database parallel dump utility gpcrondump backs up the Greenplum master instance and each active segment instance at the same time.
By default, gpcrondump creates dump files in the gp_dump subdirectory.
Several dump files are created for the master, containing database information such as DDL statements, the Greenplum system catalog tables, and metadata files.
gpcrondump creates one dump file for each segment, which contains commands to recreate the data on that segment.
You can perform full or incremental backups. To restore a database to its state when an incremental backup was made, you will need to restore the previous full backup and all subsequent incremental backups.
Each file created for a backup begins with a 14-digit timestamp key that identifies the backup set the file belongs to.
gpcrondump can be run directly in a terminal on the master host, or you can add it to crontab on the master host to schedule regular backups.
To run a full backup, use “gpcrondump -x database -u /path/for/backup -a”. This will backup the entire database to the directory given without prompting the user.

$ gpcrondump -x tutorial -u /tmp -a -r

112. Is it Indexes useful for performance in greenplum ?

Answer:    Greenplum Database does not depend upon indexes to the same degree as conventional data warehouse systems. Because the segments execute table scans in parallel, each segment scanning a small segment of the table, the traditional performance advantage from indexes is diminished.
Indexes consume large amounts of space and require considerable CPU time to compute during data loads. There are, however, times when indexes are useful, especially for highly selective queries.
When a query looks up a single row, an index can dramatically improve performance.

113. What is Row vs. column orientation ?

Answer:    Greenplum Database offers the ability to store a table in either row or column orientation. Both storage options have advantages, depending upon data compression characteristics, the kinds of queries executed, the row length, and the complexity and number of join columns.
As a general rule, very wide tables are better stored in row orientation, especially if there are joins on many columns.
Column orientation works well to save space with compression and to reduce I/O when there is much duplicated data in columns.

CREATE TABLE FAA.OTP_C (LIKE faa.otp_r) WITH (appendonly=true, orientation=column)

114. What is partitioning  a table in greenplum?

Answer:    Partitioning a table can improve query performance and simplify data administration. The table is divided into smaller child files using a range or a list value, such as a date range or a country code.
Partitions can improve query performance dramatically. When a query predicate filters on the same criteria used to define partitions, the optimizer can avoid searching partitions that do not contain relevant data.
Using the ALTER TABLE statement, an existing partition can be dropped by removing its child file. This is much more efficient than scanning the entire table and removing rows with a DELETE statement.
Partitions may also be subpartitioned. For example, a table could be partitioned by month, and the month partitions could be subpartitioned by week. Greenplum Database creates child files for the months and weeks. The actual data, however, is stored in the child files created for the week subpartitions—only child files at the leaf level hold data.
When a new partition is added, you can run ANALYZE on just the data in that partition. ANALYZE can run on the root partition (the name of the table in the CREATE TABLE statement) or on a child file created for a leaf partition. If ANALYZE has already run on the other partitions and the data is static, it is not necessary to run it again on those partitions.
Greenplum Database supports:
  • Range partitioning: division of data based on a numerical range, such as date or price.
  • List partitioning: division of data based on a list of values, such as sales territory or product line.
  • A combination of both types.

115. Explain Query Planning and Dispatch  ?

Answer:    The master receives, parses, and optimizes the query. The resulting query plan is either parallel or targeted. The master dispatches parallel query plans to all segments. Each segment is responsible for executing local database operations on its own set of data.query plans.
Most database operations—such as table scans, joins, aggregations, and sorts—execute across all segments in parallel. Each operation is performed on a segment database independent of the data stored in the other segment databases.

Understanding Greenplum Query Plans

A query plan is the set of operations Greenplum Database will perform to produce the answer to a query. Each node or step in the plan represents a database operation such as a table scan, join, aggregation, or sort. Plans are read and executed from bottom to top.
In addition to common database operations such as tables scans, joins, and so on, Greenplum Database has an additional operation type called motion. A motion operation involves moving tuples between the segments during query processing.
To achieve maximum parallelism during query execution, Greenplum divides the work of the query plan into slices. A slice is a portion of the plan that segments can work on independently. A query plan is sliced wherever a motion operation occurs in the plan, with one slice on each side of the motion.

116. What is  Parallel Query Execution and how work ?

Answer:    Greenplum creates a number of database processes to handle the work of a query. On the master, the query worker process is called the query dispatcher (QD). The QD is responsible for creating and dispatching the query plan. It also accumulates and presents the final results. On the segments, a query worker process is called a query executor (QE). A QE is responsible for completing its portion of work and communicating its intermediate results to the other worker processes.
There is at least one worker process assigned to each slice of the query plan. A worker process works on its assigned portion of the query plan independently. During query execution, each segment will have a number of processes working on the query in parallel.
Related processes that are working on the same slice of the query plan but on different segments are called gangs. As a portion of work is completed, tuples flow up the query plan from one gang of processes to the next. This inter-process communication between the segments is referred to as the interconnect component of Greenplum Database.

117. What is MVCC ?

Answer:    Greenplum uses Multiversion Concurrency Control (MVCC) to guarantee isolation, one of the ACID properties of relational databases. MVCC enables multiple users of the database to obtain consistent results for a query, even if the data is changing as the query is being executed. There can be multiple versions of rows in the database, but a query sees a snapshot of the database at a single point in time, containing only the versions of rows that were valid at that point in time. When a row is updated or deleted and no active transactions continue to reference it, it can be removed.

118. What is VACUUM , VACUUM FULL & ANALYZE ?
Answer:    The VACUUM command removes older versions that are no longer needed, leaving free space that can be reused. In a Greenplum database, normal OLTP operations do not create the need for vacuuming out old rows, but loading data while tables are in use may. It is a best practice to VACUUM a table after a load. If the table is partitioned, and only a single partition is being altered, then a VACUUM on that partition may suffice. The VACUUM FULL command behaves much differently than VACUUM, and its use is not recommended in Greenplum databases. It can be expensive in CPU and I/O, cause bloat in indexes, and lock data for long periods of time. The ANALYZE command generates statistics about the distribution of data in a table. In particular it stores histograms about the values in each of the columns. The query optimizer depends on these statistics to select the best plan for executing a query. For example, the optimizer can use distribution data to decide on join orders. One of the optimizer’s goals in a join is to minimize the volume of data that must be analyzed and potentially moved between segments by using the statistics to choose the smallest result set to work with first.

119. What is explain plans ?

Answer:    An explain plan explains the method the optimizer has chosen to produce a result set. Depending on the query, there can be a variety of methods to produce a result set. The optimizer calculates the cost for each method and chooses the one with the lowest cost. In large queries, cost is generally measured by the amount of I/O to be performed.
An explain plan does not do any actual query processing work. Explain plans use statistics generated by the ANALYZE command, so plans generated before and after running ANALYZE can be quite different. This is especially true for queries with multiple joins, because the order of the joins can have a tremendous impact on performance.

120. How to Changing optimizers ?

Answer:    You may see “legacy query optimizer” listed in the EXPLAIN output under “Optimizer status.”
  1. Check whether the Pivotal Query Optimizer is enabled.
$ gpconfig -s optimizer
Values on all segments are consistent
GUC          : optimizer
Master  value: off
Segment value: off
  1. Enable the Pivotal Query Optimizer
$ gpconfig -c optimizer -v on --masteronly
 20151201:09:08:31:172854 gpconfig:gpdb-sandbox:gpadmin-[INFO]:-completed successfully
  1. Reload the configuration on master and segment instances.
$ gpstop -u

121. How to Load external data into Greenplum Database ?
Answer:    Loading external data into Greenplum Database tables can be accomplished in different ways. We will use three methods to load the FAA data:
  • The simplest data loading method is the SQL INSERT statement. You can execute INSERT statements directly with psql or another interactive client, run a script containing INSERT statements, or run a client application with a database connection. This is the least efficient method for loading large volumes of data and should be used only for small amounts of data.
  • You can use the COPY command to load the data into a table when the data is in external text files. The COPY command syntax allows you to define the format of the text file so the data can be parsed into rows and columns. This is faster than INSERT statements but, like INSERT statements, it is not a parallel process.
    The SQL COPY command requires that external files be accessible to the host where the master process is running. On a multi-node Greenplum Database system, the data files may be on a file system that is not accessible to the master process. In this case, you can use the psql \copy meta-command, which streams the data to the server over the psql connection.
  • You can use a pair of Greenplum utilities, gpfdist and gpload, to load external data into tables at high data transfer rates. In a large scale, multi-terabyte data warehouse, large amounts of data must be loaded within a relatively small maintenance window. Greenplum supports fast, parallel data loading with its external tables feature. Administrators can also load external tables in single row error isolation mode to filter bad rows into a separate error table while continuing to load properly formatted rows. Administrators can specify an error threshold for a load operation to control how many improperly formatted rows cause Greenplum to abort the load operation.
By using external tables in conjunction with Greenplum Database’s parallel file server (gpfdist), administrators can achieve maximum parallelism and load bandwidth from their Greenplum Database system.
Another Greenplum utility, gpload, runs a load task that you specify in a YAML-formatted control file. You describe the source data locations, format, transformations required, participating hosts, database destinations, and other particulars in the control file and gpload executes the load. This allows you to describe a complex task and execute it in a controlled, repeatable fashion.


122. What is steps to Load data with gpfdist ?
Answer:    For the FAA fact table, we will use an ETL (Extract, Transform, Load) process to load data from the source gzip files into a loading table, and then insert the data into a query and reporting table. For the best load speed, use the gpfdist Greenplum utility to distribute the rows to the segments. In a production system, gpfdist runs on the servers where the data is located. With a single-node Greenplum Database instance, there is only one host, and you run gpdist on it. Starting gpfdist is like starting a file server; there is no data movement until a request is made on the process.
Note: This exercise loads data using the Greenplum Database external table feature to move data from external data files into the database. Moving data between the database and external tables is a security consideration, so only superusers are permitted to use the feature. Therefore, you will run this exercise as the gpadmin database user.
  1. Execute gpfdist. Use the –d switch to set the “home” directory used to search for files in the faa directory. Use the –p switch to set the port and background the process.
$ gpfdist -d ~/gpdb-sandbox-tutorials/faa -p 8081 > /tmp/gpfdist.log 2>&1 &
  1. Check that gpfdist is running with the ps command:
$ ps -A | grep gpfdist
  1. View the contents of the gpfdist log.
more /tmp/gpfdist.log
  1. Start a psql session as gpadmin and execute the create_load_tables.sql script. This script creates two tables: the faa_otp_load table, into which gpdist will load the data, and the faa_load_errors table, where load errors will be logged. (The faa_load_errors table may already exist. Ignore the error message.) The faa_otp_load table is structured to match the format of the input data from the FAA Web site.

$ psql -U gpadmin tutorial
tutorial=# \i create_load_tables.sql
CREATE TABLE
CREATE TABLE
Create an external table definition with the same structure as the faa_otp_load table.
tutorial=# \i create_ext_table.sql
psql:create_ext_table.sql:5: NOTICE:  HEADER means that each one of the data files has a header row.
CREATE EXTERNAL TABLE
This is a pure metadata operation. No data has moved from the data files on the host to the database yet. The external table definition references files in the faa directory that match the pattern otp*.gz. There are two matching files, one containing data for December 2009, the other for January 2010.
Move data from the external table to the faa_otp_load table.
tutorial=#  INSERT INTO faa.faa_otp_load SELECT * FROM faa.ext_load_otp;
NOTICE:  Found 26526 data formatting errors (26526 or more input rows).
Rejected related input data.
INSERT 0 1024552
Greenplum moves data from the gzip files into the load table in the database. In a production environment, you could have many gpfdist processes running, one on each host or several on one host, each on a separate port number.
Examine the errors briefly. (The \x on psql meta-command changes the display of the results to one line per column, which is easier to read for some result sets.)
tutorial=# \x
Expanded display is on.
tutorial=# SELECT DISTINCT relname, errmsg, count(*)
FROM faa.faa_load_errors GROUP BY 1,2;
-[ RECORD 1 ]-------------------------------------------------
relname | ext_load_otp
errmsg  | invalid input syntax for integer: "", column deptime
count   | 26526
Exit the psql shell:
tutorial=# \q

123. How to Load data with gpload ?
Answer:    Greenplum provides a wrapper program for gpfdist called gpload that does much of the work of setting up the external table and the data movement. In this exercise, you reload the faa_otp_load table using the gpload utility.
1.         Since gpload executes gpfdist, you must first kill the gpfdist process you started in the previous exercise.
            [gpadmin@gpdb-sandbox faa]$ ps -A | grep gpfdist
            4035 pts/0    00:00:02 gpfdist
Your process id will not be the same, so kill the appropriate one with the kill command, or just use the simpler killall command:
  gpadmin@gpdb-sandbox faa]$ killall gpfdist
  [1]+  Exit 1    gpfdist -d $HOME/gpdb-sandbox-tutorials/faa -p   8081 > /tmp/  gpfdist.log 2>&1
2.         Edit and customize the gpload.yaml input file. Be sure to set the correct path to the faa directory. Notice the TRUNCATE: true preload instruction ensures that the data loaded in the previous exercise will be removed before the load in this exercise starts.
vi gpload.yaml
---
VERSION: 1.0.0.1
# describe the Greenplum database parameters
DATABASE: tutorial
USER: gpadmin
HOST: localhost
PORT: 5432
# describe the location of the source files
# in this example, the database master lives on the same host as the source files
GPLOAD:
   INPUT:
    - SOURCE:
         LOCAL_HOSTNAME:
           - gpdb-sandbox
         PORT: 8081
         FILE:
           - /Users/gpadmin/gpdb-sandbox-tutorials/faa/otp*.gz
    - FORMAT: csv
    - QUOTE: '"'
    - ERROR_LIMIT: 50000
    - ERROR_TABLE: faa.faa_load_errors
   OUTPUT:
    - TABLE: faa.faa_otp_load
    - MODE: INSERT
   PRELOAD:
    - TRUNCATE: true
3.         Execute gpload with the gpload.yaml input file. (Include the -v flag if you want to see details of the loading process.)
$ gpload -f gpload.yaml -l gpload.log
2015-10-21 15:05:39|INFO|gpload session started 2015-10-21 15:05:39
2015-10-21 15:05:39|INFO|started gpfdist -p 8081 -P 8082 -f "/home/gpadmin/gpdb-sandbox-tutorials/faa/otp*.gz" -t 30
2015-10-21 15:05:58|WARN|26528 bad rows
2015-10-21 15:05:58|INFO|running time: 18.64 seconds
2015-10-21 15:05:58|INFO|rows Inserted          = 1024552
2015-10-21 15:05:58|INFO|rows Updated           = 0
2015-10-21 15:05:58|INFO|data formatting errors = 0
2015-10-21 15:05:58|INFO|gpload succeeded with warnings

124. How to perform Data loading ?

Answer:    The ability to load billions of rows quickly into the Greenplum database is one of its key features. Using “Extract, Load and Transform” (ELT) allows load processes to make use of the massive parallelism of the Greenplum system by staging the data (perhaps just the use of external tables) and then applying data transformations within Greenplum Database. Set-based operations can be done in parallel, maximizing performance.
With other loading mechanisms such as COPY, data is loaded through the master in a single process. This does not take advantage of the parallel processing power of the Greenplum segments. External tables provide a means of leveraging the parallel processing power of the segments for data loading. Also, unlike other loading mechanisms, you can access multiple data sources with one SELECT of an external table.
External tables make static data available inside the database. External tables can be defined with file:// or gpfdist:// protocols. gpfdist is a file server program that loads files in parallel. Since the data is static, external tables can be rescanned during a query execution.
External Web tables allow http:// protocol or an EXECUTE clause to execute an operating system command or script. That data is assumed to be dynamic—query plans involving Web tables do not allow rescanning because the data could change during query execution. Execution plans may be slower, as data must be materialized (I/O) if it cannot fit in memory.
The script or process to populate a table with external Web tables may be executed on every segment host. It is possible, therefore, to have duplication of data. This is something to be aware of and check for when using Web tables, particularly with SQL extract calls to another database.

125. What is Greenplum Database ?
Answer:    Greenplum Database is a massively parallel processing (MPP) database server with an architectur specially designed to manage large-scale analytic data warehouses and business intelligence workloads.

MPP (also known as a shared nothing architecture) refers to systems with two or more processors that cooperate to carry out an operation, each processor with its own memory, operating system and disks. Greenplum uses this high-performance system architecture to distribute the load of multi-terabyte data warehouses and can use all of a system’s resources in parallel to process a query.
Greenplum Database is based on PostgreSQL open-source technology. It is essentially several PostgreSQL database instances acting together as one cohesive database management system (DBMS). It is based on PostgreSQL 8.2.15, and in most cases is very similar to PostgreSQL with regard to SQL support, features, configuration options, and end-user functionality. Database users interact with Greenplum Database as they would a regular PostgreSQL DBMS.
The internals of PostgreSQL have been modified or supplemented to support the parallel structure of Greenplum Database. For example, the system catalog, optimizer, query executor, and transaction manager components have been modified and enhanced to be able to execute queries simultaneously across all of the parallel PostgreSQL database instances. The Greenplum interconnect (the networking layer) enables communication between the distinct PostgreSQL instances and allows the system to behave as one logical database.
Greenplum Database also includes features designed to optimize PostgreSQL for business intelligence (BI) workloads. For example, Greenplum has added parallel data loading (external tables), resource management, query optimizations, and storage enhancements, which are not found in standard PostgreSQL.

126. Explain Greenplum Database Architecture?
Answer:    Greenplum Database stores and processes large amounts of data by distributing the data and processing workload across several servers or hosts. Greenplum Database is an array of individual databases based upon PostgreSQL 8.2 working together to present a single database image. The master is the entry point to the Greenplum Database system. It is the database instance to which clients connect and submit SQL statements. The master coordinates its work with the other database instances in the system, called segments, which store and process the data.
Figure 1. High-Level view of the Greenplum Database Architecture

Greenplum Master

The Greenplum Database master is the entry to the Greenplum Database system, accepting client connections and SQL queries, and distributing work to the segment instances.
Greenplum Database end-users interact with Greenplum Database (through the master) as they would with a typical PostgreSQL database. They connect to the database using client programs such as psql or application programming interfaces (APIs) such as JDBC or ODBC.
The master is where the global system catalog resides. The global system catalog is the set of system tables that contain metadata about the Greenplum Database system itself. The master does not contain any user data; data resides only on the segments. The master authenticates client connections, processes incoming SQL commands, distributes workloads among segments, coordinates the results returned by each segment, and presents the final results to the client program.

Greenplum Segments

Greenplum Database segment instances are independent PostgreSQL databases that each store a portion of the data and perform the majority of query processing.
When a user connects to the database via the Greenplum master and issues a query, processes are created in each segment database to handle the work of that query.
User-defined tables and their indexes are distributed across the available segments in a Greenplum Database system; each segment contains a distinct portion of data. The database server processes that serve segment data run under the corresponding segment instances. Users interact with segments in a Greenplum Database system through the master.
Segments run on a servers called segment hosts. A segment host typically executes from two to eight Greenplum segments, depending on the CPU cores, RAM, storage, network interfaces, and workloads. Segment hosts are expected to be identically configured. The key to obtaining the best performance from Greenplum Database is to distribute data and workloads evenly across a large number of equally capable segments so that all segments begin working on a task simultaneously and complete their work at the same time.

Greenplum Interconnect

The interconnect is the networking layer of the Greenplum Database architecture.
The interconnect refers to the inter-process communication between segments and the network infrastructure on which this communication relies. The Greenplum interconnect uses a standard 10-Gigabit Ethernet switching fabric.
By default, the interconnect uses User Datagram Protocol (UDP) to send messages over the network. The Greenplum software performs packet verification beyond what is provided by UDP. This means the reliability is equivalent to Transmission Control Protocol (TCP), and the performance and scalability exceeds TCP. If the interconnect used TCP, Greenplum Database would have a scalability limit of 1000 segment instances. With UDP as the current default protocol for the interconnect, this limit is not applicable.

127. What is Pivotal Query Optimizer ?
Answer:    The Pivotal Query Optimizer brings a state of the art query optimization framework to Greenplum Database that is distinguished from other optimizers in several ways:
  • Modularity. Pivotal Query Optimizer is not confined inside a single RDBMS. It is currently leveraged in both Greenplum Database and Pivotal HAWQ, but it can also be run as a standalone component to allow greater flexibility in adopting new backend systems and using the optimizer as a service. This also enables elaborate testing of the optimizer without going through the other components of the database stack.
  • Extensibility. The Pivotal Query Optimizer has been designed as a collection of independent components that can be replaced, configured, or extended separately. This significantly reduces the development costs of adding new features, and also allows rapid adoption of emerging technologies. Within the Query Optimizer, the representation of the elements of a query has been separated from how the query is optimized. This lets the optimizer treat all elements equally and avoids the issues with the imposed order of optimizations steps of multi-phase optimizers.
  • Performance. The Pivotal Query Optimizer leverages a multi-core scheduler that can distribute individual optimization tasks across multiple cores to speed up the optimization process. This allows the Query Optimizer to apply all possible optimizations as the same time, which results in many more plan alternatives and a wider range of queries that can be optimized. For instance, when the Pivotal Query Optimizer was used with TPC-H Query 21 it generated 1.2 Billion possible plans in 250 ms. This is especially important in Big Data Analytics where performance challenges are magnified by the volume of data that needs to be processed. A suboptimal optimization choice could very well lead to a query that just runs forever.

128. What is Users and Roles ?

Answer:    Greenplum Database manages database access using roles. Initially, there is one superuser role—the role associated with the OS user who initialized the database instance, usually gpadmin. This user owns all of the Greenplum Database files and OS processes, so it is important to reserve the gpadmin role for system tasks only.
A role can be a user or a group. A user role can log in to a database; that is, it has the LOGIN attribute. A user or group role can become a member of a group.
Permissions can be granted to users or groups. Initially, of course, only the gpadmin role is able to create roles. You can add roles with the createuser utility command, CREATE ROLE SQL command, or the CREATE USER SQL command. The CREATE USER command is the same as the CREATE ROLE command except that it automatically assigns the role the LOGIN attribute.

129. How to Create and Prepare Database ?

Answer:    Create a new database with the CREATE DATABASE SQL command in psql or the createdb utility command in a terminal. The new database is a copy of the template1 database, unless you specify a different template. To use the CREATE DATABASE command, you must be connected to a database. With a newly installed Greenplum Database system, you can connect to the template1 database to create your first user database. The createdb utility, entered at a shell prompt, is a wrapper around the CREATE DATABASE command.

130. How to Create a schema and set a search path ?

Answer:    A database schema is a named container for a set of database objects, including tables, data types, and functions. A database can have multiple schemas. Objects within the schema are referenced by prefixing the object name with the schema name, separated with a period. For example, the person table in the employee schema is written employee.person.
The schema provides a namespace for the objects it contains. If the database is used for multiple applications, each with its own schema, the same table name can be used in each schema employee.person is a different table than customer.person. Both tables could be accessed in the same query as long as they are qualified with the schema name.
The database contains a schema search path, which is a list of schemas to search for objects names that are not qualified with a schema name. The first schema in the search path is also the schema where new objects are created when no schema is specified. The default search path is user,public, so by default, each object you create belongs to a schema associated with your login name. In this exercise, you create an faa schema and set the search path so that it is the default schema.
1.         Change to the directory containing the FAA data and scripts:
$ cd ~/gpdb-sandbox-tutorials/faa
2.         Connect to the tutorial database with psql:
$ psql -U user1 tutorial
3.         Create the faa schema:
tutorial=# DROP SCHEMA IF EXISTS faa CASCADE;
tutorial=# CREATE SCHEMA faa;
4.         Add the faa schema to the search path:
tutorial=# SET SEARCH_PATH TO faa, public, pg_catalog, gp_toolkit;
5.         View the search path:
            tutorial=# SHOW search_path;
             search_path
            -------------------------------------
             faa, public, pg_catalog, gp_toolkit
            (1 row)
            The search path you set above is not persistent; you have to set it each time you connect to the database. You can associate a search path with the user role by using the ALTER ROLE command, so that each time you connect to the database with that role, the search path is restored:
tutorial=# ALTER ROLE user1 SET search_path TO faa, public, pg_catalog, gp_toolkit;

 

131. What is Greenplum data distribution policy

Answer:    The definition of a table includes the distribution policy for the data, which has great bearing on system performance. The goals for the distribution policy are to:
  • distribute the volume of data and query execution work evenly among the segments, and to
  • enable segments to accomplish the most expensive query processing steps locally.
The distribution policy determines how data is distributed among the segments. Defining an effective distribution policy requires an understanding of the data’s characteristics, the kinds of queries that will be run once the data is loaded into the database, and what distribution strategies best utilize the parallel execution capacity of the segments.
Use the DISTRIBUTED clause of the CREATE TABLE statement to define the distribution policy for a table. Ideally, each segment will store an equal volume of data and perform an equal share of work when processing queries. There are two kinds of distribution policies:
  • DISTRIBUTED BY (column, …) defines a distribution key from one or more columns. A hash function applied to the distribution key determines which segment stores the row. Rows that have the same distribution key are stored on the same segment. If the distribution keys are unique, the hash function ensures the data is distributed evenly. The default distribution policy is a hash on the primary key of the table, or the first column if no primary key is specified.
  • DISTRIBUTED RANDOMLY distributes rows in round-robin fashion among the segments.
When different tables are joined on the same columns that comprise the distribution key, the join can be accomplished at the segments, which is much faster than joining rows across segments. The random distribution policy makes this impossible, so it is best practice to define a distribution key that will optimize joins.

132.  Steps to migrate data from Oracle to Greenplum

Answer:   Step1 : Get the information about source schema.
Step2 : Generate DDL for Greenplum schema from Oracle schema
Step3: Generate CSV data dump for oracle tables.
Step 4: Load the database using GPFDist
step 5: Validate the data

§  Test Oracle Database Connectivity
§  Oracle Database Information Report
§  Oracle Table Rows Count Report
§  Oracle Table Checksum Report
§  Generate Greenplum Schema Table DDL corresponding to Oracle Schema
§  Generate Greenplum External Table DDL corresponding to Oracle Schema
§  Generate Load data insert table scripts to insert data into Greenplum table
§  Generate Select count DML scripts to count no of rows in greenplum internal and external tables
§  Export Oracle Table Data in CSV Format consumed by Greenplum External Table
§  **Export very large partitioned tables data in parallel and store in different location
§  **Generate External table DDL of large partitioned tables
§  Test Greenplum Database Connectivity
§  Create table in Greenplum using DDL generated from option 5
§  Create external table in Greenplum using DDL generated from option 6 or option 10
§  Load Data in Greenplum
§  Generate table counts DML script
§  Create Checksum Report of Migrated data in Greenplum
§  Compare Oracle and Greenplum Checksum Report
§  Schema Manager - Rebuild, Upgrade, Clone Schema
§  Users and Roles Management - Create/Modify/Delete users
§  Grants and Privileges - Database Access Control
§  Call a calculator
§  Display files in user directory
§  Display Disk Usage
§  Display CPU Utilization
§  Display Memory Utilization
§  Find and Replace
§  Find files for a particular date (example APR 19)

§  List files in MB size