111. What
is gpcrondump and how to work ?
Answer:
The
Greenplum Database parallel dump utility gpcrondump
backs up the Greenplum master instance and each active segment instance at the
same time.
By default, gpcrondump creates dump files
in the gp_dump subdirectory.
Several dump files are created for the
master, containing database information such as DDL statements, the Greenplum
system catalog tables, and metadata files.
gpcrondump creates one dump file for each
segment, which contains commands to recreate the data on that segment.
You can perform full or incremental
backups. To restore a database to its state when an incremental backup was
made, you will need to restore the previous full backup and all subsequent
incremental backups.
Each file created
for a backup begins with a 14-digit timestamp key that identifies the backup
set the file belongs to.
gpcrondump
can be run directly in a terminal on the master host, or you can add it to
crontab on the master host to schedule regular backups.
To
run a full backup, use “gpcrondump -x database -u /path/for/backup
-a”. This will backup the entire database to the directory given
without prompting the user.
$
gpcrondump -x tutorial -u /tmp -a -r
112. Is it Indexes useful
for performance in greenplum ?
Answer: Greenplum
Database does not depend upon indexes to the same degree as conventional data
warehouse systems. Because the segments execute table scans in parallel, each
segment scanning a small segment of the table, the traditional performance
advantage from indexes is diminished.
Indexes consume large amounts of space and require considerable
CPU time to compute during data loads. There are, however, times when indexes
are useful, especially for highly selective queries.
When a query looks up a single row, an index can dramatically
improve performance.
113. What is Row vs.
column orientation ?
Answer: Greenplum
Database offers the ability to store a table in either row or column
orientation. Both storage options have advantages, depending upon data
compression characteristics, the kinds of queries executed, the row length, and
the complexity and number of join columns.
As a general rule, very wide tables
are better stored in row orientation, especially if there are joins on many
columns.
Column orientation works well to
save space with compression and to reduce I/O when there is much duplicated
data in columns.
CREATE TABLE FAA.OTP_C (LIKE
faa.otp_r) WITH (appendonly=true, orientation=column)
114. What is partitioning
a table in greenplum?
Answer: Partitioning
a table can improve query performance and simplify data administration. The
table is divided into smaller child files using a range or a list value, such
as a date range or a country code.
Partitions can improve query performance dramatically. When a
query predicate filters on the same criteria used to define partitions, the
optimizer can avoid searching partitions that do not contain relevant data.
Using the ALTER TABLE statement, an existing partition can be
dropped by removing its child file. This is much more efficient than scanning
the entire table and removing rows with a DELETE statement.
Partitions may also be subpartitioned. For example, a table could
be partitioned by month, and the month partitions could be subpartitioned by
week. Greenplum Database creates child files for the months and weeks. The
actual data, however, is stored in the child files created for the week
subpartitions—only child files at the leaf level hold data.
When a new partition is added, you can run ANALYZE on just the
data in that partition. ANALYZE can run on the root partition (the name of the
table in the CREATE TABLE statement) or on a child file created for a leaf
partition. If ANALYZE has already run on
the other partitions and the data is static, it is not necessary to run it
again on those partitions.
Greenplum Database supports:
- Range partitioning: division of
data based on a numerical range, such as date or price.
- List partitioning: division of
data based on a list of values, such as sales territory or product line.
- A combination
of both types.
115. Explain Query
Planning and Dispatch ?
Answer: The master
receives, parses, and optimizes the query. The resulting query plan is either
parallel or targeted. The master dispatches parallel query plans to all
segments. Each segment is responsible for executing local database operations
on its own set of data.query plans.
Most database operations—such as table scans, joins, aggregations,
and sorts—execute across all segments in parallel. Each operation is performed
on a segment database independent of the data stored in the other segment
databases.
Understanding
Greenplum Query Plans
A query plan is the set of operations Greenplum Database will
perform to produce the answer to a query. Each node or step in the plan
represents a database operation such as a table scan, join, aggregation, or
sort. Plans are read and executed from bottom to top.
In addition to common database operations such as tables scans,
joins, and so on, Greenplum Database has an additional operation type called motion. A motion operation involves
moving tuples between the segments during query processing.
To achieve maximum parallelism during query execution, Greenplum
divides the work of the query plan into slices. A slice is a portion of the plan that segments can work on
independently. A query plan is sliced wherever a motion operation occurs in the
plan, with one slice on each side of the motion.
116. What is Parallel Query Execution and how work ?
Answer: Greenplum
creates a number of database processes to handle the work of a query. On the
master, the query worker process is called the query dispatcher (QD).
The QD is responsible for creating and dispatching the query plan. It also
accumulates and presents the final results. On the segments, a query worker
process is called a query executor
(QE). A QE is responsible for completing its portion of work and communicating
its intermediate results to the other worker processes.
There is at least one worker process assigned to each slice of the
query plan. A worker process works on its assigned portion of the query plan
independently. During query execution,
each segment will have a number of processes working on the query in parallel.
Related processes that are working
on the same slice of the query plan but on different segments are called gangs. As a portion of work is completed, tuples flow up the query plan
from one gang of processes to the next. This inter-process communication
between the segments is referred to as the interconnect component of Greenplum
Database.
117. What is MVCC ?
Answer: Greenplum
uses Multiversion Concurrency Control (MVCC) to guarantee isolation, one of the
ACID properties of relational databases. MVCC enables multiple users of the
database to obtain consistent results for a query, even if the data is changing
as the query is being executed. There can be multiple versions of rows in the
database, but a query sees a snapshot of the database at a single point in
time, containing only the versions of rows that were valid at that point in
time. When a row is updated or deleted and no active transactions continue to
reference it, it can be removed.
118. What is VACUUM , VACUUM FULL
& ANALYZE ?
Answer: The VACUUM
command removes older versions that are no longer needed, leaving free space
that can be reused. In a Greenplum database, normal OLTP operations do not
create the need for vacuuming out old rows, but loading data while tables are
in use may. It is a best practice to
VACUUM a table after a load. If the table is partitioned, and only a single
partition is being altered, then a VACUUM on that partition may suffice. The VACUUM FULL command behaves much
differently than VACUUM, and its use is not recommended in Greenplum databases.
It can be expensive in CPU and I/O, cause bloat in indexes, and lock data for
long periods of time. The ANALYZE command generates statistics about the
distribution of data in a table. In particular it stores histograms about the
values in each of the columns. The query optimizer depends on these statistics
to select the best plan for executing a query. For example, the optimizer can
use distribution data to decide on join orders. One of the optimizer’s goals in
a join is to minimize the volume of data that must be analyzed and potentially
moved between segments by using the statistics to choose the smallest result
set to work with first.
119. What is explain
plans ?
Answer: An explain
plan explains the method the optimizer has chosen to produce a result set.
Depending on the query, there can be a variety of methods to produce a result
set. The optimizer calculates the cost
for each method and chooses the one with the lowest cost. In large queries,
cost is generally measured by the amount of I/O to be performed.
An explain plan does not do any actual query processing work.
Explain plans use statistics generated by the ANALYZE command, so plans
generated before and after running ANALYZE can be quite different. This is
especially true for queries with multiple joins, because the order of the joins
can have a tremendous impact on performance.
120. How to Changing
optimizers ?
Answer: You may see “legacy query optimizer”
listed in the EXPLAIN output under “Optimizer status.”
- Check whether
the Pivotal Query Optimizer is enabled.
$ gpconfig -s optimizer
Values on all segments are consistent
GUC : optimizer
Master value: off
Segment value: off
- Enable the
Pivotal Query Optimizer
$ gpconfig -c optimizer -v on
--masteronly
20151201:09:08:31:172854 gpconfig:gpdb-sandbox:gpadmin-[INFO]:-completed successfully
- Reload the
configuration on master and segment instances.
$ gpstop -u
121. How to Load external data into
Greenplum Database ?
Answer: Loading external data into Greenplum Database tables can be
accomplished in different ways. We will use three methods to load the FAA data:
- The simplest
data loading method is the SQL INSERT statement. You can execute INSERT
statements directly with psql or another interactive client, run a script
containing INSERT statements, or run a client application with a database
connection. This is the least efficient method for loading large volumes
of data and should be used only for small amounts of data.
- You can use
the COPY command to load the data into a table when the data is in
external text files. The COPY command syntax allows you to define the
format of the text file so the data can be parsed into rows and columns. This
is faster than INSERT statements but, like INSERT statements, it is not a
parallel process.
The SQL COPY command requires that external files be accessible to the host where the master process is running. On a multi-node Greenplum Database system, the data files may be on a file system that is not accessible to the master process. In this case, you can use the psql \copy meta-command, which streams the data to the server over the psql connection. - You can use a
pair of Greenplum utilities, gpfdist
and gpload, to load external
data into tables at high data transfer rates. In a large scale,
multi-terabyte data warehouse, large amounts of data must be loaded within
a relatively small maintenance window. Greenplum supports fast, parallel
data loading with its external tables feature. Administrators can also
load external tables in single row error isolation mode to filter bad rows
into a separate error table while continuing to load properly formatted
rows. Administrators can specify an error threshold for a load operation
to control how many improperly formatted rows cause Greenplum to abort the
load operation.
By using external tables in conjunction with Greenplum Database’s
parallel file server (gpfdist),
administrators can achieve maximum parallelism and load bandwidth from their
Greenplum Database system.
Another Greenplum utility, gpload,
runs a load task that you specify in a YAML-formatted control file. You
describe the source data locations, format, transformations required,
participating hosts, database destinations, and other particulars in the
control file and gpload executes the load. This allows you to describe a
complex task and execute it in a controlled, repeatable fashion.
122. What is steps to Load data with gpfdist ?
Answer: For the FAA fact table, we will use an ETL
(Extract, Transform, Load) process to load data from the source gzip files into
a loading table, and then insert the data into a query and reporting table. For
the best load speed, use the gpfdist Greenplum utility to distribute the rows
to the segments. In a production system, gpfdist runs on the servers where the
data is located. With a single-node Greenplum Database instance, there is only
one host, and you run gpdist on it. Starting gpfdist is like starting a file
server; there is no data movement until a request is made on the process.
Note: This exercise loads data using the
Greenplum Database external table feature to move data from external data files
into the database. Moving data between the database and external tables is a
security consideration, so only superusers are permitted to use the feature.
Therefore, you will run this exercise as the gpadmin database user.
- Execute gpfdist.
Use the –d switch to set the “home” directory used to search for files in
the faa directory. Use the –p switch to set the port and background the
process.
$ gpfdist -d
~/gpdb-sandbox-tutorials/faa -p 8081 > /tmp/gpfdist.log 2>&1 &
- Check
that gpfdist is running with the ps command:
$ ps -A | grep
gpfdist
- View
the contents of the gpfdist log.
more /tmp/gpfdist.log
- Start
a psql session as gpadmin and execute the create_load_tables.sql script.
This script creates two tables: the faa_otp_load table, into which gpdist
will load the data, and the faa_load_errors table, where load errors will
be logged. (The faa_load_errors table may already exist. Ignore the error
message.) The faa_otp_load table is structured to match the format of the
input data from the FAA Web site.
$ psql -U gpadmin tutorial
tutorial=# \i
create_load_tables.sql
CREATE TABLE
CREATE
TABLE
Create
an external table definition with the same structure as the faa_otp_load table.
tutorial=# \i
create_ext_table.sql
psql:create_ext_table.sql:5:
NOTICE: HEADER means that each one of
the data files has a header row.
CREATE
EXTERNAL TABLE
This
is a pure metadata operation. No data has moved from the data files on the host
to the database yet. The external table definition references files in the faa
directory that match the pattern otp*.gz. There are two matching files, one
containing data for December 2009, the other for January 2010.
Move
data from the external table to the faa_otp_load table.
tutorial=# INSERT INTO faa.faa_otp_load SELECT * FROM
faa.ext_load_otp;
NOTICE: Found 26526 data formatting errors (26526 or
more input rows).
Rejected related input data.
INSERT 0 1024552
Greenplum
moves data from the gzip files into the load table in the database. In a
production environment, you could have many gpfdist processes running, one on
each host or several on one host, each on a separate port number.
Examine
the errors briefly. (The \x on psql meta-command changes the display of the
results to one line per column, which is easier to read for some result sets.)
tutorial=# \x
Expanded display is on.
tutorial=# SELECT DISTINCT
relname, errmsg, count(*)
FROM faa.faa_load_errors
GROUP BY 1,2;
-[ RECORD 1 ]-------------------------------------------------
relname | ext_load_otp
errmsg | invalid input syntax for integer:
"", column deptime
count | 26526
Exit the psql shell:
tutorial=# \q
123. How to Load
data with gpload ?
Answer: Greenplum provides
a wrapper program for gpfdist called gpload
that does much of the work of setting up the external table and the data
movement. In this exercise, you reload the faa_otp_load table using the gpload
utility.
1. Since gpload executes gpfdist, you must
first kill the gpfdist process you started in the previous exercise.
[gpadmin@gpdb-sandbox faa]$ ps -A | grep gpfdist
4035 pts/0 00:00:02 gpfdist
Your
process id will not be the same, so kill the appropriate one with the kill
command, or just use the simpler killall command:
gpadmin@gpdb-sandbox faa]$ killall gpfdist
[1]+
Exit 1 gpfdist -d
$HOME/gpdb-sandbox-tutorials/faa -p
8081 > /tmp/ gpfdist.log
2>&1
2. Edit and customize the gpload.yaml
input file. Be sure to set the correct path to the faa directory. Notice the
TRUNCATE: true preload instruction ensures that the data loaded in the previous
exercise will be removed before the load in this exercise starts.
vi
gpload.yaml
---
VERSION:
1.0.0.1
#
describe the Greenplum database parameters
DATABASE:
tutorial
USER:
gpadmin
HOST:
localhost
PORT:
5432
#
describe the location of the source files
# in
this example, the database master lives on the same host as the source files
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- gpdb-sandbox
PORT: 8081
FILE:
-
/Users/gpadmin/gpdb-sandbox-tutorials/faa/otp*.gz
- FORMAT: csv
- QUOTE: '"'
- ERROR_LIMIT: 50000
- ERROR_TABLE: faa.faa_load_errors
OUTPUT:
- TABLE: faa.faa_otp_load
- MODE: INSERT
PRELOAD:
- TRUNCATE: true
3. Execute gpload with the gpload.yaml
input file. (Include the -v flag if you want to see details of the loading
process.)
$
gpload -f gpload.yaml -l gpload.log
2015-10-21
15:05:39|INFO|gpload session started 2015-10-21 15:05:39
2015-10-21
15:05:39|INFO|started gpfdist -p 8081 -P 8082 -f
"/home/gpadmin/gpdb-sandbox-tutorials/faa/otp*.gz" -t 30
2015-10-21
15:05:58|WARN|26528 bad rows
2015-10-21
15:05:58|INFO|running time: 18.64 seconds
2015-10-21
15:05:58|INFO|rows Inserted =
1024552
2015-10-21
15:05:58|INFO|rows Updated = 0
2015-10-21
15:05:58|INFO|data formatting errors = 0
2015-10-21
15:05:58|INFO|gpload succeeded with warnings
124. How to perform
Data loading ?
Answer:
The ability to load billions of rows quickly into the Greenplum
database is one of its key features. Using “Extract, Load and Transform” (ELT)
allows load processes to make use of the massive parallelism of the Greenplum
system by staging the data (perhaps just the use of external tables) and then
applying data transformations within Greenplum Database. Set-based operations
can be done in parallel, maximizing performance.
With other loading mechanisms such as COPY,
data is loaded through the master in a single process. This does not take
advantage of the parallel processing power of the Greenplum segments. External
tables provide a means of leveraging the parallel processing power of the
segments for data loading. Also, unlike other loading mechanisms, you can
access multiple data sources with one SELECT of an external table.
External tables make static data available
inside the database. External tables can be defined with file:// or gpfdist://
protocols. gpfdist is a file server program that loads files in parallel. Since
the data is static, external tables can be rescanned during a query execution.
External Web tables allow http:// protocol
or an EXECUTE clause to execute an operating system command or script. That
data is assumed to be dynamic—query plans involving Web tables do not allow
rescanning because the data could change during query execution. Execution
plans may be slower, as data must be materialized (I/O) if it cannot fit in
memory.
The script or process to populate a table
with external Web tables may be executed on every segment host. It is possible,
therefore, to have duplication of data. This is something to be aware of and
check for when using Web tables, particularly with SQL extract calls to another
database.
125. What is Greenplum Database ?
Answer: Greenplum Database is a massively parallel
processing (MPP) database server with an architectur specially designed to
manage large-scale analytic data warehouses and business intelligence
workloads.
MPP (also known as a shared nothing architecture) refers to
systems with two or more processors that cooperate to carry out an operation,
each processor with its own memory, operating system and disks. Greenplum uses
this high-performance system architecture to distribute the load of
multi-terabyte data warehouses and can use all of a system’s resources in
parallel to process a query.
Greenplum Database is based on PostgreSQL open-source technology.
It is essentially several PostgreSQL database instances acting together as one
cohesive database management system (DBMS). It is based on PostgreSQL 8.2.15,
and in most cases is very similar to PostgreSQL with regard to SQL support,
features, configuration options, and end-user functionality. Database users
interact with Greenplum Database as they would a regular PostgreSQL DBMS.
The internals of PostgreSQL have been modified or supplemented to
support the parallel structure of Greenplum Database. For example, the system
catalog, optimizer, query executor, and transaction manager components have
been modified and enhanced to be able to execute queries simultaneously across
all of the parallel PostgreSQL database instances. The Greenplum interconnect (the networking layer) enables communication
between the distinct PostgreSQL instances and allows the system to behave as
one logical database.
Greenplum Database also includes features designed to optimize
PostgreSQL for business intelligence (BI) workloads. For example, Greenplum has
added parallel data loading (external tables), resource management, query
optimizations, and storage enhancements, which are not found in standard
PostgreSQL.
126. Explain Greenplum Database Architecture?
Answer: Greenplum Database stores and processes
large amounts of data by distributing the data and processing workload across
several servers or hosts. Greenplum Database is an array of individual
databases based upon PostgreSQL 8.2 working together to present a single
database image. The master is the entry
point to the Greenplum Database system. It is the database instance to
which clients connect and submit SQL statements. The master coordinates its
work with the other database instances in the system, called segments, which store and process the
data.
Figure 1.
High-Level view of the Greenplum Database Architecture
Greenplum
Master
The Greenplum Database master is the entry
to the Greenplum Database system, accepting client connections and SQL queries,
and distributing work to the segment instances.
Greenplum Database end-users interact with
Greenplum Database (through the master) as they would with a typical PostgreSQL
database. They connect to the database using client programs such as psql or
application programming interfaces (APIs) such as JDBC or ODBC.
The master
is where the global system catalog resides.
The global system catalog is the set of system tables that contain metadata
about the Greenplum Database system itself. The master does not contain any
user data; data resides only on the segments. The master authenticates client
connections, processes incoming SQL commands, distributes workloads among
segments, coordinates the results returned by each segment, and presents the
final results to the client program.
Greenplum
Segments
Greenplum Database segment instances are
independent PostgreSQL databases that each store a portion of the data and perform
the majority of query processing.
When a user connects to the database via
the Greenplum master and issues a query, processes are created in each segment
database to handle the work of that query.
User-defined tables and their indexes are
distributed across the available segments in a Greenplum Database system; each
segment contains a distinct portion of data. The database server processes that
serve segment data run under the corresponding segment instances. Users
interact with segments in a Greenplum Database system through the master.
Segments run on a servers called segment
hosts. A segment host typically
executes from two to eight Greenplum segments, depending on the CPU cores, RAM,
storage, network interfaces, and workloads. Segment hosts are expected to
be identically configured. The key to obtaining the best performance from
Greenplum Database is to distribute data and workloads evenly across a large
number of equally capable segments so that all segments begin working on a task
simultaneously and complete their work at the same time.
Greenplum
Interconnect
The interconnect is the networking layer of
the Greenplum Database architecture.
The interconnect refers to the
inter-process communication between segments and the network infrastructure on
which this communication relies. The Greenplum interconnect uses a standard
10-Gigabit Ethernet switching fabric.
By default, the interconnect uses User
Datagram Protocol (UDP) to send messages over the network. The Greenplum software performs packet verification beyond what
is provided by UDP. This means the reliability is equivalent to Transmission
Control Protocol (TCP), and the performance and scalability exceeds TCP. If the
interconnect used TCP, Greenplum Database would have a scalability limit of
1000 segment instances. With UDP as the current default protocol for the
interconnect, this limit is not applicable.
127.
What is Pivotal Query Optimizer ?
Answer:
The Pivotal Query
Optimizer brings a state of the art query optimization framework to Greenplum
Database that is distinguished from other optimizers in several ways:
- Modularity. Pivotal
Query Optimizer is not confined inside a single RDBMS. It is currently
leveraged in both Greenplum Database and Pivotal HAWQ, but it can also be
run as a standalone component to allow greater flexibility in adopting new
backend systems and using the optimizer as a service. This also enables
elaborate testing of the optimizer without going through the other components
of the database stack.
- Extensibility. The
Pivotal Query Optimizer has been designed as a collection of independent
components that can be replaced, configured, or extended separately. This
significantly reduces the development costs of adding new features, and
also allows rapid adoption of emerging technologies. Within the Query
Optimizer, the representation of the elements of a query has been
separated from how the query is optimized. This lets the optimizer treat
all elements equally and avoids the issues with the imposed order of
optimizations steps of multi-phase optimizers.
- Performance. The
Pivotal Query Optimizer leverages a multi-core scheduler that can
distribute individual optimization tasks across multiple cores to speed up
the optimization process. This allows the Query Optimizer to apply all
possible optimizations as the same time, which results in many more plan
alternatives and a wider range of queries that can be optimized. For
instance, when the Pivotal Query Optimizer was used with TPC-H Query 21 it
generated 1.2 Billion possible plans in 250 ms. This is especially
important in Big Data Analytics where performance challenges are magnified
by the volume of data that needs to be processed. A suboptimal
optimization choice could very well lead to a query that just runs
forever.
128. What is Users and Roles ?
Answer:
Greenplum Database manages database access using roles. Initially,
there is one superuser role—the role
associated with the OS user who initialized the database instance, usually gpadmin. This user owns all of the
Greenplum Database files and OS processes, so it is important to reserve the
gpadmin role for system tasks only.
A role can
be a user or a group. A user role can log in to a
database; that is, it has the LOGIN attribute. A user or group role can become
a member of a group.
Permissions can be granted to users or
groups. Initially, of course, only the
gpadmin role is able to create roles. You can add roles with the createuser utility command, CREATE ROLE SQL command, or the CREATE USER SQL command. The CREATE USER command is the same as the
CREATE ROLE command except that it automatically assigns the role the LOGIN
attribute.
129. How to Create and Prepare Database ?
Answer:
Create a new database with the CREATE DATABASE SQL
command in psql or the createdb utility command in a
terminal. The new database is a copy of
the template1 database, unless you specify a different template. To use the
CREATE DATABASE command, you must be connected to a database. With a newly
installed Greenplum Database system, you can connect to the template1 database
to create your first user database. The createdb utility, entered at a shell
prompt, is a wrapper around the CREATE DATABASE command.
130. How to Create a
schema and set a search path ?
Answer:
A database schema is a named container for a set of database
objects, including tables, data types, and functions. A database can have
multiple schemas. Objects within the schema are referenced by prefixing the
object name with the schema name, separated with a period. For example, the
person table in the employee schema is written employee.person.
The schema provides a namespace for the
objects it contains. If the database is used for multiple applications, each
with its own schema, the same table name can be used in each schema
employee.person is a different table than customer.person. Both tables could be
accessed in the same query as long as they are qualified with the schema name.
The
database contains a schema search path, which is a list of schemas to search
for objects names that are not qualified with a schema name. The first schema in the
search path is also the schema where new objects are created when no schema is
specified. The default search path is
user,public, so by default, each object you create belongs to a schema
associated with your login name. In this exercise, you create an faa schema
and set the search path so that it is the default schema.
1. Change to the directory containing the
FAA data and scripts:
$ cd ~/gpdb-sandbox-tutorials/faa
2. Connect to the tutorial database with
psql:
$ psql -U user1 tutorial
3. Create the faa schema:
tutorial=# DROP SCHEMA IF EXISTS faa CASCADE;
tutorial=# CREATE SCHEMA faa;
4. Add the faa schema to the search path:
tutorial=# SET SEARCH_PATH TO faa, public, pg_catalog,
gp_toolkit;
5. View the search path:
tutorial=#
SHOW search_path;
search_path
-------------------------------------
faa, public, pg_catalog, gp_toolkit
(1 row)
The search path you set above is not
persistent; you have to set it each time you connect to the database. You can
associate a search path with the user role by using the ALTER ROLE command, so
that each time you connect to the database with that role, the search path is
restored:
tutorial=# ALTER ROLE user1 SET search_path TO faa, public,
pg_catalog, gp_toolkit;
131. What is Greenplum
data distribution policy
Answer:
The definition of a table includes the distribution policy for the
data, which has great bearing on system performance. The goals for the
distribution policy are to:
- distribute
the volume of data and query execution work evenly among the segments, and
to
- enable
segments to accomplish the most expensive query processing steps locally.
The distribution policy determines how data
is distributed among the segments. Defining an effective distribution policy
requires an understanding of the data’s characteristics, the kinds of queries
that will be run once the data is loaded into the database, and what
distribution strategies best utilize the parallel execution capacity of the
segments.
Use the DISTRIBUTED clause of the CREATE
TABLE statement to define the distribution policy for a table. Ideally, each segment will store an equal volume of data and
perform an equal share of work when processing queries. There are two kinds of
distribution policies:
- DISTRIBUTED BY (column, …) defines a
distribution key from one or more columns. A hash function applied to the
distribution key determines which segment stores the row. Rows that have
the same distribution key are stored on the same segment. If the
distribution keys are unique, the hash function ensures the data is
distributed evenly. The default
distribution policy is a hash on the primary key of the table, or the
first column if no primary key is specified.
- DISTRIBUTED RANDOMLY distributes
rows in round-robin fashion among the segments.
When different tables are joined on the
same columns that comprise the distribution key, the join can be accomplished
at the segments, which is much faster than joining rows across segments. The
random distribution policy makes this impossible, so it is best practice to
define a distribution key that will optimize joins.
132. Steps to migrate
data from Oracle to Greenplum
Answer: Step1
: Get the information about source schema.
Step2 : Generate DDL for Greenplum schema from
Oracle schema
Step3: Generate CSV data dump for oracle tables.
Step 4: Load the database using GPFDist
step 5: Validate the data
§ Test
Oracle Database Connectivity
§ Oracle
Database Information Report
§ Oracle
Table Rows Count Report
§ Oracle
Table Checksum Report
§ Generate
Greenplum Schema Table DDL corresponding to Oracle Schema
§ Generate
Greenplum External Table DDL corresponding to Oracle Schema
§ Generate
Load data insert table scripts to insert data into Greenplum table
§ Generate
Select count DML scripts to count no of rows in greenplum internal and external
tables
§ Export
Oracle Table Data in CSV Format consumed by Greenplum External Table
§ **Export
very large partitioned tables data in parallel and store in different location
§ **Generate
External table DDL of large partitioned tables
§ Test
Greenplum Database Connectivity
§ Create
table in Greenplum using DDL generated from option 5
§ Create
external table in Greenplum using DDL generated from option 6 or option 10
§ Load
Data in Greenplum
§ Generate
table counts DML script
§ Create
Checksum Report of Migrated data in Greenplum
§ Compare
Oracle and Greenplum Checksum Report
§ Schema
Manager - Rebuild, Upgrade, Clone Schema
§ Users
and Roles Management - Create/Modify/Delete users
§ Grants
and Privileges - Database Access Control
§ Call
a calculator
§ Display
files in user directory
§ Display
Disk Usage
§ Display
CPU Utilization
§ Display
Memory Utilization
§ Find
and Replace
§ Find
files for a particular date (example APR 19)
§ List
files in MB size