Manoj Simar - The Technology Expert: Interview Q and A for Greenplum DB Part

1. How to check distribution policy of a test table sales?

Answer: The Describe table sales shows the distribution details.

psql>\d sales

Table"public.sales"

Column | Type | Modifiers

--------+---------+-----------

id | integer |

date | date |

Distributed by: (id)

2. How many user schemas are there in the database?

Answer: Use"\dn" at psql prompt.

3.When is my tables last analyzed in Greenplum database?

Answer: In 4.x check pg_stat_operations for all actionname performed on any object.

For example,a sales table:

gpdb=# select objname,actionname,statime from pg_stat_operations where objname like 'sales';

objname | actionname | statime

--------+-----------+-------------------------------

sales | CREATE | 2010-10-01 12:18:41.996244-07

sales | ANALYZE | 2010-10-06 14:38:21.933281-07

sales | VACUUM | 2010-10-06 14:38:33.881245-07

4. How to check the size of a table?

Answer: Table Level: Replace schema.tablename with your search table.

psql> select pg_size_pretty(pg_relation_size('schema.tablename'));

Table and Index: Replace schema.tablename with your search table.

psql> select pg_size_pretty(pg_total_relation_size('schema.tablename'));

5. How to check the Schema size?

Answer: Schema Level: Replace SCHEMANAME with your schema name.

psql> select schemaname ,round(sum(pg_total_relation_size(schemaname||'.'||tablename))/1024/1024) "Size_MB" from pg_tables where schemaname='SCHEMANAME' group by 1;

6. How to check the database size?

Answer: To see size of specific database:

psql> select pg_size_pretty(pg_database_size('DATBASE_NAME'));

To see all database sizes:

psql> select datname,pg_size_pretty(pg_database_size(datname)) from pg_database;

7. How to check partitioned table size including indexes and partitions?

Answer: Table size with partitions:

The following SQL gives you employee_dailly table size, which includes partitions.

select schemaname,tablename,round(sum(pg_total_relation_size(schemaname || '.' || partitiontablename))/1024/1024) "MB" from pg_partitions where tablename='employee_daily' group by 1,2;

schemaname | tablename | MB

-----------+----------------+-----

public | employee_daily | 254

8. How do I get help on syntax to alter table?

Answer: In psql session type \h alter table which will display the syntax:

gpdb=# \h alter table

9. How to connect in utility mode?

Answer: From master host

PGOPTIONS='-c gp_session_role=utility' psql -p <port> -h <hostname>

Where:

port is segment/ master database port. hostname is segment/master hostname.

10. Where/How to find db logs?

Answer: Master:

Master gpdb logfile is located in the $MASTER_DATA_DIRECTORY/pg_log/ directory and the file name depends on the database "log_filename" parameter.

1. $MASTER_DATA_DIRECTORY/pg_log/gpdb-yyyy-mm-dd_000000.csv -->Log file format with default installation.

2. ~gpadmin/gpAdminLogs/ -->gpstart,gpstop,gpstate and other utility logs.

Segments:

1. primary segments run below SQL to see logfile location:

select dbid,hostname,datadir||'/pg_log' from gp_configuration where content not in (-1) and isprimary is true;

2. Miror Segments run below SQL to see logfile location:

select dbid,hostname,datadir||'/pg_log' from gp_configuration where content not in (-1) and isprimary is false;

11. How to see the list of available functions in Greenplum DB?

Answer: \df schemaname.functionname (schemaname and function name support wildcard characters)

test=# \df pub*.*test*

List of functions

Schema | Name | Result data type | Argument data types

--------+-------------+------------------+---------------------

public | bugtest | integer |

public | test | boolean | integer

public | test | void |

(3 rows)

12. How to check whether Greenplum server is up and running?

Answer: The gpstate is the utility to check gpdb status.

Use gpstate -Q to show a quick status. Refer to gpstate --help for more options.

Sample output:

[gpadmin@stinger2]/export/home/gpadmin>gpstate -Q

gpadmin-[INFO]:-Obtaining GPDB array type, [Brief], please wait...

gpadmin-[INFO]:-Quick Greenplum database status from Master instance only

gpadmin-[INFO]:----------------------------------------------------------

gpadmin-[INFO]:-GPDB fault action value = readonly

gpadmin-[INFO]:-Valid count in status view = 4

gpadmin-[INFO]:-Invalid count in status view = 0

gpadmin-[INFO]:----------------------------------------------------------

13. How to create a Database?

Answer: There are two ways to create gpdb database using psql session or the Greenplum createdb utility.

1. Using psql session:

gpdb=# \h create database

Command: CREATE DATABASE

Description: create a new database

Syntax:

CREATE DATABASE name

[ [ WITH ] [ OWNER [=] dbowner ]

[ TEMPLATE [=] template ]

[ ENCODING [=] encoding ]

[ TABLESPACE [=] tablespace ]

[ CONNECTION LIMIT [=] connlimit ] ]

2. Using createdb utility:

Usage: $GPHOME/bin/createdb --help

createdb [OPTION]... [DBNAME] [DESCRIPTION]

Options:

-D, --tablespace=TABLESPACE default tablespace for the database

-e, --echo show the commands being sent to the server

-E, --encoding=ENCODING encoding for the database

-O, --owner=OWNER database user to own the new database

-T, --template=TEMPLATE template database to copy

--help show this help, then exit

--version output version information, then exit

14. How do I get a list of databases in a greenplum cluster?

Answer: gpdb=# \ l (lowercase letter "l")

List of databases

Name | Owner | Encoding

------{}----------

gpdb | gpadmin | UTF8

gpperfmon | gpadmin | UTF8

postgres | gpadmin | UTF8

template0 | gpadmin | UTF8

template1 | gpadmin | UTF8

Check below SQL for more details on dbs.

gpdb=# select * from pg_database;

15. How to delete/drop an existing database in Greenplum?

Answer:

gpdb=# \h DROP Database

Command: DROP DATABASE

Description: remove a database

Syntax:DROP DATABASE [ IF EXISTS ] name

Also check dropdb utility:

$GPHOME/bin/dropdb --help

dropdb removes a PostgreSQL database.

Usage:

dropdb [OPTION]... DBNAME

16. Where can I get help on postgres psql commands?

Answer: In psql session

"\ ?" - for all psql session help

"\h <SQL Command> " For any SQL syntax help.

17. gpstart failed what should I do?

Answer: Check gpstart logfile in ~gpadmin/gpAdminLogs/gpstart_yyyymmdd.log

Take a look at the pg start log file for more details in $MASTER_DATA_DIRECTORY/pg_log/startup.log

18. Why do we need gpstop -m and gpstart -m?

Answer: The gpstart -m command allows you to start the master only and none of the data segments and is used primarily by support to get system level information/configuration. An end user would not regularly or even normally use it.

19. What is the procedure to get rid of mirror segments?

Answer: There are no utilities available to remove mirrors from Greenplum. You need to make sure all primary segments are good then you can remove the mirror configuration from gp_configuration in 3.x.

20. How to run gpcheckcat?

Answer: The gpcheckcat tool is used to check catalog inconsistencies between master and segments. It can be found in the $GPHOME/bin/lib directory:

Usage: gpcheckcat <option> [dbname]

-B parallel: number of worker threads

-g dir : generate SQL to rectify catalog corruption, put it in dir

-h host : DB host name

-p port : DB port number

-P passwd : DB password

-o : check OID consistency

-U uname : DB User Name

-v : verbose

Example: gpcheckcat gpdb >gpcheckcat_gpdb_logfile.log

21. What is gpdetective and how do I run it in Greenplum?

Answer: The gpdetective utility collects information from a running Greenplum Database system and creates a bzip2-compressed tar output file. This output file helps with the diagnosis of Greenplum Database errors or system failures. for more details check help.

gpdetective --help

22. How to delete a standby?

Answer: To remove the currently configured standby master host from your Greenplum Database system, run the following command in the master only:

# gpinitstandby -r

23. How to re-sync a standby?

Answer: Use this option if you already have a standby master configured, and just want to resynchronize the data between the primary and backup master host. The Greenplum system catalog tables will not be updated.

# gpinitstandby -n (resynchronize)

24. How to recover an invalid segment?

Answer: Use the gprecoverseg tool, which will recognize which segments need recovery and will initialize recovery.

3.3.x:

o Without "-F" option - First files will be compared, difference found and only different files will be synched (the first stage could last a long time if there are too many files in the data directory)

o With "-F" option - Entire data directory will be resynched.

4.0.x:

o Without "-F" option - The change tracking log will be sent and applied to the mirror.

o With "-F" option - Entire data directory will be resynched.

25. How to add mirrors to the array?

Answer: The gpaddmirrors utility configures mirror segment instances for an existing Greenplum Database system that was initially configured with primary segment instances only. For more details check help.

# gpaddmirrors --help

26. How to see primary to mirror mapping?

Answer: From database catalog following query list configuration on content ID, you can figure out primary and mirror for each content.

gpdb=# select * from gp_configuration order by content.

Note: starting from GPDB 4.x, gp_segment_configuration table is used instead.

gpdb=# select * from gp_segment_configuration order by dbid;

27. How to start/stop db in admin mode?

Answer:

Admin mode:

The gpstart with option (-R) is stands for Admin mode or restricted mode where only super users can connect to database when database opened using this option.

utility mode:

Utility mode allows you to connect to only individual segments when started using gpstart -m, for example< to connect to only master instance only:

PGOPTIONS='-c gp_session_role=utility' psql

28. How to run gpcheckperf IO/netperf?

Answer: Create a directory where you have free space and common in all hosts.

For network I/O test for each nic card:

gpcheckperf -f seg_host_file_nic-1 -r N -d /data/gpcheckperf > seg_host_file_nic_1.out

gpcheckperf -f seg_host_file_nic-2 -r N -d /data/gpcheckperf > seg_host_file_nic_2.out

For disk I/O:

gpcheckperf -f seg_host_file_nic-1 -r ds -D -d /data/gpdb_p1 -d /data/gpdb_p2 -d /data/gpdb_m1 -d /data/gpdb_m2

29. How to update postgresql.conf and reload it?

Answer: In GP 4.0 version check gpconfig utility to change postgres.conf parameters.

30. How to manage pg_hba.conf?

Answer: The pg_hba.conf file of the master instance controls client access and authentication to your Greenplum system.

31.How to add new user to the database?

Answer:

Use createuser utility to create users. See createuser --help for more details.

You can also use SQL commands in psql prompt to create users.

For example: CREATE USER or ROLE <ROLE_NAME> ....

32. How to create a password free trusted env b/w the all the segment hosts?

Answer:

Use gpssh-exkeys:

gpssh-exkeys -h hostname1 -h hostname2 .. -h hostnameN

33. How to check db version and version at init db?

Answer: To check version:

psql> select version();

postgres --gp-version

To check gp version at install:

psql> select * from gp_version_at_initdb;

34. How to see the value of GUC?

Answer: By connecting GPDB database using psql query catalog or do show parameter.

Example:

gpdb# select name,setting from pg_settings where name='GUC';

gpdb# show <GUC_NAME>;

35. What is the location of pg_hba/logfile/master_data_directory?

Answer: cd $MASTER_DATA_DIRECTORY - Master direcoty.

pg_hba.conf and postgres.conf location and other GPDB internal directories.

cd $MASTER_DATA_DIRECTORY/pg_logs -- Master database logfiles location.

36. How to find errors / fatal from log files?

Answer: grep for ERRORS, FATAL, SIGSEGV in pg_logs directory.

37. What is vacuum and when should I run this?

Answer: VACUUM reclaims storage occupied by deleted tuples. In normal GPDB operation, tuples that are deleted or obsoleted by an update are not physically removed from their table. They remain present on disk until a VACUUM is done. Therefore, it is necessary to do VACUUM periodically, especially on frequently-updated table.

38. What is difference between vacuum and vacuum full?

Answer: Unless you need to return space to the OS so that other tables or other parts of the system can use that space, you should use VACUUM instead of VACUUM FULL.

VACUUM FULL is only needed when you have a table that is mostly dead rows, that is, the vast majority of its contents have been deleted. Even then, there is no point using VACUUM FULL unless you urgently need that disk space back for other things or you expect that the table will never again grow to its past size. Do not use it for table optimization or periodic maintenance as it is counterproductive.

39. What is Analyze and how frequency should I run this?

Answer: ANALYZE collects statistics about the contents of tables in the database, and stores the results in the system table pg_statistic. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries.

It is a good idea to run ANALYZE periodically, or just after making major changes in the contents of a table. Accurate statistics will help the query planner to choose the most appropriate query plan, and thereby improve the speed of query processing. A common strategy is to run VACUUM and ANALYZE once a day during a low-usage time of day.

40. What is resource queues?

Answer: Resource queues are used to manager Greenplum database workload management. All user / queries can be prioritized using Resource queues.

41. What is gp_toolkit?

Answer: The gp_toolkit is a database schema, which has many tables, views and functions to better manage Greenplum Database when DB is up. All database users can access gp_toolkit to view and query the system log files and other system metrics.

42. How to generate DDL for a table?

Answer: Use pg_dump utility to generate DDL.

Example: pg_dump -t njonna.accounts -s -f ddl_accounts.sql

Where:

-f ddl_accounts.sql is output file.

-t njonna.accounts is table name with schema njonna.

-s dump only schema no data

43. What are the tools available in Greenplum to take backup and restores?

Answer: For non-parallel backups:

Use postgres utililities (pg_dump, pg_dumpall for backup, and pg_restore for restore).

Another useful command for getting data out of database is the COPY <TABLE> to <File>.

For parallel backups:

gp_dump and gpcrondump for backups and gp_restore for restore process.

44. How do I clone my production database to PreProd / QA environment?

Answer: If Prod and QA on same GPDB cluster, use CREATE database <Clone_DBname> template <Source_DB>.

If Prod and QA are on different clusters, use backup and restore utilities.

45. What is difference between pg_dump and gp_dump?

Answer:

pg_dump - Non-parallel backup utility, you need big file system where backup will be created in the master node only.

gp_dump - Parallel backup utility. Backup will be created in master and segments file system.

46. What is gpcrondump?

Answer: A wrapper utility for gp_dump, which can be called directly or from a crontab entry.

Example: gpcrondump -x <database_name>

47. What are the backup options available at OS level?

Answer: Solaris: zfs snapshots at file system level.

All OS: gpcrondump / gp_dump.

48. My SQL query is running very slow, it was running fine yesterday what should I do?

Answer:

A. Check that your connection to the Greenplum cluster is still good if you are using a remote client. You can do this by running the SQL locally to the GP cluster.

B. Check that the system tables and user tables involved are not bloated or skewed. Read jetpack or Greenplum toolkit documentation about how to do this.

C. Check with your DBA that the Greenplum interconnect is still performing correctly.

This can be done by checking for dropped packets on the interconnect "netstat -i" and by running gpcheckperf.

It is also possible that a segment is experiencing hardware problems, which can be found in the output of dmesg or in

/var/log/messages* (Linux) and /var/adm/messages* (Solaris).

49. How to turn on timing, and checking how much time a query takes to execute?

Answer:

A. You can turn in timing per session before you run your SQL with the \timing command.

B. You can run explain analyze against your SQL statement to get the timing.

50. How to trace child processes on segment server?

Answer: When session start in master and segments, all the child processes in segments will be identified with master session_id connection string (con+sess_id).

For example:

gpdb=# select * from pg_Stat_activity;

datid | datname | procpid | sess_id |.. ..

-------+---------+---------+---------+

16986 | gpdb | 18162 | 76134 | .. ..

In all segments child processes for session 76134:

[gpadmin@stinger2]/export/home/gpadmin/gp40>gpssh -f host_file /usr/ucb/ps -auxww |grep con76134

[stinger2] gpadmin 18162 1.7 6.0386000124480 ? S 09:57:55 0:04 postgres: port 4000, gpadmin gpdb [local] con76134 [local] cmd3 CREATE DATABASE.......................................

[stinger2] gpadmin 18625 0.3 2.726056455932 ? S 10:01:56 0:01 postgres: port 40000, gpadmin gpdb 10.5.202.12(18864) con76134 seg0 cmd4 MPPEXEC UTILITY...............................

[stinger2] gpadmin 18669 0.0 0.1 3624 752 pts/2 S 10:02:36 0:00 grep con76134

[stinger3] gpadmin 22289 0.8 9.4531860196404 ? S 09:36:20 0:05 postgres: port 40000, gpadmin gpdb 10.5.202.12(18866) con76134 seg1 cmd4 MPPEXEC UTILITY...............................

51. What kind of locks should we focus on MPP system when system is slow /hung?

Answer: Locks that are held for a very long time and multiple other queries are waiting for that lock also.

52. How do I monitor user activity history in Greenplum database ?

Answer: Use Greenplum performance monitor (gpperfmon), which has GUI to monitor and query performance history.

53. What is Greenplum performance monitor and how to install ?

Answer: It’s a monitoring tool that collects statistics on system and query performance and builds historical data.

54. when the client connects does he connect to the Master or segment node?

Answer: Master

55. Can you explain the process of data migration from Oracle to Greenplum?

Answer: There are many ways.

Simplest steps are Unload data into csv files, create tables in greenplum database corresponding to Oracle, Create external table, start gpfdist pointing to external table location, Load data into greenplum.

You can also use gpload utility. Gpload creates external table at runtime.

56. Which command would you use to backup a database?

Answer: gp_dump, gpcrondump, pg_dump, pg_dumpall, copy

57. When you restore from a backup taken from gp_dump, can you import a table?

Answer: NO. Yes if during the gp_dump you backed up one table only.

58. Can you open and view a dump file?

Answer: Yes

59. Which option would you use to export the ddl of the database or table?

Answer: -s (-s | --schema-only Dump only the object definitions (schema), not data.)

60. When a user submits a query, where does it run in Master or segment nodes?

Answer: Segment nodes

61. If you configure your with Master and Segment nodes, where would the data reside?

Answer: Segment nodes

62. How would go about query tuning? OR

What would you do when a user or users are complaining that a particular query is running slow?

Answer: 1) look at the query plan

2) Look at the stats of the table/tables in the query

3) look at the table distribution keys and joins in the query

4) look at the network performance

5) look at the resource queues

6) look at the interconnect performance

7) look at the join order of tables in the query

8) look at the query itself i.e. if it can be written in more efficient way

63. What would you do to gather statistics in the database? as well as reclaim the space?

Answer: VACUUM FULL , CTAS .

A VACUUM FULL will reclaim all expired row space, but is a very expensive operation and may take an unacceptably long time to finish on large, distributed Greenplum Database tables. If you do get into a situation where the free space map has overflowed, it may be more timely to recreate the table with a CREATE TABLE AS statement and drop the old table. A VACUUM FULL is not recommended in Greenplum Database.

64. How would you implement compression and explain possible the compression types?

Answer: There are two types of in-database compression available in the Greenplum Database for append-only tables:

1. Table-level compression is applied to an entire table.

2. Column-level compression is applied to a specific column. You can apply different column-level compression algorithms to different columns.

65. What are major differences between Oracle and Greenplum?

Answer: Oracle is relational database. Greenplum is MPP nature. Greenplum is shared nothing architecture. There are many other differences in terms of functionality and behavior.

66. What is good and bad about the Greenplum, compared to Oracle and Greenplum?

Answer: Greenplum is built on top of Postgresql . It is shared nothing, MPP architecture best for data warehousing environment. Good for big data analytics purpose.

Oracle is an all-purpose database.

67. How would troubleshoot an issue/error/problem, when there is no one available to help you or you are all by yourself?

Answer: Look at the log files. querying the gp perfmon and tools schema for various data and statistics.

68. Can you write stored procedures in Greenplum/Postgres?

Answer: No

69. Can you create stored functions and use them in Greenplum?

Answer: Yes

70. Can you do partitioning in Greenplum tables?

Answer: Yes

71. Which parameters can you use to manage workload in a Greenplum database?

Answer: workload management is done by creating resource queues and assigning various limits.

72. Tell me some of the aspects/implementations/configurations you have done in Greenplum?

Answer: Configuration like port change, Active directory authentication, pg_hba.conf changes, postgresql.conf changes, gpfdist etc

73. How to check if my session queries are running or waiting on locks?

Answer: Check "waiting" column in pg_stat_activity and "granted" column in pg_locks for any object level locks.

74. What is the maximum number of columns that can be present in the "Select" statement?

Answer: 1664

75. What is the maximum number of URL's that can be placed in the "Location" part of the external table creation?

Answer: Example:

CREATE EXTERNAL TABLE ext3 ( a int )

LOCATION ('gpfdist://mdw:8081/a1','gpfdist://mdw:8081/a2',.....,'gpfdist://mdw:8081/a999',) FORMAT 'TEXT' (DELIMITER '|');

There is no upper limit for the max number of URL's that can be added to the LOCATION clause , what restrict is the amount of size the location clause which is 32712 (32KB) , if you exceed you might hit with the error:

ERROR:  row is too big: size 42872, maximum size 32712

So if you planning to add many URL's in the Location clause , would recommend to move them on to a separate folder and start the gpfdist file server program in the background using

gpfdist -p 8081 -d <file_location> -l /home/gpadmin/log &

and write the create external command ( to the one below ) to read all the files in the location

CREATE EXTERNAL TABLE ext3 ( a int )

 LOCATION ('gpfdist://mdw:8081/*') FORMAT 'TEXT' (DELIMITER '|');

Also another limit is the FORMAT clause can take in only 8191 bytes , if you exceed you may end up with error

"ERROR: format options must be less than 8191 bytes in size".

76. What is the max text length of the SQL statement?

Answer: No limit , but Pivotal Greenplum do control the length of the text using the max_stack_depth parameter and OS control it using the stack limit defined by Unix stack size , so if the customer has a bigger text , then increase the stack size from the OS first and then alter the max_stack_depth in Greenplum and restart the database after each changes , since to take advantage of the new stack size , the green plum process needs to be recycled.

77. How to determine the disk/network speed between servers?

Answer: GPDB has a utility called "gpcheckperf" that can help to determine the speed of the Disk / Network / Memory Bandwidth

78. How to do I understand which process/query is consuming most of the segment memory?

Answer: Execute mem_watcher script available in"Pivotal Support Package" to track the memory consumption of the queries on the segments.

Review the output once an OOM is observed to identify possible queries or processes.

Follow the guide on to setup mem_watcher and guide to interpret the mem_watcher output.

79. How do i determine if my table in a query needs a analyze?

Answer: Check the "gp_stat_check" utility

80. What is an orphan process?

Answer: Orphan processes are postgres/greenplum process that exists in some segments and there is no information about its gang members on any other segments.

81. Will these orphan processes cause a performance issue?

Answer: Yes, they can if the process is holding locks on a relation on that segments and subsequent queries are waiting for this lock to be released.

82. What is data skew ?

Answer: Data skew happens if data is not equally distributed across all active segment instances.

83. what is Data directory ?

Answer: The data directory is the file system location on disk where database data is stored. The master data directory contains the global system catalog only — no user data is stored on the master. The data directory on the segment instances has user data for that segment plus a local copy of the system catalog. The data directory contains several subdirectories, control files, and configuration files as well.

84. What is Distribution key ?

Answer: In a Greenplum table that uses hash distribution, one or more columns are used as the distribution key, meaning those columns are used to divide the data among all of the segments. The distribution key should be the primary key of the table or a unique column or set of columns.

85. What is Distribution policy ?

Answer: The distribution policy determines how to divide the rows of a table among the Greenplum segments. Greenplum Database provides two types of distribution policy: hash distribution and random distribution.

86. What is Greenplum array ?

Answer: A Greenplum array aka Greenplum system, or simply Greenplum database, is a set of individual segments host designed to work together in parallel, and a master server to distribute queries across segments.

87. What is Heap tables ?

Answer: Whenever you create a table without specifying a storage structure, the default is a heap storage structure. In a heap structure, the table is an unordered collection of data that allows multiple copies or versions of a row. Heap tables have row-level versioning information and allow updates and deletes.

88. What is Master host / node ?

Answer: The master is a dedicated processing unit with its own CPU, disks, memory, and network interface. It is responsible for accepting and distributing queries. It is the entry point into the Greenplum system and does not store data.

89. What is Mirror ?

Answer: A mirror segment is a standby segment typically used in high-availability environments that is activated should its corresponding primary segment no longer be available. Mirrors are optional.

90. What is motion node ?

Answer: A motion node is a portion of a query execution plan that indicates data movement between the various database instances of Greenplum Database (segments and the master). Some operations, such as joins, require segments to send and receive tuples to one another in order to satisfy the operation. A motion node can also indicate data movement from the segments back up to the master.

91. What is Primary segment ?

Answer: A primary segment stores a distinct portion of data and is responsible for handling queries.

92. What is Processing skew ?

Answer: If you distribute your data on a good column where the data is laid out well, but some segments end up working harder than others, you introduce processing skew.

93. what is pg_catalog ?

Answer: pg_catalog contains the system catalog tables, built-in data types, functions, and operators. It is always part of the schema search path, even if it is not explicitly named in the search path.

94. What is pg_toast ?

Answer: pg_toast stores large objects such as records that exceed the page size. This schema is used internally by the Greenplum Database system.

95. What is pg_bitmapindex ?

Answer: pg_bitmapindex stores bitmap index objects such as lists of values. This schema is used internally by the Greenplum Database system.

96. What is pg_aoseg ?

Answer: pg_aoseg stores append-only table objects. This schema is used internally by the Greenplum Database system.

97. What is Query dispatcher (QD) ?

Answer: The query dispatcher (QD) is a process that is initiated when users connect to the master and issue SQL commands. This process represents a user session and is responsible for sending the query plan to the segments and coordinating the results it gets back. The query dispatcher process spawns one or more query executor processes to assist in the execution of SQL commands.

98. What is Query executor (QE) ?

Answer: A query executor process (QE) is associated with a query dispatcher (QD) process and operates on its behalf. Query executor processes run on the segment instances and execute their slice of the query plan on a segment.

99. What is Query plan (QP) ?

Answer: A query plan is the set of operations that Greenplum Database will perform to produce the answer to a given query. Each node or step in the plan represents a database operation such as a table scan, join, aggregation or sort. Plans are read and executed from bottom to top. Greenplum Database supports an additional plan node type called a motion node.

100. What is Random Distribution ?

Answer: With random distribution, table rows are sent to the segments as they come in, cycling across the segments in a round-robin fashion. Rows with columns having the same values will not necessarily be located on the same segment. Although a random distribution ensures even data distribution, there are performance advantages to choosing a hash distribution policy whenever possible.

101. What is Resource queue ?

Answer: A resource queue has attributes that limit the size and/or total number of queries that can be executed by the users (or roles) in that queue.

102. Redistribute motion ?

Answer: To perform a local join on the segment instance, matching rows must be located together on the same segment instance. In the case where data was not distributed on the join key, a dynamic redistribution of the needed rows form one of the tables to another segment instance will be performed.

103. What is segment ?

Answer: A segment represents a portion of data in a Greenplum database. User-defined tables and their indexes are distributed across the available number of segment instances in the Greenplum Database system. Each segment instance contains a distinct portion of the user data. A primary segment instance and its mirror both store the same segment of data.

104. What is Segment instance ?

Answer: A Postgres database server process, simply called a segment.

105. What is Segment host / node ?

Answer: A node or segment host, in a Greenplum system is a dedicated processing unit with its own CPU, disks, memory, and network interface. The segment host runs database server processes. It supports 1-n segments depending on the number of processors / cores.

106. What is slice ?

Answer: In order to achieve maximum parallelism during query execution, Greenplum divides the work of the query plan into slices. A slice is a portion of the plan that can be worked on independently at the segment level. A query plan is sliced wherever a motion node occurs in the plan, one slice on each side of the motion. Plans that do not require data movement (such as catalog lookups on the master) are known as single-slice plans.

107. What is Shared-nothing architecture ?

Answer: In contrast to a shared-everything architecture in which resources are used by many different applications, a shared nothing architecture grants usage to resources to the same application throughout all the time.

108. What is System catalog ?

Answer: The global system catalog resides on the master and denotes the set of system tables that contain metadata about the Greenplum Database system itself.

109. What is Scatter / Gather technique ?

Answer: Scattering denotes the process of getting all data from all source systems to all segment instances in the Greenplum system. Each segment instance in the system gathers the data it is responsible for.

110. What is gpdbrestore and how to work ?

Answer: The Greenplum Database parallel restore utility gpdbrestore takes the timestamp key generated by gpcrondump, validates the backup set, and restores the database objects and data into a distributed database in parallel. Parallel restore operations require a complete backup set created by gpcrondump, a full backup and any required incremental backups.

The Greenplum Database gpdbrestore utility provides flexibility and verification options for use with the automated backup files produced by gpcrondump or with backup files moved from the Greenplum array to an alternate location.

gpdbrestore -T faa.otp_r -s tutorial -u /tmp -a

Manoj Simar - The Technology Expert

About Me

Friday, 16 November 2018

Interview Q and A for Greenplum DB Part - 1

No comments:

Post a Comment