Thursday, 7 June 2018

Interview Q and A for PostgreSQL Part - 9

189. What is difference among backup options?
Backup/Recovery options:

SQL dump to an archive file  pg_dump -F c
SQL dump to a script file
pg_dump -F p or pg_dumpall
Filesystem backup using pg_start_ backup

Backup type
Logical
Logical
Physical
Recover to point in time?
No
No
Yes
Backup all databases?
One at a time
Yes (pg_dumpall)
Yes
All databases backed up at same time?
No
No
Yes
Selective backup?
Yes
Yes
No (Note 3)
Incremental backup?
No
No
Possible (Note 4)
Selective restore?
Yes
Possible (Note 1)
No (Note 5)
DROP TABLE recovery
Yes
Yes
Possible (Note 6)
DROP TABLESPACE recovery
Possible (Note 2)
Possible (Note 6)
Possible (Note 6)
Compressed backup files?
Yes
Yes
Yes
Backup is multiple files?
No
No
Yes
Parallel backup possible?
No
No
Yes
Parallel restore possible?
Yes
No
Yes
Restore to later release?
Yes
Yes
No
Standalone backup?
Yes
Yes
Yes (Note 7)
Allows DDL during backup
No
No
Yes
1. If you've generated a script with pg_dump or pg_dumpall and need to restore just a single object, then you're going to need to go deep. You will need to write a Perl script (or similar) to read the file and extract out the parts you want. It's messy and time-consuming, but probably faster than restoring the whole thing to a second server, and then extracting just the parts you need with another pg_dump.
2. See recipe Recovery of a dropped/damaged tablespace.
3. Selective backup with physical backup is possible, though will cause later problems when you try to restore. See note 6.
4. See Incremental/Differential backup.
5. Selective restore with physical backup isn't possible with currently supplied utilities.
6. See recipe for Recovery of a dropped/damaged tablespace.
7. See recipe for Standalone hot physical backup.

190. Why use pg_dump  rather than pg_dumpall for  backup
 I have four reasons why you shouldn't use pg_dumpall utility , which are as follows:
ffIf you use pg_dumpall, then the only output produced is into a script file. Script files can't use the parallel restore feature of pg_restore, so by taking your backup in this way you will be forcing the restore to be slower than it needs to be.
ffpg_dumpall produces dumps of each database, one after another. This means that:
..pg_dumpall is slower than running multiple pg_dump tasks in parallel, one against each database.
..The dumps of individual databases are not consistent to a particular point in time. If you start the dump at 04:00 and it ends at 07:00 then we're not sure exactly when the dump relates to—sometime between 0400 and 07:00.
      ffOptions for pg_dumpall are similar in many ways to pg_dump, though not all of them exist, so some things aren't possible.
      ff pg_dump does not dump the roles (such as users/groups) and tablespaces. Those two things are only dumped by pg_dumpall.

 In summary, pg_dumpall is slower to backup, slow to restore, and gives you less control over the dump. I suggest you don't use it for those reasons.
If you have multiple databases, then I suggest you take your backup by doing either
Dump global information for the database server using pg_dumpall –g . Then dump all databases in parallel using a separate pg_dump for each database, taking care to check for errors if they occur. Use the physical database backup technique instead.

191. Explain PG_DUMP and PG_DUMPALL?
PG_DUMP 
pg_dump is a utility for backing up a PostgreSQL database. It makes consistent backups even if
the database is being used concurrently. pg_dump runs by executing SQL statements against the database to unload data. When PostgreSQL runs an SQL statement we take a "snapshot" of currently running transactions, which freezes our viewpoint of the database. We can't (yet) share that snapshot across multiple sessions, so we cannot run an exactly consistent pg_dump in parallel in one database, nor across many databases.
The time of the snapshot is the only time we can recover to—we can't recover to a time either before or after that time. Note that the snapshot time is the start of the backup, not the end.
pg_dump produces a single output file. The output file can use the split(1) command to separate the file into multiple pieces, if required.
pg_dump into custom format is lightly compressed by default. Compression can be removed or made more aggressive.
pg_dump does not block other users accessing the database (readers or writers). When pg_dump runs, it holds the very lowest kind of lock on the tables being dumped. Those are designed to prevent DDL from running against the tables while the dump takes place. If a dump is run at the point that other DDL are already running, then the dump will sit and wait. If you want to limit the waiting time you can do that by setting the –-lock-wait-timeout option.
pg_dump allows you to make a selective backup of tables. The -t option also allows you to specify views and sequences. There's no way to dump other object types individually using pg_dump.
pg_dump works against earlier releases of PostgreSQL, so it can be used to migrate data between releases.
pg_dump doesn't generally handle included modules very well. pg_dump isn't aware of additional tables that have been installed as part of an additional package, such as PostGIS or Slony, so it will dump those objects as well. That can cause difficulties if you then try to restore from the backup, as the additional tables may have been created as part of the software installation process in an empty server.
The snapshot for a pg_dump is taken at the beginning of a run. The file modification time will tell you when the dump finished. The dump is consistent at the time of the snapshot, so you may want to know that time. If you are making a script dump, you can do a dump verbose as follows:
pg_dump -v
which then adds the time to the top of the script. Custom dumps store the start time as well and that can be accessed using the following:
pg_restore ––schema-only -v dumpfile | head | grep Started
-- Started on 2010-06-03 09:05:46 BST
Notes
If your database cluster has any local additions to the template1 database, be careful to restore the output of pg_dump into a truly empty database; otherwise you are likely to get errors due to duplicate
definitions of the added objects. To make an empty database without any local additions, copy from
template0 not template1, for example:
CREATE DATABASE foo WITH TEMPLATE template0;
When a data-only dump is chosen and the option --disable-triggers is used, pg_dump emits commands to disable triggers on user tables before inserting the data, and then commands to re-enable
them after the data has been inserted. If the restore is stopped in the middle, the system catalogs might
be left in the wrong state.
Members of tar archives are limited to a size less than 8 GB. (This is an inherent limitation of the
tar file format.) Therefore this format cannot be used if the textual representation of any one table
exceeds that size. The total size of a tar archive and any of the other output formats is not limited,
except possibly by the operating system.
The dump file produced by pg_dump does not contain the statistics used by the optimizer to make
query planning decisions. Therefore, it is wise to run ANALYZE after restoring from a dump file to ensure optimal performance. The dump file also does not contain any ALTER DATABASE ... SET commands; these settings are dumped by pg_dumpall, along with database users and other installation-wide settings.
Because pg_dump is used to transfer data to newer versions of PostgreSQL, the output of pg_dump
can be expected to load into PostgreSQL server versions newer than pg_dump’s version. pg_dump can
also dump from PostgreSQL servers older than its own version. (Currently, servers back to version 7.0
are supported.) However, pg_dump cannot dump from PostgreSQL servers newer than its own major version; it will refuse to even try, rather than risk making an invalid dump. Also, it is not guaranteed
that pg_dump’s output can be loaded into a server of an older major version — not even if the dump
was taken from a server of that version. Loading a dump file into an older server may require manual
editing of the dump file to remove syntax not understood by the older server.
Examples
To dump a database called mydb into a SQL-script file:
$ pg_dump mydb > db.sql
To reload such a script into a (freshly created) database named newdb:
$ psql -d newdb -f db.sql
To dump a database into a custom-format archive file:
$ pg_dump -Fc mydb > db.dump
To dump a database into a directory-format archive:
$ pg_dump -Fd mydb -f dumpdir

To dump a database into a directory-format archive in parallel with 5 worker jobs:
$ pg_dump -Fd mydb -j 5 -f dumpdir
To reload an archive file into a (freshly created) database named newdb:
$ pg_restore -d newdb db.dump
To dump a single table named mytab:
$ pg_dump -t mytab mydb > db.sql
To dump all tables whose names start with emp in the detroit schema, except for the table named
employee_log:
$ pg_dump -t ’detroit.emp*’ -T detroit.employee_log mydb > db.sql
To dump all schemas whose names start with east or west and end in gsm, excluding any schemas
whose names contain the word test:
$ pg_dump -n ’east*gsm’ -n ’west*gsm’ -N ’*test*’ mydb > db.sql
The same, using regular expression notation to consolidate the switches:
$ pg_dump -n ’(east|west)*gsm’ -N ’*test*’ mydb > db.sql
To dump all database objects except for tables whose names begin with ts_:
$ pg_dump -T ’ts_*’ mydb > db.sql
To specify an upper-case or mixed-case name in -t and related switches, you need to double-quote
the name; else it will be folded to lower case (see Patterns). But double quotes are special to the
shell, so in turn they must be quoted. Thus, to dump a single table with a mixed-case name, you need
something like
$ pg_dump -t "\"MixedCaseName\"" mydb > mytab.sql


[postgres@db ~]$ pg_dump -?
pg_dump dumps a database as a text file or to other formats.

Usage:
  pg_dump [OPTION]... [DBNAME]

General options:
  -f, --file=FILENAME          output file or directory name
  -F, --format=c|d|t|p         output file format (custom, directory, tar,
                               plain text (default))
  -v, --verbose                verbose mode
  -V, --version                output version information, then exit
  -Z, --compress=0-9           compression level for compressed formats
  --lock-wait-timeout=TIMEOUT  fail after waiting TIMEOUT for a table lock
  -?, --help                   show this help, then exit

Options controlling the output content:
  -a, --data-only              dump only the data, not the schema
  -b, --blobs                  include large objects in dump
  -c, --clean                  clean (drop) database objects before recreating
  -C, --create                 include commands to create database in dump
  -E, --encoding=ENCODING      dump the data in encoding ENCODING
  -n, --schema=SCHEMA          dump the named schema(s) only
  -N, --exclude-schema=SCHEMA  do NOT dump the named schema(s)
  -o, --oids                   include OIDs in dump
  -O, --no-owner               skip restoration of object ownership in
                               plain-text format
  -s, --schema-only            dump only the schema, no data
  -S, --superuser=NAME         superuser user name to use in plain-text format
  -t, --table=TABLE            dump the named table(s) only
  -T, --exclude-table=TABLE    do NOT dump the named table(s)
  -x, --no-privileges          do not dump privileges (grant/revoke)
  --binary-upgrade             for use by upgrade utilities only
  --column-inserts             dump data as INSERT commands with column names
  --disable-dollar-quoting     disable dollar quoting, use SQL standard quoting
  --disable-triggers           disable triggers during data-only restore
  --exclude-table-data=TABLE   do NOT dump data for the named table(s)
  --inserts                    dump data as INSERT commands, rather than COPY
  --no-security-labels         do not dump security label assignments
  --no-tablespaces             do not dump tablespace assignments
  --no-unlogged-table-data     do not dump unlogged table data
  --quote-all-identifiers      quote all identifiers, even if not key words
  --section=SECTION            dump named section (pre-data, data, or post-data)
  --serializable-deferrable    wait until the dump can run without anomalies
  --use-set-session-authorization
                               use SET SESSION AUTHORIZATION commands instead of
                               ALTER OWNER commands to set ownership

Connection options:
  -h, --host=HOSTNAME      database server host or socket directory
  -p, --port=PORT          database server port number
  -U, --username=NAME      connect as specified database user
  -w, --no-password        never prompt for password
  -W, --password           force password prompt (should happen automatically)
  --role=ROLENAME          do SET ROLE before dump

If no database name is supplied, then the PGDATABASE environment
variable value is used.

Report bugs to <pgsql-bugs@postgresql.org>.

pg_dumpall
pg_dumpall is a utility for writing out (“dumping”) all PostgreSQL databases of a cluster into one
script file. The script file contains SQL commands that can be used as input to psql to restore the
databases. It does this by calling pg_dump for each database in a cluster. pg_dumpall also dumps
global objects that are common to all databases. (pg_dump does not save these objects.) This currently includes information about database users and groups, tablespaces, and properties such as access permissions that apply to databases as a whole.
Since pg_dumpall reads tables from all databases you will most likely have to connect as a database
superuser in order to produce a complete dump. Also you will need superuser privileges to execute
the saved script in order to be allowed to add users and groups, and to create databases.
The SQL script will be written to the standard output. Use the [-f|file] option or shell operators to
redirect it into a file.
pg_dumpall needs to connect several times to the PostgreSQL server (once per database). If you use
password authentication it will ask for a password each time. It is convenient to have a ~/.pgpass
file in such cases.
Notes
Since pg_dumpall calls pg_dump internally, some diagnostic messages will refer to pg_dump.
Once restored, it is wise to run ANALYZE on each database so the optimizer has useful statistics. You
can also run vacuumdb -a -z to analyze all databases.
pg_dumpall requires all needed tablespace directories to exist before the restore; otherwise, database
creation will fail for databases in non-default locations.
Examples
To dump all databases:
$ pg_dumpall > db.out
To reload database(s) from this file, you can use:
$ psql -f db.out postgres

[postgres@db ~]$ pg_dumpall -?
pg_dumpall extracts a PostgreSQL database cluster into an SQL script file.
                                         
Usage:
  pg_dumpall [OPTION]...

General options:
  -f, --file=FILENAME          output file name
  -V, --version                output version information, then exit
  --lock-wait-timeout=TIMEOUT  fail after waiting TIMEOUT for a table lock
  -?, --help                   show this help, then exit

Options controlling the output content:
  -a, --data-only              dump only the data, not the schema
  -c, --clean                  clean (drop) databases before recreating
  -g, --globals-only           dump only global objects, no databases
  -o, --oids                   include OIDs in dump
  -O, --no-owner               skip restoration of object ownership
  -r, --roles-only             dump only roles, no databases or tablespaces
  -s, --schema-only            dump only the schema, no data
  -S, --superuser=NAME         superuser user name to use in the dump
  -t, --tablespaces-only       dump only tablespaces, no databases or roles
  -x, --no-privileges          do not dump privileges (grant/revoke)
  --binary-upgrade             for use by upgrade utilities only
  --column-inserts             dump data as INSERT commands with column names
  --disable-dollar-quoting     disable dollar quoting, use SQL standard quoting
  --disable-triggers           disable triggers during data-only restore
  --inserts                    dump data as INSERT commands, rather than COPY
  --no-security-labels         do not dump security label assignments
  --no-tablespaces             do not dump tablespace assignments
  --no-unlogged-table-data     do not dump unlogged table data
  --quote-all-identifiers      quote all identifiers, even if not key words
  --use-set-session-authorization
                               use SET SESSION AUTHORIZATION commands instead of
                               ALTER OWNER commands to set ownership

Connection options:
  -h, --host=HOSTNAME      database server host or socket directory
  -l, --database=DBNAME    alternative default database
  -p, --port=PORT          database server port number
  -U, --username=NAME      connect as specified database user
  -w, --no-password        never prompt for password
  -W, --password           force password prompt (should happen automatically)
  --role=ROLENAME          do SET ROLE before dump

If -f/--file is not used, then the SQL script will be written to the standard
output.

Report bugs to <pgsql-bugs@postgresql.org>.
192. Hot logical backup of all databases
My recommendation is that you do exactly what you did for one database on each database in your cluster.
You can run those individual dumps in parallel if you want to speed things up.
Once complete, dump the global information also, using the following:
pg_dumpall -g

193. How take Large Databases backup
Some operating systems have maximum file size limits that cause problems when creating large
pg_dump output files. Fortunately, pg_dump can write to the standard output, so you can use standard Unix tools to work around this potential problem. There are several possible methods:
Use compressed dumps. You can use your favorite compression program, for example gzip:
pg_dump dbname | gzip > filename.gz
Reload with:
gunzip -c filename.gz | psql dbname
or:
cat filename.gz | gunzip | psql dbname
Use split. The split command allows you to split the output into smaller files that are acceptable
in size to the underlying file system. For example, to make chunks of 1 megabyte:
pg_dump dbname | split -b 1m - filename
Reload with:
cat filename* | psql dbname
Use pg_dump’s custom dump format. If PostgreSQL was built on a system with the zlib compression library installed, the custom dump format will compress data as it writes it to the output file. This will produce dump file sizes similar to using gzip, but it has the added advantage that tables can be restored selectively. The following command dumps a database using the custom dump format:
pg_dump -Fc dbname > filename
A custom-format dump is not a script for psql, but instead must be restored with pg_restore, for
example:
pg_restore -d dbname filename
For very large databases, you might need to combine split with one of the other two approaches.
Use pg_dump’s parallel dump feature. To speed up the dump of a large database, you can use
pg_dump’s parallel mode. This will dump multiple tables at the same time. You can control the degree of parallelism with the -j parameter. Parallel dumps are only supported for the "directory" archive format.
pg_dump -j num -F d -f out.dir dbname
You can use pg_restore -j to restore a dump in parallel. This will work for any archive of either
the "custom" or the "directory" archive mode, whether or not it has been created with pg_dump -j.

194. Hot logical backup of all tables in a tablespace
Sometimes we may wish to make a dump of tables and data in a tablespace. Unfortunately, there isn't a simple command to do this, so we need to write some reusable scripts.
It is possible for a tablespace to contain objects from more than one database, so run the following query to see which databases from which you need to dump:
SELECT datname FROM pg_database WHERE oid IN ( SELECT pg_tablespace_databases(ts.oid) FROM pg_tablespace ts WHERE spcname = 'mytablespacename');
The following procedure allows you to dump all tables that reside on one tablespace and within one database only.
Create a file named onets.sql that contains the following SQL, which extracts the list of tables in a tablespace:
SELECT 'pg_dump' UNION ALL
SELECT '-t ' || spcname || '.' || relname
FROM pg_class t JOIN pg_tablespace ts
ON reltablespace = ts.oid AND spcname = :TSNAME JOIN pg_namespace n ON n.oid = t.relnamespace
WHERE relkind = 'r'
UNION ALL
SELECT '-F c > dumpfile'; -- dumpfile is the name of the output file
Execute the query to build the pg_dump script:
psql -t -v TSNAME="'mytablespace'" -f onets.sql > get_my_ts
From the recovered database server, dump the tables in the tablespace, including data and definitions. The output file is named "dumpfile", from last line in the first step.
chmod 755 get_my_ts
./get_my_ts

pg_dump allows you to specify more than one table on the command line, so it's possible to generate a list of tables directly from the database.
We use the named parameter in psql to create a parameterized script, which we then execute to create a dump.

195. Backup of database object definitions
Sometimes it's useful to get a dump of the object definitions that make up a database. This is useful for comparing what's in the database against the definitions in a data or object-modeling tool. It's also useful to make sure you can recreate objects in exactly the correct schema, tablespace, and database with the correct ownership and permissions.

The basic command to dump the definitions only is to use the following:
pg_dumpall ––schema-only > myscriptdump.sql
Which includes all objects, including roles, tablespaces, databases, schemas, tables, indexes, triggers, constraints, views, functions, ownership, and privileges.
If you want to dump PostgreSQL role definitions, you can use the following:
pg_dumpall -–roles-only > myroles.sql
If you want to dump PostgreSQL tablespace definitions, you can use the following:
pg_dumpall -–tablespaces-only > mytablespaces.sql
Or if you want to dump both roles and tablespaces, then you can use the following:
pg_dumpall -–globals-only > myglobals.sql
The output is a human-readable script file that can be re-executed to re-create each of the databases.
In PostgreSQL, the word "schema" is also used to describe a set of related database objects similar to a directory, also known as a "namespace". Be careful that you don't confuse what is happening here. The ––schema-only option makes a backup of the "database schema" - the definitions of all objects in the database (and in all namespaces). To make a backup of the data and definitions in just one namespace, use the –s option. Or, to make a backup of only the definitions, in just one namespace, use both –s and ––schema-only together.

196. Why VACUUM requires in MVCC
VACUUM removes dead row versions in tables and marks the space available for future reuse. The VACUUM process also updates the Free Space Map and using VACUUM FULL we can recover those free spaces.

197. How can you know that your operations are spilling to disk?
When you execute EXPLAIN ANALYZE for your query and if you see a line like Sort Method: external merge Disk: 88342kB, your operations are spilling to disk.

198. Is spilling to disk a bad operation?
Yes: Because it requires more CPU/IO for the disk operation and it leads to biggest performance cause.

199. What is the solution?
The simple solution is, validate the configuration of a work_mem parameter. If you do a lot of complex sorts and have a lot of memory, then increasing the work_mem parameter allows PostgreSQL to do larger in-memory sorts which, unsurprisingly, will be faster than disk-based equivalents.

200. How can you configure work_mem?
A work_mem of at least 16MB would keep the intermediate data in memory and likely improve the query response time.
You can set at the server level for the general use (in postgresql.conf).
You can also set at your session level using below sample query. Increase the setting until you see Memory instead of Disk in the EXPLAIN output.
BEGIN;
SET LOCAL work_mem = '50MB';
SELECT *FROM Table_Name;
 COMMIT;

201. What is REINDEX?
PostgreSQL provides two ways to perform Reindexing on Database.
Using REINDEX command:
You can execute REINDEX command on Database, Table and Index.
You can connect PostgreSQL using PGAdmin or PSQL tools to execute this command.
Reindex all indexes of Database:
REINDEX DATABASE Database_Name;
Reindex on particular Index:
REINDEX INDEX Index_Name;
Reindex on particular Table:
REINDEX TABLE Table_Name;

202. What is BRIN (Block Range Index)?
The BRIN is a new database indexing technique, mainly design for very large table.
We know about the Table Partitioing concept, which we require implementing on a very large table.
The BRIN index also works like Horizontal Table Partitioning, by creating a different block of data.
If we are using BRIN, we do not require implementing Table Partitioing explicitly.
The BRIN index creates different block of data and it defines range using MinMax function.
Every block has one Min value and Max value associated which is used to identify a related block of data.
A block range is a group of pages that are physically adjacent in the table, for each block range, some summary info is stored by the index.
The BRIN index is a very lightweight because it stores only summery of the data block. Unlike B-Tree indexes, it doesn’t require storing all key node value in index page.
An Even entire form of BRIN indexes can easily store in the memory because of its compact form and it reduces the scanning of the disk.
As it is stored only summary of block so it is also 10 times faster than B-Tree for INSERT and UPDATE operations.
The creation of the BRIN indexes on a large table are also much faster than B-Tree indexes.
The efficiency of BRIN index is also depends on the physical order of table data.
The BRIN indexes are scanning the tuples bases on a bitmap and it returns only those tuples which are matched in block range summary information so It also avoids scanning of the large parts of the table.
The B-Tree indexes are still superior for random searches, and we should use the BRIN indexes for specific type of queries which are mostly used in BIGDATA, Reporting, Data Warehouse.
PostgreSQL 9.5 introduced the powerful BRIN Index, which is performance much faster than the regular BTREE Index.
The most important two lines of the BRIN are: It stores only minimum and maximum value per block so it does not require more space. For extremely large table It runs faster than any other Indexes.
Create BRIN index on TransactionDate Column:
CREATE INDEX idx_tbl_ItemTransactions_TransactionDate
ON tbl_ItemTransactions
USING BRIN (TransactionDate);
203. Using PSQL, Disable Autocommit Globally
Login in your PSQL tool and execute below command (It is a case sensitive):
\set autocommit off;
 To enable again, execute below command:
\set autocommit on;

204. Which Statistical Table Used by Postgres
PostgreSQL Database Server is mainly using two Statistical Table.
SELECT *FROM pg_class;
SELECT *FROM pg_stats;

205. What is difference between MySQL and PostgreSQL?

MySQL is controlled by Oracle, whereas Postgres is available under an open-source license from the PostgreSQL Global Development Group.
PostgreSQL is closer to the ANSI SQL standard, MySQL is closer to the ODBC standard.
One of the most notable differences between MySQL and PostgreSQL is the fact that you can’t do nested subqueries of subselects in MySQL. MySQL doesn’t support CTE, window functions, full outer joins and arrays.
Compare to PostgreSQL, MySQL has its own large user community and various type of materials.
MySQL uses nonstandard ‘#’ to begin a comment line and PostgreSQL use ‘–‘ to begin a comment line.
MySQL provides different type of table storage engines like: InnoDB, MyIsam, Memory. PostgreSQL does not provide different type of table stroage engines.
PostgreSQL is case-sensitive for string comparisons. The field “DbRnd” is not the same as the field “dbrnd”. This is a big change for many users from MySQL and other small database systems.
PostgreSQL has some nice features like: generate_series, custom aggregate functions, arrays etc. other hand MySQL has also some nice feature like: session variables in queries, FORCE INDEX, etc.
In PostgreSQL, when you are inside a transaction almost any operation can be undone. There are some irreversible operations (like creating or destroying a database or tablespace), but normal table modifications can be backed out by issuing a ROLLBACK via its Write-Ahead Log design.
MySQL doesn’t support any sort of rollback when using MyISAM. With InnoDB, the server has an implicit commit that occurs even if the normal auto-commit behavior is turned off. This means that any single table alteration or similar change is immediately committed.
Postgres obliterates MySQL in all manner of query-level goodness, and we can use Postgres for data mining and batch analytics databases, where we run lots of very large, hand-coded queries that can take advantage of Postgres’ advantages in join methods (MySQL only supports Nested Loop and Nested Loop over Index, while Postgres supports numerous methods), generally richer querying capabilities, and other good stuff like partial indexes can be used extensively.
PostgreSQL uses a robust locking model called MVCC that limits situations where individual clients interfere with each other. A short summary of the main benefit of MVCC would be “readers are never blocked by writers”.
MySQL’s InnoDB implements MVCC using a rollback segment and InnoDB databases supports all four SQL standard transaction isolation levels.
PostgreSQL uses cost-based query optimization methods to get good performance for many different types of joins. Query costs are assessed based on planner statistics collected when tables are analyzed combined with adjustable planner costs and advanced features such as the Genetic Query Optimizer.
MySQL doesn’t have this level of sophistication in its planner, and the tunables for Controlling Query Optimizer Performance are crude. Developers instead do things like explicitly provide index hints to make sure joins are executed correctly.
PostgreSQL has different replication options than MySQL. MySQL has no point-in-time recovery.
One operation that PostgreSQL is known to be slow performing is doing a full count of rows in a table, typically using this SQL:
SELECT COUNT(*) FROM your_table_name;
 The reason why this is slow is related to the MVCC implementation in PostgreSQL. The fact that multiple transactions can see different states of the data means that there can be no straightforward way for “COUNT(*)” to summarize data across the whole table; PostgreSQL must walk through all rows, in some sense.
In MySQL, MyISAM tables cache the row count information, making this type of count operation almost instant. But if you’re using InnoDB instead, this is no longer the case and on InnoDB can’t assume that a full row count will be fast.

206. How to Increase the Performance of Bulk Data Load

Tuning Checkpoint Parameters to Increase the Performance of Bulk Data Load

PostgreSQL 9.5 onwards you can use min_wal_size and max_wal_size instead of checkpoint_segments.

If our PostgreSQL set up for heavy bulk operations, after setting the above memory parameters we should also look into Checkpoint related Parameters.

207. What is a Checkpoint?
In simple word, it writes all dirty pages from memory to disk and cleans the shared_buffers area.
It does not require any system lock and make sure that data has been written to the disk.
If our Postgres server crashed, we can measure the data loss between last Checkpoint value time and PostgreSQL stopped time. We can also recover our system using this information.

208. Why Checkpoint is an Important parameter?
If we have 1GB of database and 15 GB of MEMORY, we can easily store our whole database into MEMORY. Which returns the best performance because It does not require any physical page swapping into the Disk.
Now let say, PostgreSQL Checkpoint is not updated since last 48 hours and we are working with only those data pages which are in MEMORY only.
Now the system crashed because of power failure and lots of work not updated into the Disk so next, when Postgres start it takes long time to load because It has to recover last 48 hours data from WAL segments.
That’s why the proper configuration of Checkpoint Parameters is very important.

209. What are the different Checkpoint Parameters?

checkpoint_segments (integer): The default value is 3 and every segment has a 16 MB capacity to store log information.
checkpoint_timeout (integer): The default is 5min and It is time between automatic WAL checkpoints. If checkpoint set for every minute – only pages from minute would have to be written to disk, 5 minutes – more pages, 1 hour – even more pages.
checkpoint_completion_target (floating point): The default is 05, and It is completion time between the Checkpoints.
checkpoint_warning (integer): In case of overlapping, it writes a message to the server.

210. How we can improve the performance for Bulk data operations?
We should increase the size of checkpoint_segments, checkpoint_completion_target, checkpoint_timeout.
When we increase the size of this parameter like checkpoint_timeout from 5min to 20min, It reduces the Disk I/O by writing again and again.
Whenever we have a large Shared_buffers area, we should increase the size of the Checkpoint parameters.
I suggest to apply these changes only for large bulk load data operations related system.
For regular load, default PostgreSQL setting is fine.
Example of parameters for PostgreSQL optimization:
shared_buffers = 512MB (default: 32MB)
effective_cache_size = 1024MB (default: 128MB)
checkpoint_segment = 32 (default: 3)
checkpoint_completion_target = 0.9 (default: 0.5)
default_statistics_target = 1000 (default: 100)
work_mem = 100MB (default: 1MB)
maintainance_work_mem = 256MB (default: 16MB)

211: How to increase the performance of Bulk Insertion (INSERT)

We should optimize Bulk Insertion operation because It is also utilizing more system resources.
Few steps to increase the performance of Bulk Insertion:
  • If It is only one time data load, remove all Index and create after INSERT. For regular Bulk Insertion, look into Index part and try to optimize it.
  • Take a backup of all the Triggers and drop all the Triggers.
  • Use COPY command instead of INSERT.
  • If you can not use COPY command because of of INSERT multiple value, you should not give more than 1000 multiple values in single INSERT. All INSERT multiple values stored in the memory area which may decrease the INSERT performance.
  • Disable other database logging process.
  • If BULK INSERTION is running with a single transaction, you should also disable all FOREIGN KEYs and later you can enable it.
  • Change synchronous_commit=off to reduce the cost of committing.
  • Change Checkpoint default configuration, Increase the checkpoint_segments value.
  • Increase the size of shared_buffers, effective_cache_size, work_mem.
  • Prepared all statements on server side.
  • Disable Auto-Vacuum on Table.
  • Disable Auto Database Backup process.
  • Increase the system resource (Memory, Disk, CPU), If it is required.
  • Use SSD type of Disk for faster storage.
Note: Whatever you changed for BULK INSERTION, please do not forget to rollback after completion of BULK INSERTION.

No comments:

Post a Comment