189. What is difference among backup
options?
Backup/Recovery
options:
|
SQL dump to an
archive file pg_dump -F c
|
SQL dump to a script
file
pg_dump -F p or pg_dumpall
|
Filesystem backup
using pg_start_ backup
|
|||
Backup type
|
Logical
|
Logical
|
Physical
|
|||
Recover to point in
time?
|
No
|
No
|
Yes
|
|||
Backup all
databases?
|
One at a time
|
Yes (pg_dumpall)
|
Yes
|
|||
All databases backed
up at same time?
|
No
|
No
|
Yes
|
|||
Selective backup?
|
Yes
|
Yes
|
No (Note 3)
|
|||
Incremental backup?
|
No
|
No
|
Possible (Note 4)
|
|||
Selective restore?
|
Yes
|
Possible (Note 1)
|
No (Note 5)
|
|||
DROP TABLE recovery
|
Yes
|
Yes
|
Possible (Note 6)
|
|||
DROP TABLESPACE
recovery
|
Possible (Note 2)
|
Possible (Note 6)
|
Possible (Note 6)
|
|||
Compressed backup
files?
|
Yes
|
Yes
|
Yes
|
|||
Backup is multiple
files?
|
No
|
No
|
Yes
|
|||
Parallel backup
possible?
|
No
|
No
|
Yes
|
|||
Parallel restore
possible?
|
Yes
|
No
|
Yes
|
|||
Restore to later
release?
|
Yes
|
Yes
|
No
|
|||
Standalone backup?
|
Yes
|
Yes
|
Yes (Note 7)
|
|||
Allows DDL during
backup
|
No
|
No
|
Yes
|
|||
1. If you've generated a script with pg_dump or
pg_dumpall and
need to restore just a single object, then you're going to need to go deep. You
will need to write a Perl script (or similar) to read the file and extract out
the parts you want. It's messy and time-consuming, but probably faster than
restoring the whole thing to a second server, and then extracting just the
parts you need with another pg_dump.
2. See recipe Recovery of a
dropped/damaged tablespace.
3. Selective backup with physical backup is
possible, though will cause later problems when you try to restore. See note 6.
4. See Incremental/Differential backup.
5. Selective restore with physical backup
isn't possible with currently supplied utilities.
6. See recipe for Recovery of a
dropped/damaged tablespace.
7. See recipe for Standalone hot physical backup.
190. Why use pg_dump rather than
pg_dumpall for backup
I have four reasons why you shouldn't use pg_dumpall
utility , which are as follows:
ffIf you use pg_dumpall, then the only output produced is into a script file. Script
files can't use the parallel restore feature of pg_restore, so by taking your backup in this way you will be forcing the
restore to be slower than it needs to be.
ffpg_dumpall
produces dumps of each database, one after
another. This means that:
..pg_dumpall is slower than running multiple pg_dump tasks in parallel, one against each database.
..The dumps of
individual databases are not consistent to a particular point in time. If you
start the dump at 04:00 and it ends at 07:00 then we're not sure exactly when
the dump relates to—sometime between 0400 and 07:00.
ffOptions for pg_dumpall are similar in many ways to pg_dump, though not all of them exist, so some things aren't possible.
ff
pg_dump does not dump the
roles (such as users/groups) and tablespaces. Those two things are only dumped
by pg_dumpall.
In summary, pg_dumpall is slower to backup, slow to restore, and gives you less control
over the dump. I suggest you don't use it for those reasons.
If you have multiple databases, then I suggest you take your
backup by doing either
Dump global
information for the database server using pg_dumpall –g . Then dump all databases in parallel using a separate pg_dump
for each database, taking care to check for errors if
they occur. Use the physical database backup technique instead.
191. Explain PG_DUMP and PG_DUMPALL?
PG_DUMP
pg_dump is a utility
for backing up a PostgreSQL database. It makes consistent backups even if
the database is being used concurrently. pg_dump
runs by executing SQL statements against the database to unload data. When
PostgreSQL runs an SQL statement we take a "snapshot" of currently
running transactions, which freezes our viewpoint of the database. We can't
(yet) share that snapshot across multiple sessions, so we cannot run an exactly
consistent pg_dump in parallel in one database, nor across many databases.
The time of the
snapshot is the only time we can recover to—we can't recover to a time either
before or after that time. Note that the snapshot time is the start of the
backup, not the end.
pg_dump produces a single output file. The
output file can use the split(1) command to separate the file into multiple
pieces, if required.
pg_dump into custom
format is lightly compressed by default. Compression can be removed or made
more aggressive.
pg_dump does not block other users
accessing the database (readers or writers). When pg_dump runs, it holds the
very lowest kind of lock on the tables being dumped. Those are designed to
prevent DDL from running against the tables while the dump takes place. If a
dump is run at the point that other DDL are already running, then the dump will
sit and wait. If you want to limit the waiting time you can do that by setting
the –-lock-wait-timeout option.
pg_dump allows
you to make a selective backup of tables. The -t option also allows you to specify views and sequences. There's no way to
dump other object types individually using pg_dump.
pg_dump works against
earlier releases of PostgreSQL, so it can be used to migrate data between
releases.
pg_dump doesn't
generally handle included modules very well. pg_dump isn't aware of additional
tables that have been installed as part of an additional package, such as PostGIS
or Slony, so it will dump those objects as well. That can cause difficulties if
you then try to restore from the backup, as the additional tables may have been
created as part of the software installation process in an empty server.
The snapshot for a
pg_dump is taken at the beginning of a run. The file modification time will
tell you when the dump finished. The dump is consistent at the time of the
snapshot, so you may want to know that time. If you are making a script dump,
you can do a dump verbose as follows:
pg_dump -v
which then adds the
time to the top of the script. Custom dumps store the start time as well and
that can be accessed using the following:
pg_restore ––schema-only -v dumpfile | head
| grep Started
-- Started on
2010-06-03 09:05:46 BST
Notes
If
your database cluster has any local additions to the template1
database, be careful to restore the output of pg_dump into
a truly empty database; otherwise you are likely to get errors due to duplicate
definitions
of the added objects. To make an empty database without any local additions,
copy from
template0
not template1,
for example:
CREATE
DATABASE foo WITH TEMPLATE template0;
When
a data-only dump is chosen and the option --disable-triggers
is used, pg_dump emits commands to disable triggers on user
tables before inserting the data, and then commands to re-enable
them
after the data has been inserted. If the restore is stopped in the middle, the
system catalogs might
be
left in the wrong state.
Members
of tar archives are limited to a size less than 8 GB. (This is an inherent
limitation of the
tar
file format.) Therefore this format cannot be used if the textual
representation of any one table
exceeds
that size. The total size of a tar archive and any of the other output formats
is not limited,
except
possibly by the operating system.
The
dump file produced by pg_dump does not contain the statistics used by the
optimizer to make
query
planning decisions. Therefore, it is wise to run ANALYZE
after restoring from a dump file to ensure optimal
performance. The dump file also does not contain any ALTER
DATABASE ... SET commands; these settings are dumped by
pg_dumpall, along with database users and other installation-wide settings.
Because
pg_dump is used to transfer data to newer versions of PostgreSQL, the output of
pg_dump
can
be expected to load into PostgreSQL server versions newer than pg_dump’s
version. pg_dump can
also
dump from PostgreSQL servers older than its own version. (Currently, servers
back to version 7.0
are
supported.) However, pg_dump cannot dump
from PostgreSQL servers newer than its own major version; it will refuse to
even try, rather than risk making an invalid dump. Also, it is not guaranteed
that
pg_dump’s output can be loaded into a server of an older major version — not
even if the dump
was
taken from a server of that version. Loading
a dump file into an older server may require manual
editing of the dump file to remove
syntax not understood by the older server.
Examples
To
dump a database called mydb into a
SQL-script file:
$ pg_dump
mydb > db.sql
To
reload such a script into a (freshly created) database named newdb:
$ psql
-d newdb -f db.sql
To
dump a database into a custom-format archive file:
$ pg_dump -Fc mydb > db.dump
To
dump a database into a directory-format archive:
$ pg_dump -Fd mydb -f dumpdir
To
dump a database into a directory-format archive in parallel with 5 worker jobs:
$ pg_dump -Fd mydb -j 5 -f dumpdir
To
reload an archive file into a (freshly created) database named newdb:
$ pg_restore
-d newdb db.dump
To
dump a single table named mytab:
$ pg_dump -t mytab mydb > db.sql
To
dump all tables whose names start with emp in the detroit schema, except for the table named
employee_log:
$ pg_dump
-t ’detroit.emp*’ -T detroit.employee_log mydb > db.sql
To
dump all schemas whose names start with east or west and end
in gsm, excluding any schemas
whose
names contain the word test:
$ pg_dump
-n ’east*gsm’ -n ’west*gsm’ -N ’*test*’ mydb > db.sql
The
same, using regular expression notation to consolidate the switches:
$ pg_dump
-n ’(east|west)*gsm’ -N ’*test*’ mydb > db.sql
To
dump all database objects except for tables whose names begin with ts_:
$ pg_dump -T ’ts_*’ mydb > db.sql
To
specify an upper-case or mixed-case name in -t and related switches, you need to double-quote
the
name; else it will be folded to lower case (see Patterns). But double quotes are special to the
shell,
so in turn they must be quoted. Thus, to dump a single table with a mixed-case
name, you need
something
like
$ pg_dump -t
"\"MixedCaseName\"" mydb > mytab.sql
[postgres@db
~]$ pg_dump -?
pg_dump
dumps a database as a text file or to other formats.
Usage:
pg_dump [OPTION]... [DBNAME]
General
options:
-f, --file=FILENAME output file or directory name
-F, --format=c|d|t|p output file format (custom, directory,
tar,
plain text (default))
-v, --verbose verbose mode
-V, --version output version information,
then exit
-Z, --compress=0-9
compression level for compressed formats
--lock-wait-timeout=TIMEOUT fail after waiting TIMEOUT for a table lock
-?, --help show this help, then exit
Options
controlling the output content:
-a, --data-only dump only the data, not the
schema
-b, --blobs include large objects in dump
-c, --clean clean (drop) database objects
before recreating
-C, --create include commands to create
database in dump
-E, --encoding=ENCODING dump the data in encoding ENCODING
-n, --schema=SCHEMA dump the named schema(s) only
-N, --exclude-schema=SCHEMA do NOT dump the named schema(s)
-o, --oids include OIDs in dump
-O, --no-owner skip restoration of object
ownership in
plain-text
format
-s, --schema-only dump only the schema, no data
-S, --superuser=NAME superuser user name to use in
plain-text format
-t, --table=TABLE dump the named table(s) only
-T, --exclude-table=TABLE do NOT dump the named table(s)
-x, --no-privileges do not dump privileges (grant/revoke)
--binary-upgrade for use by upgrade utilities only
--column-inserts dump data as INSERT commands with
column names
--disable-dollar-quoting disable dollar quoting, use SQL standard
quoting
--disable-triggers disable triggers during data-only
restore
--exclude-table-data=TABLE do NOT dump data for the named table(s)
--inserts dump data as INSERT commands,
rather than COPY
--no-security-labels do not dump security label assignments
--no-tablespaces do not dump tablespace assignments
--no-unlogged-table-data do not dump unlogged table data
--quote-all-identifiers quote all identifiers, even if not key
words
--section=SECTION dump named section (pre-data, data,
or post-data)
--serializable-deferrable wait until the dump can run without
anomalies
--use-set-session-authorization
use SET SESSION AUTHORIZATION
commands instead of
ALTER OWNER
commands to set ownership
Connection
options:
-h, --host=HOSTNAME database server host or socket directory
-p, --port=PORT database server port number
-U, --username=NAME connect as specified database user
-w, --no-password never prompt for password
-W, --password force password prompt (should happen
automatically)
--role=ROLENAME do SET ROLE before dump
If no
database name is supplied, then the PGDATABASE environment
variable
value is used.
Report
bugs to <pgsql-bugs@postgresql.org>.
pg_dumpall
pg_dumpall is a utility for writing out
(“dumping”) all PostgreSQL databases of a cluster into one
script file. The script file contains SQL
commands that can be used as input to psql to restore the
databases. It does this by calling pg_dump
for each database in a cluster. pg_dumpall also dumps
global objects that are common to all
databases. (pg_dump does not save these objects.) This currently includes
information about database users and groups, tablespaces, and properties such
as access permissions that apply to databases as a whole.
Since pg_dumpall reads tables from all
databases you will most likely have to connect as a database
superuser in order to produce a complete
dump. Also you will need superuser privileges to execute
the saved script in order to be allowed to
add users and groups, and to create databases.
The SQL script will be written to the standard
output. Use the [-f|file] option or shell operators to
redirect it into a file.
pg_dumpall needs to connect several times
to the PostgreSQL server (once per database). If you use
password authentication it will ask for a
password each time. It is convenient to have a ~/.pgpass
file in such cases.
Notes
Since pg_dumpall calls pg_dump internally, some diagnostic
messages will refer to pg_dump.
Once restored, it is wise to run ANALYZE
on each database so the optimizer has useful statistics.
You
can also run vacuumdb -a
-z to analyze all databases.
pg_dumpall requires all needed tablespace directories to
exist before the restore; otherwise, database
creation will fail for databases in non-default locations.
Examples
To dump all databases:
$ pg_dumpall > db.out
To reload database(s) from this file, you can use:
$ psql -f db.out postgres
[postgres@db
~]$ pg_dumpall -?
pg_dumpall
extracts a PostgreSQL database cluster into an SQL script file.
Usage:
pg_dumpall [OPTION]...
General
options:
-f, --file=FILENAME output file name
-V, --version output version information,
then exit
--lock-wait-timeout=TIMEOUT fail after waiting TIMEOUT for a table lock
-?, --help show this help, then exit
Options
controlling the output content:
-a, --data-only dump only the data, not the
schema
-c, --clean clean (drop) databases before
recreating
-g, --globals-only dump only global objects, no
databases
-o, --oids include OIDs in dump
-O, --no-owner skip restoration of object
ownership
-r, --roles-only dump only roles, no databases or
tablespaces
-s, --schema-only dump only the schema, no data
-S, --superuser=NAME superuser user name to use in the dump
-t, --tablespaces-only dump only tablespaces, no databases or
roles
-x, --no-privileges do not dump privileges (grant/revoke)
--binary-upgrade for use by upgrade utilities only
--column-inserts dump data as INSERT commands with
column names
--disable-dollar-quoting disable dollar quoting, use SQL standard
quoting
--disable-triggers disable triggers during data-only
restore
--inserts dump data as INSERT commands, rather than
COPY
--no-security-labels do not dump security label assignments
--no-tablespaces do not dump tablespace assignments
--no-unlogged-table-data do not dump unlogged table data
--quote-all-identifiers quote all identifiers, even if not key
words
--use-set-session-authorization
use SET SESSION
AUTHORIZATION commands instead of
ALTER OWNER
commands to set ownership
Connection
options:
-h, --host=HOSTNAME database server host or socket directory
-l, --database=DBNAME alternative default database
-p, --port=PORT database server port number
-U, --username=NAME connect as specified database user
-w, --no-password never prompt for password
-W, --password force password prompt (should happen
automatically)
--role=ROLENAME do SET ROLE before dump
If
-f/--file is not used, then the SQL script will be written to the standard
output.
Report
bugs to <pgsql-bugs@postgresql.org>.
192. Hot logical backup of all databases
My recommendation is that you do exactly what you did for one
database on each database in your cluster.
You can run those individual dumps in
parallel if you want to speed things up.
Once complete, dump the global information also, using the
following:
pg_dumpall -g
193. How take Large
Databases backup
Some operating
systems have maximum file size limits that cause problems when creating large
pg_dump output
files. Fortunately, pg_dump can write to the standard output, so you can use
standard Unix tools to work around this potential problem. There are several
possible methods:
Use compressed
dumps. You can use your favorite compression program, for example
gzip:
pg_dump
dbname | gzip > filename.gz
Reload with:
gunzip
-c filename.gz | psql dbname
or:
cat
filename.gz | gunzip | psql dbname
Use split.
The
split
command
allows you to split the output into smaller files that are acceptable
in size to the underlying
file system. For example, to make chunks of 1 megabyte:
pg_dump
dbname | split -b 1m - filename
Reload with:
cat
filename* | psql dbname
Use pg_dump’s
custom dump format. If PostgreSQL was built on a system
with the zlib compression library installed, the custom dump format will
compress data as it writes it to the output file. This will produce dump file
sizes similar to using gzip, but it has the added advantage that
tables can be restored selectively. The following command dumps a database using
the custom dump format:
pg_dump
-Fc dbname > filename
A custom-format
dump is not a script for psql, but instead must be restored with pg_restore,
for
example:
pg_restore
-d dbname filename
For very large
databases, you might need to combine split with one of the
other two approaches.
Use pg_dump’s
parallel dump feature. To speed up the dump of a large
database, you can use
pg_dump’s parallel
mode. This will dump multiple tables at the same time. You can control the
degree of parallelism with the -j parameter. Parallel dumps are only
supported for the "directory" archive format.
pg_dump
-j num -F d -f out.dir dbname
You can use pg_restore
-j to restore a dump in parallel. This will work for any
archive of either
the "custom" or the "directory" archive
mode, whether or not it has been created with pg_dump -j.
194. Hot logical
backup of all tables in a tablespace
Sometimes we may wish
to make a dump of tables and data in a tablespace. Unfortunately, there isn't a
simple command to do this, so we need to write some reusable scripts.
It is possible for a
tablespace to contain objects from more than one database, so run the following
query to see which databases from which you need to dump:
SELECT datname
FROM pg_database WHERE oid IN ( SELECT pg_tablespace_databases(ts.oid) FROM
pg_tablespace ts WHERE spcname = 'mytablespacename');
The following
procedure allows you to dump all tables that reside on one tablespace and
within one database only.
Create a file named onets.sql
that contains the following SQL, which
extracts the list of tables in a tablespace:
SELECT 'pg_dump'
UNION ALL
SELECT '-t ' ||
spcname || '.' || relname
FROM pg_class t
JOIN pg_tablespace ts
ON reltablespace
= ts.oid AND spcname = :TSNAME JOIN pg_namespace n ON n.oid = t.relnamespace
WHERE relkind =
'r'
UNION ALL
SELECT '-F c >
dumpfile'; -- dumpfile is the name of the output file
Execute the query to
build the pg_dump script:
psql -t -v
TSNAME="'mytablespace'" -f onets.sql > get_my_ts
From the recovered
database server, dump the tables in the tablespace, including data and
definitions. The output file is named "dumpfile", from last line in
the first step.
chmod 755
get_my_ts
./get_my_ts
pg_dump allows you to specify
more than one table on the command line, so it's possible to generate a list of
tables directly from the database.
We use the named
parameter in psql to create a parameterized script, which we then execute to
create a dump.
195. Backup of database object
definitions
Sometimes it's useful to get a dump of the object
definitions that make up a database. This is useful for comparing what's in the
database against the definitions in a data or object-modeling tool. It's also
useful to make sure you can recreate objects in exactly the correct schema,
tablespace, and database with the correct ownership and permissions.
The basic command to dump the definitions only is to use
the following:
pg_dumpall ––schema-only > myscriptdump.sql
Which includes all objects, including roles, tablespaces,
databases, schemas, tables, indexes, triggers, constraints, views, functions,
ownership, and privileges.
If you want to dump PostgreSQL role definitions, you can
use the following:
pg_dumpall -–roles-only > myroles.sql
If you want to dump PostgreSQL tablespace definitions, you
can use the following:
pg_dumpall -–tablespaces-only > mytablespaces.sql
Or if you want to dump both roles and tablespaces, then you
can use the following:
pg_dumpall -–globals-only > myglobals.sql
The output is a human-readable script file that can be re-executed
to re-create each of the databases.
In PostgreSQL, the word "schema" is also used to
describe a set of related database objects similar to a directory, also known
as a "namespace". Be careful that you don't confuse what is happening
here. The ––schema-only option makes a backup of the "database
schema" - the definitions of all objects in the database (and in all
namespaces). To make a backup of the data and definitions in just one
namespace, use the –s option. Or, to make a backup of only the definitions, in
just one namespace, use both –s and ––schema-only together.
196. Why VACUUM requires in MVCC
VACUUM
removes dead row versions in tables and marks the space available for future reuse.
The VACUUM process also updates the Free Space Map and using VACUUM FULL we can
recover those free spaces.
197. How can you know that your
operations are spilling to disk?
When you execute EXPLAIN ANALYZE for
your query and if you see a line like Sort Method: external merge
Disk: 88342kB, your
operations are spilling to disk.
198. Is spilling to disk a bad operation?
Yes:
Because it requires more CPU/IO for the disk operation and it leads to biggest
performance cause.
199. What is the solution?
The
simple solution is, validate the configuration of a work_mem parameter. If you
do a lot of complex sorts and have a lot of memory, then increasing the
work_mem parameter allows PostgreSQL to do larger in-memory sorts which,
unsurprisingly, will be faster than disk-based equivalents.
200. How can you configure work_mem?
A
work_mem of at least 16MB would keep the intermediate data in memory and likely
improve the query response time.
You can set at the server level for the general use (in postgresql.conf).
You can set at the server level for the general use (in postgresql.conf).
You
can also set at your session level using below sample query. Increase the
setting until you see Memory instead of Disk in the EXPLAIN output.
BEGIN;
SET LOCAL work_mem =
'50MB';
SELECT *FROM Table_Name;
COMMIT;
201. What is REINDEX?
PostgreSQL
provides two ways to perform Reindexing on Database.
Using REINDEX command:
You
can execute REINDEX command on Database, Table and Index.
You can connect PostgreSQL using PGAdmin or PSQL tools to execute this command.
You can connect PostgreSQL using PGAdmin or PSQL tools to execute this command.
Reindex
all indexes of Database:
REINDEX DATABASE
Database_Name;
Reindex
on particular Index:
REINDEX INDEX Index_Name;
Reindex
on particular Table:
REINDEX TABLE
Table_Name;
202. What is BRIN (Block Range Index)?
The
BRIN is a new database indexing technique, mainly design for very large table.
We
know about the Table Partitioing concept, which we require implementing on a
very large table.
The
BRIN index also works like Horizontal Table Partitioning, by creating a
different block of data.
If
we are using BRIN, we do not require implementing Table Partitioing explicitly.
The
BRIN index creates different block of data and it defines range using MinMax
function.
Every
block has one Min value and Max value associated which is used to identify a
related block of data.
A
block range is a group of pages that are physically adjacent in the table, for
each block range, some summary info is stored by the index.
The
BRIN index is a very lightweight because it stores only summery of the data
block. Unlike B-Tree indexes, it doesn’t require storing all key node value in
index page.
An
Even entire form of BRIN indexes can easily store in the memory because of its
compact form and it reduces the scanning of the disk.
As
it is stored only summary of block so it is also 10 times faster than B-Tree
for INSERT and UPDATE operations.
The
creation of the BRIN indexes on a large table are also much faster than B-Tree
indexes.
The
efficiency of BRIN index is also depends on the physical order of table data.
The
BRIN indexes are scanning the tuples bases on a bitmap and it returns only
those tuples which are matched in block range summary information so It also
avoids scanning of the large parts of the table.
The
B-Tree indexes are still superior for random searches, and we should use the
BRIN indexes for specific type of queries which are mostly used in BIGDATA,
Reporting, Data Warehouse.
PostgreSQL
9.5 introduced the powerful BRIN Index, which is performance much faster than
the regular BTREE Index.
The most important two lines of the BRIN are: It stores only minimum and maximum value per block so it does not require more space. For extremely large table It runs faster than any other Indexes.
The most important two lines of the BRIN are: It stores only minimum and maximum value per block so it does not require more space. For extremely large table It runs faster than any other Indexes.
Create BRIN index on TransactionDate
Column:
CREATE INDEX
idx_tbl_ItemTransactions_TransactionDate
ON tbl_ItemTransactions
USING BRIN (TransactionDate);
203. Using PSQL, Disable Autocommit
Globally
Login
in your PSQL tool and execute below command (It is a case sensitive):
\set autocommit off;
To enable again, execute below command:
\set autocommit on;
204. Which Statistical Table Used by
Postgres
PostgreSQL
Database Server is mainly using two Statistical Table.
SELECT *FROM pg_class;
SELECT *FROM pg_stats;
205. What is difference between MySQL and
PostgreSQL?
MySQL
is controlled by Oracle, whereas Postgres is available under an open-source
license from the PostgreSQL Global Development Group.
PostgreSQL
is closer to the ANSI SQL standard, MySQL is closer to the ODBC standard.
One
of the most notable differences between MySQL and PostgreSQL is the fact that
you can’t do nested subqueries of subselects in MySQL. MySQL doesn’t support
CTE, window functions, full outer joins and arrays.
Compare
to PostgreSQL, MySQL has its own large user community and various type of
materials.
MySQL
uses nonstandard ‘#’ to begin a comment line and PostgreSQL use ‘–‘ to begin a
comment line.
MySQL
provides different type of table storage engines like: InnoDB, MyIsam, Memory.
PostgreSQL does not provide different type of table stroage engines.
PostgreSQL
is case-sensitive for string comparisons. The field “DbRnd” is not the same as
the field “dbrnd”. This is a big change for many users from MySQL and other
small database systems.
PostgreSQL
has some nice features like: generate_series, custom aggregate functions,
arrays etc. other hand MySQL has also some nice feature like: session variables
in queries, FORCE INDEX, etc.
In
PostgreSQL, when you are inside a transaction almost any operation can be
undone. There are some irreversible operations (like creating or destroying a
database or tablespace), but normal table modifications can be backed out by
issuing a ROLLBACK via its Write-Ahead Log design.
MySQL
doesn’t support any sort of rollback when using MyISAM. With InnoDB, the server
has an implicit commit that occurs even if the normal auto-commit behavior is
turned off. This means that any single table alteration or similar change is
immediately committed.
Postgres
obliterates MySQL in all manner of query-level goodness, and we can use
Postgres for data mining and batch analytics databases, where we run lots of
very large, hand-coded queries that can take advantage of Postgres’ advantages
in join methods (MySQL only supports Nested Loop and Nested Loop over Index,
while Postgres supports numerous methods), generally richer querying
capabilities, and other good stuff like partial indexes can be used
extensively.
PostgreSQL
uses a robust locking model called MVCC that limits situations where individual
clients interfere with each other. A short summary of the main benefit of MVCC
would be “readers are never blocked by writers”.
MySQL’s
InnoDB implements MVCC using a rollback segment and InnoDB databases supports
all four SQL standard transaction isolation levels.
PostgreSQL
uses cost-based query optimization methods to get good performance for many
different types of joins. Query costs are assessed based on planner statistics
collected when tables are analyzed combined with adjustable planner costs and
advanced features such as the Genetic Query Optimizer.
MySQL
doesn’t have this level of sophistication in its planner, and the tunables for
Controlling Query Optimizer Performance are crude. Developers instead do things
like explicitly provide index hints to make sure joins are executed correctly.
PostgreSQL
has different replication options than MySQL. MySQL has no point-in-time
recovery.
One
operation that PostgreSQL is known to be slow performing is doing a full count
of rows in a table, typically using this SQL:
SELECT COUNT(*) FROM your_table_name;
The reason why this is slow is related to the
MVCC implementation in PostgreSQL. The fact that multiple transactions can see
different states of the data means that there can be no straightforward way for
“COUNT(*)” to summarize data across the whole table; PostgreSQL must walk
through all rows, in some sense.
In
MySQL, MyISAM tables cache the row count information, making this type of count
operation almost instant. But if you’re using InnoDB instead, this is no longer
the case and on InnoDB can’t assume that a full row count will be fast.
206. How to Increase the Performance of
Bulk Data Load
Tuning Checkpoint Parameters to Increase the Performance of
Bulk Data Load
PostgreSQL 9.5 onwards you can use min_wal_size and
max_wal_size instead of checkpoint_segments.
If
our PostgreSQL set up for heavy bulk operations, after setting the above memory
parameters we should also look into Checkpoint related Parameters.
207. What is a Checkpoint?
In
simple word, it writes all dirty pages from memory to disk and cleans the
shared_buffers area.
It does not require any system lock and make sure that data has been written to the disk.
It does not require any system lock and make sure that data has been written to the disk.
If
our Postgres server crashed, we can measure the data loss between last
Checkpoint value time and PostgreSQL stopped time. We can also recover our
system using this information.
208. Why Checkpoint is an Important
parameter?
If
we have 1GB of database and 15 GB of MEMORY, we can easily store our whole
database into MEMORY. Which returns the best performance because It does not
require any physical page swapping into the Disk.
Now
let say, PostgreSQL Checkpoint is not updated since last 48 hours and we are
working with only those data pages which are in MEMORY only.
Now
the system crashed because of power failure and lots of work not updated into
the Disk so next, when Postgres start it takes long time to load because It has
to recover last 48 hours data from WAL segments.
That’s why the proper configuration of Checkpoint Parameters is very important.
That’s why the proper configuration of Checkpoint Parameters is very important.
209. What are the different Checkpoint
Parameters?
checkpoint_segments (integer): The
default value is 3 and every segment has a 16 MB capacity to store log
information.
checkpoint_timeout (integer): The default is 5min and It is time between automatic WAL checkpoints. If checkpoint set for every minute – only pages from minute would have to be written to disk, 5 minutes – more pages, 1 hour – even more pages.
checkpoint_completion_target (floating point): The default is 05, and It is completion time between the Checkpoints.
checkpoint_warning (integer): In case of overlapping, it writes a message to the server.
checkpoint_timeout (integer): The default is 5min and It is time between automatic WAL checkpoints. If checkpoint set for every minute – only pages from minute would have to be written to disk, 5 minutes – more pages, 1 hour – even more pages.
checkpoint_completion_target (floating point): The default is 05, and It is completion time between the Checkpoints.
checkpoint_warning (integer): In case of overlapping, it writes a message to the server.
210. How we can improve the performance
for Bulk data operations?
We should increase the size of
checkpoint_segments, checkpoint_completion_target, checkpoint_timeout.
When we increase the size of this
parameter like checkpoint_timeout from 5min to 20min, It reduces the Disk I/O
by writing again and again.
Whenever we have a large
Shared_buffers area, we should increase the size of the Checkpoint parameters.
I suggest to apply these changes only
for large bulk load data operations related system.
For regular load, default PostgreSQL setting is fine.
For regular load, default PostgreSQL setting is fine.
Example
of parameters for PostgreSQL optimization:
shared_buffers
= 512MB (default: 32MB)
effective_cache_size
= 1024MB (default: 128MB)
checkpoint_segment
= 32 (default: 3)
checkpoint_completion_target
= 0.9 (default: 0.5)
default_statistics_target
= 1000 (default: 100)
work_mem
= 100MB (default: 1MB)
maintainance_work_mem
= 256MB (default: 16MB)
211:
How to increase the
performance of Bulk Insertion (INSERT)
We should optimize Bulk Insertion
operation because It is also utilizing more system resources.
Few
steps to increase the performance of Bulk Insertion:
- If It is only one time data load, remove all Index and
create after INSERT. For regular Bulk Insertion, look into Index part and
try to optimize it.
- Take a backup of all the Triggers and drop all the
Triggers.
- Use COPY command instead of INSERT.
- If you can not use COPY command because of of INSERT
multiple value, you should not give more than 1000 multiple values in
single INSERT. All INSERT multiple values stored in the memory area which
may decrease the INSERT performance.
- Disable other database logging process.
- If BULK INSERTION is running with a single
transaction, you should also disable all FOREIGN KEYs and later you can
enable it.
- Change synchronous_commit=off to reduce the cost of
committing.
- Change
Checkpoint default configuration, Increase the checkpoint_segments value.
- Increase the size of shared_buffers,
effective_cache_size, work_mem.
- Prepared all statements on server side.
- Disable Auto-Vacuum on Table.
- Disable Auto Database Backup process.
- Increase the system resource (Memory, Disk, CPU), If
it is required.
- Use SSD type of Disk for faster storage.
Note:
Whatever
you changed for BULK INSERTION, please do not forget to rollback after
completion of BULK INSERTION.
No comments:
Post a Comment