Manoj Simar - The Technology Expert: Interview Q and A for PostgreSQL Part

51. Can PostgreSQL be embedded?

PostgreSQL is designed as a client/server architecture, which requires separate processes for each client and server, and various helper processes. Many embedded architectures can support such requirements. However, if your embedded architecture requires the database server to run inside the application process, you cannot use Postgres and should select a lighter-weight database solution.

52. What interfaces are available for PostgreSQL?

The core PostgreSQL source code includes only the C and embedded C interfaces. All other interfaces are independent projects that are downloaded separately; being separate allows them to have their own release schedule and development teams.

Many PostgreSQL installers bundle language client interfaces like PgJDBC, nPgSQL, the Pg ruby gem, psycopg2 for Python, DBD::Pg for Perl, etc into the PostgreSQL installer or offer to download them for you. Additionally, some programming language runtimes come with PostgreSQL client libraries pre-installed.

On Linux systems you can generally just install language bindings like psycopg2 using your package manager.

53. When installing from source code, how do I install PostgreSQL somewhere other than /usr/local/pgsql?

Specify the --prefix option when running configure. If you forgot to do that,

you can edit Make_le.global and change POSTGRESDIR accordingly, or create a Make_le.custom and

de_ne POSTGRESDIR there.

54. How do I control connections from other hosts?

By default, PostgreSQL only allows connections from the local machine using Unix domain sockets or TCP/IP connections. Other machines will not be able to connect unless you modify listen_addresses in the postgresql.conf file, enable host-based authentication by modifying the $PGDATA/pg_hba.conf file, and restart the database server.

55. How do I tune the database engine for better performance?

There are three major areas for potential performance improvement:

Query Changes

This involves modifying queries to obtain better performance:

Creation of indexes, including expression and partial indexes
Use of COPY instead of multiple INSERTs
Grouping of multiple statements into a single transaction to reduce commit overhead
Use of CLUSTER when retrieving many rows from an index
Use of LIMIT for returning a subset of a query's output
Use of Prepared queries
Use of ANALYZE to maintain accurate optimizer statistics
Regular use of VACUUM or pg_autovacuum
Dropping of indexes during large data changes

Server Configuration

A number of postgresql.conf settings affect performance.

Hardware Selection

The effect of hardware on performance is detailed in http://www.powerpostgresql.com/PerfList/ and http://momjian.us/main/writings/pgsql/hw_performance/index.html.

56. What debugging features are available?

There are many log_* server configuration variables at http://www.postgresql.org/docs/current/interactive/runtime-config-logging.html that enable printing of query and process statistics which can be very useful for debugging and performance measurements.

57. Why do I get "Sorry, too many clients" when trying to connect?

You have reached the default limit of 100 database sessions. See Number of database connections for advice on whether you should raise the connection limit or add a connection pooler.

58. Will PostgreSQL handle recent daylight saving time changes in various countries?

PostgreSQL releases 8.0 and up depend on the widely-used tzdata database (also called the zoneinfo database or the Olson timezone database) for daylight savings information. To deal with a DST law change that affects you, install a new tzdata file set and restart the server.

All PostgreSQL update releases include the latest available tzdata files, so keeping up-to-date on minor releases for your major version is usually sufficient for this.

On platforms that receive regular software updates including new tzdata files, it may be more convenient to rely on the system's copy of the tzdata files. This is possible as a compile-time option. Most Linux distributions choose this approach for their pre-built versions of PostgreSQL.

PostgreSQL releases before 8.0 always relied on the operating system's timezone information.

59. How does PostgreSQL use CPU resources?

The PostgreSQL server is process-based (not threaded), and uses one operating system process per database session. A single database session (connection) cannot utilize more than one CPU. Of course, multiple sessions are automatically spread across all available CPUs by your operating system. Client applications can easily use threads and create multiple database connections from each thread.

A single complex and CPU-intensive query is unable to use more than one CPU to do the processing for the query. The OS may still be able to use others for disk I/O etc, but you won't see much benefit from more than one spare core.

60. Why does PostgreSQL have so many processes, even when idle?

PostgreSQL is process based, so it starts one postgres (or postgres.exe on Windows) instance per connection. The postmaster (which accepts connections and starts new postgres instances for them) is always running. In addition, PostgreSQL generally has one or more "helper" processes like the stats collector, background writer, autovacuum daemon, walsender, etc, all of which show up as "postgres" instances in most system monitoring tools.

Despite the number of processes, they actually use very little in the way of real resources.

61.Why does PostgreSQL use so much memory?

Despite appearances, this is absolutely normal, and there's actually nowhere near as much memory being used as tools like top or the Windows process monitor say PostgreSQL is using.

Tools like top and the Windows process monitor may show many postgres, each of which appears to use a huge amount of memory. Often, when added up, the amount the postgres instances use is many times the amount of memory actually installed in the computer!

This is a consequence of how these tools report memory use. They generally don't understand shared memory very well, and show it as if it was memory used individually and exclusively by each postgres instance. PostgreSQL uses a big chunk of shared memory to communicate between its backends and cache data. Because these tools count that shared memory block once per postgres instance instead of counting it once for all postgres instances, they massively over-estimate how much memory PostgreSQL is using.

Furthermore, many versions of these tools don't report the entire shared memory block as being used by an individual instance immediately when it starts, but rather count the number of shared pages it has touched since starting. Over the lifetime of an instance, it will inevitably touch more and more of the shared memory until it has touched every page, so that its reported usage will gradually rise to include the entire shared memory block. This is frequently misinterpreted to be a memory leak; but it is no such thing, only a reporting artifact.

62. How do I SELECT only the first few rows of a query? A random row?

To retrieve only a few rows, if you know at the number of rows needed at the time of the SELECT use LIMIT . If an index matches the ORDER BY it is possible the entire query does not have to be executed. If you don't know the number of rows at SELECT time, use a cursor and FETCH.

To SELECT a random row, use:

SELECT col FROM tab ORDER BY random() LIMIT 1;

See also this blog entry by Andrew Gierth that has more information on this topic.

63. How do I find out what tables, indexes, databases, and users are defined? How do I see the queries used by psql to display them?

Use the \dt command to see tables in psql. For a complete list of commands inside psql you can use \?. Alternatively you can read the source code for psql in file pgsql/src/bin/psql/describe.c, it contains SQL commands that generate the output for psql's backslash commands. You can also start psql with the -E option so it will print out the queries it uses to execute the commands you give. PostgreSQL also provides an SQL compliant INFORMATION SCHEMA interface you can query to get information about the database.

There are also system tables beginning with pg_ that describe these too.

Use psql -l will list all databases.

Also try the file pgsql/src/tutorial/syscat.source. It illustrates many of the SELECTs needed to get information from the database system tables.

64.How do you change a column's data type?

Changing the data type of a column can be done easily in 8.0 and later with ALTER TABLE ALTER COLUMN TYPE.

In earlier releases, do this:

BEGIN;

ALTER TABLE tab ADD COLUMN new_col new_data_type;

UPDATE tab SET new_col = CAST(old_col AS new_data_type);

ALTER TABLE tab DROP COLUMN old_col;

COMMIT;

You might then want to do VACUUM FULL tab to reclaim the disk space used by the expired rows.

65. What is the maximum size for a row, a table, and a database?

These are the limits:

Maximum size for a database? unlimited (32 TB databases exist)

Maximum size for a table? 32 TB

Maximum size for a row? 400 GB

Maximum size for a field? 1 GB

Maximum number of rows in a table? unlimited

Maximum number of columns in a table? 250-1600 depending on column types

Maximum number of indexes on a table? unlimited

Of course, these are not actually unlimited, but limited to available disk space and memory/swap space. Performance may suffer when these values get unusually large.

The maximum table size of 32 TB does not require large file support from the operating system. Large tables are stored as multiple 1 GB files so file system size limits are not important.

The maximum table size, row size, and maximum number of columns can be quadrupled by increasing the default block size to 32k. The maximum table size can also be increased using table partitioning.

One limitation is that indexes can not be created on columns longer than about 2,000 characters. Fortunately, such indexes are rarely needed. Uniqueness is best guaranteed by a function index of an MD5 hash of the long column, and full text indexing allows for searching of words within the column.

66. How much database disk space is required to store data from a typical text file?

A PostgreSQL database may require up to five times the disk space to store data from a text file.

As an example, consider a file of 100,000 lines with an integer and text description on each line. Suppose the text string averages twenty bytes in length. The flat file would be 2.8 MB. The size of the PostgreSQL database file containing this data can be estimated as 5.2 MB:

 24 bytes: each row header (approximate)

 24 bytes: one int field and one text field

+ 4 bytes: pointer on page to tuple

----------------------------------------

 52 bytes per row

The data page size in PostgreSQL is 8192 bytes (8 KB), so:

8192 bytes per page

-------------------  =  158 rows per database page (rounded down)

  52 bytes per row

 100000 data rows

------------------  =  633 database pages (rounded up)

 158 rows per page

633 database pages * 8192 bytes per page  =  5,185,536 bytes (5.2 MB)

Indexes do not require as much overhead, but do contain the data that is being indexed, so they can be large also.

NULLs are stored as bitmaps, so they use very little space.

Note that long values may be compressed transparently.

See also this presentation on the topic: File:How Long Is a String.pdf.

67.Why are my queries slow / what makes SQL slow and solutions for same.

A SQL statement can be slow for a lot of reasons. Following, we give a shortlist of these and at least one way for recognizing each of these.

Old statstics of table & Index

Too much data is processed

Run the query with explain analyse to see how much data is processed for completing the query

Too little of the data fits in the memory

If not enough of the data fits in shared buffers, lots of re-reading of the same data happens.

The query returns too much data

Sometimes lazy programmers write a query that returns a lot more rows than needed.

Locking problems

Not enough CPU power or disk I/O capacity for the current load

Table and index bloat

Solutions :>

Reducing the number of rows returned

A full text search returns 10,000 documents, but only first the 20 are displayed to user

An application requests all products for a branch office to run a complex calculation over them

Application runs a huge number of small lookup queries

Simplifying complex SQL

Moving part of the query into a view

Using the WITH statement instead of a separate view

Using temporary tables for parts of the query

Use materialized views (long-living temp tables)

Using set-returning functions for some parts of queries

Speeding up queries without rewriting them

Providing better information to the optimizer

Adding a multi-column index tuned specifically for that query

Adding a special conditional index

Cluster tables on specific indexes

Use table partitioning and constraint exclusion

In case of many updates set fillfactor on table

Rewriting the schema—a more radical approach

Vaccum & Update statstics of table & Index

68. How do I see how the query optimizer is evaluating my query?

This is done with the EXPLAIN command; see Using EXPLAIN.

69. How do I change the sort ordering of textual data?

PostgreSQL sorts textual data according to the ordering that is defined by the current locale, which is selected during initdb. (In 8.4 and up it will be possible to select a different locale when creating a new database.) If you don't like the ordering then you need to use a different locale. In particular, most locales other than "C" sort according to dictionary order, which largely ignores punctuation and spacing. If that's not what you want then you need "C" locale.

70. How do I perform regular expression searches and case-insensitive regular expression searches? How do I use an index for case-insensitive searches?

The ~ operator does regular expression matching, and ~* does case-insensitive regular expression matching. The case-insensitive variant of LIKE is called ILIKE.

Case-insensitive equality comparisons are normally expressed as:

SELECT * FROM tab WHERE lower(col) = 'abc';

This will not use a standard index on "col". However, if you create an expression index on "lower(col)", it will be used:

CREATE INDEX tabindex ON tab (lower(col));

If the above index is created as UNIQUE, then the column can store upper and lowercase characters, but it cannot contain identical values that differ only in case. To force a particular case to be stored in the column, use a CHECK constraint or a trigger.

In PostgreSQL 8.4 and later, you can also use the contributed CITEXT data type, which internally implements the "lower()" calls, so that you can effectively treat it as a fully case-insensitive data type. CITEXT is also available for 8.3, and an earlier version that treats only ASCII characters case-insensitively on 8.2 and earlier is available on pgFoundry.

71. In a query, how do I detect if a field is NULL? How do I concatenate possible NULLs? How can I sort on whether a field is NULL or not?

You can test the value with IS NULL or IS NOT NULL, like this:

SELECT * FROM tab WHERE col IS NULL;

Concatenating a NULL with something else produces another NULL. If that's not what you want, you can replace the NULL(s) using COALESCE(), like this:

SELECT COALESCE(col1, '') || COALESCE(col2, '')FROM tab;

To sort by the NULL status, use an IS NULL or IS NOT NULL test in your ORDER BY clause. Things that are true will sort higher than things that are false, so the following will put NULL entries at the front of the output:

SELECT * FROM tab ORDER BY (col IS NOT NULL), col;

In PostgreSQL 8.3 and up, you can also control sort ordering of NULLs using the recently-standardized NULLS FIRST/NULLS LAST modifiers, like this:

SELECT * FROM tab ORDER BY col NULLS FIRST;

72. What is the difference between the various character types?

Type	Internal Name	Notes
VARCHAR(n)	varchar	size specifies maximum length, no padding
CHAR(n)	bpchar	blank-padded to the specified fixed length
TEXT	text	no specific upper limit on length
BYTEA	bytea	variable-length byte array (null-byte safe)
"char" (with the quotes)	char	one byte

You will see the internal name when examining system catalogs and in some error messages.

The first four types above are "varlena" types (i.e., the field length is explicitly stored on disk, followed by the data). Thus the actual space used is slightly greater than the expected size. However, long values are also subject to compression, so the space on disk might also be less than expected.

VARCHAR(n) is best when storing variable-length strings if a specific upper limit on the string length is required by the application. TEXT is for strings of "unlimited" length (though all fields in PostgreSQL are subject to a maximum value length of one gigabyte).

CHAR(n) is for storing strings that are all the same length. CHAR(n) pads with blanks to the specified length, while VARCHAR(n) only stores the characters supplied. BYTEA is for storing binary data, particularly values that include zero bytes. All these types have similar performance characteristics, except that the blank-padding involved in CHAR(n) requires additional storage and some extra runtime.

The "char" type (the quotes are required to distinguish it from CHAR(n)) is a specialized datatype that can store exactly one byte. It is found in the system catalogs but its use in user tables is generally discouraged.

73. How do I create a serial/auto-incrementing field?

PostgreSQL supports a SERIAL data type. Actually, this isn't quite a real type. It's a shorthand for creating an integer column that is fed from a sequence.

For example, this:

CREATE TABLE person (     id SERIAL,    name TEXT );

is automatically translated into this:

CREATE SEQUENCE person_id_seq;

CREATE TABLE person (

    id INTEGER NOT NULL DEFAULT nextval('person_id_seq'),

    name TEXT

);

The automatically created sequence is named table_serialcolumn_seq, where table and serialcolumn are the names of the table and SERIAL column, respectively.

There is also BIGSERIAL, which is like SERIAL except that the resulting column is of type BIGINT instead of INTEGER. Use this type if you think that you might need more than 2 billion serial values over the lifespan of the table.

Note that sequences may contain "holes" or "gaps" as a normal part of operation. It is entirely normal for generated keys to go 1, 4, 5, 6, 9, ... . See the FAQ entry on sequence gaps.

74. How do I get the value of a SERIAL insert?

The simplest way is to retrieve the assigned SERIAL value with RETURNING. Using the example table in the previous question, it would look like this:

INSERT INTO person (name) VALUES ('Blaise Pascal') RETURNING id;

You can also call nextval() and use that value in the INSERT, or call currval() after the INSERT.

75. Doesn't currval() lead to a race condition with other users?

No. currval() returns the latest sequence value assigned by your session, independently of what is happening in other sessions.

76. Why are there gaps in the numbering of my sequence/SERIAL column? Why aren't my sequence numbers reused on transaction abort?

To improve concurrency, sequence values are given out to running transactions on-demand; the sequence object is not kept locked but is immediately available for another transaction to get another sequence value. This causes gaps in numbering from aborted transactions, as documented in the NOTE section for the nextval() function.

Additionally, an unclean server shutdown will cause sequences to increment on recovery, because PostgreSQL keeps a cache of sequence numbers to hand out and in an unclean shutdown it isn't sure which of those cached numbers has already been used. Since sequences are allowed to have gaps anyway it takes the safe option and increments the sequence.

Another cause for gaps in sequence is the use of the CACHE clause in CREATE SEQUENCE.

In general, you should not rely on SERIAL keys or SEQUENCEs being gapless, nor should you make assumptions about their order; it is not guaranteed that id n+1 was inserted after id n except when both were generated within the same transaction. Compare synthetic keys for equality and only for equality.

Gap-less sequences are possible, but are very bad for performance. At most one transaction at a time can be inserting rows from a gapless sequence. There is no built-in SERIAL or SEQUENCE equivalent for gap-less sequences, but one is trivial to implement. Information on gapless sequence implementations can be found in the mailing list archives, on Stack Overflow, and in this useful article. Avoid using a gap-less sequence unless it is an absolute business requirement. Consider dynamically generating the gap-less numbering on demand for display, using the row_number() window function, or adding it in a batch process that runs periodically.

77. What is an OID?

If a table is created WITH OIDS, each row includes an OID column that is automatically filled in during INSERT. OIDs are sequentially assigned 4-byte integers. Initially they are unique across the entire installation. However, the OID counter wraps around at 4 billion, and after that OIDs may be duplicated.

It is possible to prevent duplication of OIDs within a single table by creating a unique index on the OID column (but note that the WITH OIDS clause doesn't by itself create such an index). The system checks the index to see if a newly generated OID is already present, and if so generates a new OID and repeats. This works well so long as no OID-containing table has more than a small fraction of 4 billion rows.

PostgreSQL uses OIDs for object identifiers in the system catalogs, where the size limit is unlikely to be a problem.

To uniquely number rows in user tables, it is best to use SERIAL rather than an OID column, or BIGSERIAL if the table is expected to have more than 2 billion entries over its lifespan.

78. What is a CTID?

CTIDs identify specific physical rows by their block and offset positions within a table. They are used by index entries to point to physical rows. A logical row's CTID changes when it is updated, so the CTID cannot be used as a long-term row identifier. But it is sometimes useful to identify a row within a transaction when no competing update is expected.

79. Why do I get the error "ERROR: Memory exhausted in AllocSetAlloc()"?

You probably have run out of virtual memory on your system, or your kernel has a low limit for certain resources. Try this before starting the server:

ulimit -d 262144

limit datasize 256m

Depending on your shell, only one of these may succeed, but it will set your process data segment limit much higher and perhaps allow the query to complete. This command applies to the current process, and all subprocesses created after the command is run. If you are having a problem with the SQL client because the backend is returning too much data, try it before starting the client.

80. How do I create a column that will default to the current time?

Use CURRENT_TIMESTAMP:

CREATE TABLE test (x int, modtime TIMESTAMP DEFAULT CURRENT_TIMESTAMP );

81. How do I perform an outer join?

PostgreSQL supports outer joins using the SQL standard syntax. Here are two examples:

SELECT *

FROM t1 LEFT OUTER JOIN t2 ON (t1.col = t2.col);

SELECT *

FROM t1 LEFT OUTER JOIN t2 USING (col);

These identical queries join t1.col to t2.col, and also return any unjoined rows in t1 (those with no match in t2). A RIGHT join would add unjoined rows of t2. A FULL join would return the matched rows plus all unjoined rows from t1 and t2. The word OUTER is optional and is assumed in LEFT, RIGHT, and FULL joins. Ordinary joins are called INNER joins.

82. How do I perform queries using multiple databases?

There is no way to directly query a database other than the current one. Because PostgreSQL loads database-specific system catalogs, it is uncertain how a cross-database query should even behave.

The SQL/MED support in PostgreSQL allows a "foreign data wrapper" to be created, linking tables in a remote database to the local database. The remote database might be another database on the same PostgreSQL instance, or a database half way around the world, it doesn't matter. postgres_fdw is built-in to PostgreSQL 9.3 and includes read/write support; a read-only version for 9.2 can be compiled and installed as a contrib module.

contrib/dblink allows cross-database queries using function calls and is available for much older PostgreSQL versions. Unlike postgres_fdw it can't "push down" conditions to the remote server, so it'll often land up fetching a lot more data than you need.

Of course, a client can also make simultaneous connections to different databases and merge the results on the client side.

83. How do I return multiple rows or columns from a function?

It is easy using set-returning functions, Return more than one row of data from PL/pgSQL functions.

84. Why do I get "relation with OID ##### does not exist" errors when accessing temporary tables in PL/PgSQL functions?

In PostgreSQL versions < 8.3, PL/PgSQL caches function scripts, and an unfortunate side effect is that if a PL/PgSQL function accesses a temporary table, and that table is later dropped and recreated, and the function called again, the function will fail because the cached function contents still point to the old temporary table. The solution is to use EXECUTE for temporary table access in PL/PgSQL. This will cause the query to be reparsed every time.

This problem does not occur in PostgreSQL 8.3 and later.

85. What replication solutions are available?

Though "replication" is a single term, there are several technologies for doing replication, with advantages and disadvantages for each. Our documentation contains a good introduction to this topic at http://www.postgresql.org/docs/current/static/high-availability.html and a grid listing replication software and features is at Replication, Clustering, and Connection Pooling

Master/slave replication allows a single master to receive read/write queries, while slaves can only accept read/SELECT queries. The most popular freely available master-slave PostgreSQL replication solution is Slony-I.

Multi-master replication allows read/write queries to be sent to multiple replicated computers. This capability also has a severe impact on performance due to the need to synchronize changes between servers. PGCluster is the most popular such solution freely available for PostgreSQL.

There are also proprietary and hardware-based replication solutions available supporting a variety of replication models.

86. Is possible to create a shared-storage postgresql server cluster?

PostgreSQL does not support clustering using shared storage on a SAN, SCSI backplane, iSCSI volume, or other shared media. Such "RAC-style" clustering isn't supported. Only replication-based clustering is currently supported.

See Replication, Clustering, and Connection Pooling information for details.

Shared-storage 'failover' is possible, but it is not safe to have more than one postmaster running and accessing the data store at the same time. Heartbeat and STONITH or some other hard-disconnect option are recommended.

87. Why are my table and column names not recognized in my query? Why is capitalization not preserved?

The most common cause of unrecognized names is the use of double-quotes around table or column names during table creation. When double-quotes are used, table and column names (called identifiers) are stored case-sensitive, meaning you must use double-quotes when referencing the names in a query. Some interfaces, like pgAdmin, automatically double-quote identifiers during table creation. So, for identifiers to be recognized, you must either:

Avoid double-quoting identifiers when creating tables
Use only lowercase characters in identifiers
Double-quote identifiers when referencing them in queries

88. I lost the database password. What can I do to recover it?

You can't. However, you can reset it to something else. To do this, you

edit pg_hba.conf to allow trust authorization temporarily
Reload the config file (pg_ctl reload)
Connect and issue ALTER ROLE / PASSWORD to set the new password
edit pg_hba.conf again and restore the previous settings
Reload the config file again

89. Does PostgreSQL have stored procedures?

PostgreSQL doesn't. However, PostgreSQL have very powerful functions and user-defined functions capabilities that can do most things that other RDBMS stored routines (procedures and functions) can do and in many cases, more.

These functions can be of different types and can be implemented in several programming languages. (Refer to documentation for more details. User-Defined Functions)

PostgreSQL functions can be invoked in many ways. If you want to invoke a function as you would call a stored procedure in other RDBMS (typically a function with side-effects but whose result you don't care for example because it returns void), one option would be to use PL/pgSQL Language for your procedure and the PERFORM command. Example:

PERFORM theNameOfTheFunction(arg1, arg2);

Note that invoking instead:

SELECT theNameOfTheFunction(arg1, arg2);

would produce a result even if the function returns void (this result would be one row containing a void value).

PERFORM could thus be used to discard this unuseful result.

The main limitations on Pg's stored functions - as compared to true stored procedures - are:

inability to return multiple result sets
no support for autonomous transactions (BEGIN, COMMIT and ROLLBACK within a function)
no support for the SQL-standard CALL syntax, though the ODBC and JDBC drivers will translate calls for you.

90. Why don't BEGIN, ROLLBACK and COMMIT work in stored procedures/functions?

PostgreSQL doesn't support autonomous transactions in its stored functions. Like all PostgreSQL queries, stored functions always run in a transaction and cannot operate outside a transaction.

If you need a stored procedure to manage transactions, you can look into the dblink interface or do the work from a client-side script instead. In some cases you can do what you need to using exception blocks in PL/PgSQL, because each BEGIN/EXCEPTION/END block creates a subtransaction.

About Me

Wednesday, 27 September 2017

Interview Q and A for PostgreSQL Part - 3