Manoj Simar - The Technology Expert: Interview Q and A for PostgreSQL Part

91. Why is "SELECT count(*) FROM bigtable;" slow?

In 9.2 and above it generally isn't thanks to Index-only scans.

For more information on older versions, see Slow Counting.

92. Why is my query much slower when run as a prepared query?

In PostgreSQL 9.2 and above this issue is rare because PostgreSQL can decide to use a generic or value-optimised plan on a per-execution basis. The planner might still make the wrong choice, so the following remains somewhat relevant:

When PostgreSQL has the full query with all parameters known by planning time, it can use statistics in the table to find out if the values used in the query are very common or very uncommon in a column. This lets it change the way it fetches the data to be more efficient, as it knows to expect lots or very few results from a certain part of the query. For example, it might choose an sequential scan instead of doing an index scan if you search for 'active=y' and it knows that 99% of the records in the table have 'active=y', because in this case a sequential scan will be much faster.

In a prepared query, PostgreSQL doesn't have the value of all parameters when it's creating the plan. It has to try to pick a "safe" plan that should work fairly well no matter what value you supply as the parameter when you execute the prepared query. Unfortunately, this plan might not be very appropriate if the value you supply is vastly more common, or vastly less common, than is average for some randomly selected values in the table.

If you suspect this issue is affecting you, start by using the EXPLAIN command to compare the slow and fast queries. Look at the output of EXPLAIN SELECT query... and compare it to the result of

PREPARE query... ;
EXPLAIN EXECUTE query...

to see if the plans are notably different. EXPLAIN ANALYZE may give you more information, such as row count estimates and counts.

Usually people having this problem are trying to use prepared queries as a security measure to prevent SQL injection, rather than as a performance tuning option for expensive-to-plan queries frequently executed with a variety of different parameters. Those people should consider using client-side prepared statements if their client interface (eg PgJDBC) supports it. The PostgreSQL protocol supports parameterised queries without server-side persistent prepared statements and most client drivers support using that via their client side prepared statement interfaces.

At present, PostgreSQL does not offer a way to request re-planning of a prepared statement using a particular set of parameter values; however, 9.2 and above may do so automatically where statistics indicate that it is desirable.

93. Why is my query much slower when run in a function than standalone?

Queries in PL/PgSQL functions are prepared and cached, so they execute in much the same way as if you'd PREPAREd Then EXECUTEd the query yourself.

If you're having really severe issues with this that improving the table statistics or adjusting your query don't help with, you can work around it by forcing PL/PgSQL to re-prepare your query at every execution. To do this, use the EXECUTE ... USING statement in PL/PgSQL to supply your query as a textual string. Alternately, the quote_literal or quote_nullable functions may be used to escape parameters substituted into query text.

94. Why do my strings sort incorrectly?

First, make sure you are using the locale you want to be using. Use SHOW lc_collate to show the database-wide locale in effect. If you are using per-column collations, check those. If everything is how you want it, then read on.

PostgreSQL uses the C library's locale facilities for sorting strings. So if the sort order of the strings is not what you expect, the issue is likely in the C library. You can verify the C library's idea of sorting using the sort utility on a text file, e.g.,

LC_COLLATE=xx_YY.utf8 sort testfile.txt

If this results in the same order that PostgreSQL gives you, then the problem is outside of PostgreSQL.

PostgreSQL deviates from the libc behavior in so far as it breaks ties by sorting strings in byte order. This should rarely make a difference in practice, and is usually not the source of the problem when users complain about the sort order, but it could affect cases where, for example, combining and precombined Unicode characters are mixed.

If the problem is in the C library, you will have to take it up with your operating system maintainers. Note, however, that while actual bugs in locale definitions of C libraries have been known to exist, it is more likely that the C library is correct, where "correct" means it follows some recognized international or national standard. Possibly, you are expecting one of multiple equally valid interpretations of a language's sorting rules.

Common complaint patterns include:

Spaces and special characters: The sorting algorithm normally works in multiple passes. First, all the letters are compared, ignoring spaces and punctuation. Then, spaces and punctuation are compared to break ties. (This is a simplification of what actually happens.) It's not possible to change this without changing the locale definitions themselves (and even then it's difficult). You might want to restructure your data slightly to avoid this problem. For example, if you are sorting a name field, you could split the field into first and last name fields, avoiding the space in between.

Upper/lower case: Locales other than the C locale generally sort upper and lower case letters together. So the order will be something like a A b B c C ... instead of the A B C ... a b c ... that a sort based on ASCII byte values will give. That is correct.

German locale: sort order of ä as a or ae. Both of these are valid (see http://de.wikipedia.org/wiki/Alphabetische_Sortierung), but most C libraries only provide the first one. Fixing this would require creating a custom locale. This is possible, but will take some work.

It is not in ASCII/byte order. No, it's not, it's not supposed to be. ASCII is an encoding, not a sort order. If you want this, you can use the C locale, but then you use the ability to non-ASCII characters.

That said, if you are on Mac OS X or a BSD-family operating system, and you are using UTF-8, then give up. The locale definitions on those operating systems are broken.

95. What is Infinite Cache?

Infinite Cache allows you to increase the amount of data maintained as in-memory cache by distributing the cache across existing inexpensive commodity hardware farms. Compression of cache data allows storing entire databases in a durable fashion for lightning fast performance. By storing more or all data in a RAM based cache there are many fewer disk operations which are much slower. Infinite Cache is also completely transparent to applications and so requires no special caching code by application developers to safeguard the durability of any write operations.

96. How does Infinite Cache compare to caching everything in the application server?

While traditionally, the architecture of caching everything at the application tier has been effective, maintaining the cache has been difficult to code and burdensome. Additional development is necessary in the application to properly synchronize the cache with the data in the database and it limits the overall data flexibility by restricting how the data in the backend database is changed.

97. I have a transactional system; will Infinite Cache still help?

Yes. Infinite Cache is most effective in a read-mostly environment, but the performance benefits are still tangible as transactions are mixed into the workload. For example, even in more extreme transactional cases such as a 50-50 read vs. write mix, Infinite Cache still provides a 2X performance gain.

98. What is pgpool?

pgpool is a connection pool server for PostgreSQL based databases. pgpool runs between Postgres clients (front ends) and servers (back ends) and caches a collection of established connections to the database so clients can connect quickly improving performance. A Postgres client connects to pgpool as if it were a standard Postgres Plus Advanced Server.

99. What is Migration Toolkit?

Migration Toolkit comes with Advanced Server and provides migration from Oracle, Sybase, and SQL Server databases to Postgres. It migrates Oracle schema, data, triggers, stored procedures, packages, synonyms, functions, etc. For SQL Server and Sybase it migrates schema and data.

100. When I start the postmaster, I get a Bad System Call or core dumped message. Why?

It could be a variety of problems, but first check to see that you have System V extensions installed in your kernel. POSTGRESQL requires kernel support for shared memory and semaphores.

101. When I try to start the postmaster, I get IpcMemoryCreate errors. Why?

You either do not have shared memory configured properly in your kernel or you need to enlarge the shared memory available in the kernel. The exact amount you need depends on your architecture and how many buffers and backend processes you configure for the postmaster. For most systems, with default numbers of buffers and processes, you need a minimum of ~1MB.

102. When I try to start the postmaster, I get IpcSemaphoreCreate errors.Why?

If the error message is IpcSemaphoreCreate: semget failed (No space left on device) then your kernel is not configured with enough semaphores. Postgres needs one semaphore per potential backend process. A temporary solution is to start the postmaster with a smaller limit on the number of backend processes. Use -N with a parameter less than the default of 32. A more permanent solution is to increase your kernel's SEMMNS and SEMMNI parameters.

If the error message is something else, you might not have semaphore support configured in

your kernel at all.

103. How do I prevent other hosts from accessing my POSTGRESQL database?

By default, POSTGRESQL only allows connections from the local machine using Unix domain

sockets. Other machines will not be able to connect unless you add the -i flag to the postmaster,

and enable host-based authentication by modifying the file $PGDATA/pg_hba.conf accordingly.

This will allow TCP/IP connections.

104. Why can't I connect to my database from another machine?

The default configuration allows only Unix domain socket connections from the local machine. To enable TCP/IP connections, make sure the postmaster has been started with the -i option, and add an appropriate host entry to the filele pgsql/data/pg_hba.conf.

105. Why can't I access the database as the root user?

You should not create database users with user id 0 (root). They will be unable to access the

database. This is a security precaution because of the ability of users to dynamically link object

modules into the database engine.

106. All my servers crash under concurrent table access. Why?

This problem can be caused by a kernel that is not configured to support semaphores

107. What are the pg_sorttempNNN.NN files in my database directory?

They are temporary files generated by the query executor. For example, if a sort needs to be done

to satisfy an ORDER BY, and the sort requires more space than the backend's -S parameter allows, then temporary files are created to hold the extra data.

The temporary files should be deleted automatically, but might not if a backend crashes during

a sort. If you have no backends running at the time, it is safe to delete the pg_tempNNN.NN files.

108. Why is the system confused about commas, decimal points, and date

formats?

Check your locale configuration. POSTGRESQL uses the locale setting of the user that ran the

postmaster process. There are postgres and psql SET commands to control the date format. Set

those accordingly for your operating environment.

109. Why do I get the error .FATAL: palloc failure: memory exhausted?.

It is possible you have run out of virtual memory on your system, or your kernel has a low limit

for certain resources. Try this before starting the postmaster:

ulimit -d 65536

limit datasize 64m

Depending on your shell, only one of these may succeed, but it will set your process data segment

limit much higher and perhaps allow the query to complete. This command applies to the current

process, and all subprocesses created after the command is run. If you are having a problem with

the SQL client because the backend is returning too much data, try it before starting the client.

110. Explain about concurreny with the help of MVCC?

Multi version concurrency control is used to manage concurrency. This feature is very useful because changes made in the database will not be visible to other users until the transaction is completed. This removes the need for read locks. ACID principles are given a further boost by this feature and can be implemented in general.

111. Explain about string constants?

String constant contains a sequence of characters bound by single quotes. This feature is used during insertion of a character or passing character to database objects.

PostgreSQL allows the usage of single quotes but embedded by a C style backslash. This feature is important in parsing data.

112. Explain about tokens?

Tokens are also known to contain several special character symbols. It can be considered as keyword, constant, identifier and quoted identifier. Keywords include pre-defined SQL meanings and SQL commands. Variable names such as tables, columns, etc are represented by identifiers.

113. Explain about pgadmin?

Pgadmin forms a graphical front end administration tool. This feature is available under free software released under Artistic License. Pgadmin iii is the new database administration tool released under artistic license.

114. Explain about database administration tools?
There are various data administration tools they are
1) Psql
2) Pgadmin
3) Phppgadmin
Most of these tools are front end administration tools and web based interfaces. Out of these phppgadmin is the most popular one.

115. Explain about the command enable debug?
This command is used for enabling compilation of all libraries and applications. This process generally slows down the system and it also increases the binary file size. Debugging symbols are present which can assist developers in noticing bugs and problems associated with their script.

116. State some of the advanced features of PostgreSQL?
These are the following features which are present in PostgreSQL they are
1) Object relational database
2) Extensibility and support for SQL
3) Database validation and flexible API
4) Procedural languages and MVCC
5) Client server and WAL.

117. Explain about Write Ahead logging?
This feature increases the reliability of the database by logging changes before any changes or updations to the data base. This provides log of database incase of a database crash. This helps to start the work from the point it was discontinued.

118. Explain Multi version concurrency control (MVCC) ?
PostgreSQL implements MVCC (Multi-Version Concurrency Control), that allows users to read data at the same time as writers make changes. MVCC is used to avoid unnecessary locking of the database. This removes the time lag for the user to log into his database. This feature or time lag occurs when some one else is on the content. All the transactions are kept as a record.

This is an important feature for concurrency in database applications, as it can allow the following:

Better performance because of fewer locks

Greatly reduced deadlocking

Simplified application design and management

MVCC is a core part of PostgreSQL and cannot be turned off, nor would you really want it to be. The internals of MVCC have some implications for the DBA that need to be understood. The price for these benefits is that SQL UPDATE command can cause tables and indexes to grow in size because they leave behind dead row versions. DELETEs and aborted INSERTs take up space that must be reclaimed by garbage collection. VACUUM is the mechanism by which we reclaim space, though there is also another internals feature named HOT, which does much of this work automatically for us.

119. How can I limit what resources a query or user uses (Prioritizing users, queries, or databases) ?

PostgreSQL has no facilities to limit what resources a particular user, query, or database consumes, or correspondingly to set priorities such that one user/query/database gets more resources than others. It's necessary to use operating system facilities to achieve what limited prioritization is possible.

There are three main resources that PostgreSQL users, queries, and databases will contend for:

Memory
CPU
Disk I/O

Of these, disk I/O is commonly a bottleneck for database applications, but that's not always the case. Some schema designs and queries are particularly CPU heavy. Others really benefit from having lots of memory to work with, typically for sorting.

Before struggling too much with prioritizing your queries/users/databases, it's worthwhile to optimize your queries and tune your database. You may find that you can get perfectly acceptable performance without playing with priorities or taking extreme measures, using techniques such as:

Improving your queries
Tune autovacuum to reduce bloat
Generally polishing your cluster's performance
Avoiding use of VACUUM FULL. That can lead to bloated indexes that eat lots of memory and take forever to scan, wasting disk I/O bandwidth. See the wiki page on VACUUM FULL for more information.

People often complain of pegged (100%) CPU and assume that's the cause of database slowdowns. That's not necessarily the case - a system may show an apparent 100% CPU use, but in fact be mainly limited by I/O bandwidth.

Prioritizing CPU

For adjusting the CPU priority of PostgreSQL processes, you can use "renice" (on UNIX systems), but it's a bit clumsy to do since you need to "renice" the backend of interest, not the client program connected to that backend. You can get the backend process id using the SQL query "SELECT pg_backend_pid()" or by looking at the pg_stat_activity view.

One significant limitation of "renice", or any approach based on the setpriority() call, is that on most UNIX-like platforms one must be root to lower the numerical priority value (i.e. schedule the process to run more urgently) of a process.

Increasing the priority of important backends, via a root user's call to "renice", instead of lowering the priority of unimportant ones, may be more effective.

Prioritizing I/O

I/O is harder. Some operating systems offer I/O priorities for processes, like Linux's ionice, and you'd think you could use these in a similar way to how you use 'nice'. Unfortunately, that won't work particularly well, because a lot of the work PostgreSQL does - especially disk writes - are done via a separate background writer process working from memory shared by all backends. Similarly, the write-ahead logs are managed by their own process via shared memory. Because of those two, it's very hard to effectively give one user priority over another for writes. ionice should be moderately effective for reads, though.

As with "nice", effective control on a per-connection level will require the addition of appropriate helper functions, and user co-operation is required to achieve per-user priorities.

Better separation of I/O workloads will require cluster separation, which has its own costs and is only effective on the per-database level.

Prioritizing memory

PostgreSQL does have some tunable parameters for memory use that are per-client, particularly work_mem and maintenance_work_mem. These may be set within a given connection to allow that backend to use more than the usual amount of memory for things like sorts and index creation. You can set these to conservative, low values in postgresql.conf then use the SET command to assign higher values to them for a particular backend, eg SET work_mem = '100MB';.

You can set different values for work_mem and maintenance_work_mem using per-user GUC variables. For example:

ALTER USER myuser SET work_mem = '50MB';

You cannot affect the shared memory allocation done with settings like shared_buffers this way, that value is fixed at database startup time and can't be changed without restarting it.

There's no easy way in most operating systems to prioritize memory allocations, so that for example the OS would prefer to swap one backend's memory out instead of another's.

120. How can admin tasks login to the database without a password?

Most access to the database, including the psql program, goes through the libpq library. This library includes a feature where if you specify a file called .pgpass (or the file referenced by PGPASSFILE) you can put the password needed to connect as a user in there. This allows automating routine administration tasks through mechanisms like cron.

The format of the .pgpass file is the following:

hostname:port:database:username:password

the character '*' can match any value in any of the fields (except password)

n.b.: if the environment variable PGPASSWORD is set, then the ~/.pgpass file is not read

Example PGPASSFILE value for a path with spaces on Windows 7 64-bit:

set PGPASSFILE=C:\Program Files\someapp\pgpass.conf

Note that the environment variable value must not use " (double quotes).

Manoj Simar - The Technology Expert

About Me

Wednesday, 27 September 2017

Interview Q and A for PostgreSQL Part - 4