Monday, 30 October 2017

Interview Q and A for PostgreSQL Part - 5

121. PostgreSQL 9.0 Architecture
PostgreSQL Architecture includes sevaral things memory, process and storage file system



PostgreSQL 9.0 Architecture Overview
PostgreSQL instance consists of set of Process and Memory. PostgreSQL uses a simple "process per-user" client/server model. PostgreSQL types of process.
  • The 'postmaster', supervisory daemon process, 'postmaster' is attached to shmmem segment but refrains from accessing to it.
  • Utility processes (bgwriter, walwriter, syslogger, archiver, statscollector and autovacuum lancher) and
  • User Backend process (postgres process itself, Server Process)
When a client request for connection to the database, firstly request is hit to Postmaster daemon process after performing Authentication and authorization it spawns one new backend server process(postgres). From that point on, the frontend process and the backend server communicate without intervention by the postmaster. Hence, the postmaster is always running, waiting for connection requests, whereas frontend and backend processes come and go. The libpq library allows a single frontend to make multiple connections to backend processes.
However, each backend process is a single-threaded process that can only execute one query at a time; so, the communication over any one frontend-to-backend connection is single-threaded. Postmaster and postgres servers run with the user ID of the PostgreSQL "superuser". 
One Postgres process exists for every open database session. Once authenticated with user connection, it directly connects (with who and for what purpose) with shared memory. 
Memory:
Shared Buffers
WAL Buffers
clog Buffers
Other Buffers
Following links will brief more.
 Utility Process:
   Mandatory process: These processes are not having an option of Enable/Disable.
  • BGWriter 
  • WAL Writer
   Optional Process:  These processes are having an option of Enable/Disable.

  • Stats-collector
  • Autovacuum launcher
  • Archiver 
  • Syslogger
  • WAL Sender
  • WAL Receiver

122. How to change all objects ownership in a particular schema in PostgreSQL

No special code included in a script, I basically picked the technique suggested and simplified the implementation method via script. Actually, REASSIGN OWNED BY command does most of the work smoothly, however, it changes database-wide objects ownership regardless of any schema. Two eventualities, where you may not use REASSIGN OWNED BY:
1. If the user by mistake creates all his objects with super-user(postgres), and later intend to change to other user, then REASSIGN OWNED BY will not work and it merely error out as:
  1. postgres=# reassign owned by postgres to user1;  
  2. ERROR:  cannot reassign ownership of objects owned by role postgres because they are required by the database system  
2. If user wish to change just only one schema objects ownership.
Either cases of changing objects, from "postgres" user to other user or just changing only one schema objects, we need to loop through each object by collecting object details from pg_catalog's & information_schema and calling ALTER TABLE / FUNCTION / AGGREGATE / TYPE etc.
I liked the technique of tweaking pg_dump output using OS commands(sed/egrep), because it known that by nature the pg_dump writes ALTER .. OWNER TO of every object (TABLES / SEQUENCES / VIEWS / FUNCTIONS / AGGREGATES / TYPES) in its output. Grep'ing those statements from pg_dump stdout by replacing new USER/SCHEMANAME with sed and then passing back those statements to psql client will fix the things even if the object owned by Postgres user. I used same approach in script and allowed user to pass NEW USER NAME and SCHEMA NAME, so to replace it in ALTER...OWNER TO.. statement.
Script usage and output:
sh change_owner.sh  -n new_rolename -S schema_name  
  
-bash-4.1$ sh change_owner.sh -n user1 -S public  
  
 Summary:  
        Tables/Sequences/Views : 16  
        Functions              : 43  
        Aggregates             : 1  
        Type                   : 2  
You can download the script from here, and there's also README to help you on the usage.

123. Duplicate Rows in a primary key Table.

"ERROR: could not create unique index
DETAIL: Table contains duplicated values."
This error is thrown out by Postgres when it encounters duplicate rows in a primary key table by failing any of these command REINDEX or CREATE UNIQUE INDEX.
Why duplicate rows exists in a table ?
Two thing to my mind.
Firstly, it might be delayed index creation or if you have shared sequences in a database, sharing on two different Primary key Tables might be the cause while restoring the data into table (pg_restore). Secondly, if any huge transaction is taking place on that table and at the backend someone has abruptly stopped the instance, which might also fail the index (primary key) to point to the right row.
How to fix it?
Well, as common practice, when we encounter some duplicate rows in a table (despite of any reason), we first filter the duplicate rows and delete them, and later by doing REINDEX should fix the issue.
Query for finding duplicate rows:
  1. select count(*),primary_column from table_name group by primary_column having count(*) > 1;  
Even after deleting the duplicate rows REINDEX or CREATE UNIQUE INDEX fails, it means your index is not cleaned properly. Above query might not be giving 100% result oriented output what you are expecting, because the query is going to pick the index which is already corrupted with duplicate rows. See the explain plan below.
postgres=# explain select count(*),id from duplicate_test group by id having count(*) > 1;  
                                              QUERY PLAN  
-------------------------------------------------------------------------------------------------------  
 GroupAggregate  (cost=0.00..5042.90 rows=99904 width=4)  
   Filter: (count(*) > 1)  
   ->  Index Scan using duplicate_test_pkey on duplicate_test  (cost=0.00..3044.82 rows=99904 width=4)  
(3 rows)  
We need to catch CTID of duplicate rows from the main table and delete with conditional statement as CTID + PRIMARY KEY VALUE.

I have played a bit with pg_catalogs to voilate Primary Key Table to reproduce the scenario with similar error. (Please don't it)
postgres=# create unique index idup on duplicate_test(id);  
ERROR:  could not create unique index "idup"  
DETAIL:  Key (id)=(10) is duplicated.  
My Table Definition & Data:
postgres=# \d duplicate_test  
Table "public.duplicate_test"  
 Column |  Type   | Modifiers  
--------+---------+-----------  
 id     | integer | not null  
 name   | text    |  
Indexes:  
    "duplicate_test_pkey" PRIMARY KEY, btree (id)  
  
postgres=# select * from duplicate_test ;  
 id |  name  
----+---------  
 10 | Raghav    ---Duplicate  
 20 | John H  
 30 | Micheal  
 10 | Raghav    ---Duplicate  
(4 rows)  
Now, lets fix this....

Step 1. Create a new table from effected table by pulling only two column values CTID and PRIMARY KEY.
postgres=# CREATE TABLE dupfinder AS SELECT ctid AS tid, id FROM duplicate_test;  
SELECT 4  
Step 2. Now, lets run the duplicate finder query with CTID to get the exact duplicates.
postgres=# select * from dupfinder x where exists (select 1 from dupfinder y where x.id = y.id and x.tid != y.tid);  
  tid  | id  
-------+----  
 (0,1) | 10  
 (0,5) | 10  
(2 rows)  
Step 3. On above result, now you can delete one row from main table(effected table) with CTID.
postgres=# delete from duplicate_test where ctid='(0,5)' and id=10;  
DELETE 1  
Step 4. Now, your REINDEX or CREATE UNIQUE INDEX will be successful.
postgres=# create unique index idup on duplicate_test(id);  
CREATE INDEX  
 postgres=# select * from duplicate_test ;  
 id |  name  
----+---------  
 10 | Raghav  
 20 | John H  
 30 | Micheal  
(3 rows)  
Step 5. Don't forget to do immediate VACUUM ANALYZE on the table to update the system catalogs as well CTID movement.


124. PostgreSQL 9.0 Memory & Processes

Every PostgreSQL Instance startup, there will be a set of utilty process (including mandatory and optional process) and memory. Two mandatory process (BGWRITER and WAL Writer) and four optional process (Autovacuum launcher, stats collector, syslogger, and Archiver). You can check it out with the command 'ps -ef | grep postgres' given below in figure 


Overview of the Process and memory.



                               

Above figure shows the processes attached to the PostgreSQL Shared memory.

BGWriter/Writer Process:
BGWRITER or WRITER process is a mandotary process.
All PostgreSQL server process reads data from disk and moves them into Shared Buffer Pool. Shared Buffer pool uses ARC algorithm or LRU(least-recently used) mechanism to select the page it evicts from the pool. BGWRITER spends much of its time sleeping, but every time it wakes, it searches through the shared buffer pool looking for modified pages. After each search, the BGWRITER chooses some number of modified pages, writes them to disk, and evicts those pages from the shared buffer pool. BGWRITER process can be controlled with three parameters BGWRITER_DELAY,BGWRITER_LRU_PERCENT and BGWRITER_LRU_MAXPAGES.

WAL Writer Process:
WAL writer process is a mandatory process.
WAL writer process writes and fsync WAL at convenient Intervals. WAL buffers hold the changes made to the database in the transaction logs, in order to guarantee transaction security. WAL buffers are written out to the disk at every transaction commit, as WAL writer process is responsible to write on to the disk. WAL_WRITER_DELAY parameter for invoking the WAL Writer Process, however there are other parameters which also keeps the WAL Writer busy. Follow below link.

Stats Collector Process:
Stats collecotr process is optional process, default is ON. 
Stats collector process will collect the information about the server activity. It counts number of access to the tables and indexes in both disk-block and individual row items. It also tracks the total number of rows in each table, and information about VACUUM and ANALYZE actions for each table. Collection of statistics adds some overhead to query execution, whether to collect or not collect information. Some of the parameter in the postgresql.conf file will control the collection activity of the stats collector process. Following link will brief more about the stats collector process and its related parameters.

Autovacuum Launcher Process:
Autovacuuming is an optional Process, default is ON.
For automating the execution of VACUUM and ANALYZE command, Autovacuum Launcher is a daemon process consists of multiple processes called autovacuum workers. Autovacuum launcher is a charge of starting autovacuum worker processes for all databases. Launcher will distribute the work across time, attempting to start one worker on each database for every interval, set by the parameter autovacuum_naptime. One worker will be launched for each database, set by the parameter autovacuum_max_workers. Each worker process will check each table within its database and execute VACUUM or ANALYZE as needed. Following will brief about the AUTOVACUUM LAUNCHER PROCESS parameters.

Syslogger Process / Logger Process:



Logging is an optional process, default is OFF.
As per the figure, it is clearly understood that all the utility process + User backends + Postmaster Daemon attached to syslogger process for logging the information about their activities. Every process information is logged under $PGDATA/pg_log with the file .log. 
Note: If the data directory is created with INITDB command, then their wont be pg_log directory under it. Explicit creation is needed.
Debugging more on the process information will cause overhead on the Server. Minimul tunning is always recommended, however, increasing the debug level when required. Link below will brief on logging parameters. 

Archiver Process:



Achiver process is optional process, default is OFF.
Above Figure is made from my observation on the Archiving process in PostgreSQL. Setting up the database in Archive mode means, to capture the WAL data of each segment file once it is filled, and save that data somewhere before the segment file is recycled for reuse. 
Diagrammatical explination on Numbering tags.  
1. On Database Archivelog mode, once the WAL data is filled in the WAL Segment, that filled segment named file is created under $PGDATA/pg_xlog/archive_status by the WAL Writer naming the file as ".ready". File naming will be "segment-filename.ready".
2. Archiver Process triggers on finding the files which are in ".ready" state created by the WAL Writer process. Archiver process picks the 'segment-file_number' of .ready file and copies the file from $PGDATA/pg_xlog location to its concerned Archive destination given in 'archive_command' parameter(postgresql.conf). 
3. On successful completion of copy from source to destination, archiver process renames the "segment-filename.ready" to "segment-filename.done". This completes the archiving process. 
It is understood that, if any files named "segement-filename.ready" found in $PGDATA/pg_xlog/archive_status are the pending files still to be copied to Archive destination. 
For more information on parameters and Archiving, see the below link.

125. Database File Layout

All the data needed for a database cluster is stored within the cluster's data directory, commonly referred to as PGDATA (after the name of the environment variable that can be used to define it). A common location for PGDATA is /var/lib/pgsql/data. Multiple clusters, managed by different server instances, can exist on the same machine.
The PGDATA directory contains several subdirectories and control files. In addition to these required items, the cluster configuration files postgresql.conf, pg_hba.conf, and pg_ident.conf are traditionally stored in PGDATA (although in PostgreSQL 8.0 and later, it is possible to keep them elsewhere).
Table Contents of PGDATA
Item
Description
PG_VERSION
A file containing the major version number of PostgreSQL
base
Subdirectory containing per-database subdirectories
global
Subdirectory containing cluster-wide tables, such as pg_database
pg_clog
Subdirectory containing transaction commit status data
pg_multixact
Subdirectory containing multitransaction status data (used for shared row locks)
pg_notify
Subdirectory containing LISTEN/NOTIFY status data
pg_serial
Subdirectory containing information about committed serializable transactions
pg_snapshots
Subdirectory containing exported snapshots
pg_stat_tmp
Subdirectory containing temporary files for the statistics subsystem
pg_subtrans
Subdirectory containing subtransaction status data
pg_tblspc
Subdirectory containing symbolic links to tablespaces
pg_twophase
Subdirectory containing state files for prepared transactions
pg_xlog
Subdirectory containing WAL (Write Ahead Log) files
postmaster.opts
A file recording the command-line options the server was last started with
postmaster.pid
A lock file recording the current postmaster process ID (PID), cluster data directory path, postmaster start timestamp, port number, Unix-domain socket directory path (empty on Windows), first valid listen_address (IP address or *, or empty if not listening on TCP), and shared memory segment ID (this file is not present after server shutdown)
For each database in the cluster there is a subdirectory within PGDATA/base, named after the database's OID in pg_database. This subdirectory is the default location for the database's files; in particular, its system catalogs are stored there.
Each table and index is stored in a separate file. For ordinary relations, these files are named after the table or index's filenode number, which can be found in pg_class.relfilenode. But for temporary relations, the file name is of the form tBBB_FFF, where BBB is the backend ID of the backend which created the file, and FFF is the filenode number. In either case, in addition to the main file (a/k/a main fork), each table and index has a free space map, which stores information about free space available in the relation. The free space map is stored in a file named with the filenode number plus the suffix _fsm. Tables also have a visibility map, stored in a fork with the suffix _vm, to track which pages are known to have no dead tuples. The visibility map is described further. Unlogged tables and indexes have a third fork, known as the initialization fork, which is stored in a fork with the suffix _init.
Caution
Note that while a table's filenode often matches its OID, this is not necessarily the case; some operations, like TRUNCATE, REINDEX, CLUSTER and some forms of ALTER TABLE, can change the filenode while preserving the OID. Avoid assuming that filenode and table OID are the same. Also, for certain system catalogs including pg_class itself, pg_class.relfilenode contains zero. The actual filenode number of these catalogs is stored in a lower-level data structure, and can be obtained using the pg_relation_filenode() function.
When a table or index exceeds 1 GB, it is divided into gigabyte-sized segments. The first segment's file name is the same as the filenode; subsequent segments are named filenode.1, filenode.2, etc. This arrangement avoids problems on platforms that have file size limitations. (Actually, 1 GB is just the default segment size. The segment size can be adjusted using the configuration option --with-segsize when building PostgreSQL.) In principle, free space map and visibility map forks could require multiple segments as well, though this is unlikely to happen in practice.
A table that has columns with potentially large entries will have an associated TOAST table, which is used for out-of-line storage of field values that are too large to keep in the table rows proper. pg_class.reltoastrelid links from a table to its TOAST table, if any.
Tablespaces make the scenario more complicated. Each user-defined tablespace has a symbolic link inside the PGDATA/pg_tblspc directory, which points to the physical tablespace directory (i.e., the location specified in the tablespace's CREATE TABLESPACE command). This symbolic link is named after the tablespace's OID. Inside the physical tablespace directory there is a subdirectory with a name that depends on the PostgreSQL server version, such as PG_9.0_201008051. (The reason for using this subdirectory is so that successive versions of the database can use the same CREATE TABLESPACE location value without conflicts.) Within the version-specific subdirectory, there is a subdirectory for each database that has elements in the tablespace, named after the database's OID. Tables and indexes are stored within that directory, using the filenode naming scheme. The pg_default tablespace is not accessed through pg_tblspc, but corresponds to PGDATA/base. Similarly, the pg_global tablespace is not accessed through pg_tblspc, but corresponds to PGDATA/global.
The pg_relation_filepath() function shows the entire path (relative to PGDATA) of any relation. It is often useful as a substitute for remembering many of the above rules. But keep in mind that this function just gives the name of the first segment of the main fork of the relation — you may need to append a segment number and/or _fsm or _vm to find all the files associated with the relation.
Temporary files (for operations such as sorting more data than can fit in memory) are created within PGDATA/base/pgsql_tmp, or within a pgsql_tmp subdirectory of a tablespace directory if a tablespace other than pg_default is specified for them. The name of a temporary file has the form pgsql_tmpPPP.NNN, where PPP is the PID of the owning backend and NNN distinguishes different temporary files of that backend.