Thursday, 7 June 2018

Interview Q and A for PostgreSQL Part - 10

212: What If I forgot my Postgres user password?
If you forgot your postgres password, you can login without password by changing “md5” to “trust” in pg_hba.conf file and restart the PostgreSQL server.
After login, you can ALTER postgres user:
ALTER USER postgres PASSWORD 'MyNewPassword';
Now again change your pg_hba.conf entry from “trust” to “md5” and restart the PostgreSQL server.

213: How we can get a first record  for each GROUP ?
In PostgreSQL, We can get a first record for each GROUP using different options like:
  • Using DISTINCT ON
  • Using LATERAL
  • CTE with ROW_NUMBER()
  • CTE with LATERAL
  • Subquery with ROW_NUMBER()
  • Using array_agg()

214: Why New User can access all Databases without any Grants

Yes: When you create any new DB User in PostgreSQL, It has a default CONNECT privileges.
But It cannot access any Table or data of Databases, yes It can create new Table in any Database.
To prevent a new User for connecting any existing Database, we should run REVOKE command on particular User or Role.
REVOKE CONNECT privileges from Database:
REVOKE CONNECT ON DATABASE Database_Name FROM User_Name;
REVOKE all privileges from Database: 
REVOKE ALL PRIVILEGES ON DATABASE Database_Name FROM User_Name;
You should GRANT only required Databases:
GRANT CONNECT ON DATABASE Database_Name TO User_Name;

215: What is SKIP LOCKED feature in PostgreSQL
PostgreSQL 9.5 introduced one of the powerful feature called SKIP LOCKED which is used for SKIP uncommitted data.
explaining by one practical example.
In our project we have implemented thread mechanism. Different threads are running and taking a fixed set of records from the Database.
These threads basically implemented for performing Reverse Geo Coding on the stored Latitude and Longitude.
I am giving 10000 records to each thread in each batch.
Now the average execution time of each thread is 5 second to 7 second.
Now, one batch of 10000 records taken by thread_1 for 5 second to 7 second that same records should not taken by any other thread because that first 10000 records already under the process by thead_1.
Thread_2 has taken next 10000 records which are not taken by thread_1.
Now with the PostgreSQL 9.5, we do not require any additional locking mechanism or isolation level to achieve this scenario.
We can very easily achieve using FOR UPDATE SKIP LOCKED option. When we SELECT data using this option, it will skip all running transactions (uncommitted transaction) and return only required records to be processed.

216: How to Force Autovacuum for running Aggressive ?

In one of our Postgres reporting servers, Autovacuum is not running frequently as per the expectations.

A Vacuum commands used to remove dead tuples from the disk which improve the overall performance of the PostgreSQL Server
When we have enabled Autovacuum related parameters, It will perform vacuum automatically whenever a database has no load and tables should not have any exclusive lock.
This is fine with default Autovacuum setting, but My reporting system is very loaded with tons of bulk operations.
In this situation, Autovacuum will not run frequently so we should change the default value of Autovacuum related parameters and we should make It more aggressive.
Below have required changes to force the Autovacuum parameters for running frequently.
First enable the log for Autovacuum process:
log_autovacuum_min_duration = 0

Increase the size of worker to check table more:
autovacuum_max_workers = 6
autovacuum_naptime = 15s

Decrease the value of thresholds and auto analyze to trigger the sooner:
autovacuum_vacuum_threshold = 25
autovacuum_vacuum_scale_factor = 0.1
autovacuum_analyze_threshold = 10
autovacuum_analyze_scale_factor = 0.05

Make autovacuum less interruptable:
autovacuum_vacuum_cost_delay = 10ms
autovacuum_vacuum_cost_limit = 1000

217: How a query is internally processed in PostgreSQL?
There are different stages involved in the PostgreSQL query execution process and I will explain short notes on each. 
Connection: The connection has to be established from an application program to the PostgreSQL Server.
Parser: It creates a query tree after checking of all the query syntax which is sent by an application.
Rewrite System: It takes the query tree from the parser and check for any other rules and modification. For example, if the query contains a VIEW, it has to modify the query to select a base table which is related to that view.
Planner or Optimizer: It creates all possible query paths and after that it chooses one of the cheapest path for query execution.
For example, It finds two paths, one path is the table scan without an index and second path is the index scan with the indexes. It is estimated and executes the best path.
Executor: It takes the plan from the Planner and extract the required set of the rows. Internally, it manages the pipeline mechanism and one by one it takes the Query-Plan and delivers the required rows.

218: Should we create Multiple Databases OR create Multiple Schemas into one Database?

I have observed that most of the new PostgreSQL users are creating multiple databases into PostgreSQL Server.
This is not wrong, we all have common practice to create multiple databases for different purpose of the project.
But in the PostgreSQL server, we should create multiple schemas instead of creating multiple databases.
I recommended 1-database many-schema approach for the PostgreSQL Server because cross database queries is very costly in PostgreSQL. 
In the Microsoft SQL Server, we are creating different databases and we can also execute query between multiple databases by specifying a database name. But this is not possible with the PostgreSQL.
The Schemas in PostgreSQL are a lot like packages in Oracle. A number of schemas should be more lightweight than a number of databases.
We require to take a backup of only single database, including all schemas. We can also add or remove particular schema during the backup or restore operation.
We can also create different database permission group & role for each and every different schemas so that we can restrict the database user.
Specially for the PostgreSQL Server, we should go with multiple schemas into one database because this is more flexible and usable approach.

219: PostgreSQL: Important Parameters for better Performance

Here, I have listed some important parameters: You can set this all parameter in postgresql.conf file.
max_connections:
We can configure the maximum number of client connections and this is very important because every connection requires memory and other resources.
PostgreSQL can handle some hundred connections, but if we are planning for thousands connections, we should use some connection pooling mechanism to reduce overhead of connections.
shared_buffers:
Instead of big shared memory pool, PostgreSQL has a separate process for each database connection. PostgreSQL uses this dedicated memory area for caching purpose, so its provide lots of performance gain in your system.
If we have more loads, we can set large values for shared_buffers.
Generally, we should set shared_buffers values up to 1/4 part of the main memory.
You can calculate using, this formula
(0.25 * Main Memory)
wal_buffers:
This parameter defines, how much space is required for caching to write-ahead log entries. This is really very small size value but it is required to change in heavily loaded servers.
Recommended value is: 8MB.
work_mem:
This is also one of the most important parameter which is used to set a private work space for each and every new connection.
We have shared_buffers which we are using as dedicated buffer for all connections, but using work_mem we can set memory for each query.
If we have configured wrongly, it really creates a big problem for us because if we have a big, complex query and which is running for multiple times then it will create a big problem for us.
We should configure this parameter base on number of connections and what type of queries we are running on the server.
We can calculate using this method,
Memory(MB) / Max_connections.
If the result is very near to Max_connections, perform divide by 2 again. 
The result should be at least 16MB otherwise we should buy more RAM.
maintenance_work_mem:
This is also same as like work_mem parameter, but use only for maintenance related queries like, VACUUM, CLUSTER.
Generally, we are doing maintenance when query load is not too high.
I suggest you to set a big value for parameter so that we can quickly complete our maintenance task.
Default is 16MB and Recommended is (Memory / 8).
effective_cache_size:
This parameter is mostly dedicated to PostgreSQL query planner because query planner is responsible to execute a query by choosing a better optimized execution path.
The query planner also requires some space to perform their work so this parameter helps the query planner.
Default is 65536 and Recommended is (Memory (MB) * 0.75)
autovacuum:
This is also one of the important parameter because PostgreSQL support MVCC nature.
This is really very required and it solves lots of performance related issues by removing dead tuples.
Whenever table is free, Autovacuum performs vacuuming on that table so another individual Vacuum executes fast because it has very less to remove.
synchronous_commit:
You can set number of transactions to commit per second. e.g. set 100 transaction means per second, 100 transactions is going to commit.
This parameter should be ON or OFF is up to our choice because when we put OFF, our bulk insertion is faster, but there is the chance of failure and when we put ON, bulk insertion may slow down but the chance of data failure is also very less.

220: What are stalls or waits in the systems ?
First wait event type is lightweight lock which is used to protect a particular data structure in shared memory. 
Second wait event type is named lightweight lock tranche, this indicates that the server process is waiting for one of a group of related lightweight locks.
Third wait event type is heavyweight lock which is used to primarily protect SQL-visible objects such as tables. 
Fourth type of wait event is BufferPin where the server process waits to access to a data buffer during a period when no other process can be examining that buffer.  For detail explanation, refer PostgreSQL documentation at http://www.postgresql.org/docs/devel/static/monitoring-stats.html#PG-STAT-ACTIVITY-VIEW

221: What types of  stats in PostgreSQL ?

The two kinds of stats in PostgreSQL

Data distribution stats

  • statistics describing distribution of the data, collected by ANALYZE / autovacuum
  • used by the planner/optimizer when planning queries
  • stored in the database, pretty much regular tables protected by WAL
  • no official way to reset this kind of statistics
  • the problems usually result in choosing poor query plans (more on this later)

Monitoring stats

  • statistics tracking the operation of the database system itself
  • used for monitoring purposes and by autovacuum (to identify objects that need maintenance)
  • stored outside database, in a binary file (pgstat.stat) or a collection of per-database files (since 9.3)
  • this is what gets reset by pg_stat_reset()
  • the most common problem is high I/O load when the pgstat.stat file gets large (due to tracking many database objects)
  • resetting the stats is not a solution - it won't make the problem go away, and it negatively impacts autovacuum (possibly causing I/O load far worse than it solved)
  • pre-9.3 solution: move the pgstat.stat file to a tmpfs filesystem, consider upgrading to 9.3
  • 9.3 and beyond: you shouldn't really have problems with this (unless you're using a single database with schemas)

222: Which tool/utility use for postgresql scalabilty ?
the PostgreSQL community offers free of charge expansions like PL/Proxy from Skype that allow allocating information in database clusters and separate cluster solutions based on PostgreSQL / Postgres-XC and Postgres-XL.

223: Why below warning in log and how fix ?
               WARNING: pgstat wait timeout
Which essentially means that a backend process asked postmaster to write fresh data, but postmaster was unable to write the data fast enough (usually because of overloaded IO, or maybe because the filesystem is full, etc.).

tmpfs

The one universal solution that works pretty well is a RAM-based filesystem. Linux has tmpfs, most other system have something similar. Do something like this
mkdir /mnt/pg_stats_tmp
mount -t tmpfs -o size=1G none /mnt/pg_stats_tmp
You'll get a tmpfs filesystem that may grow up to 1GB (it's not reserved all the time). And then change stats_temp_directory in postgresql.conf so that it points to the new path:
stats_temp_directory = '/mnt/pgstats_tmp'
After restart, the PostgreSQL will copy the files to the new location (and back when it's stopped).
Be careful when setting the filesystem size - you need at least 2x the size of the file, because the file is not overwritten but a new copy is written and it's renamed. So if your pgstat.stat file has ~200MB, you need at least ~400MB of space. Also, when the file grew to ~200MB there's no reason why it shouldn't grow to 300MB, so add some spare space. The tmpfs size is not dedicated (i.e. if you don't use the RAM, it can be used by other processes). Setting some alert on free space is probably a good idea.

224: Important Parameters to enable Log for all Queries

A database log is necessary for finding different statuses, errors, bad queries and any changes on the Database Server.
A Database Administrator can also log different executed queries and analyze it for performance tuning.
Following are few relevant parameters to enable PostgreSQL log for all queries.
Please open your postgresql.conf file and make sure about all logs related parameter’s value which is mentioned here.
  • log_directory = ‘pg_log’ (default directory name)
  • log_filename = ‘postgresql-%Y-%m-%d_%H%M%S.log’ (default file name structure)
  • log_statement = ‘all’ (value ‘all’ is recommended {none, ddl, mod, all})
  • logging_collector = ON
  • log_line_prefix = ‘%t %c %u ‘ (time, sessionid, user)
  • log_destination = ON (stderr,syslog,csvlog,eventlog)
  • log_rotation_size = 15MB
  • log_rotation_age = 1d (create new log file every day)

225: How to perform VACUUM FULL without disk space?
We require VACUUM and VACUUM FULL in PostgreSQL because of MVCC Architecture.

Postgres VACUUM FULL reclaims all free space released by VACUUM to removing dead rows.
This method also requires extra disk space, since it writes a new copy of the table and doesn’t release the old copy until the operation is complete.
In one of the our Postgres Database Server we require VACUUM FULL on one table and this table size is 13GB and we don’t have more space on hard-disk.
I have worked around this problem and created one of the best solution to resolve this problem.
As per the nature of VACUUM FULL, we require additional storage like any other network drive or portable hard disk.
Please do not forget that VACUUM FULL requires an exclusive lock on the table so during this operation your table can not be accessible.
Now, add new hard disk and make it as table-space.
Create new tablespace:                                          
CREATE TABLESPACE temptablespace LOCATION '/path/../';

Check your table current tablespace:
SELECT tablespace FROM pg_tables WHERE tablename = 'mybigtable';

If it is NULL, it has a default tablespace. Move table to new tablespace:
ALTER TABLE mybigtable SET TABLESPACE temptablespace;

Perform VACUUM FULL:
VACUUM FULL mybigtable;

Move table to old tablespace:(moving to pg_default)
ALTER TABLE mybigtable SET TABLESPACE pg_default;

Drop that temp table space:
DROP TABLESPACE temptablespace;

226. Why multi column index not used by SQL?
 Multi-column indexes can only optimize the queries that reference the columns in the index in the same order, while multiple single column indexes provide performance improvements to a larger number of queries.
However there are cases where a multi-column index clearly makes sense. An index on columns (a, b) can be used by queries containing WHERE a = x AND b = y, or queries using WHERE a = x only, but will not be used by a query using WHERE b = y. So if this matches the query patterns of your application, the multi-column index approach is worth considering. 

227. How to  Parallel Sequential Scans tune ?

Three new GUC parameters have been added to tune the usage of this feature.
max_parallel_degree - This is used to set the maximum number of workers that can be used for an individual parallel operation.  It is very well possible that the requested number of workers are not available at execution time.  Parallel workers are taken from the pool of processes established by max_worker_processes which means that value of max_parallel_degree should be lesser than max_worker_processes.
It might not be useful to set the value of this parameter more than the number of CPU count on your system.
parallel_tuple_cost - This is used by planner to estimate the cost of transferring a tuple from parallel worker process to master backend.  The default is 0.1.  The more the number of tuples that needs to be passed from worker backend processes to master backend process, the more this cost will be and more overall cost of parallel sequential scan plan.
parallel_setup_cost - This is used by planner to estimate the cost of launching parallel worker processes and setting up dynamic shared memory to communicate.  The default is 1000.

228. How to  use funcation in parallel query ?
let us discuss about usage of functions in parallel query. A new clause PARALLEL is added to the CREATE FUNCTION statement.  There are three valid values that can be used by user with this clause.
1. PARALLEL Unsafe - This indicates that the function can't be executed in parallel mode and the presence of such a function in a SQL statement forces a serial execution plan.
2. PARALLEL Restricted - This indicates that the function can be executed in parallel mode, but the execution is restricted to parallel group leader.  As of now, if the qualification for any particular relation has anything that is parallel restricted, that relation won't be chosen for parallelism.
3. Parallel Safe - This indicates that the function is safe to run in parallel mode without restriction.

The default value for function is PARALLEL Unsafe.

229. Script to Stop all Connections and Force to Drop the Database

As a Database Administrator, we are also responsible to manage and remove this kind of unwanted objects or databases.  These are the steps to perform an action:
First, Make sure that no one can connect to this database:
UPDATE pg_database    SET datallowconn = 'false'   WHERE datname = 'DatabaseName';
Second, execute this script to terminate all connections to a database (For postgres < 9.2):
SELECT pg_terminate_backend(procpid)  FROM pg_stat_activity WHERE datname = 'DatabaseName';
Drop now:
DROP DATABASE DatabaseName;
Note: Step one and two require super user permission and only database owner can drop this database.

230. What is initdb command and how to use ?
To create the PostgreSQL database cluster you have to use initdb, there is no other possibility. When you take a look at the options you can provide to initdb there are not too much. The only mandatory parameter is “-D” or “–pgdata”. This tells initdb where you want to have the files created on disk:
Of course you can override that by passing any of the supported character sets to initdb, e.g.:
You can control how sorting and the display for numbers, money and so on shall happen by specifying the various “–lc” parameters, e.g.:
PostgreSQL comes with build-in full text search and the line below the encoding stuff tells you that the default will be English. Can overwrite this as well with the “–text-search-config” parameter of initdb.
Maybe the most important message is this:”Data page checksums are disabled.”. This means that PostgreSQL will not use checksums to detect silent data corruptions. Of course this introduces overhead when enabled but your data usually is important, isn’t it? You can enable this by using the “–data-checksums” switch of initdb and this cannot be changed afterwards.

initdb -D test --encoding=LATIN1 --locale=de_DE --lc-messages=en_US --lc-monetary=de_DE

231. Is it ‘alter system’ command used to change parameter value and how it work ?
Yes. Starting with PostgreSQL 9.4 you can change parameters in postgresql.conf with the “alter system” command in psql. In principle every parameter can be changed that way but not every parameter is applied immediately. Some of the parameters require a reload others require a complete restart of the instance.
The first thing to know is that parameter changes you trigger by using “alter system” are not written to postgresql.conf which is the parameter file for the instance. Instead parameter changes are written to postgresql.auto.conf. 
When the server starts it reads postgresql.conf and in addition the values from postgresql.auto.conf are applied and the values for parameters in postgresql.conf are overwritten/replaced by the values for the parameters in postgresql.auto.conf. This is important to understand. If you change the value of a parameter in postgresql.conf but you have the same parameter specified in postgresql.auto.conf you will never succeed:
 Conclusion: Parameters set in postgresql.conf are overwritten if the same parameter is set in postgresql.auto.conf.


No comments:

Post a Comment