Wednesday 30 August 2017

Interview Questions and Answers for Kafka

1. What Is an ISR?
An ISR is an in-sync replica. If a leader fails, an ISR is picked to be a new leader.

2. How Does Kafka Scale Consumers?
Kafka scales consumers by partition such that each consumer gets its share of partitions. A consumer can have more than one partition, but a partition can only be used by one consumer in a consumer group at a time. If you only have one partition, then you can only have one consumer.

3. What Are Leaders & Followers?
Leaders perform all reads and writes to a particular topic partition. Followers replicate leaders.

4. How Does Kafka Perform Failover for Consumers?
If a consumer in a consumer group dies, the partitions assigned to that consumer is divided up amongst the remaining consumers in that group.

5. How Does Kafka Perform Failover for Brokers?
If a broker dies, then Kafka divides up leadership of its topic partitions to the remaining brokers in the cluster.

6. Can producers occasionally write faster than consumers?
Yes. A producer could have a burst of records, and a consumer does not have to be on the same page as the consumer.

7. What is the default partition strategy for producers without using a key?
Round-Robin

8. What is the default partition strategy for Producers using a key?
Records with the same key get sent to the same partition.

9. What picks which partition a record is sent to?
The Producer picks which partition a record goes to.

10. Why is Kafka so fast?
Kafka is fast because it avoids copying buffers in-memory (Zero Copy), and streams data to immutable logs instead of using random access.

11. How is Kafka getting used?
Kafka is used to feed data lakes like Hadoop, and to feed real-time analytics systems like Flink, Storm and Spark Streaming.

12. How does Kafka relate to real-time analytics?
Kafka feeds data to real-time analytics systems like Storm, Spark Streaming, Flink, and Kafka Streaming.

13. How does Kafka decouple streams of data?
It decouple streams of data by allowing multiple consumer groups that can each control where in the topic partition they are. The producers don’t know about the consumers. Since the Kafka broker delegates the log partition offset (where the consumer is in the record stream) to the clients (Consumers), the message consumption is flexible. This allows you to feed your high-latency daily or hourly data analysis in Spark and Hadoop and the same time you are feeding microservices real-time messages, sending events to your CEP system and feeding data to your real-time analytic systems.

14. What is a consumer group?
A consumer group is a group of related consumers that perform a task, like putting data into Hadoop or sending messages to a service. Consumer groups each have unique offsets per partition. Different consumer groups can read from different locations in a partition.

15. Does each consumer group have its own offset?
Yes. The consumer groups have their own offset for every partition in the topic which is unique to what other consumer groups have.

16. When can a consumer see a record?
A consumer can see a record after the record gets fully replicated to all followers.

17. What happens if there are more consumers than partitions?
The extra consumers remain idle until another consumer dies.

18. What happens if you run multiple consumers in many threads in the same JVM?
Each thread manages a share of partitions for that consumer group.


Wednesday 23 August 2017

PostgreSQL DB Link

dblink is a PostgreSQL contrib module that can be found in the folder contrib/dblink. It is treated as an extension. The goal of this module is to provide simple functionalities to connect and interact with remote database servers from a given PostgreSQL server to which your client application or driver is connected.
Here we are using 2 different server.  
Host server -- test_db01 (tables here)  &  testdb02 (Db link )
Some data will be inserted on test_db01, and the goal is to fetch this data to testdb02 using dblink.
1 - ( By root user ) Confirm contrib rpm for postgres on both server
-bash-4.1$ rpm -qa postgresql9*
postgresql94-contrib-9.4.4-1PGDG.rhel6.x86_64

2 - Let's first prepare test_db01 and create some data on it.
-bash-4.1$ psql postgres
psql (9.4.4)
Type "help" for help.
testdb=# create table tab (a int, b varchar(3));
CREATE TABLE
testdb=# insert into tab values (1, 'aaa'), (2,'bbb'), (3,'ccc');
INSERT 0 3

3 - (test_db02) The sources of dblink have been installed, but they are not yet active on test_db02. dblink is treated as an extension, which is a functionality that has been introduced since PostgreSQL 9.1. In order to activate a new extension module, here dblink, on a PostgreSQL server, the following commands are necessary.
postgres=# CREATE EXTENSION dblink;
CREATE EXTENSION
postgres=# \dx
                                 List of installed extensions
  Name   | Version |   Schema   |                         Description
---------+---------+------------+--------------------------------------------------------------
 dblink  | 1.1     | public     | connect to other PostgreSQL databases from within a database
 plpgsql | 1.0     | pg_catalog | PL/pgSQL procedural language
(2 rows)

On both server pg_hba.conf should have entry for both server and postgres cluster reload/restart.

4- (test_db02) Now let's fetch the data from test_db02 with dblink while connecting on test_db01. The function dblink can be invocated to fetch data as it uses as return type "SETOF record". This implies that the function has to be called in FROM clause.
postgres=# select * from dblink('hostaddr=test_db01 port=5432 dbname=testdb', 'select * from tab') as t1 (a int, b varchar(3));
-bash-4.1$ psql
psql (9.4.4)
Type "help" for help.
postgres=# select * from dblink('hostaddr=test_db01 port=5432 dbname=testdb', 'select * from tab') as t1 (a int, b varchar(3));
 a |  b
---+-----
 1 | aaa
 2 | bbb
 3 | ccc
(3 rows)

Note :- Do not forget to use aliases in the FROM clause to avoid errors of the following type:
postgres=# select * from dblink_exec('port=5432 dbname=postgres', 'select * from tab');
ERROR:  statement returning results not allowed

It is also possible to do more fancy stuff with dblink functions. dblink_connect allows you to create a permanent connection to a remote server. Such connections are defined by names you can choose. This avoids to have to create new connections to remote servers all the time at invocating of function dblink, allowing to gain more time by maintaining connections alive. In case you wish to use the connection created, simply invocate its name when using dblink functions.
Execution of other queries, like DDL or DML, can be done with function dblink_exec.
 
postgres=# select dblink_exec('port=5432 dbname=postgres', 'create table aa (a int, b int)');
 dblink_exec  
--------------
 CREATE TABLE
(1 row)
 
*****************************************************************