Monday 21 August 2017

Interview Q & A for Oracle Golden Gate Part - 3

51. What parameters can be used to configure Oracle GoldenGate to extract data from Archived Redo log files only?
Use the TRANLOGOPTIONS ARCHIVEDLOGONLY option in the parameter file.

52. I have a one-way replication setup. The system administration team wants to apply an OS patch to both the OGG source host and the target servers. Provide the sequence of steps that you will carry before and after applying this patch.
Procedure:
  • Check to make sure that the Extract has processed all the records in the data source (Online Redo/archive logs)
    GGSCI> send extract , logend
    (The above command should print YES)
  • Verify the extract, pump and replicat has zero lag.
    GGSCI> send extract , getlag
    GGSCI> send extract , getlag
    GGSCI> send replicat , getlag
    (The above command should pring “At EOF, no more records to process.”)
  • Stop all application and database activity.
  • Make sure that the primary extract is reading the end of the redolog and that there is no LAG at all for the processes.
  • Now proceed with stopping the processes:
Source:
  1. Stop the primary extract
  2. Stop the pump extract
  3. Stop the manager process
  4. Make sure all the processes are down.
Target:
  1. Stop replicat process
  2. Stop mgr
  3. Make sure that all the processes are down.
  4. Proceed with the maintenance
  5. After the maintenance, proceed with starting up the processes:
Source:
  1. Start the manager process
  2. Start the primary extract
  3. Start the pump extract
    (Or simply all the extract processes as GGSCI> start extract *)
  4. Make sure all that the processes are up.
Target:
  1. Start the manager process
  2. Start the replicat process.
  3. Make sure that all the processes are up.

53. What are the basic resources required to configure Oracle GoldenGate high availability solution with Oracle Clusterware?
There are 3 basic resources required:
  • Virtual IP
  • Shared storage
  • Action script

54. Does Oracle GoldenGate replicate DDL statements or DML for Oracle Database?
Oracle GoldenGate replicates DML and DDL operations for Oracle Database and Teradata.

55. Does GoldenGate support SQL Server 2012 extraction? In the past only apply was supported.
Yes, starting with the new 12c release GoldenGate captures from SQL Server 2012 in addition to delivery capabilities.

56. Which RDBMS does GoldenGate 12c support?
GoldenGate supports all major RDBMS. For a full list of supported

57. How can GoldenGate be better than database trigger to push data into custom tables?
Triggers can cause high CPU overhead, in some cases almost double compared to reading from redo or transaction logs. In addition, they are intrusive to the application and cause management overhead as application changes. Oracle GoldenGate's log-based change data capture is not only low-impact in terms of CPU utilization, but also non-intrusive to the application with low maintenance requirements.

58. Does GoldenGate 12c support compressed objects for replication? Also, does it supports BLOB/CLOB columns?
Yes, GoldenGate 12c and GoldenGate 11gR2 both support compressed objects. GoldenGate has been supporting BLOB/CLOB columns since version 10.

59. Is Oracle Database 11.2.0.4 mandatory to use GoldenGate 12c Integrated Delivery? Not earlier versions?
Yes. To use GoldenGate 12c’s Integrated Delivery, for the target environment Oracle Database 11.2.04 and above is required.

60. We have Oracle Streams implementation for more than 5 years. We would like to migrate to GoldenGate, however older version of GoldenGate were not supporting filtering individual transactions. Is it supported in GoldenGate 12c?
Yes, it is supported in GoldenGate 12c.

61. How to Resync a Single Table with Minimum Impact to Another Tables' Replication?
 1. Stop your Replicat on the Target side, most likely it already abended, goto step 2.
GGSCI>STOP REPLICAT <NAME>
2. Comment out the MAP statement of that table in your Replicat parameter file
3. Restart your Replicat, so all other tables are still being replicated.
4. Use the LAG command to check the LAG of the Replicat, wait till the Replicat catches up.
ggsci>lag replicat <name>
How to tell it catches up? When the LAG result shows AT EOF
5. Stop both the target Replicat and source extract.
6. In source extract parameter file, add:
  FETCHOPTIONS FETCHPKUPDATECOLS

N.B. FETCHOPTIONS FETCHPKUPDATECOLS is not valid for NSK Guardian extracts.  Omit it.
7. restart the extract.
8. reload the table from the source to target using your initial load method after deleting existing rows in the target table.
9. Wait till initial load is finished, get the table back in your Replicat parameter file, put on handle collisions parameter
10. Restart your Replicat
11. Wait till the Replicat catches up, stop the source Extract and target Replicat
12. Comment out handle collisions in Replicat parameter file, and comment out "FETCHOPTIONS FETCHPKUPDATECOLS" in Extract parameter.
13. restart the Extract and Replicat.
This procedure applies to all the platforms, all databases.

62. What Causes The 'Bad Column Index(xxxx)' Error in Replicat?
There are some potential reasons and workarounds for this error:
1) Column mismatch between source and target
The GGS ERROR 160 Bad Column Index(xxxx) error in Replicat is caused by a Source Column Index "xxxx" greater than the number of columns in the Source Table Definition File Input to the Replicat process or if the ASSUMETARGETDEFS parameter is used and the Source Table and Target Table do not have the same structure, the Source Table has more columns than the Target Table.
Example
GGS ERROR 160 Bad column index(129) specified for table {table name}, max columns = 127
Explanation
The source table trail record has an index and data for column number 129 but only 127 columns are defined in the Source Table Definition File or when the ASSUMETARGETDEFS parameter is used in the Replicat Parameter File the Target Table contains 127 columns.
This is generally caused by changes in the Source Table or Target Table(i.e. columns have been added or deleted and a new Source Definition File has not been created to reflect the Source Table structure to match the Trail Records that Replicat is trying to process.
To resolve this error, run DEFGEN on the Source System for the Table causing the Replicat abend, copy that Definition File to the Target system.  Add this SOURCEDEFS file to the Replicat Parameter file and restart the Replicat process.
2) Encryption and decrypttrail
  • Check if the trail source is encrypted and if it is, then add the decrypttail parameter to the Replicat to eliminate the error.
  • If the extract does not use encryption, and the replicat has decrypttrail enabled, replicat will read an un-encrypted number and translate it to a column index that does not conform to the proper index number. In that case remove the decrypttail parameter from the Replicat.
  • Note: DECRYPT/ENCRYPT parameters must be set before RMTTRAIL parameter for a correct behavior of encryption/decription

63. Delete extract/replicat forcefully

To delete extract/replicat, if simple delete command is not working, then we can do it forcibly by adding exclamation mark (!) at the end.
1. How to stop extract/replicat?
kill extract group_name
kill replicat group_name
2. How to delete extracat / replicat
delete extract group_name [!]
delete replicat group_name [!]
3. How to cleanup (stop first)
cleanup extract group_name
cleanup replicat group_name
Error:
GGSCI (XXXn1) 27> delete replicat r_XXX
ERROR: Could not delete DB checkpoint for REPLICAT R_XXX (SQL operation failed: Deleting from checkpoint table GGATE.CHECKPOINT, group R_XXX, key 2648753342 (0x9de0c4be), SQL SQL Error 972: ORA-00972: identifier is too long).
GGSCI (xxn1) 35> delete replicat r_XXX !
ERROR: REPLICAT R_XXX does not exist.

64. How to Handle Tables without Primary Keys or Unique Indexes with Oracle GoldenGate
In order to maintain data integrity when replicating transactional data, Oracle GoldenGate will use primary key columns or unique index columns to uniquely identify a row when issuing update or delete statements against the target database. If no primary keys or unique indexes exist on the table being replicated, Oracle GoldenGate will use all columns to uniquely identify a row.
It is perfectly acceptable to use all columns to uniquely identify a row under the following conditions:
  • A logical key column cannot be defined for the table using the KEYCOLS parameter.
  • No duplicate rows exist in the table
  • Table contains a small number of rows, so full table lookups on the target database are minimal
  • Table DML activity is very low, so "all column" table supplemental log groups do not negatively impact the source database redo logs
If the table being replicated does not meet all of the conditions listed above, it is recommended to add a column to the table with a SYS_GUID default value to uniquely identify the row

65. Why Does GoldenGate Report the Lag Is Unknown or a Timestamp Mismatch?
The first thing to check is to make sure the timezone environmental parameter is set correctly. Here are the steps to set the correct timezone TZ environmental parameter.
Example     
$ > env |grep TZ
        TZ=PST9PDT
        $ > date
        Mon Mar 12 12:40:57 PDT 2007
        $ > export TZ=PST8PDT
        $ > date
        Mon Mar 12 13:41:08 PDT 2007

        You must re-start the Manager and all other GoldenGate groups.
If the TZ environmental parameter is not the issue, then there is another condition under which the "timestamp mismatch" message may be displayed. The "timestamp mismatch" may also be displayed when the Extract or Replicat is completely caught up and a LAG command is issued. The LAG command will try to compute the difference between current timestamp on target versus the actual timestamp of the processed record (which comes from the source). When the difference is actually a negative value, then we will display the "timestamp mismatch" message. If these times are very close, this can be when we calculate a negative lag (target timestamp - record timestamp).
Following can be done from sqlplus  
sqlplus ggadmin@FILCRDP1
ALTER SESSION set NLS_DATE_FORMAT='MM/DD/YY HH24:Mi:SS';
select sysdate from dual;
host  date
sqlplus ggadmin@FILCRDP2
ALTER SESSION set NLS_DATE_FORMAT='MM/DD/YY HH24:Mi:SS';
select sysdate from dual;
If TZ difference is in another direction, the related extract and replicat may display unexpected lag.  If the process is at EOF, the lag may be on and off, when on it shows several hours (the TZ difference) lag.  

66. GoldenGate Extract Abends with "Unable to lock file" Error For Trail File

Cause

The trails cannot be exclusively lock for writes by the server/collector process running on the target. As of v10.4, Server/Collector locks the trail file to prevent multiple processes from writing to the same trail file, so new Server/Collector processes are unable to lock the trail files.
Network outages that last longer than the time the TCP/IP  stack is configured to retransmit unacknowledged packets may result in "orphan" TCP/IP connections on the RMTHOST system. Since the local system has closed the connections and the "RST" packets were lost due to the network outage, no packets (data or "control") will ever be sent for these connections.
Since the RST packets were not delivered to the RMTHOST, the TCP/IP stack will not present an error to the Server/Collector process The Server/Collector process will continue to wait, passively, forever, for new data that will never arrive because the Extract process on the other system is no longer running.
A second cause for this symptom is that the remote server was rebooted and the Network-Attached Storage (NAS) device where the target trails reside did not detect and was not notified of the reboot, so the locks acquired prior to the reboot are still considered to be in force.

Solution

 1) Investigate why a server/collector process is still running when a new server/collector process is started to access the same trail. You can kill orphan server/collector to resolve the immediate issue.
2) You can overwrite the server/collector by using the RMTHOST UNLOCKEDTRAILS option. Use this option with CAUTION as it can cause trail corruption. You must investigate why the trails are locked by another server or kill these server/collector processes.

NOTE that if an extract pump is stopped normally, the server/collector process stops immediately. By default, current versions (11.1/11.2 onwards) has a default timeout of 5 mins. One can overwrite this value using the RMTHOST TIMEOUT option. Example setting timeout to 40 seconds.
RMTHOST 192.168.10.1, MGRPORT 7809, PARAMS TIMEOUT 40
This tells the Server/Collector to terminate if it doesn't receive any checkpoint information for more than 40 seconds. DO NOT set too low a value, TCPIP communication performance varies throughout the day.
Other notes:
NAS related issue:
In the case where the NAS was unaware that the system had been rebooted, the best long-term solution is to contact the NAS vendor, who might be able to provide an utility program that can be run early in the system startup process to notify the NAS that it should release all locks owned by this system. The following procedure might offer a short-term work-around:
  1. Stop all REPLICAT processes that read the trail file.
  2. Stop the target MGR process.
  3. Copy trail file xx000000 to xx000000.bk
  4. Delete trail file xx000000.
  5. mv xx000000.bk to xx000000.
  6. Repeat steps 2-5 for each trail file that can't be locked.
  7. From the shell, kill the server (collector) process that was writing to the trail.  ie Check on OS level for orphan processes, e.g. on unix style OS's: ps -ef | grep server
8.        If any such orphan servers exist, e.g.:
9.        oracle   27165     1  0 11:20 ?        00:00:00  ./server -p 7840 -k -l /opt/oracle/gg/ggserr.log
10.     Then: kill 27165 (or, kill -9 27165) (for this particular case)
  1. Start MGR.
  2. Start the REPLICAT processes.
  3. Re-start the extract that abended and gave this error message.
Note that this may not work, depending on the NAS and the way it keeps track of advisory file locks acquired using fcntl( F_GETLK ).
Cluster failover:
When a system is failover to another node, the GoldenGate processes should be stopped typically by using ggsci > stop * and > stop mgr commands, however processes such a server/collectors remain running. Stop the extract pumps manually or kill the processes. You should check that no processes or running from the GoldenGate directory before switching GoldenGate to run on another node.

67.Does GoldenGate Support Tables With An UNUSED Column?
Yes. starting in version 9.5, GoldenGate supports tables with unused columns, but the support is disabled by default, and Extract abends on them. You can use the DBOPTIONS parameter with the ALLOWUNUSEDCOLUMN option to force Extract to generate a warning and continue processing. When using ALLOWUNUSEDCOLUMN, either the same unused column must exist in the target table, or a source definitions file must be created for Replicat with the DEFGEN utility. You can include the appropriate ALTER TABLE...SET UNUSED statements in a DDL replication configuration.
Prior to version 9.5, GoldenGate does not support tables with UNUSED columns.  The Extract will abend with an error:
        table Schema.TABLENAME;
        2007-02-21 13:58:08 GGS ERROR 118 unable to validate table
        Schema.TABLENAME 57148 due to unused column.
The only solution if running a pre 9.5 version of GoldenGate is to drop the unused columns.

68. Extract abends,mss: missing trailing blocks, unable to queue I/O, I/O beyond file size

Symptoms

extract that reads archived logs on ASM hits following error:
Problem now occuring with a 1 Extract configuration. Extract abends with:
SQL <BEGIN dbms_diskgroup.read(:handle, :offset, :length, :buffer); END;>: (27091) ORA-27091: unable to queue I/O
212146455571218590 Redo Thread 2: ORA-17510: Attempt to do i/o beyond file size
212146455571218590 Redo Thread 2: ORA-06512: at "SYS.X$DBMS_DISKGROUP", line 124
212146455571218590 Redo Thread 2: ORA-06512: at line 1 )-> 500
Cause
There can be 2 causes for this error:
1. ASM diskgroup is frequently mounted and dismounted, which makes the ASM diskgroup temporarily unavailable.
2. When extract is processing archived log, extract waits for the log to be completely written out before starting to processing them. However if extract could not validate this via check on the block header from very last block and file size is not increasing, then extract abends after a wait of about 10 seconds.

Solution

Based on the cause, following solution/workaround may be used:
1. Check database alert.logs. If there are many mounted and dismounted messages for the ASM diskgroup, this is likely a known issue  particularly in an environment where multiple database are being supported where when an archive log is written then we dismount the diskgroup. If another environment tries to access the diskgroup while it is being dismounted we get errors. A small dummy tablespace in the related diskgroup may be created. This will then prevent the diskgroup from constantly being mounted and dismounted. This will then eliminate the possibility for this to be the cause.
e.g, CREATE TABLESPACE dummy DATAFILE '+FLASH01' SIZE 10M;
In this way there is always a file open in the diskgroup and therefore it will not be repeatedly mounted and dismounted.

2. If the above is not the cause of the problem, following parameter may be used:
    "TRANLOGOPTIONS COMPLETEARCHIVEDLOGTIMEOUT 600 "
When extract is processing archived log, extract would wait for the log to be completely written out before it start processing them. However if extract could not validate this via check on the block header from very last block and file size is not increasing, then extract would abend after a wait of about 10 seconds. The parameter "COMPLETEARCHIVEDLOGTIMEOUT" is to be followed by integer number to indicate the number of seconds to wait.

69. When Replicat is Running but Not Applying Records
Occasionally, Replicat will show a status of running but a report will show no records have been updated. There are 3 common causes for this.

Troubleshooting Steps

These causes are:
1) The wrong trail is being read
2) The wrong trail file is being read
3) The table name format is incorrectly specified
4)  Wrong table name specified 
Each of these cases will show a status of running. A review of the report will show that no records have been replicated.
Case 1: The wrong trail is being read
Do an info detail of the extract or look at the rmttrail statement in the extract params
Do an info detail of the replicat or review the replicat report
The replicat trail file being read must be the same name, machine and location of the trail the extract is writing.
Common errors are specifying the wrong target machine or the wrong subdirectory for the trail
Erroneous extract param files and add exttrail or add replicat statements are often the cause
This is a setup error and needs to be reconfigured
Case 2: The wrong trail file is being read
Do an info detail of the replicat or review the replicat report
Examine the trail files (ls -l) on the target.
Does the trail file being read exist? 
Is the replicat at the end of a trail file not advancing to the next available file?
The solution is to do an > alter rep <repname>, extseqno <NNNN>, extrba 0 to cause the replicat to advance to the next file and restart the replicat
This is commonly caused by doing an etrollover on the source. This generally requires doing an alter on all downstream processes. This is an administrative error.
 Case 3:  The table name format is incorrectly specified
The schema.table name in the replicat param file must match the schema.table name in the trail.
Do a logdump on a trail file and do a count command with detail turned on
Or use logdump to look directly at a row to be applied with ghdr turned on
The table.schema in the replicat must match that in the trail file. On NSK, this includes matching any system name occurrence or absence.
Additionally, changes may be made on the source side that may change the table specification
This is commonly a setup error that is easily corrected. Adjust the replicat specification, repoint the replicat to the beginning of data and rerun.
Case4:  Wrong table name specified
This could be a case where the source table name of our interest is not the one which is intended.
Need to check the parameter file and correct the table name.  Once the check is done, need to alter the replicat back to earlier sequence or to a particular RBA and restart it to apply the transactions.

70. How to recover from Extract ERROR 180 encountered commit SCN that is not greater than the highest SCN already processed
How to recover from Extract ERROR 180 encountered commit SCN<n.xxx> that is not greater than the highest SCN already processed (GG Version 10)

Solution

This applies for Oracle GoldenGate Version 10 and above only using Version 10 formatted trail and NOT using the RMTTRAIL or EXTTRAIL FORMAT RELEASE parameter.
Description:

This error occurs in an Oracle RAC environment after a transaction is written to the idle node but does not yet appear in the redo log when the current transaction is processed. This transaction will have a higher SCN then the previous transaction.
Solution Overview:
Verify that the Extract Pump and Replicat have completely processed the records in trail then alter Extract Pump and Replicat to process new trail created by the ETROLLOVER from the main Extract.
Solution Details:
1. Do an ETROLLOVER on Extract, and take note of the new sequence number of the trail file.
ALTER EXTRACT [name], ETROLLOVER
2. Start extract
START EXTRACT [name]
3. Send PUMP, LOGEND, to see if it's at the end of the previous trail.
SEND EXTRACT [pump_name], LOGEND

4. Once it is at the end of the trail file, You must stop the pump, and do an ETROLLOVER for it too. Take note of the new trail file sequence number that is created from this step.
STOP EXTRACT [pump_name]
ALTER EXTRACT [pump_name], ETROLLOVER
5. Alter the pump to SEQNO to the new trail file created from step #1.
ALTER EXTRACT [pump_name], EXTSEQNO ##### EXTRBA 0
6. Restart pump
START EXTRACT [pump_name]
7. Send Replicat, LOGEND to make sure it has processed all the remaining data, and stop Replicat.
SEND REPLICAT [name], LOGEND
STOP REPLICAT [name]
8. If replicat is not at end of trail, generate a report and forcestop replicat
SEND REPLICAT [name], REPORT
STOP REPLICAT [name]!
9. Add the following parameters to replicat parameter file to allow replicat to process each trail record as a single transaction, or set them to 1 if you have any of these parameters:
GROUPTRANSOPS 1
MAXTRANSOPS 1
10. Restart replicat
START REPLICAT [name]
11. Once replicat has completely processed the trail, stop the replicat
STOP REPLICAT [name]
12. Edit the replicat parameter file:
     - Add parameter HANDLECOLLISIONS to Replicat parameter file
     - Remove or comment out GROUPTRANSOPS and MAXTRANSOPS or revert them back to their original values. 
Note: There are pre-conditions for using  HANDLECOLLISIONS . There must be either
a. no pkupdates
or
b. extract  has "FETCHOPTIONS FETCHPKUPDATECOLS"
Also all the tables should have Primary key or unique index on the table to avoid data integrity issues when using handlecollisions.

13. ALTER REPLICAT, SEQNO to the new trail file created in step #4.
ALTER REPLICAT [name], EXTSEQNO ###### EXTRBA 0
14. Start Replicat
START REPLICAT [name]
15. Once Replicat has processed the out of order SCN operations, disable HANDLECOLLISIONS. You could also look for the CSN and wait for Replicat to checkpoint past it.
SEND REPLICAT [name], NOHANDLECOLLISIONS.
16.Edit the replicat parameter and comment out the HANDLECOLLISIONS parameter. You do not need to stop/restart replicat. This will ensure that on any subsequent replicat restarts the parameter is disabled.
Note:
If the out of order transactions happen to the SAME record on the SAME table, then the Replicat will probably abend in step 7 or step 10. If that is the case, look at the discard file find out that particular record (the primary key value). Then add handlecollisions, continue with the steps below. Later, once the Replicat caught up, that particular record needs to be manually synced.
However, the chance of an out of order transactions happening to the SAME record on SAME table is rare. Given an example, this will mean that the user application inserts record A from node 1, then immediately it updated this record A on node 2. Out of order transactions usually happens to different tables or different records on same table.
In other words, if you do see collisions been handled in this recovery procedure, that means you have to re-sync ONLY those particular records that have collisions.

No comments:

Post a Comment