Performing a healthcheck of a live GoldenGate configuration
Like any technology, once implemented, GoldenGate forms a key part of enterprise infrastructure. It is responsible for synchronizing the state of some data between two or more systems. For a GoldenGate administrator, it is quite important to know the status of replication configuration. This can have an impact on the state of data in the destination system to some extent. In case the administrator finds any issue with the replication, he can then follow corrective measures to fix it and bring the systems back into synchronization. In this recipe, we will look into various steps that you can follow to ensurethat a GoldenGate configuration is running fine. This recipe does not cover the steps required to check the performance of a GoldenGate configuration.
The state of a GoldenGate replication can be checked by performing the following steps on both source and target systems:
- Check the GoldenGate event logger file
ggserr.log
for any error or warning messages. You can find these messages using theegrep
command as shown in the following code:cd /u01/app/ggate egrep –i 'WARNING|ERROR' ggserr.log | more
- Using the GoldenGate command line interface, check the status of the processes as follows:
GGSCI (prim1-ol6-112.localdomain) 1> status * EXTRACT EGGTEST1: RUNNING EXTRACT PGGTEST1: ABENDED
The status of all processes should beRUNNING
. - If the status of any of the processes is not
RUNNING
, check theProcess Report
files under$GG_HOME/dirrpt/
. For example, for thePGGTEST1
process that has anABENDED
status you should look for the error in the /u01/app/ggate/dirrpt/PGGTEST1.rpt
report file. - If the process is abended due to any data issue, you will see an ORA-error in the
Process Report
file. GoldenGate writes failed record information to thediscard
file. - If all the processes are running, check the lag between the systems as shown in the following code:
GGSCI (prim1-ol6-112.localdomain) 1> lag * Sending GETLAG request to EXTRACT EGGTEST1 ... Last record lag: 5 seconds. At EOF, no more records to process. Sending GETLAG request to EXTRACT PGGTEST1 ... Last record lag: 6 seconds. At EOF, no more records to process.
- Sometimes there can be an issue with the underlying database due to which the GoldenGate processes can abend. The state of the database can be checked from the database alert log.
As we saw in the previous recipe, there are mainly six steps to conduct a healthcheck of a GoldenGate configuration.
Every GoldenGate instance has an event logger file called
ggserr.log
. In the first step we see how we can query the error/warning messages from this file. GoldenGate writes all informational, warning and error messages to this file in a sequence. This file is a good starting point to check the overall status of the replication. GoldenGate writes the information about the following events to this file:
The following is an example of the messages from
ggserr.log
:2012-09-24 18:47:30 ERROR OGG-00664 Oracle GoldenGate Capture for Oracle, eggtest1.prm: OCI Error during OCIServerAttach (status = 12162-ORA-12162: TNS:net service name is incorrectly specified). 2012-09-24 18:47:30 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, eggtest1.prm: PROCESS ABENDING. 2012-09-24 18:47:30 INFO OGG-00975 Oracle GoldenGate Manager for Oracle, mgr.prm: client_start_er_prog failed. 2012-09-24 18:47:30 WARNING OGG-01742 Oracle GoldenGate Command Interpreter for Oracle: Command sent to MGR MGR returned with an ERROR response. 2012-09-24 18:49:51 INFO OGG-00987 Oracle GoldenGate Command Interpreter for Oracle: GGSCI command (ggate): start egg*. 2012-09-24 18:49:51 INFO OGG-00963 Oracle GoldenGate Manager for Oracle, mgr.prm: Command received from GGSCI on host localhost (START EXTRACT EGGTEST1 ). 2012-09-24 18:49:51 INFO OGG-00992 Oracle GoldenGate Capture for Oracle, eggtest1.prm: EXTRACT EGGTEST1 starting. 2012-09-24 18:49:51 INFO OGG-03035 Oracle GoldenGate Capture for Oracle, eggtest1.prm: Operating system character set identified as UTF-8. Locale: en_US, LC_ALL
In the next step, we run the
ggsci
command to check the status of all processes in this GoldenGate instance. A status other than RUNNING
implies that there is an issue. If the status of any of the processes is not RUNNING
, you can look into the report files for each process. All processes have a dedicated report file under the dirrpt
folder. GoldenGate writes all statistical information of a process in the report file. Also, whenever the status of a process changes, the report file is updated with appropriate information. So this is a good place to find the detailed information about the reason of process failure. Every time you restart a process the report files are rolled over and a new one is created. The following is an example of an error from the PGGTEST1.rpt
file:2013-02-20 20:00:43 ERROR OGG-01232 Receive TCP params error: TCP/IP error 104 (Connection reset by peer). ****************************************************************** * ** Run Time Statistics ** ****************************************************************** Report at 2013-02-20 20:00:43 (activity since 2013-02-20 19:55:33) Output to /u01/app/ggate/dirdat/rt: From Table SCOTT.EMP: # inserts: 0 # updates: 168 # befores: 168 # deletes: 0 # discards: 0 Last log location read: FILE: /u01/app/ggate/dirdat/st000032 SEQNO: 32 RBA: 72281 TIMESTAMP: Not Available EOF: YES READERR: 400
When a process abends due to any Oracle error or any data validation issue, the records are discarded and the details of the failed records are written in the
discard
file. If the data loss is not acceptable to the business, the information in the discard
file is very critical. This is where you can see the details of the error occurred while processing the failed records. The following is an example of the discard
file from the RGGTEST1
replicat. The error shown in this case is an instance when replicat couldn't find the records it was trying to update in the target database:Oracle GoldenGate Delivery for Oracle process started, group RGGTEST1 discard file opened: 2013-02-21 18:58:35 Current time: 2013-02-21 18:58:41 Discarded record from action ABEND on error 1403 OCI Error ORA-01403: no data found, SQL <UPDATE "SCOTT"."EMP" SET "ENAME" = :a1,"JOB" = :a2,"MGR" = :a3,"HIREDATE" = :a4,"SAL" = :a5,"COMM" = :a6,"DEPTNO" = :a7 WHERE "EMPNO" = :b0> Aborting transaction on ./dirdat/rt beginning at seqno 30 rba 14033 error at seqno 30 rba 19750 Problem replicating SCOTT.EMP to SCOTT.EMP Record not found Mapping problem with compressed update record (target format)... EMPNO = 7902 ENAME = FORD JOB = ANALYST MGR = 7566 HIREDATE = 1981-12-03 00:00:00 SAL = 3100.00 COMM = NULL DEPTNO = 20
A
RUNNING
status of the GoldenGate replication does not necessarily mean that the replication is working efficiently. Sometimes, the process status is RUNNING
, but due to some issues, the speed of data replication drops down. In order to check whether the records generated in the database are being processed efficiently by the replication, we use the LAG
command. The LAG
command reports the difference in seconds from the time the records were generated to the time when it was processed by the GoldenGate process.
No comments:
Post a Comment