Wednesday, April 20, 2016

Performing a healthcheck of a live GoldenGate configuration

Performing a healthcheck of a live GoldenGate configuration

Like any technology, once implemented, GoldenGate forms a key part of enterprise infrastructure. It is responsible for synchronizing the state of some data between two or more systems. For a GoldenGate administrator, it is quite important to know the status of replication configuration. This can have an impact on the state of data in the destination system to some extent. In case the administrator finds any issue with the replication, he can then follow corrective measures to fix it and bring the systems back into synchronization. In this recipe, we will look into various steps that you can follow to ensurethat a GoldenGate configuration is running fine. This recipe does not cover the steps required to check the performance of a GoldenGate configuration.

How to do it...

The state of a GoldenGate replication can be checked by performing the following steps on both source and target systems:
  1. Check the GoldenGate event logger file ggserr.log for any error or warning messages. You can find these messages using the egrep command as shown in the following code:
    cd /u01/app/ggate
    egrep –i 'WARNING|ERROR' ggserr.log | more
    
  2. Using the GoldenGate command line interface, check the status of the processes as follows:
    GGSCI (prim1-ol6-112.localdomain) 1> status *
    EXTRACT EGGTEST1: RUNNING
    EXTRACT PGGTEST1: ABENDED
    
    The status of all processes should be RUNNING.
  3. If the status of any of the processes is not RUNNING, check the Process Report files under$GG_HOME/dirrpt/. For example, for the PGGTEST1 process that has an ABENDED status you should look for the error in the /u01/app/ggate/dirrpt/PGGTEST1.rpt report file.
  4. If the process is abended due to any data issue, you will see an ORA-error in the Process Report file. GoldenGate writes failed record information to the discard file.
  5. If all the processes are running, check the lag between the systems as shown in the following code:
    GGSCI (prim1-ol6-112.localdomain) 1> lag *
    Sending GETLAG request to EXTRACT EGGTEST1 ...
    Last record lag: 5 seconds.
    At EOF, no more records to process.
    Sending GETLAG request to EXTRACT PGGTEST1 ...
    Last record lag: 6 seconds.
    At EOF, no more records to process.
    
  6. Sometimes there can be an issue with the underlying database due to which the GoldenGate processes can abend. The state of the database can be checked from the database alert log.

How it works...

As we saw in the previous recipe, there are mainly six steps to conduct a healthcheck of a GoldenGate configuration.
Every GoldenGate instance has an event logger file called ggserr.log. In the first step we see how we can query the error/warning messages from this file. GoldenGate writes all informational, warning and error messages to this file in a sequence. This file is a good starting point to check the overall status of the replication. GoldenGate writes the information about the following events to this file:
  • All the GGSCI commands that have been issued in this instance
  • GoldenGate processes startup/shutdown
  • All errors are recorded
  • Any warnings
  • Any informational messages
The following is an example of the messages from ggserr.log:
2012-09-24 18:47:30  ERROR   OGG-00664  Oracle GoldenGate Capture for Oracle, eggtest1.prm:  OCI Error during OCIServerAttach (status = 12162-ORA-12162: TNS:net service name is incorrectly specified).
2012-09-24 18:47:30  ERROR   OGG-01668  Oracle GoldenGate Capture for Oracle, eggtest1.prm:  PROCESS ABENDING.
2012-09-24 18:47:30  INFO    OGG-00975  Oracle GoldenGate Manager for Oracle, mgr.prm:  client_start_er_prog failed.
2012-09-24 18:47:30  WARNING OGG-01742  Oracle GoldenGate Command Interpreter for Oracle:  Command sent to MGR MGR returned with an ERROR response.
2012-09-24 18:49:51  INFO    OGG-00987  Oracle GoldenGate Command Interpreter for Oracle:  GGSCI command (ggate): start egg*.
2012-09-24 18:49:51  INFO    OGG-00963  Oracle GoldenGate Manager for Oracle, mgr.prm:  Command received from GGSCI on host localhost (START EXTRACT EGGTEST1 ).
2012-09-24 18:49:51  INFO    OGG-00992  Oracle GoldenGate Capture for Oracle, eggtest1.prm:  EXTRACT EGGTEST1 starting.
2012-09-24 18:49:51  INFO    OGG-03035  Oracle GoldenGate Capture for Oracle, eggtest1.prm:  Operating system character set identified as UTF-8. Locale: en_US, LC_ALL
In the next step, we run the ggsci command to check the status of all processes in this GoldenGate instance. A status other than RUNNING implies that there is an issue. If the status of any of the processes is not RUNNING, you can look into the report files for each process. All processes have a dedicated report file under the dirrpt folder. GoldenGate writes all statistical information of a process in the report file. Also, whenever the status of a process changes, the report file is updated with appropriate information. So this is a good place to find the detailed information about the reason of process failure. Every time you restart a process the report files are rolled over and a new one is created. The following is an example of an error from the PGGTEST1.rpt file:
2013-02-20 20:00:43  ERROR   OGG-01232  Receive TCP params error: TCP/IP error 104 (Connection reset by peer).
******************************************************************
*                   ** Run Time Statistics **                        
******************************************************************
Report at 2013-02-20 20:00:43 (activity since 2013-02-20 19:55:33)
Output to /u01/app/ggate/dirdat/rt:
From Table SCOTT.EMP:
    #             inserts:         0
    #             updates:       168
    #             befores:       168
    #             deletes:         0
    #             discards:        0
Last log location read:
    FILE:      /u01/app/ggate/dirdat/st000032
    SEQNO:     32
    RBA:       72281
    TIMESTAMP: Not Available
    EOF:       YES
    READERR:   400
When a process abends due to any Oracle error or any data validation issue, the records are discarded and the details of the failed records are written in the discard file. If the data loss is not acceptable to the business, the information in the discard file is very critical. This is where you can see the details of the error occurred while processing the failed records. The following is an example of the discard file from the RGGTEST1 replicat. The error shown in this case is an instance when replicat couldn't find the records it was trying to update in the target database:
Oracle GoldenGate Delivery for Oracle process started, group RGGTEST1 discard file opened: 2013-02-21 18:58:35
Current time: 2013-02-21 18:58:41
Discarded record from action ABEND on error 1403
OCI Error ORA-01403: no data found, SQL <UPDATE "SCOTT"."EMP" SET "ENAME" = :a1,"JOB" = :a2,"MGR" = :a3,"HIREDATE" = :a4,"SAL" = :a5,"COMM" = :a6,"DEPTNO" = :a7 WHERE "EMPNO" = :b0>
Aborting transaction on ./dirdat/rt beginning at seqno 30 rba 14033 error at seqno 30 rba 19750
Problem replicating SCOTT.EMP to SCOTT.EMP
Record not found
Mapping problem with compressed update record (target format)...
EMPNO = 7902
ENAME = FORD
JOB = ANALYST
MGR = 7566
HIREDATE = 1981-12-03 00:00:00
SAL = 3100.00
COMM = NULL
DEPTNO = 20
RUNNING status of the GoldenGate replication does not necessarily mean that the replication is working efficiently. Sometimes, the process status is RUNNING, but due to some issues, the speed of data replication drops down. In order to check whether the records generated in the database are being processed efficiently by the replication, we use the LAG command. The LAG command reports the difference in seconds from the time the records were generated to the time when it was processed by the GoldenGate process.
If there is an issue with the underlying database, you can check its alert log. The core database issues and alert log scanning are out of the scope of this book.

See also

  • See the next recipe, Script to perform a regular scheduled healthcheck of a live GoldenGate configuration

No comments:

Post a Comment