Sunday, 2 March 2014

Troubleshooting CUCM Database Replication

CUCM uses IBM Informix for database needs. We have no native access to this (unless you are a Cisco TAC Engineer who can gain temporary access to root).
If DB replication breaks, we see many symptoms in our IPT network like Phone registered to a Subscriber unable to make calls to phones registered on the other subscriber, unable to login to extension mobility etc.

We can make use on any one of these methods to check replication status:
  •  Open RTMT -- Click on : Call Manager --> Service --> Database Summary
  • Open a web browser, type in https://<IP address>:8443/cucreports/ , Enter your authorized username and password.
    Go to : Database Status Report
  • Using putty, SSH to the CUCM to take CLI access and run this command : utils dbreplication runtimestate --> REPl.LOOP? is the current state.

Here is what the replication state means :
0 - Initialization state : This state indicates that replication is in the process of trying to  setup. Being in this state for a period longer than an hour could  indicate a failure in setup.

1 - Number of replicates not correct : This state is rarely seen, can indicate its still in the setup process. Being in this state for a period longer than an hour could indicate a failure in setup.

2 - Replication is good : All is well in paradise :)

3 - Tables are suspect : Logical connections have been established but we are unsure if tables match. This can happen because the other servers are unsure if  there is an update to a user facing feature that has not been passed  from that sub to the other device in the cluster.

4 - Setup failed / Dropped : The server no longer has an active logical connection to receive  database table across. No replication is occurring in this state.

If State is other than 2, check:
  • Server & Cluster connectivity : Check TCP/UDP ports needed to be opened on the network. To get the port list for your CUCM version, just google : CUCM < your version> ports
  • Configuration files (if on older CUCM - extremely rare) :
  1. /etc/hosts ---> local resolution of hostnames to IP addresses
  2. /home/informix/.rhosts ---> hosts trusted to make database connections
  3. $INFORMIXDIR/etc/sqlhosts ---> full list of CCM servers for replication
  • Check if server times are correct and synced (NTP working fine)
  • DNS not configured properly (forward/reverse lookup)
  • NTP not reachable/time drift between server
  • A Cisco DB replicator service not running/not working
  • Cisco Database Layer monitor (DbMon) hung/stopped

Useful Commands :


  • utils dbreplication setrepltimeout -- The default value is set to 300 seconds. You can validate this by running "show tech repltimeout". This is the timer used to put multiple servers into one run of the data sync. In other words, it is the "batching" timer. This affects when the broadcast realize template and data sync will fire (n seconds from the end of the first defined server). Clustering over WAN (CoW) long delays can cause the data sync process to be exponentially longer. Try to sync the local servers first.
 
  • utils dbreplication repair -- in CUCM 5.x, this command meant a reset of the replication, whereas, in CUCM 6.x and higher versions, this means a repair of the data. It runs a repair process on all tables in the replication for all servers that are included in the command. Run this command when RTMT = 2, not when RTMT = 0 or 3.
 
  • utils dbreplication repairtable / repairreplicate -- This command essentially does the same thing as the repair command, but runs on only one table / replicate, hence making the process much faster. It fixes the out of sync data for that table / replicate. You can verify by running "utils dbreplication status" to see if there are any mismatches or errors found. It is particularly useful on large CUCM clusters. Run this command when RTMT = 2, not when RTMT = 0 or 3.
 
  • utils dbreplication stop -- You should only be running this if you want to stop replication setup. The only way to recover from a stop is with a reset. This command removes the set-up indicator file i.e. the dbmonpreflightcheck file and kills the currently running replication commands. It pauses for the duration of repltimeout timer, so if you run replication commands soon after running a stop, it could kill the commands again. Run this command when RTMT = 0, not when RTMT = 3
 
  • utils dbreplication reset -- This command causes replication to be torn down and then set-up. You should run this command when RTMT = 4 or when you have issued stop. Successful completion of this process results in RTMT = 2.
 
  • utils dbreplication clusterreset -- Avoid running this command. It is for debugging replication set-up problems. It bypasses the RTMT settings, cluster requirements and normal CUCM set-up. It causes services to go out of sync with the database because it syncs data without change notification. The services need to be restarted when this command is run, no exceptions!
 
  • utils dbreplication dropadmindb -- Run this command when there is a looping attempt to define a server in replication. It's usually not the server that's failing, it's the pub which is corrupted as a result of an attempt or the sub, prior to the current one attempting set-up.
 
  • utils dbreplication forcedatasyncsub -- This command takes a backup of the publisher and restores it to the subscriber(s) and resets up replication. It requires a serivces restart on the subscriber so they get the new values.
 New Commands and Database Improvements in CUCM 9.x: 
  • Re-engineered CLI forcedatasyncsuball (Lightening fast) -- This command can now restore a larger cluster in a shorter period of time!

  • New CLI rebuild is a stop, drop and reset all in one (and faster) -- The architecture of Rebuild is multi-threaded, the total operation time is much shorter than executing three different CLI commands (stop / drop / reset). Rebuild, is a master command that will stop, delete and trigger the replication setup signal across the cluster automatically and in parallel:
  1. Stop DB Replication – stop the current replication setup process if exists
  1. Remove server from database – Remove replication from the network by either “cdr delete”, dropping the syscdr database or renaming the syscdr database remotely
  1. Trigger Dbmon on the subscriber to submit a replication setup request on to publisher.
 
  • New CLI utils replication status table/replicate -- The "utils dbreplication status" command is lengthy when it runs. If only one table is suspect, then you have to wait for all the tables to check. Being able to check one table speeds up checking of replication.

 
  • Better Log Collection -- "utils create report database" collects all the database logs in one go. Also, ercollect.sh script is embedded into the server for IBM root cause cases. The script is on the server now, no need to transfer and change permissions. It is accessible via root access only.
 
  • Faster and more accurate Runtimestate CLI -- This command is now multithreaded, making it much faster. The output will also be logged for historical RCA. If there are any unreachable servers in the cluster, this command will no longer hang. Some additional information will be included in it such as repltimeout and IDS server number.

- Abhinay Mylavarapu

No comments:

Post a Comment