Home > Guides

Guide Contents

Pingdom Health Checks Guide v1.1

Added on:  06/01/18     Updated on:  03/24/20
Table of Contents

Introduction


Pingdom is a service used for monitoring the state of UniPay/UniBroker nodes. There are a set of pages that include health checks for tracking whether the servers and particular services work properly. The environment is scanned for any possible errors and deviations. If any issue is identified, a corresponding notification is sent to the system administrators.


Intended Audience


This guide will be useful for product administrators that manage servers where UniPay/UniBroker are deployed.


Pingdom Check Pages


To perform regular server health monitoring, UniPay contains separate Pingdom Module operating with the pages that can be opened with any browser (Mozilla Firefox, Microsoft Internet Explorer, Google Chrome, etc.).

All pages have two default responses that can be retrieved depending on the system state:
OK - the health check was performed successfully.
FAIL - the health check was not performed.
Responses specific to a particular page are listed below.

The following status pages are used with UniPay service:

your_unipay_server_address/pingdom/audit.jsp - this page is used if you want to do a health check verifying conditions that are not critical for system execution but suggest a potential error that may cause the shutdown. The verification is performed every 15 minutes and emails with alert are sent if any issue is detected.

This page contains the following checks:
1) checkSystemAuditLogLastExecutionDate
2) checkFreeDiskSpace
3) checkJobMessagePendingCount
4) checkMemoryLeak
5) checkQueryPlanCache
6) checkJobObjectActiveStateDifference

To learn more about these health checks, review the Pingdom Health Checks section.

your_unipay_server_address/pingdom/jboss.jsp - this page is designated to be used by load balancers to verify if the specific node is active and can be processing requests. The checks are exactly the same as the one executed by index.jsp. The difference is, however, that the result of the verification is cached every 15 seconds. Any subsequent request within the next 15 seconds will be responded with the previous result.

your_unipay_server_address/pingdom/index.jsp - this page is used if you want to do a health check verifying critical stability of the system, so it is recommended to run this testing procedure every 5 minutes. The SMS messages with alert are sent if any issue is detected because immediate action is required.

This page contains the following checks:
1) checkDBHealth
2) checkTokenization
3) checkResources
4) checkModules
5) checkNodes

To learn more about these health checks, review the Pingdom Health Checks section.

your_unipay_server_address/pingdom/connection.jsp - this diagnostics page is designated to check the connection to the various protocols and services (HTTP, HTTPS, FTP, SFTP, FTPS, LUNA, SMTP, SMTPS, STRONGAUTH, TCP, PING). To execute the connection verification, the following parameters should be submitted in the request:
1) type of a protocol/service that is going to be checked;
2) token - hardcoded password used for access;
3) username - UniPay name of a user;
4) password - UniPay password of a user;
5) host - host that is going to be checked;
6) port - port that is going to be checked;
7) domainId - domain sequence number (used for StrongAuth connection check only).

Possible outcomes:
CONNECTION_FAILED – connection to a specified protocol/service is not present.
CONNECTION_SUCCEEDED – connection to a specified protocol/service is present.

your_unipay_server_address/pingdom/audit-detail.jsp - this page that is designated for manual diagnostics. When running it, you can execute index.jsp on all nodes within the cluster and get an aggregated result of audit procedure for all nodes per one request for Unipay and UniBroker in XML format. The check result is not critical for real-time UniPay/UniBroker efficiency.

Possible outcomes:
NOT_SPECIFIED_COUNT - a number of the nodes specified in the request is incorrect.
NOT_OK - one or more nodes work incorrectly. To identify which node has an issue, pay attention to the format of the status that will look similar to the following - “[3;2];name of the node(status);name of the node(status)”, where [3;2] means 3 nodes of UniBroker and 2 nodes of UniPay.


The following status pages are used with UniBroker service:

your_unipay_server_address/pingdom/index.jsp - this page is used if you want to do a health check verifying critical stability of the system, so it is recommended to run this testing procedure every 5 minutes. The SMS messages with alert are sent if any issue is detected because immediate action is required.

This page contains the following checks:
1) checkDBHealth
2) checkTokenization
3) checkResources
4) checkModules

To learn more about these health checks, review the Pingdom Health Checks section.

your_unibroker_server_address/pingdom/jboss.jsp - this page is designated to be used by load balancers to verify if the specific node is active and can be processing requests. The checks are exactly the same as the one executed by your_unipay_server_address/pingdom/index.jsp. The difference is, however, that the result of the verification is cached every 15 seconds. Any subsequent request within the next 15 seconds will be responded with the previous result.

your_unibroker_server_address/pingdom/system.jsp - this page designated to review the characteristics of a specific UniBroker node.

The result provides the following information:
1) current application version (for example, 7.2.d55d54ea3b-b20180820);
2) branch (for example, test/dev);
3) code of the node (for example, 1);
4) application host (for example, testgateway.local).


The following status page is used both with UniPay and UniBroker:

your_unibroker_server_address/pingdom/node.jsp - this diagnostic page that is designated for manual verification when some problems are suspected. It allows to verify from a specific node its accessibility to other nodes in the cluster and to see whether this node has a connection to other nodes. To identify which nodes are connected to the node, pay attention to the format of the status that will look similar to the following - 2[name of the node-1,name of the node-2,name of the node-3], where 2 is the code of the accessed node.


All these pages can be used for unit testing as well.


Pingdom Health Checks


All health checks have two default responses that can be retrieved depending on the system state:
OK - the health check was performed successfully.
FAIL - the health check was not performed.
Responses specific to a particular health check are listed below.

To perform server monitoring, the following checks can be used for health checks:

1) checkDBHealth - checks the availability of the connection to the database.

Possible outcomes:
DB_FAIL - the database is unavailable.

2) checkTokenization - checks whether tokenization/detokenization services are active (this check currently works only for cases when StrongAuth appliance is used).

Possible outcomes:
TOKENIZATION_FAIL - tokenization service is inactive.
DETOKENIZATION_FAIL - detokenization service is inactive.

3) checkResources - Key UniPay folders (app-home, work-home, and resources-home) are required to have at least 200MB of free space. This check verifies whether there is enough of free space in these folders.

Possible outcomes:
FREE_DISK_SPACE_PROBLEM - free space in the folder(s) is less than 200MB.

4) checkModules - UniPay consists of various modules. Modules that should be active on a particular server are specified in unipay.system.module-key property. This check verifies whether all required modules are available.

Possible outcomes:
MODULE_PROBLEM - one or more modules are unavailable.

6) checkSystemAuditLogLastExecutionDate - There are two system audit processes within the system that run on daily, hourly and quarter-hourly cycles. They audit various internal conditions, such as user errors, internal business processes failures etc. Every time such procedure completes, its execution is recorded in the database. This check verifies that the respective audit procedure has been recorded during the last 30-minute period.

Possible outcomes:
SYSTEM_AUDIT_PROBLEM – no records were added to the database within the last 30 minutes (suggested that the audit process stopped running and human intervention is needed).
DB_FAIL – database access or any related problem is present.

7) checkFreeDiskSpace - Directory where UniPay application is located is required to have at least 1GB of free space. This check verifies if it is enough of free space in this directory.

Possible outcomes:
FREE_DISK_SPACE_PROBLEM – free space on a disk where UniPay is located is less than 1GB.

8) checkJobMessagePendingCount - Within the system, execution of the system processes including those that are run by timer, is done using Apache Camel. The execution of the tasks is accomplished through execution of the jobs. As a rule, each job is processed within 1.5 hours. If after this time period, the job is still not complete and is in pending status, there is generally an issue with an execution of this job. This check verifies whether there are any jobs with a potential processing issue (over 1.5 hours in pending status).

Possible outcomes:
CAMEL_PENDING - jobs in pending status that were not completed in 1.5 hours are present.

9) checkMemoryLeak - As a rule, 10% of the memory allocated to Java Virtual Machine has to be free. This check verifies the amount of RAM used by JVM. The rate of the free RAM (10% by default) can be defined using validateMemoryLeakUsed parameter.

Possible outcomes:
MEMORY_LEAK_PROBLEM - less than 10% of the RAM allocated to JVM is available.

10) checkQueryPlanCache – To optimize query execution, query plan cache is used for their storage. By default, it can store 2048 queries. If a system issue occurs, this cache can be overloaded and the number of requests can reach its maximum. Such situation can cause RAM problems. This check verifies that the query plan cache load rate is within limits. By default, failure is reported at a 100% load, however, this rate can be altered using occupancyQueryPlanCache parameter.

Possible outcomes:
QUERY_PLAN_PROBLEM - query plan cache is 100% overloaded.

11) checkJobObjectActiveStateDifference – Within the system, execution of the system processes including those that are run by timer, is done using Apache Camel. Some jobs and timers should be always activated. In some situations, there is a need to temporarily deactivate some of these jobs, and there is a change that they will not be reactivated (due to human factor). This check verifies that all tasks that should be active by default are currently running (the isActiveDefault and isActive parameters of the job have the same value).

Possible outcomes:
CAMEL_CONFIG – one or more jobs that should be active by default are currently deactivated.

12) checkNodes - checks whether Camel requests are balanced between the nodes correctly. It verifies that Camel requests are sent to the nodes where Camel is installed and jobs processing is allowed by the node configuration.

Possible outcomes:
FAIL – Camel job processing is either impossible or prohibited.