Service monitoring other services activities

Clément · October 11, 2024

Hi,

I'm using Delphi 12 for this one.

There are several Window Services applications working in a lot of different tasks.
For example:
A Schedule service, with thread managers and worker threads execute tasks at specific times.
A Communication Service, with several thread managers, each with it's own sets of working threads which handles communication with TCP (or UDP) devices.
A Batch service with several thread managers, each with a set of working threads which handles batch...

Well as the application grew over the years, there are more and more thread managers handling more and more workers....

Sometimes Sht hits the fan, and some threads just stops responding. Sometimes it's a worker, which might be replaced by the manager, but sometimes it's a manager thread that goes bananas.

I would like to write another service, a thread monitoring service, where I want to "send somehow" a heartbeat from each worker thread ( from all the other threads ).

I want to know when a worker thread went bananas, but mainly when a manager threads goes bazinga.
Some of the errors we detected: out of memory, out of disk space, file is used by another process ( usually anti-virus), SQL query Error ( invalid customer data ), SQL Query error ( invalid instruction ), Server Database went in maintenance mode, Database not available (communication lost, disk is full, backup taking to long), Bad Windows Server Patch , Windows update, and the list goes on and on.

All the above describes actually problems that leads a worker or a manager to fail. Sometimes we can track what happened and reply our SLA in time. But sometimes it's just a nightmare. Nobody did anything and nobody changed anything...

I guess I want the safest IPC in this context. For this to work, the worker thread cannot freeze while sending a heartbeat.
For know, just knowing what thread stopped will be enough.
I suspect a lot of things, but even with a lot of logs sometimes is very hard to track down what is happening, especially when the customer is eager to blame me.
At least, the idea is to detect a "worker strike" or a "manager riot" as early as possible.

Any tips?

DelphiUdIT · October 11, 2024

I did something similar many years ago, then I abandoned this path because all the problems were solved and there was no longer any need for a similar approach.

What I did was use a program external to the application (not a service) using it as a "dumb" TCP server to collect information from all the other applications and their threads.

The reception had to have updated data (for example the number of cycles performed, the status of the connections, the number of polling performed on all devices) and through a rough analysis performed by this external application on which the TCP server ran, any alerts were displayed. At the time I also took into account the revolutions that an encoder performed (an encoder is a device that counts the rotations of a mechanical shaft) to match it with the cycles performed and if they did not match then alarms were sent from all sides (it cannot be said, but the application sent the data privately via its own internal mail client connected to my company server so that we had everything under control, similar to "analitycs").

Angus Robertson · October 11, 2024

ICS has a new Application Monitoring client and server system, I have it running on all my public servers monitoring my web, FTP, proxy Windows Services, and restarting them if they halt on or request if they experience critical errors. Have a read of: https://wiki.overbyte.eu/wiki/index.php/FAQ_ICS_Application_Monitoring

The client part just sends simple TCP PING packets, the hard part is knowing when to send those pings, my first attempt just used a timer, but that started before the server started and did not check it ever started, things got better over the weeks.

The server is currently basic, running on the same machine since it needs to restart the Windows Services if they stop, but I'm going to add remote monitoring of that server with a websocket API so a remote PC could monitor sereveral servers.

Angus

Remy Lebeau · October 11, 2024

2 hours ago, Clément said:

I would like to write another service, a thread monitoring service, where I want to "send somehow" a heartbeat from each worker thread ( from all the other threads ).

I want to know when a worker thread went bananas, but mainly when a manager threads goes bazinga.

At my last company, I developed and maintained a central Windows service that all of our other products communicated with to 1) write entries in a centralized log file, 2) send out notifications, and 3) track heartbeats. I used a free-threaded ActiveX/COM object for the communication, and just about every thread of every product made use of this COM object. Each message identified which product and internal component it belonged to. Each product would register its heartbeats and then update them at regular intervals until shutdown. If a heartbeat ever timed out unexpectedly (because a thread had frozen or died, or was just running a task for too long) then the service would send out a notification containing those details to our tech support and/or server admins (usually an email, but other kinds of notifications were also supported).

Edited October 11, 2024 by Remy Lebeau

Sign In

Service monitoring other services activities

Recommended Posts

Clément 153

Share this post

Link to post

DelphiUdIT 245

Share this post

Link to post

Angus Robertson 652

Share this post

Link to post

Remy Lebeau 1610

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity