Jump to content
bdw_nz20

Monitor Windows application and restart if needed

Recommended Posts

Just wondering what solutions, if any, people use to monitor a program then restart if it locks up or gets an exception ?

 

Share this post


Link to post

In the program to be monitored, add code to make it answer to a message (for example a TCP socket or a pipe or a Windows message).

Write a second program launch the first and periodically connect to the main program and check if he gets an answer.

If no answer, it kills the program et start it again.

When the second program starts the first (CreateProcess), it gets an handle that can be waited to catch when the program stops normally.

 

  • Like 1

Share this post


Link to post

Inside the program I have a thread that runs independently and keeps the timing of a cyclical event controlled, let's call it the end of the processing sequence (normally a processing cycle lasts around 150 ms.).

In all my programs, I generally work on events, the application must react to a given event (reading data from a PLC, receiving an image from a camera, etc ...) and if this does not start the thread goes to monitor the device (or devices) from which input is expected to verify that it is online (if there is no "keep alive" possible). Then every certain time I launch a simulated event to verify that the processing chain is up and running.

 

All devices that have the capability (even devices intended as software, for example third-party applications), MUST launch a "keep alive" towards the application (and also the opposite) to confirm the online operating status: the KEEP ALIVE can happen via TCP or UDP, via physical inputs or various other mechanisms (for example with some third party applications there are COM Object events).

 

In addition to all this, the thread also checks the memory used by the program to verify that it does not exceed certain limits (when a program runs for months without being turned off even a single byte of leak "every now and then" becomes a problem), and the temperature of the CPU package. When possible, following a persistence of high temperature (> 92 degrees) slows down the entire process to check for anomalies in the thermal dispersion of the PC.

 

The application is launched from a ".CMD" batch file which checks the return value of the application and if it is different from $127 (for example) it relaunches it again.

The monitoring thread or any other unhandled exception generates an EXITPROCESS($0) or sets the EXITCODE to $0 (depending on how severe the detected anomaly is).

The normal exit of the program instead generates an EXITCODE at $127.

The same monitoring thread sends an email and writes a log to report the anomaly.

Share this post


Link to post

When I had an error which caused my threads to lock up, until the issue was actually found and fixed I simply created a watchdog thread. It’s only job was to query the other threads and if they are found unresponsive, dump their remaining work queue, force-close, restart them and reload the work queue.

 

The signal was a simple boolean called “alive”. The watchdog set this to false at each thread, and each thread set it to true within processing the queue. If the variable is false after 5 seconds (processing an item was < 1 s) it was considered hanging.

 

 You can implement the same logic within applications too, using window messages, TCP or memory mapped files as your signal.

Share this post


Link to post
On 2/20/2024 at 6:56 AM, aehimself said:

The signal was a simple boolean called “alive”. The watchdog set this to false at each thread, and each thread set it to true within processing the queue. If the variable is false after 5 seconds (processing an item was < 1 s) it was considered hanging.

Yes for one of applications we do a similar thing with TEvent function to watchdog driver threads which works well for created threads.

 

However the issue becomes when there is a main thread lock up which additionally needs to be monitored and then restart the whole application if it goes into that locked state.

 

As you suggest, this is to maintain the program running while any such issues are resolved.  It can take time to narrow down whats going wrong and where, hence the need for this type of watchdog on the main thread.

 

Share this post


Link to post
35 minutes ago, bdw_nz20 said:

It can take time to narrow down whats going wrong and where, hence the need for this type of watchdog on the main thread.

Sure; this is why I added that the same can be implemented between applications. You have a watchdog application which queries your main application via TCP, window messages, mapped files, etc.

 

 The logic is the same, but instead of a watchdog thread you have a separate executable, and instead of a Boolean / TEvent the signaling channel is different.

 

I personally would go with a window message which your main program has to reply to. Easy to implement on both sides: use SendMessageTimeout in the watchdog and one extra method in the main program.

Share this post


Link to post
3 hours ago, aehimself said:

which queries your main application via TCP, window messages, mapped files, etc.

Ah right I see what you were meaning now

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×