Automatically killing a service when stuck

Thijs van Dien · September 28, 2020

I'm writing a Windows service, the continued operation of which is important. To deal with crashes, the OS can be configured to automatically restart it. What may also happen, however, is it just getting stuck somehow. I've already seen this happen when network calls go wrong; once OpenSSL got into a faulty state and blocked forever. Now I'm looking for the best way to terminate the process when it seems like it's no longer doing any work for too long. A few options I can think of:

1. Doing the work in a separate thread that is monitored from the main thread

2. Doing the work in a child process that is monitored by the parent

3. Doing the work in the parent process that is monitored by a child

4. Doing the work in a process that is scheduled to be killed by Windows

First of all, I'm not sure if I need process isolation or not. Can one thread cause the whole process to hang, i.e. all threads, including those it has no interaction with? Process isolation might be ideal, but it also brings complications (e.g. logging to the same file) that I could do without.

If 1 is not good enough, do I really need to go for 2 or 3, or could 4 work well enough? If a schedule timer-queue timer is fired to terminate the process when stuck, does that running in a Windows-owned thread make it any different (in terms of reliability) from 1? Would it still work when the process is completely stuck? Is there another, better way to let Windows handle the situation?

Between 2 and 3, is there any meaningful difference?

Edited September 29, 2020 by Thijs van Dien

Darian Miller · September 29, 2020

I like using child processes. The main windows service process starts two processes: a task launcher and a server updater.

1) Windows service process:

Launcher thread simply runs the child task-launcher and optionally the server-updater processes.

Main process thread waits for service stop/shutdown signal. If detected, signal its launcher thread to terminate. Terminate the windows service process once launcher thread terminates (or timeout reached and child processes are forcibly terminated.)

2) Task-launcher process:

For each externally configured task worker process to be launched:

Look for update of task worker executable. If update available, install (typically just a simple .exe copy) If no update, or after update is completed, launch the task worker process. Wait for task worker process to terminate. If task worker process terminates, and the service isn't shutting down, repeat. (auto-update of worker executables)

If task-launcher is signalled to terminate, then signal all child applications to terminate (send a close message.)

3) Optional server-update task process:

Each task-worker executable is in it's own directory. The job of server-update is to find/download any update to any task-worker executables. It simply puts the new exectuables in a predetermined newversion folder.

4) Every child process (including the task-launcher, server-update, and all custom task-worker executables)

Single task launcher thread simply looks for any configured task threads and launches child threads as needed.

Main thread also looks for new executable version, if found then signal task launcher thread to terminate. Automatically terminate the process once launcher thread terminates (or timeout reached and child threads forcibly terminated.)

Automatically terminate process if close message received after signaling task launcher thread to terminate.

These can be simple VCL applications with a main form that has a Start / Stop button (which are only used for running in debug mode within the IDE, or standalone outside of IDE and outside the Windows Service) and a timer that is looking for updates

Benefits:

Windows service has very minimal code and never needs to be updated, which is nice because updating service processes are much more painful.

The task-launcher process also has very little code and rarely needs to be updated.

The optional server-update code is isolated to a single stand-alone executable and separated from task-worker custom code. (Could also move server-updates to a centralized management server and just copy new task-worker .exe to new version folder as needed.)

The much more frequently changing task-worker code is isolated to stand-alone executables and these executables have no windows service, process launching, or server-update plumbing type code (other than the knowledge of where to look for a new version is ready to be installed.)

Each task-worker can be built, debugged and tested just like any standard VCL type application.

So if you have a many middle-tier application Windows Server machines to manage, each machine can run identical windows-service, task-launcher, and server-update executables. You decide on which machines to run various simple to create task-worker executables as needed. You can make changes on the fly by editing external configuration files (add new task-worker executables for example) and never have to restart the windows service. This system worked great for years. The only main trick is that your worker threads should periodically check for Terminated. If you don't code worker threads in a responsive way, the worker threads can be forcibly aborted in the middle of a task if the server is being shut down and your timeout is exceeded (which is likely no different than a current problem that has to be managed somehow.) Given the stability of Windows Server these days, the only time you have to restart the machine is when a Windows Updates requires it so that's also the only time this windows service will ever need to be restarted (it's a very good day indeed when you no longer have to stop a windows service to do some sort of application update.)

You can also deal with 'stuck' task-worker executables by simply putting a copy of the executable into the newversion folder and the application will self-terminate and auto-restart based on your timeout preference. (Or simply use Windows task manager to kill the custom task-worker process and it will be automatically relaunched by the task-launcher.)

September 29, 2020

Darian gave a nice and detailed suggestion for child process, but since you seem to prefer 1 over the others, then i will explain that 1 might work nice too and even none of them

7 hours ago, Thijs van Dien said:

1. Doing the work in a separate thread that is monitored from the main thread

This can work and when that thread goes rouge you can just suspend it and leave it as long you are containing its damage, to be exact what handles it does hold and in what states they are, of course i assume you are aware of the impact of this not only for the handles but the memory that will be reported as leaks, but on other hand this is the shorter way as those leaks will be recorded no matter what approach you will use.

So the thing is you can simply suspend it, or kill it, both suspend and killing are remedy for the symptoms and will not work for long time, as an example you can't be sure what handles and internally allocated resources in OpenSLL are held by that thread, hence both of those are really dangerous, and might have unpredicted behaviour.

2 ,3 and 4 include creating a process and restarting, and to workaround 1 ( explained above) you need to restart self too, so here another approach and in my opinion it might enhance 1 a lot and theoretically it is the shortest.

Use "Application Recovery and Restart"

https://docs.microsoft.com/en-us/windows/win32/recovery/using-application-recovery-and-restart

https://docs.microsoft.com/en-us/windows/win32/recovery/registering-for-application-recovery

and if you face on top of that the ghost window, when System sees your application as not responding then you can add DisableProcessWindowsGhosting

https://docs.microsoft.com/en-us/windows/win32/api/winuser/nf-winuser-disableprocesswindowsghosting

Here either you can combine those with 1 or not, just use register for recovery and restart then handle the callback to solve whatever you need for the new process, you even can run a new process on call then pass the data through shared memory to the new one then exit.

Just keep to remember these recovery and restart callbacks are coming from system threads, so make sure your main thread or the faulty thread are suspended, to prevent wasting time trying to understand what is going on in some bad and hard to repeat cases.

FPiette · September 29, 2020

8 hours ago, Thijs van Dien said:

Doing the work in a child process that is monitored by the parent

I am using that solution with great success.

1) The main service launch a child process which does the work.

2) The main service is notified by Windows when his child exit (You can have a thread waiting for process termination).

3) The main service check child process health by sending a request using IPC. In my case the child service is accessed by clients so the main service use the same code to connect to the child process. If the connection fails, then the main service kill the child process.

4) When the child service gracefully terminate it set a flag so that the main service do not restart it but also close. The flag is actually a file created by the main service and deleted by the child process when terminating properly. If the main service detect the file is still there when the child process terminate, then the child process crashed.

Fr0sT.Brutal · September 29, 2020

1 hour ago, Kas Ob. said:

Use "Application Recovery and Restart"

https://docs.microsoft.com/en-us/windows/win32/recovery/using-application-recovery-and-restart

https://docs.microsoft.com/en-us/windows/win32/recovery/registering-for-application-recovery

AFAIU this mechanism only deals with GUI apps not services and requires user interaction anyway?

September 29, 2020

4 minutes ago, Fr0sT.Brutal said:

1 hour ago, Kas Ob. said:

Use "Application Recovery and Restart"

https://docs.microsoft.com/en-us/windows/win32/recovery/using-application-recovery-and-restart

https://docs.microsoft.com/en-us/windows/win32/recovery/registering-for-application-recovery

AFAIU this mechanism only deals with GUI apps not services and requires user interaction anyway?

No, they do work everywhere even in console application, from the remark section https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-registerapplicationrecoverycallback

Quote

A console application that can be updated uses the CTRL_C_EVENT notification to initiate recovery (for details, see the HandlerRoutine callback function). The timeout for the handler to complete is 30 seconds.

Notice they are declared in kernel not in the user shell.

September 29, 2020

And for the ghost window it does not affect any window per se in case of Window Service, but will be shown in TaskManager as "(not responding)" or something, i can't remember.

Fr0sT.Brutal · September 29, 2020

3 hours ago, Kas Ob. said:

No, they do work everywhere even in console application, from the remark section https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-registerapplicationrecoverycallback

I've never seen a hang console app that couldn't be closed. And while service of course could hang, the description of these functions always mentioned WER dialog and user clicking a button.

Quote

If you register for restart and the application encounters an unhandled exception or is not responsive, the user is offered the opportunity to restart the application; the application is not automatically restarted without the user's consent

Edited September 29, 2020 by Fr0sT.Brutal

Thijs van Dien · September 29, 2020

Thank you for the surprisingly large number of responses so far. To clarify: my intention in every case is to let Windows restart the service, so cleaning up after handles and all should not be an issue. What I'm looking for, is the simplest solution that can be depended on to get that done. Using child processes for all the work means quite a big refactoring that I'm willing to do only if it is actually more robust than 1 or 4 (easy to implement), which I'm still not sure of. I don't really need any of the other benefits they provide.

A.M. Hoornweg · September 29, 2020

4 hours ago, FPiette said:

I am using that solution with great success.

1) The main service launch a child process which does the work.

2) The main service is notified by Windows when his child exit (You can have a thread waiting for process termination).

3) The main service check child process health by sending a request using IPC. In my case the child service is accessed by clients so the main service use the same code to connect to the child process. If the connection fails, then the main service kill the child process.

4) When the child service gracefully terminate it set a flag so that the main service do not restart it but also close. The flag is actually a file created by the main service and deleted by the child process when terminating properly. If the main service detect the file is still there when the child process terminate, then the child process crashed.

I do it slightly differently:

- The main process launches a child process for the work.

- The child process creates a mutex.

The service application knows if the child process is still running by periodically checking the mutex.

- The child process has a separate thread called tWatchdogThread.

- The main thread must call "tWatchdogThread.Notify" regularly to prove it's still alive.
- If that sign of life stays out for more than 10 seconds, the watchdog assumes the main thread has crashed and ends the process using TerminateProcess().

This mechanism has proved to be extremely reliable, I use it often.

September 29, 2020

30 minutes ago, Fr0sT.Brutal said:

I've never seen a hang console app that couldn't be closed.

You are missing the point here, let me explain

1) console application can be running on different user or the unreachable console session preventing the OS from ending that session or exiting like when restarting, or cleaning the user profile...etc

2) Hanging up is not only the problem that need interaction from WER, on contrary WER will be invoked with many escaping exception from running process ( no matter what type process is ), here comes the benefits of those API's, they are here to give some control over the WER behavior while give the process a chance to clean up (like close files) and exit, or may be recover by terminating long calculation background thread, restart ..

Fr0sT.Brutal · September 29, 2020

2 hours ago, Kas Ob. said:

1) console application can be running on different user or the unreachable console session preventing the OS from ending that session or exiting like when restarting, or cleaning the user profile...etc

From what I've read about these functions, there are always timeouts for close/system shutdown notifications. So OS will do what it intended to regardless of the app's reaction.

2 hours ago, Kas Ob. said:

2) Hanging up is not only the problem that need interaction from WER, on contrary WER will be invoked with many escaping exception from running process ( no matter what type process is ), here comes the benefits of those API's, they are here to give some control over the WER behavior while give the process a chance to clean up (like close files) and exit, or may be recover by terminating long calculation background thread, restart ..

Yes, this is clear. It's just not clear from description that these functions could help with non-interactive service without any user action.

Lars Fosdal · September 29, 2020

Here is a very simplistic mechanism we use in production to have an application or service restart itself. No second process.

We use it for our services that runs 24/7/365 to do f.x. restarts for upgrades (Rename running .exe, copy new exe to same name, prepare restart and exit service).

https://pastebin.com/YCiqiNAq

Thijs van Dien · September 29, 2020

@Lars Fosdal The issue isn't really the restarting itself, but making sure nothing (except for continued operation) will stop that code from running.

September 29, 2020

@Fr0sT.Brutal your insistence on the idea that whose will not work made me doubt my information, so built a small test app and tried it and it showed the WER asking for close or wait, only after close clicking the recovery process started, but i am sure of what i know, so i started searching my back up for old projects and test projects, found it and found my notes, so what is going on ?

You are right!
As Windows has been changed over time, in my notes i wrote a reminder to disable the notification for Windows Error Reporting using Global Policy editor, so i went and found two instead of one in Windows 10 now, one is legacy for Windows XP and Server 2003 and the other should work with them all, setting "Prevent display of the user interface for critical errors" to enabled make the interface appear and close without interaction and without calling or starting the recovery process !, this should not be the happening and it might be a bug in WER.
It should call the recovery process and it should not display the error notification windows then close it without interaction, this behaviour has been changed in Windows and i can't tell which version but i am sure of how i used it in the past, for many years now i count on EurekaLog.

Lars Fosdal · September 29, 2020

3 hours ago, Thijs van Dien said:

@Lars Fosdal The issue isn't really the restarting itself, but making sure nothing (except for continued operation) will stop that code from running.

What are the pitfalls that can cause the service to stop running? If the service stops hard, Windows can restart it, but if it becomes unresponsive - that is harder to deal with. What if it is partially unresponsive? Is it safe to kill it?

I'd prefer to know the risk factors here. Normally, healthy code does not suddenly stop running?

Thijs van Dien · September 29, 2020

It is always safe to kill, but that should only happen when it appears no work is being done for too long. As for the cause, I don't want to make any assumptions. In the worker thread, "anything" can happen. Part of my question is whether another thread (with which there is no interaction) is a safe enough place to monitor the worker and potentially kill the whole process, or that I need stronger isolation. Could something nasty happening in one thread cause issues for the whole process (not crashing but effectively blocking it completely)? I don't know; that's why I'm asking. And if so, is a Windows-owned thread acting in the context of that same single process any safer? Again, not familiar enough with Windows internals.

Edited September 29, 2020 by Thijs van Dien

Lars Fosdal · September 29, 2020

Quote

"anything" can happen

No, it can't. Only what you decide should happen, can happen.

Quote

Could something nasty happening in one thread cause issues for the whole process (not crashing but effectively blocking it completely)?

Only if it was badly written, or there were no measures in place to regulate resource consumption (number of threads, memory, limited resources).

On killing threads - often there is a risk to just kill a thread that is doing some sort of processing, particularly if that processing stores or sends data somewhere.

Killing it could lead to memory leaks or limited resource leaks as it would be terminated without cleaning up after itself.

A watchdog needs to be aware of what it is watching. It is kinda pointless to restart a process that is not processing, if there is nothing to process.

But, since we have zero clue to what kind of processing this mystery service will perform, there is little use in speculating on how to handle "anything nasty" that could happen.

Thijs van Dien · September 29, 2020

If the worker were known to be well-behaved, there would be no need for such monitoring to begin with. There's networking, third party libraries, DLLs and what not. I can't know exactly what could cause a freeze; only that it will be rare, and restarting will be an effective way to get on with life. Sort of the 'Let It Crash' philosophy. I want to treat the worker as a blackbox. The only requirement is it will regularly report that it's still functional, and if it doesn't for too long (because it is stuck for whatever reason) the whole process is to be killed—not just threads.

Edited September 29, 2020 by Thijs van Dien

Lars Fosdal · September 29, 2020

"Let it crash" is not on my list of stability strategies.

A.M. Hoornweg · September 30, 2020

9 hours ago, Lars Fosdal said:

"Let it crash" is not on my list of stability strategies.

I had situations where an ADO connection to MS SQL Server would just "lock up" whilst executing a SQL statement and the call never returned. No exception, no error, just a lockup. I suspect it had something to do with our IT department's scheduled backup job saving the SQL Server VM because it always happened around that time of night.

Since lockups like that are out of the programmer's control, all I could do was design a workaround and it turned out that a watchdog thread worked brilliantly.

Fr0sT.Brutal · October 6, 2020

On 9/29/2020 at 11:01 PM, Lars Fosdal said:

Only if it was badly written, or there were no measures in place to regulate resource consumption (number of threads, memory, limited resources).

F.ex., some corruption in memory manager that happened in bg thread could make the whole app insane.

On 9/29/2020 at 11:51 PM, Lars Fosdal said:

"Let it crash" is not on my list of stability strategies.

Full control over the whole stack is an utopia; you write code that runs 3rd party code that runs tons of RTL code that runs tons of OS DLL functions that run hardware drivers... And all these levels could cause deadlock. It's frequently better to shut the service down and restart it - with proper logging of course - than just hang waiting for dev to come. Consider it as a network socket that couldn't connect. Just reconnect and things will be OK.

Lars Fosdal · October 6, 2020

15 minutes ago, Fr0sT.Brutal said:

Full control over the whole stack is an utopia

Absolutely. But nevertheless a goal. You need to KNOW the challenges with each component, and know how to work around them. Restarting is an option, but it should be a choice, not a result of a crash.
As for "A network socket that couldn't connect" - Do you have to restart your services for that reconnect to happen?

Thijs van Dien · October 6, 2020

I am missing the point of this discussion. As has been said, it is impossible to foresee all states the service could end up in. When it appears that things are not working correctly, the priority is to get back into a known good state. Terminating the service and having it restarted by the service manager (Windows) is a means to do so. Yes, it should be logged and investigated later to hopefully prevent it in the future, but some incidents are rare enough that dealing with them in this way is acceptable. If you disagree with that, go tell the Erlang people they got it all wrong. I'm just looking for the best way to implement it.

Although I still feel that the stronger isolation of processes might be desirable, for now I am going to count on threads being good enough. Here's how my dead man's switch looks right now: when the service is started, I use CreateTimerQueueTimer with a callback that the worker is supposed to continuously prevent from executing by means of ChangeTimerQueueTimer. If it does go off, that first sets an event to give the main thread (service loop) a chance to log what's happening and exit relatively cleanly. If that doesn't happen within 5 seconds, because apparently it got into a faulty state as well, ExitProcess(1) is called. My assumption is that because this third thread only calls three "cheap" Windows API's (SetEvent, Sleep, and ExitProcess) directly, things would have to be very, very bad for that to fail. And short of some accidental denial of service attack on the thread pool, I can't really think of anything that would keep it from running at all.

Edited October 6, 2020 by Thijs van Dien

Angus Robertson · October 6, 2020

While services should be stopped by the service manager, this won't work if if the main thread in the service has stopped and messages are not being processed.

So you need a backup after repeated stop attempts with timeouts fail. the equivalent of End Task in Task Manager which sometimes is needed to stop non-responding applications. I've done this with TerminateProcess which needs a process handle which you can get from a process ID, which requires searching the process list to match exe names. I did this from a second service that monitored the first, making sure the message queue was working and a few other things. The second service also sent emails so this could be checked manually to make sure it restarted ok.

Service manager should already be set-up to immediately restart a stopped service, so that part is easy once it stops.

Angus

Sign In

Automatically killing a service when stuck

Recommended Posts

Thijs van Dien 10

Share this post

Link to post

Darian Miller 390

Share this post

Link to post

Guest

Share this post

Link to post

FPiette 393

Share this post

Link to post

Fr0sT.Brutal 904

Share this post

Link to post

Guest

Share this post

Link to post

Guest

Share this post

Link to post

Fr0sT.Brutal 904

Share this post

Link to post

Thijs van Dien 10

Share this post

Link to post

A.M. Hoornweg 160

Share this post

Link to post

Guest

Share this post

Link to post

Fr0sT.Brutal 904

Share this post

Link to post

Lars Fosdal 1904

Share this post

Link to post

Thijs van Dien 10

Share this post

Link to post

Guest

Share this post

Link to post

Lars Fosdal 1904

Share this post

Link to post

Thijs van Dien 10

Share this post

Link to post

Lars Fosdal 1904

Share this post

Link to post

Thijs van Dien 10

Share this post

Link to post

Lars Fosdal 1904

Share this post

Link to post

A.M. Hoornweg 160

Share this post

Link to post

Fr0sT.Brutal 904

Share this post

Link to post

Lars Fosdal 1904

Share this post

Link to post

Thijs van Dien 10

Share this post

Link to post

Angus Robertson 672

Share this post

Link to post

Create an account or sign in to comment

Create an account