How do you stop a background process?

2024 April 2

While I was working on Redirectory last year, I was often running a Conan package server and tcpdump simultaneously whilel I reverse engineered the server API. (The API is undocumented, and I felt this would be faster than trying to investigate the code.) To help me, I wrote a script that would run them as background processes and then drop me into a shell to run interactive Conan commands. After the subshell exited, it would stop the background processes before continuing. Or rather, it would try to stop them.

The pattern that I expected to work was to send the child processes SIGTERM and then wait for their exit. I would hit a series of wrinkles in this plan.

Killing a child

Wrinkle the first: killing a child process might not kill its children. The problem child was npm. I would run Redirectory via npm start. npm was my child process, and it would start a grandchild sh process that would start a great-grandchild node process that was the server. Sending SIGTERM to the npm process would terminate it and its child, but not the great-grandchild, which would hold onto the port. npm would completely ignore SIGINT. I could stop it with SIGKILL, but that would orphan its child.

I suspect that well-behaved programs have a general responsibility to propagate terminating signals to their children, and that npm was misbehaving in this regard, but I do not know for sure. Either way, npm used to propagate these signals, but it no longer was. Comments on that issue suggest that newer versions of npm have been fixed to restore the old behavior, but I have not checked.

The fix I arrived at was to kill the process group of the npm process.

Killing its family

Wrinkle the second: the process tree rooted by a child process might not share exactly one process group. The problem child this time was sudo, as in sudo tcpdump.

To be fair, it wasn't a problem in practice. sudo would launch its command in a different process group, but it would also relay SIGINT and SIGTERM signals to its children, as I expected a reasonable program to do, so that killing the group of the root process effectively killed its entire tree.

But it could have been a problem in theory. What if it ignored signals like npm? I was searching for a 100% reliable, fool-proof method for terminating all processes spawned directly and indirectly by a command, regardless of the specific command.

I guess I could have looked for a way to find the process groups of every process in the tree rooted by a child process, but for the moment I just stuck with killing the group of the child process.

Killing its parents

Wrinkle the third: the child process might be in the same process group as the shell. When I killed a child process group in a terminal, it worked fine, but once I incorporated the technique into a script, it would kill the whole script.

This is how I learned about the monitor (-m) shell option. monitor enables job control. This not only includes the jobs and wait built-ins, but it also puts each child process into its own process group. Interactive shells have monitor enabled by default, but non-interactive shells, like the ones that execute scripts, do not.

The fix here was to set the monitor option at the start of the script:

set -o monitor

Waiting for its children to die

Wrinkle the fourth: a shell cannot wait for its grandchildren, only its children. The wait built-in, when passed an ID for a process that is not a child, will assume that it has already exited unsuccessfully:

If one or more pid operands are specified that represent unknown process IDs, wait shall treat them as if they were known process IDs that exited with exit status 127.

I came across some ideas to get around this limitation, but I did not investigate any:

  • Get a file descriptor for the process with pidfd_open(), and wait for its exit with poll() (waitid() only works on child processes).
  • Periodically poll for the existence of the process with ps.
  • inotifywait a file descriptor opened by the process, e.g. stdout.

Moving on

At the end of the day, the best method I got was to send SIGTERM to the group of the child process, and then wait on the child process, but I'm not confident this will work reliably in all future cases. How do you do it? How are we supposed to do it?