Table of Contents

  1. 背景:
  2. 基本概念
  3. docker中的不同
  4. 解决方法
    1. 参考

背景:

在项目中使用的是微服务架构,各component通过restful api的协议进行通信。各component跑在docker contanier中,其中一个component的功能是使用puppeteer来跑各种html文件,puppeteer可以理解为提供了一系列的api通过无界面的方式来控制chromium,是一个nodejs库。由于在这个component中会有较多的进程创建和crash,所以发现该docker container中存在大量僵尸进程。故总结了一下原因和解决办法。

基本概念

  • 僵尸进程(z):子进程的一种状态—–子进程退出,但是父进程没有调用wait/waitpid获取子进程的状态信息,于是子进程的进程描述符仍然保留在系统中。

  • 孤儿进程:子进程的一种状态—–父进程先于子进程退出。

  • PID=1的init进程:在自举过程结束后由内核调用的。孤儿进程会被init进程接管,成为所有孤儿进程的父进程。被init进程接管的孤儿进程会不会成为僵尸进程?不会!因为init被编写为无论如何时只要有一个子进程终止,init就会调用wait函数获取其最终状态。

  • 僵尸进程的危害:在每个进程结束后,内核会释放该进程所有的资源(文件、内存等),但是仍保留一定的信息(进程号、退出状态、cpu时间等),这一部分信息要等父进程调用wait/waitpid才会释放,如果不被释放,其进程号一直占用,而系统能产生的进程号是有限的,如果有大量的僵尸进程,可能导致系统无法产生新的进程。僵尸进程为死掉的进程,所以不能再接收任何信号,kill也无法杀死。一个可选的解决方法是,杀死父进程(需要谨慎……),于是僵尸进程成为”孤儿进程”,它由给1号进程init收养,init 进程会周期性地去调用 wait 系统调用来清除它的僵尸孩子。

docker中的不同

Docker: 在docker中没有init进程,pid=1的进程为容器的主进程(npm/node.js……)

Turns out that NodeJS is not able to receive signals and handle them appropriately (if it runs as PID 1). By signals, I mean kernel signals like SIGTERM, SIGINT, etc.
The following code wouldn’t work at all if you run NodeJS as PID 1:

process.on(‘SIGTERM’, function onSigterm()
{
process.exit(0);
//do the cleaning job, but it wouldn’t
});

As a result, you will get a zombie process which will be terminated forcefully via SIGKILL signal, meaning, that your “clean up” code will not be called at all.

也就是说nodjs进程在pid不等于1的时候是可以handle 退出信号的。但是pid=1的情况下就不可以了。

So,What is Special With Pid 1?

The process with PID 1 differs from the other processes in the following ways:
1. When the process with pid 1 die for any reason, all other processes are killed with KILL signal
2. When any process having children dies for any reason, its children are reparented to process with PID 1
3. Many signals which have default action of Term do not have one for PID 1.
 In practice the most inconvenient one is the last one. For development purposes it effectively means you can’t stop process by sending SIGTERM or SIGINT, if process have not installed a signal handler.
At the end of the day, all above means most processes that were not explicitly designed to run as PID 1 (which are all applications except supervisors), do not run well. 

解决方法

dumb-init
使用dumb-init 作为初始进程,确保所有子进程都由dumb-init 进程创建。写到dockfile中:apt-get install -y dumb-init=1.2.2-1.1

参考

https://www.elastic.io/nodejs-as-pid-1-under-docker-images
https://vagga.readthedocs.io/en/latest/pid1mode.html