Linux watchdog配置_代码007(未授权)

本文介绍: 好久没写文章了，最近遇到一个蛋疼的问题，Linux 内核假死的情况，简而言之就是内核在工作的过程中突然进入一种未知状态，不能正常工作了。watchdog主要有两种：第一种是硬件支持的，第二种是纯软件的。今天我们主要讲第二种，纯软件实现的。1、配置过程不算太难，但还是需要研究下2、特别需要注意的是，软件看门狗的稳定性有待挖掘，其本身会不会被内核杀死依然是一个谜，我觉得是有这种可能性的。遗憾的是我目前还没有模拟出来软件看门狗被杀死的情形，这个问题暂时留着。3、建议软硬件搭配使用，防患于未然。

好久没写文章了，最近遇到一个蛋疼的问题，Linux 内核假死的情况，简而言之就是内核在工作的过程中突然进入一种未知状态，不能正常工作了。watch dog主要有两种：第一种是硬件支持的，第二种是纯软件的。今天我们主要讲第二种，纯软件实现的。

看门狗，又叫 watch do g，从本质上来说就是一个定时器电路，一般有一个输入和一个输出，其中输入叫做喂狗，输出一般连接到另外一个部分的复位端，一般是连接到单片机。看门狗的功能是定期的查看芯片内部的情况，一旦发生错误就向芯片发出重启信号。看门狗命令在程序的中断中拥有最高的优先级。

注意：硬件看门狗本质上是电路，是物理层面的东西，本质上是不会受到干扰的。

watch do g – a software watch d o g d ae mon

其实就是一个后台服务，一直循环遍历任务，接下来我们详细展开。

以Ubuntu-18.04为例，其他的系统请自行研究。我手上用的是o range pi-4，它的系统里自带硬件看门狗，具体查看方法是:

ls /dev/watchdog

#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;string.h&gt;
#include &lt;sys/types.h&gt;
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <errno.h>
#include <sys/time.h>
#include <unistd.h>
#include <time.h>
#include <getopt.h>
#include <sys/signal.h>
#include <termios.h>

struct watchdog_info{
    unsigned int options;   //options the card/driver supprots 19
    unsigned int firmware_version;  //firmcard version of the card
    unsigned char identity[32];     //identity of the board 21
 };

#define WATCHDOG_IOCTL_BASE 'W'
#define WDIOC_GETSUPPORT _IOR(WATCHDOG_IOCTL_BASE, 0, struct watchdog_info)
#define WDIOC_SETTIMEOUT _IOWR(WATCHDOG_IOCTL_BASE, 6, int)
#define WDIOC_GETTIMEOUT _IOR(WATCHDOG_IOCTL_BASE, 7, int) //27
#define WDIOS_DISABLECARD 0x0001
#define WDIOS_ENABLECARD 0x0002
#define WDIOC_SETOPTIONS _IOR(WATCHDOG_IOCTL_BASE, 4, int)
#define WDIOC_KEEPALIVE _IOR(WATCHDOG_IOCTL_BASE, 5, int)

int Getch (void)   //无回显的从屏幕输入字符，来达到喂狗的目的

{

     int ch;
     struct termios oldt, newt;   //终端设备结构体
     tcgetattr(STDIN_FILENO, &amp;oldt);   //获得终端属性
     newt = oldt;
     newt.c_lflag &amp;= ~(ECHO|ICANON);   //设置无回显属性
     tcsetattr(STDIN_FILENO, TCSANOW, &amp;newt);  //设置新的终端属性
     ch = getchar();   //从键盘输入一个数据
     tcsetattr(STDIN_FILENO, TCSANOW, &amp;oldt);  //恢复终端设备初始设置
     return ch;

}
 //suspend some seconds
int zsleep(int millisecond)

{
     unsigned long usec;
     usec=1000*millisecond;
     usleep(usec); //usleep(1)睡眠一微秒（10E-6),这里也就是0.1s
}
int Init()
{
     int fd;
     //open device file
     fd = open("/dev/watchdog",O_RDWR);   //打开看门狗设备
      if(fd < 0)
     {
         printf("device open failn");
         return -1;
     }
     printf("open successn");
     return fd;
}

int main(int argc,char **argv)
{
     int fd,ch;
     int i,j;
     char c;
     struct watchdog_info wi;
         if(argc != 2){
                 printf("Usage : ./watchdog 10n");
                 return -1;
         }
     fd=Init();  //打开终端看门狗设备
         ioctl(fd, WDIOC_SETOPTIONS, WDIOS_ENABLECARD);
     //读板卡信息，但不常用
     ioctl(fd,WDIOC_GETSUPPORT,&amp;wi);
     printf("options is %d,identity is %sn",wi.options,wi.identity);
     //读看门狗溢出时间

     printf("put_usr return,if 0,success:%dn",ioctl(fd,WDIOC_GETTIMEOUT,&amp;i));

         printf("The old reset time is: %dn", i);
     //关闭
      i=WDIOS_DISABLECARD;//WDIOC_SETOPTIONS=0X0001
     printf("return ENOTTY,if -1,success:%dn",ioctl(fd,WDIOC_SETOPTIONS,&amp;i));
     //打开
      i=WDIOS_ENABLECARD;//WDIOS_ENABLECARD 0x0002
     printf("return ENOTTY,if -1,success:%dn",ioctl(fd,WDIOC_SETOPTIONS,&amp;i));
     i=atoi(argv[1]);
     printf("put_user return,if 0,success:%dn",ioctl(fd,WDIOC_SETTIMEOUT,&amp;i));
     //读新的设置时间

     printf("put_usr return,if 0,success:%dn",ioctl(fd,WDIOC_GETTIMEOUT,&amp;i));


     while(1)
     {
           zsleep(100);
           if((c=Getch())!=27){
                //输入如果不是ESC，就喂狗，否则不喂狗，到时间后系统重启
        printf("keep alive n");
                ioctl(fd,WDIOC_KEEPALIVE,NULL);
                //write(fd,NULL,1);     //同样是喂狗

           }
     }
    close(fd);   //关闭设备
     return 0;
}

如果你和我一样是orangepi-4，这个代码你可以直接用，如果不是可能需要稍微修改下。

sudo apt update
sudo apt install watchdog

orangepi@orangepi4:~/wiringOP/examples$ cat /etc/default/watchdog
# Start watchdog at boot time? 0 or 1
#是不是在系统启动的时候打开watchdog，1是打开，0是不打开
run_watchdog=1
# Start wd_keepalive after stopping watchdog? 0 or 1
#这个是保活机制，默认打开
run_wd_keepalive=1
# Load module before starting watchdog
#维持原样
watchdog_module="none"
# Specify additional watchdog options here (see manpage).
#启动参数，具体参数下面贴出来
watchdog_options="-s -v -c /etc/watchdog.conf"

watchdog daemon options

watchdog    [-F|--foreground]    [-f|--force]    [-c    filename|--config-file   filename]
       [-v|--verbose] [-s|--sync] [-b|--softboot] [-q|--no-action]
       Available command line options are the following:

       -v, --verbose
              Set verbose mode. Only implemented if compiled with SYSLOG feature. This mode  will
              log each several infos in LOG_DAEMON with priority LOG_INFO.  This is useful if you
              want to see exactly what happened until the watchdog rebooted the system. Currently
              it  logs  the  temperature (if available), the load average, the change date of the
              files it checks and how often it went to sleep.

       -s, --sync
              Try to synchronize the filesystem every time the process is awake.  Note  that  the
              system is rebooted if for any reason the synchronizing lasts longer than a minute.

       -b, --softboot
              Soft-boot  the system if an error occurs during the main loop, e.g. if a given file
              is not accessible via the stat(2) call. Note  that  this  does  not  apply  to  the
              opening  of  /dev/watchdog and /proc/loadavg, which are opened before the main loop
              starts.

       -F, --foreground
              Run in foreground mode, useful for running under systemd (for example).

       -f, --force
              Force the usage of the interval given or the maximal  load  average  given  in  the
              config file. Without this option these values are sanity checked.

       -c config-file, --config-file config-file
              Use    config-file   as   the   configuration   file   instead   of   the   default
              /etc/watchdog.conf.

       -q, --no-action
              Do not reboot or halt the machine. This is for testing  purposes.  All  checks  are
              executed  and  the  results are logged as usual, but no action is taken.  Also your
              hardware card or the kernel software watchdog driver is  not  enabled.  Temperature
              checking is also disabled since this triggers the hardware watchdog on some cards.

看不懂的可以用谷歌翻译翻一下，实在不懂可以留言问下我，程序员看英文的能力还是要有的。

orangepi@orangepi4:~/wiringOP/examples$ cat /etc/watchdog.conf
#ping                   = 172.31.14.1
#ping                   = 172.26.1.255
interface               = eth0
#file                   = /var/log/syslog
#change                 = 1407

# Uncomment to enable test. Setting one of these values to '0' disables it.
# These values will hopefully never reboot your machine during normal use
# (if your machine is really hung, the loadavg will go much higher than 25)
#max-load-1             = 24
#max-load-5             = 18
#max-load-15            = 12

# Note that this is the number of pages!
# To get the real size, check how large the pagesize is on your machine.
#min-memory             = 1
#allocatable-memory     = 1

#repair-binary          = /usr/sbin/repair
#repair-timeout         = 60
#test-binary            =
#test-timeout           = 60

# The retry-timeout and repair limit are used to handle errors in a more robust
# manner. Errors must persist for longer than retry-timeout to action a repair
# or reboot, and if repair-maximum attempts are made without the test passing a
# reboot is initiated anyway.
#retry-timeout          = 60
#repair-maximum         = 1

watchdog-device = /dev/watchdog

# Defaults compiled into the binary
temperature-sensor      = /sys/class/thermal/thermal_zone0/temp
max-temperature         = 90

# Defaults compiled into the binary
admin                   = root
interval                = 1
logtick                 = 1
log-dir                 = /var/log/watchdog

# This greatly decreases the chance that watchdog won't be scheduled before
# your machine is really loaded
realtime                = yes
priority                = 1

# Check if rsyslogd is still running by enabling the following line
pidfile         = /var/run/rsyslogd.pid
pidfile         = /var/run/inetd.pid
pidfile         = /var/run/sshd.pid
pidfile         = /var/run/crond.pid

watchdog-timeout        = 60

# set heartbeat setting
# heartbeat-file = /var/log/watchdog/heartbeat.log
# heartbeat-stamps = 300

ping：可以配置多个选项，每隔一段时间会依次ping，如果有任何一个IP不通的话就会重启系统，这个过程会重试好多次，而不是一次。默认每4秒尝试一次。

interface：监视网卡数据吞吐，可以配置多个，依然是依次监视，如果任何一个网卡超时不吞吐数据就重启系统，这个过程会重试好多次，而不是一次。

file：设置监控文件模式发生改变，可以设置多个，依然是依次监视，如果任何一个文件模式发生改变就重启系统（比如rm文件一类的），这个过程会重试好多次，而不是一次。默认每20秒尝试一次。

change：配合file参数使用，单位是时间。

max-load-1：top 命令就可以看到当前的数值，代表着1分钟的负载平均值，如果超过设置的值就重启系统，这个参数比较危险，和性能相关，不要胡乱设置。默认行注释或者数值写0都是禁用状态。

max-load-5：top命令就可以看到当前的数值，代表着5分钟的负载平均值，如果超过设置的值就重启系统，这个参数比较危险，和性能相关，不要胡乱设置。默认行注释或者数值写0都是禁用状态。

max-load-15：top命令就可以看到当前的数值，代表着15分钟的负载平均值，如果超过设置的值就重启系统，这个参数比较危险，和性能相关，不要胡乱设置。默认行注释或者数值写0都是禁用状态。

min-memory：允许的最小内存页数，这个地方不是指具体的MB或GB一类的，不懂的建议不要胡乱设置，以免引发意想不到的结果。

allocatable-memory：可供申请的内存页数，这个地方不是指具体的MB或GB一类的，不懂的建议不要胡乱设置，以免引发意想不到的结果。

watchdog-device：指定的硬件看门狗，这个地方行注释掉。

temperature-sensor：指定需要监控的温度传感器和max-temperature搭配使用，比如Ubuntu系统的CPU温度是 /sys/class/thermal/thermal_zone0/temp，watchdog会间隔读取这个文件里面的数值和max-temperature的数值比较，如果大于max-temperature就重试多次，温度降不下来就重启。一般CPU硬件本身也有诸如温度墙和功耗墙的设置，这个选项更有点类似于在硬件之下的设置，可有可无吧，我是这么觉得的。

max-temperature：和temperature-sensor搭配使用，指定需要监控的设备。

admin：邮箱地址，如果/etc/watchdog.co nf 里面配置的任何一项触发了，再重启系统前，会给你发一封邮件通知你，属于比较温馨的设定吧。这个东西需要部署邮件服务，具体的自行研究。

interval ：指定往硬件watchdog设备写入的时间间隔，这里用不到，请自行研究。

logtick：如果打开了日志记录功能，日志写入间隔，如果值太小会消耗操作系统资源，这里用不到，请自行研究。

log-dir：指定的日志记录目录，默认是/var/log/watchdog，程序会自动创建这个文件夹，不需要手动创建。这里我用不到，就不做过多的解释了，请自行研究。

realtime：watchdog dae mon 后台程序保活用的，一般资源调度中，资源都会优先分配给前台程序，这个选项就是防止被系统杀死的。一定要打开。

priority：这个是设置程序优先级的，配合realtime使用的，默认是1就行了。

pidfile：这个很重要，如果你想监控sshd，crond，telnetd这一类的dae mon运行状况，那么这个选项可以满足你，这个支持多设置。以sshd 为例，如果sshd服务起来的话，/var/run/ 下面会有一个sshd.pid的文件，如果sshd服务不在了，这个文件就会消失，这个文件是sshd服务在配置里指定的，不一定所有的daemon都有这个文件，我知道的sshd、telnetd(inetd.pid)、crond、rsyslogd是有的。总而言之是用来监视某一服务运行状态的，一旦服务死掉了，就重启系统。这个检测过程重试好几次，而不是一次。

watchdog-timeout：这个是和watchdog-device搭配使用的，我们没用到watchdog-devi ce，这里就不赘述了，大家自行研究。

其他：太累了，一口气搞这么多，这些功能都是我一个个试的，其他几个我没用到，请大家自行研究吧。如果有一些描述不准确的地方欢迎大家评论里指出，让我们共同进步，多谢了！

sudo systemctl disable watchdog #禁止自启动
sudo systemctl start watchdog #打开服务
systemctl status watchdog #实时查看状态

显示所有内容

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

watchdog 看门狗软件

文章目录

前言

一、watch dog是什么？

1.硬件看门狗

2.软件看门狗

二、使用 步骤

1.硬件看门狗

2.软件看门狗

总结

发表回复取消回复

文章目录

一、watchdog是什么？

1.硬件看门狗

2.软件看门狗

二、使用步骤

1.硬件看门狗

2.软件看门狗

相关文章

发表回复 取消回复

一、watch dog是什么？

发表回复取消回复