Linux 定时器的正确打开姿势

最近上层子系统使用我们封装的定时器时，发现定时不准确，比实时时间慢了一些。本文记录定位过程及解决方法。

使用定时器的一般步骤

Linux 下使用定时器的一般步骤如下：

(1) 使用 timer_create() 创建定时器

struct sigevent evp;

memset(&evp, 0, sizeof(struct sigevent));
evp.sigev_notify = SIGEV_SIGNAL;
evp.sigev_signo = timer_no;

if (timer_create(CLOCK_REALTIME, &evp, &tTimer[timer_no].timer) < 0)
{
    return -1;
}

return 0;

其中的 timer_no 为定时器编号，tTimer 为事先定义的一个定时器结构体。刚创建的定时器不会自动运行。

(2) 使用 timer_settime() 设置定时器超时时间

timer_settime(tTimer[timer_no].timer, 0, &tTimer[timer_no].timevalue, NULL);

设置之后定时器立即开始运行。

(3) 在定时器线程中使用 sigwait() 等待定时器超时

sigset_t tsigmask;
int      isigrcv, i;
long     ret;

sigemptyset(&tsigmask);
for (i = 0; i < NUM_OF_TIMERS; i++)
{
    sigaddset(&tsigmask, SIGRTMAX - i);
}

while (1)
{
    ret = sigwait(&tsigmask, &isigrcv);
    if (ret >= 0)
    {
        /* 调用定时器回调函数 */
    }
}

查看内核的 date 是否准确

首先从源头开始，验证内核的实时时钟是否准确，借助 date 命令拷机实现。

先输入一次 date 命令，拷机一段时间之后，再输入一次 date 命令，通过对比两次命令输出的时间差与 SecureCRT 两条 log 记录的时间差，确认内核实时时钟准确无误。

(11:46:30.614) $ date
(11:46:30.645) Mon Jul 17 11:46:30 CST 2017
(13:48:02.798) $ date
(13:48:02.798) Mon Jul 17 13:48:02 CST 2017

查看用户态和内核的定时器计数是否一致

先在用户态的定时器线程的回调函数中，增加计数，定时器每超时一次就累加 1。

然后打开内核的 CONFIG_TIMER_STATS 配置项，重新编译内核并运行后，执行如下命令打开定时器统计：

# echo 1 > /proc/timer_stats

之后使用如下命令查看所有定时器的计数信息等：

# cat /proc/timer_stats

输出的信息类似如下：

Timer Stats Version: v0.2
Sample period: 55521.903 s
...
14205534,  1501 linux.out        .common_timer_set (posix_timer_fn)
...
389290862 total events, 7011.482 events/sec

其中 linux.out 那行就是上层子系统所使用的定时器，超时计数为 14205534。而在应用代码中的计数为 14205535，比 timer_stats 的计数多了 1。且再次开启关闭一次定时器（同样通过调用 timer_settime() 实现），应用代码中的计数比 timer_stats 的计数多了 2，如此递增。此为疑点。

发现定时器使用时的问题

重新走查用户态中定时器相关代码，发现了问题所在。

在使用 timer_create() 创建定时器之后，设置 timer_settime() 时，对于第三个入参 new_value，只将超时时间赋值给了 it_value 而 it_interval 设置为 0。

参考 timer_settime(2) 中关于 it_value 和 it_interval 的说明：

If new_value->it_value specifies a nonzero value (i.e., either subfield is nonzero), then timer_settime() arms (starts) the timer, setting it to initially expire at the given time. (If the timer was already armed, then the previous settings are overwritten.) If new_value->it_value specifies a zero value (i.e., both subfields are zero), then the timer is disarmed.

The new_value->it_interval field specifies the period of the timer, in seconds and nanoseconds. If this field is nonzero, then each time that an armed timer expires, the timer is reloaded from the value specified in new_value->it_interval. If new_value->it_interval specifies a zero value then the timer expires just once, at the time specified by it_value.

也就是说，it_value 确定第一次超时时间，it_interval 确定后续的超时时间。那么现有的代码中如何实现定时器多次执行的呢？

原来是在定时器线程的 while 循环部分再次调用 timer_settime() 重新设置一次 it_value：

while (1)
{
    ret = sigwait(&tsigmask, &isigrcv);
    if (ret >= 0)
    {
        /* 调用定时器回调函数 */
        tTimer[timer_no].clkCallback(tTimer[timer_no].argCall);
        timer_settime(tTimer[timer_no].timer, 0, &tTimer[timer_no].timevalue, NULL);
    }
}

那么在定时器此次超时到再次调用 timer_settime() 启动定时器之间，存在一定的延迟而引入误差，长时间运行之后此误差将累积，导致定时器比实时时间慢。

这也可解释为什么启动停止定时器会导致用户态的计数比内核的计数多 1 的现象。内核的定时器已经超时停止，但用户态回调最后还会累加 1。

解决方法

在调用 timer_settime() 时，同时设置 it_value 和 it_interval 的值，使定时器自动重新加载并循环运行。
在定时器回调函数中，去掉调用 timer_settime() 重新设置定时器超时时间的代码。仅保留调用上层子系统挂载的回调函数即可。

这才是 Linux 定时器的正确打开姿势。

以上。

使用定时器的一般步骤

查看内核的 date 是否准确

查看用户态和内核的定时器计数是否一致

发现定时器使用时的问题

解决方法

See Also