Linux内核idle进程分析

写作背景

在当前KVM实现中，当vcpu中执行HLT指令时会发生VMexit，KVM模块中会进入kvm_vcpu_block函数，从前此函数会快速将vcpu线程schedule out，但在配置了halt_poll_ns的情况下，当条件满足时会适当的busy loop一段时间，代码如下：

void kvm_vcpu_block(struct kvm_vcpu *vcpu)
{
	...
		do {
			/*
			 * This sets KVM_REQ_UNHALT if an interrupt
			 * arrives.
			 */
			if (kvm_vcpu_check_block(vcpu) < 0) {
				++vcpu->stat.halt_successful_poll;
				if (!vcpu_valid_wakeup(vcpu))
					++vcpu->stat.halt_poll_invalid;
				goto out;
			}
			cur = ktime_get();
		} while (single_task_running() && ktime_before(cur, stop));	
	...
}

当vcpu此时不再是halt状态或当前pcpu上不止一个可运行task 或 busy loop时间到期这三者任一满足时将退出busy loop，在busy loop过程中CPU使用率肯定会冲高到100%。可是我们不想将这段时间被TOP工具即pmcint工具（利用PMU）统计到，前者可以通过修改内核cputime.c中的几个统计代码来实现，但是后者的话稍微有点麻烦。pmcint工具利用了PMU硬件打点的原理，我们没法修改统计，最后想到的办法是改造此busy loop，在其中进入C1 state，这样的话PMU计数器会暂停，从而使得pmcint暂停打点，这样既确保了vcpu性能，又无法使用户看到很高的占用率。

细想一下，此处busy loop的好处是啥？无非就是在条件满足的情况下不要将vcpu线程切换出去，否则来回切换势必使得vcpu线程被唤醒不够及时，导致虚拟机性能受影响。这样来说的话，我们在busy loop中调用HLT/MWAIT也能起到一样的效果：

如果使用MWAIT，我们可以用MONITOR监控让busy loop结束的变量，这样当需要结束时MWAIT会被broken，从而与此前流程一样。
如果使用HLT，则我们需要根据stop设置一个backend hrtimer，保证我们肯定能够从HLT中出来。（但是，这个方式肯定不如上述代码的实现，因为HLT的broken事件中不包括内存监控，它没法及时监控到nr_running的变化，这可能会导致调度不及时）

[更新]突然想到一个此处换做HLT/MWAIT的一个优势：在开启硬件超线程的host上，如果同一core上的一个硬件线程的vcpu1进行halt_poll_ns而另外个硬件线程上的vcpu2在运行业务，那么这种做法或许比社区的好，因为vcpu1 exit执行HLT/MWAIT指令后，该core的硬件资源此时能够完全给另一个硬件线程独占，该线程性能会因资源独占而变好。

CPU Cstates

Cstates是ACPI规范中引入的，具体的可以看acpi spec，下面内容摘自Haswell芯光大道之六：C-States十种状态解析

三种常见CPU工作状态简介

常见的CPU工作状态包含S-States、C-States和P-States三种，其中S-States（Sleeping states）指系统睡眠状态，C-States（CPU Power states）指CPU电源状态，而P-States（CPU Performance states）则指CPU性能状态。当然除了这三种外，还有G-States（全局状态）和D-States（设备状态）。

S-States很好理解，就是你手动点击“睡眠”，或者达到一定的待机时间（根据系统电源管理设置而定）进入睡眠状态，S0就是指正常运作。而C-States和P-States看起来也很类似，都会调节处理器的核心电压、电流以及频率，因此经常被混淆。其实他们的区别也是很明显的，不过我们首先要梳理一下上面这三种状态的关系。

S-States中的S0指非睡眠状态，包含了系统正常运作状态以及待机状态，这意味着只有在S0状态下，C-States才会存在。同样的，C0代表正常工作状态，而P-States正是处理器正常运作时的状态，所以P-States只存在于C0状态下。

简单来说，P-States是根据系统的负载情况调节处理器核心电压和频率，处理器仍在运作当中；而C-States则是改变处理器各个部分的状态，包括核心、缓存、总线以及各种后来集成进来的模块，此时处理器应该是工作或待机状态。我们日常使用电脑的时候，系统就是频繁地在这些状态下切换，以达到提高续航和降低功耗的目的。

C-States各个状态介绍

目前C-States有以下这11个状态：

C0：正常运行模式，我们正常操作电脑时均处于C0状态。

C1/C1E：挂起/待机状态，通过软件（一般发送HLT命令）停止处理器内部时钟。增强版的C1E支持降低倍频和电压。使用CPU-Z会观察到频率、电压下降，表示系统进入了这个状态，当外部总线传来请求时就会暂时离开C1/C1E状态（只需10纳秒），处理完后会恢复。这个状态仅对硬件延迟有要求，不过目前的硬件一般都没问题。

C2/C2E：和C1/C1E类似，但C2/C2E状态通过硬件进入，而且唤醒需要100纳秒以上。

C3：深度睡眠，内部时钟同样会被停止，总线频率会被锁定，多核心系统下缓存数据保留，并暂停写入操作，无法响应外部总线的重要请求。进入C3状态的前提是硬件支持并已进入C2模式。唤醒时间在50微秒左右。

C4：更深度睡眠，在C3状态的基础上通过将电压降至1.0V以下与减少L2缓存的数据存储以降低功耗。需要进入C3后才能进入C4，另外唤醒时间不超过1秒。

C4E：同样需要进入C4状态，并且L2缓存数据被减为零。唤醒时间至少需要200微秒。

C6：深度节能，处理器可实时清除L1缓存内所有数据，在保存处理器微架构状态下，关掉内核及L2缓存，芯片组会继续为I/O提供内存交换动作。对各个核心电源进行更智能的管理，电压降至C4的一半。不过唤醒时间要比C4长50%。

C7：更深度节能，在C6的基础上增加了清空部分或者全部L3缓存。

C8：L3缓存、系统助手（也就是以前北桥整合到CPU中的部分）和IO供电都被关闭，外部VR模块电压降至1.2V。

C9：VR模块电压接近0V。

C10：关闭VR模块。（不确定）

MONITOR/MWAIT指令

MONITOR指令能够监控一个Linear Address（监控的范围通常是cache line size），当向该地址写入操作时会被检测到。The MONITOR instruction arms the address monitoring hardware using the address specified in EAX. The address range that the monitoring hardware will check for store operations can be determined by the CPUID instruction. The monitoring hardware will detect stores to an address within the address range and triggers the monitor hardware when the write is detected. The state of the monitor hardware is used by the MWAIT instruction.

MWAIT指令 provides hints to allow the processor to enter an implementation-dependent optimized state. There are two principal targeted usages: address-range monitor and advanced power management. A store to the address range armed by the MONITOR instruction, an interrupt, an NMI or SMI, a debug exception, a machine check exception, the BINIT# signal, the INIT# signal, or the RESET# signal will exit the implementation-dependent-optimized state. In addition, an external interrupt causes the processor to exit the implementation-dependent-optimized state either (1) if the interrupt would be delivered to software (e.g., as it would be if HLT had been executed instead of MWAIT); or (2) if ECX[0] = 1. Software can execute MWAIT with ECX[0] = 1 only if CPUID.05H:ECX[bit 1] = 1. (Implementation-specific conditions may result in an interrupt causing the processor to exit the implementation-dependent-optimized state even if interrupts are masked and ECX[0] = 0.)

注意：上面提到了MWAIT的牛逼之处（相比HLT）：

可以进入指定的C-state
唤醒事件中包括了向MONITOR地址进行写入操作
即使interrupt处于disabled状态，也有可能（看CPU是否支持）可以被interrupt唤醒

会不会有安全问题

问题一：虚拟化下，如果guest kernel中执行了MONITOR gva指令，此时会不会影响到host中正处于mwait状态的cpu？例如：无论是某个vcpu STORE了该gva、或者某个pcpu STORE了某hva（而此hva的数值与该gva相等），都将引起mwait BROKEN?

解决：

intel SDM vol3 26.3.3 / 27.5.6 说，VMentry/VMexit时会将可能有影响的（或正在生效的?? may be in effect, 如何翻译更好??）所有“address-range monitoring”清除。（题外话：什么叫“可能有影响”？[a]我觉得都是“可能有影响的”，因为该address-range是个virtual address，对于64位host+64位guest的线性地址空间大小是一样的，硬件上无法区分，所以应该是都会清除。[b]但是这样的话又不对了，假如pcpu监控了hva1，然后它进入了guest，这时需要将hva1清除，当VMexit后并没有提到恢复对hva1的监控，那么此pcpu上后续STORE hva1不就不能被监控了吗？[c]除非有种可能，address-range由至少二元组确定，即virtual address + cpu mode，但是这样的话SDM何不直接描述为“guest或non-root的address-range清除”即可？哎…暂时不懂）

[2017/03/22更新] 和KVM co-maintainer Radim做了些讨论，最终认为这是没有问题的，Intel spec vol2中对MWAIT有这样的描述：

If the preceding MONITOR instruction did not successfully arm an address range or if the MONITOR instruction has not been executed prior to executing MWAIT, then the processor will not enter the implementation-dependent-optimized state. Execution will resume at the instruction following the MWAIT.

也就是说，如果guest在monitor/mwait之间发生了vmexit，则monitor address-range会被clear掉，但再次vmentry时MWAIT行为和NOP一样（原因或许是：此时MWAIT执行时发现monitor address-range不存在，就当做” did not successfully arm an address range”，所以”resume at the instruction following the MWAIT”）

讨论相关记录：

>> 2) According to the "Intel sdm vol3 ch26.3.3 & ch27.5.6", I think MONITOR in
>> guest mode can't work as perfect as in host sometimes.
>> For example, a vcpu MONITOR a address and then MWAIT, if a external-intr(suppose
>> this intr won't cause to inject any virtual events ) cause VMEXIT, the monitor
>> address will be cleaned, so the MWAIT won't be waken up by a store operation to
>> the monitored address any more.
> 
> It's not as perfect, but should not cause a bug (well, there is a
> discussion with suspicious MWAIT behavior :]).
> MWAIT on all Intels I tested would just behave as a nop if exit happened
> between MONITOR and MWAIT, like it does if you skip the MONITOR (MWAIT
> instruction desciption):
> 
>   If the preceding MONITOR instruction did not successfully arm an
>   address range or if the MONITOR instruction has not been executed
>   prior to executing MWAIT, then the processor will not enter the
>   implementation-dependent-optimized state. Execution will resume at the
>   instruction following the MWAIT.
>

intel SDM vol3 25.1.3 说，如果VM-execution control的“MONITOR exiting”为1，则MONITOR指令会导致VMexit。

在KVM中，在setup_vmcs_config中“MONITOR exiting”是会强制为1的，所以vcpu只要执行该指令就会引起VMexit，退出处理handle_monitor中直接跳过该指令、把它当做NOP来处理。

[2017/03/22更新]问题二： Intel sdm vol3 ch25.3说在某些情况下，guest可正常执行MWAIT指令，正常包括进入各种deeper C-state？如果包括的话那么这就有蛋疼的事儿了：某些deeper sleep会clear缓存（L1、L2、L3都有可能），而vmx又没有能够限制guest max-cstates的方法，vcpu可以影响到其他pcpu cache，这太可怕了….

这个观点Radim很是认同，但我还需要做个实验，确定guest能够进入各种deeper sleep。

讨论相关记录：

>> 1) As "Intel sdm vol3 ch25.3" says, MWAIT operates normally (I think includes
>> entering deeper sleep) under certain conditions.
>> Some deeper sleep modes(such as C4E/C6/C7) will clear the L1/L2/L3 cache.
>> This is insecurity if we don't take other protective measures(such as limit the
>> guest's max-cstate, it's fortunately that power subsystem isn't supported by
>> QEMU, but we should be careful for some special-purpose in case). While HLT in
>> guest mode can't cause hardware into sleep.
> 
> Good point.  I'm not aware of any VMX capabilities to prevent deeper
> C-states, so we'd always hope that guests obey provided information.
>

Linux idle进程

idle进程的执行体是do_idle函数，此篇中我们关注的片段如下：

void cpu_startup_entry(enum cpuhp_state state)
{
	......
	while (1)
		do_idle();
}

/*
 * Generic idle loop implementation
 *
 * Called with polling cleared.
 */
static void do_idle(void)
{
	......

	while (!need_resched()) {
		......
		if (cpu_idle_force_poll || tick_check_broadcast_expired())
			cpu_idle_poll();
		else
			cpuidle_idle_call();
		......
	}

	......
	sched_ttwu_pending(); /* 唤醒其他需要唤醒的任务 */
	schedule_preempt_disabled(); /* 通过schedule()主动切换到其他runnable任务 */
}

如果配置了idle=poll，则会走cpu_idle_poll函数，它实际上就是在那里一直轮询检查当前是否需要进行任务调度。没配置的话则会走cpuidle_idle_call函数：


/**
 * cpuidle_idle_call - the main idle function
 *
 * NOTE: no locks or semaphores should be used here
 *
 * On archs that support TIF_POLLING_NRFLAG, is called with polling
 * set, and it returns with polling set.  If it ever stops polling, it
 * must clear the polling bit.
 */
static void cpuidle_idle_call(void)
{
	struct cpuidle_device *dev = cpuidle_get_device();
	struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
	int next_state, entered_state;

	......

	if (cpuidle_not_available(drv, dev)) {
		default_idle_call();
		goto exit_idle;
	}

	/*
	 * Suspend-to-idle ("freeze") is a system state in which all user space
	 * has been frozen, all I/O devices have been suspended and the only
	 * activity happens here and in iterrupts (if any).  In that case bypass
	 * the cpuidle governor and go stratight for the deepest idle state
	 * available.  Possibly also suspend the local tick and the entire
	 * timekeeping to prevent timer interrupts from kicking us out of idle
	 * until a proper wakeup interrupt happens.
	 */

	if (idle_should_freeze() || dev->use_deepest_state) {
		if (idle_should_freeze()) {
			entered_state = cpuidle_enter_freeze(drv, dev);
			if (entered_state > 0) {
				local_irq_enable();
				goto exit_idle;
			}
		}

		next_state = cpuidle_find_deepest_state(drv, dev);
		call_cpuidle(drv, dev, next_state);
	} else {
		/*
		 * Ask the cpuidle framework to choose a convenient idle state.
		 */
		next_state = cpuidle_select(drv, dev);
		entered_state = call_cpuidle(drv, dev, next_state);
		/*
		 * Give the governor an opportunity to reflect on the outcome
		 */
		cpuidle_reflect(dev, entered_state);
	}
	......
}

如果系统中没有高级电源管理模块（例如acpi_driver或者intel_driver），则会调用default_idle_call函数，对于X86来说它会执行HLT指令。如果有的话，则会计算出一个将要进入的C-state，然后通过call_cpuidle函数进入。


static int call_cpuidle(struct cpuidle_driver *drv, struct cpuidle_device *dev,
		      int next_state)
{
	......

	/*
	 * Enter the idle state previously returned by the governor decision.
	 * This function will block until an interrupt occurs and will take
	 * care of re-enabling the local interrupts
	 */
	return cpuidle_enter(drv, dev, next_state);
}

这里cpuidle_enter –> cpuidle_enter_state –> entered_state = target_state->enter(dev, drv, index);，最后通过->enter回调函数进入相应C-state。对于intel_idle驱动来说，这个回调函数是intel_idle函数，它最终会调用mwait_idle_with_hints函数来执行MWAIT指令。

/*
 * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
 * which can obviate IPI to trigger checking of need_resched.
 * We execute MONITOR against need_resched and enter optimized wait state
 * through MWAIT. Whenever someone changes need_resched, we would be woken
 * up from MWAIT (without an IPI).
 *
 * New with Core Duo processors, MWAIT can take some hints based on CPU
 * capability.
 */
static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
{
	if (static_cpu_has_bug(X86_BUG_MONITOR) || !current_set_polling_and_test()) {
		if (static_cpu_has_bug(X86_BUG_CLFLUSH_MONITOR)) {
			mb();
			clflush((void *)&current_thread_info()->flags);
			mb();
		}

		__monitor((void *)&current_thread_info()->flags, 0, 0);
		if (!need_resched())
			__mwait(eax, ecx);
	}
	current_clr_polling();
}

首先会通过MONITOR指令监控idle进程的flags，然后通过MWAIT进入相应C-state。

这里监控flags的原因很简单：当其他cpu唤醒了此cpu上某个任务的时候，（可参见ttwu_queue函数）要么通过RES IPI、要么其他cpu直接将该任务放在此cpu的运行队列上，前者因为是个IPI中断，肯定会第一时间唤醒此cpu；对于后者，过程中有一步会将idle进程（如果此cpu当前运行的是idle）的flags置上NEED_RESCHED标记，由于此cpuMONITOR了flags，所以此cpu也能第一时刻就唤醒。