Energy Aware Scheduling Microconference Notes
Welcome to Linux Plumbers Conference 2015
The structure will be short introductions to an issue or topic followed by a discussion with the audience.
A limit of 3 slides per presentation is enforced to ensure focus and allocate enough time for discussions.
Please use this etherpad to take notes. Microconf leaders will be giving a TWO MINUTE summary of their microconference during the Friday afternoon closing session.
Please remember there is no video this year, so your notes are the only record of your microconference
Miniconf leaders: Please remember to take note of the approximate number of attendees in your session(s).
SCHEDULE
Attendees: Around 45, about 10 active.
Overview of PM Changes in ACPI 6 (Rafael [Intel])
LPI Support in Linux (Sudeep [ARM])
How do we transform the hierarhical LPI tables into a flat table of states?
On x86 package states are controlled by hardware and doesn't have to listed. The information of the core and package states are combined into one by discarding some of the information. Exit latency of package and target residency of core state.
The only reason to present the states in a flat table is due to the way cpuidle works today?
And due to existing governors.
Sudeep: If there are different states for different clusters, do we need different drivers for each cluster?
The states are associated with the dirvers, not the core. The drivers need to changed to hand different state tables for the clusters when the driver registers. Rafael: Possibly cpu-specific tables would be the solution.
Sudeep: Do we want OS-coordinated package states? Some vendors want it.
Len: Make them pay for it, we don't want it.
Kevin: The ACPI implementation might not consider peripherals, so we need OS-coordinated states to fix it.
Sudeep: The current LPI proposal builds on the existing processor idle driver.
Rafael: I would prefer if we could leave the existing ACPI driver untouched, but I will have a look.
One idle to rule them all? Idle management of CPUs & IO devices (Kevin Hilman [Linaro])
Kevin: CPU and IO devices are currently controlled by two separate frameworks, but it is essentially all devices that can be idled. Can we have one framework to rule them all? cpuidle isn't really fit for coupled idle-states. Can we use runtime PM for CPUs?
Len: If you have 100 cores you can have any locks in the idle-path.
Kevin, Mike: It would be beneficial to be able to model dependencies for cpu idle states like it possible for IO devices today. You may have a GPU sharing power rail with one cluster, while another cluster has its own. When the GPU is active the CPUs in the same power domain are preferred candidates for tasks, while the opposite is the case when the GPU is off. Currently we cannot model this dependency.
Len: Software coordinated coupled states is a nightmare. We don't want to call any notifiers.
Kevin: Systems are already built this way.
Peter: Have you looked at the idle path? The NOHZ code is already doing expensive timer operations in the idle path.
Kevin: To use genpd for cpu idling it has to extended to support multiple levels. It currently only supports on/off.
Patches are already on the list for some of the changes necessary. See slide deck for pointers.
Lorenzo: We need a power domain hierarchy. We currently have PM notifiers in the idle path. In DT we need power domain bindings that can be used to remove the PM notifiers.
Rafael: Is the plan to follow the LPI approach?
It may not be possible to capture everything we need in ACPI, and we need a DT solution now. It may take a while to discuss the bindings.
Rafael: I suggest to start with a real use-case to figure out what we need.
Wakeup Sources Configuration and Management (Sudeep [ARM])
----------------------------------------------------------------------------------------
Friday, August 21
Energy Model Driven Scheduling (Morten)
5 RFCs released to LKML
Not all tracking bits are cpu-invariant or frequency-invariant
Would like to discuss the patchsets related to adding energy awareness to scheduling on the lists and the way forward to merging them.
This patch set doesn't include the arch backend. arch will need to provde hooks to be called. Remaining discussion on providing a default CPUfreq backend for frequency-invariance support.
PeterZ doesn't find anything objectionable in the patches that add invariant load and utilisation metrics.
With the invarant load and utilisation metrics can start making bits and pieces of the scheduler aware of capacity.in the load balancing pathways.
Morten looking at calculate_imbalance and it's history.
No major objection to cleaning the load balancing co de.
Dietmar - when do we switch between utilisation and load? Currently very conservative - when any cpu has no spare cycles. It's quite hard to figure out when to switch - questions related to preserving SMP niceness.
Rik - scheduler metric idle or instant idle? Scheduler load as CFS is built on the scheduler load metrics.
Ideas to increase range of energy aware scheduling welcome. Something to play with to figure out what the best balance is.
Noticed a load balancing issue - two threads in a hypterthreaded cores. Rik to send the patch if he gets it working.
Ideally new_idle should prefer ores over siblings.
Dietmar: if we only consider a system that has no over-utilized cpus then the operational band for energy aware scheduling is small. We'll often not use it. Agreement that we'll need to play with it and perhaps change the periodic load balance code in the future.
The idea is for scheduler to control frequency and drive CPUFreq
When the scheduler controls frequency it'll be a bit like when userspace governor controls frequency.
How do you know how much energy a task is going to consume?
Can this model figure out when CPU and GPU are on the same bus in a two CPU system? Can be done if the
Reasons to dynamically update the energy model - can help with the above system. Other reason is to do it for thermal signalling. Not experimented yet but could be done.
Static tables make the above hard. Depending on whether teh GPU sharing the bus is on or off, we might want the capacity table to be updated to be a high or low value.
What happens with the new task?
Cost of migration taken into account for moving tasks around - wake up cost, flushing caches, etc? Need to keep it simple as it lives in the scheduler. Was in previous version and should be added back if benefit is seen.
Where should the energy model live - thermal has it's, we are adding one to the scheduler now? Should have a single source of information. Thermal uses it for modelling power.
Thermal should hook into schedule for energy model informaton - avoid maintaining multiple coipes of energy model
The model is updated dynamically and that information will have to be passed to the scheduler as well. PeterZ: it shouldn't be too hard.
Summary: Capacity invariance stuff, no major objections. Next steps for the EAS patchset is to split it up and start merging some of it. Getting utilization and capacity right would be a good step forward. SchedTune patches posted earlier this week provides a knob to tune the performance/energy trade off.
SchedTune - add margin to utilisation to influence P-state selection. Len: the knob already exists in intel_pstate. SchedTune moves it to the scheduler.
Discussion about how frequency invariant is achieved - Len thinks you can feed the actual cycles instead of frequency.
Co-ordinatoin between CPU and GPU - SchedTune can be used for this.
Scheduler Driven frequency selection
A new governor cpufreq-sched.c. It only works with CFS.
It's a shim layer that plugs into fair.c, doesn't implement policy (?). It's event driven, the decisions are based when the scheduler enquesues tasks and such.
Problem: the governor knows about the CPU topology domain but sched far does not. It would be better if CFS knew about frequency domains, so that this code doesn't live in the sched governor.
In EAS there's a SD_SHARED_CAP flag that indicates the sched_domains share capacity states/are in the same cpufreq domain.
Similar information is needed for other scheduling classes as well - PeterZ CFS gives max hint, SCHED_RT, SCHED_DEADLINE give minimum hint.
Another problem: there is no async interface, no way to defer changing the frequency, as you can't change it from schedule context. Currently there's a kernel threda in parallel that does the change for you (?)
Other sched classes could have support for it. deadline is one such example.
If we are going down that route, why not include other devices that are affected by CPU performance, like the GPU. GPU and deadline support is future stuff, where MIke will like things to go.
The ideal wouuld be to get rid of the shim layer and plug all the consumers to governors themselve. That would mean multiple governors in cpufreq.
- Rafael: I would have to think about it.
Should cpuidle be thrown in the mix as well? Rafael - talking about the cpuidle governor not the core. Before trying to re-architect things, we should focus on what problems we are trying to solve.
- Mike: as we add more consumers, that shim layer will become thicker, is that what we want?
- Rafael: we would have to simplify locking and reorganize in cpufreq
- Mike: what structs need to stay in cpufreq?
- Rafael: Moving __ somewhere else makes sense
- Mike: Apart from locking, the async interface is another of the big problems that need solving
- PeterZ: is the ARM world moving to better ways of changing frequencies?
- Even if it's changing, we will still have devices with the "slow" interface
- Bobby: Changing frequency is not part of CPU architecture, nothing to do with v8. System architecture specifications exist and are being developed that should improve the situation.
- Mark: still, there are old devices that will have to cotninue to be supported
- Mike: new RFC series introduce a new machine (?) flag that Rafael requested. Maturity level of the RFC is toy.
- Eduardo: What about thermal constraints?
- Mike: thermal will limit maximum frequency. Also, we have talked about modelling, there's synergy oportunities there as well.
- Len: Scratch the surface between interactions betweeh CPU vs GPU. Opportunistic states are now both in CPU and GPU, pmic to choose who gets them.
Driver that shares power between CPU and GPU - you want these to be both busy and idle at the same time.
- John: There isn't a good way of doing it in Windows, so we do it in firmware. It's kinda compilicated there, CPU and GPU firmwares need to coordinate.
-Len: good things about SoCs is that the CPU and GPU are in the same chip so it's easier for a firmware to coordinate them. What we don't know is when is the GPU waiting for the CPU and viceversa.
- Bobby: we could use sche deadline to improve the situation by having threads that are doing things for the GPU be run on higher priorities and with deadline waranties.
- Mike: we do have a lack of an expression of these dependencies. Is per-device QoS needed?
- Badhri: buffer goes from cpu to gpu, just by looking at idle time you go down to the lowest OPP, making the process take longer than it should. We fake the busy signals, by setting iobusy flags while the GPU is doing its renderscript bit.
- Len: We will solve this problem and it will work, but then we will hit a power budget and we will have to decide which one to throttle. Or thermal limit.
You can't always get what you want: P-state selection on Intel CPUs - Kristen Accardi
Maintainer of the intel pstate driver, working in the power group at Intel with Len and Rafael.
Power management: if you don't use it, turn it off.
We don't know how much power a P state draws. Power is a function of temperature, frequency and voltage. Varies with the part as well. Uncertainity with regards to how much power a particular P-state is going to consume.
Are the thermal states exposed to the operating system? Varies with system. Certain cooling devices can set the T-state. Under normal operation you should never go below Pn. Below Pn to LFM the states reduce frequency without reducing voltage. These are inefficient states but help reduce power so helps with the thermal situation.
P-states is like dimming the light bulb. It's not like changing frequency. Typically it's a frequency and voltage combination. We call it a Performance states.
P-states != power. Power varies with thermal states, even different parts consume different power for the same owrkload at the same P-state.
Pn: is the guaranteed frequency. Means that you will get it. From P0 to P1 is the turbo range. It's used because not all the parts of the chip are active at the same time and we can use that opportunity to increase the frequency beyond nominal in other parts of the system. P1 to Pn is visible to the OS.
Pn -> LFM is not available for P-sate selection, it's only used for extreme thermal conditions. LFM is Lowest Frequency. Under normal operation you don't go below LFM
If all cores are on, there's little room to use the Turbo states. If only one core is available, then higher Turbo states can be selected. Turbo can be turned off by the OS, you can choose what's the highest performance that you want to have.
New parts now that they are getting old (reliability meter) and the firmware might decide to not give Turbo p-states based on the age. Turbo is never guaranteed, you may or may not get it.
Slide on Power delivery
Ivy/Sandybridge cores share voltage regulator and so are in the same voltage domain.
Haswell has voltage per-core.
Different workloads benefit (or not) from sharing voltage domain (or not).
Hardware co-ordination of P-states.
The only way to know what P-state you get is to use APERF/MPERF to know the number of cycles and then calculate what frequency you actually got.
intel-pstate tells you the real frequency that each core got, acp_cpufreq lies about it.
Client platforms have very low idle states - race to idle makes sense here. Vs on servers increasing P-state is very expensive and so race to idle is not the best solution.
Capacity/utilization is insufficient for determirning whether to scale. KP - Can have a step-wise increase of frequency and see if utilisation goes up and then scale down.
For big processors (Haswell, ..) you increase P-state after passing 97%, for smaller you do it at 60%.
- Paul: have you though about factoring L2 cache misses into the algorithm
- Kristen: Andi Kleen told me not to do it, because nobody else would be able to use the perforamnce counters.
- Lorenzo: can the intel_pstate driver be generalized?
- Kristen: I guess race to halt could be generalized, but there's a lot of hardware intelligence in there. The P algorithm could be made platform independent.
- ?: not all platforms may want to do race to idle. On Intel is alwasy better to get to idle as soon as possible, for some ARM platforms it may be better to execute at a lower OPP for longer
Hardware P-states (HWP): hardware is aware of a lot of things that the OS is not aware of, like the load of the GPU, whether it's a memory bound workload... and make more intelligent decisions.
When enabled in the driver, the driver just collects statistics. You can program minimum and maximum performance, so thermald can still do thermal control. You can specify your desired perfomance and there is an Energy Performance Preference that's similar to the sched tune knob presented earlier.
- Paul: which Intel parts have HWP support?
- Kristen: Future big Core parts. Also some Broadwell parts. Skylake as well.
- Punit: is intel going to get any benefit from EAS?
- Kristen: probably not, as we're moving this logi to the hw. Some task migration may be interesting for us.
- Len: we can guarantee performance using the min and the max.
- Kevin: is this moving to all processors? (I think that's what he asked)
- Samuel: yes
- Kristen: the HWP request register may not do anything, the min and max are just hints.
- Rafael: you should not report what you put in the register as running frequencies, you should count cycles to see what you get.
- Kristen: we tried not to report frequency and there was customer push back
- Paul: The scheduler would want to know how much of a task got executed.
- PeterZ: feedback is the APERF/MPERF. There are lots of other counters, but they are not fast enough to make scheduler decisions
- Kristen: P-state transitions happen very quickly. HWP is like an evolution of turbo.
- Len: some systems just run on turbo frequencies and barely have any "base" frequencies (guaranteed frequencies)
- Paul: this is a HW dream, but it's a RT nightmare
- PeterZ: if you really wnat nanosecond-accuracy hard RT, you have to disable MMU and caches or use coprocessors
- Len: The trend is to more cores, that reduces the guaranteed frequencies. An RT system will probably have 4 cores, so more guaranteed frequencies.
Recent Power Management Core Changes in Linux - Rafael Wysocki
Suspend-to-idle: Instead of suspending to RAM, you suspend all devices and then you let all the cores run in idle waiting for an interrupt.
Why suspend to idle? No platform support required. You don't go through firmware, so it may be faster. You reduce the number of interrupts. It won't get any interrupts until you hit a key in the keyboard, not even timer. All devices are forced to low power states.
- Kevin: when you freeze userspace, all drivers should transition to idle
- Rafael: in theory yes, but with this we just force the drivers.
Quiescent mode, shipped in 4.0
enable_irq_wake() sets a flag for suspend to idle
There are a number of changes that have gone in since 3.15 related to that, mostly optimizations to get suspend/resume to work as fast as possible.
Framework for interrupt wakeups. If you know this is an interrupt just to wakeup the system, you only have to call a function to set it.
-Paul: Have you thought about using cgroup freezer to freeze part of the system
- PeterZ: cgroup freezer is very broken. freeze works because it's done for all the system, cgroup freeze has dependencies that haven't been properly solved.
- Paul: Is this due to locking and priority inversion?
- Paul: Is the problem just between userspace processes, or does it include kernel resources also?
- Rafael: some parts of userspace don't handel very well being frozen while other tasks are not frozen. When everything's frozen, you don't have to worry about priority inversion for example. In windows there's an API to participate in the way the system is suspended. For that, you need application writers to do the right thing.