This version of this document is no longer maintained. For the latest documentation, see http://www.qnx.com/developers/docs. |
This chapter includes:
You typically use the adaptive partitioning scheduler to:
In either case, you need to configure the parameters for the adaptive partitioning scheduler with the whole system in mind. The basic decisions are:
It seems reasonable to put functionally-related software into the same adaptive partition, and frequently that's the right choice. However, adaptive partitioning scheduling is a structured way of deciding when not to run software. So the actual method is to separate the software into different adaptive partitions if it should be starved of CPU time under different circumstances.
For example, if the system is a packet router that:
it may seen reasonable to have two adaptive partitions: one for routing, and one for topology. Certainly logging routing metrics is functionally related to packet routing.
However, when the system is overloaded, meaning there's more outstanding work than the machine can possibly accomplish, you need to decide what work to do slowly. In this example, when the router is overloaded with incoming packets, it's still important to route them. But you may decide that if you can't do everything, you'd rather route packets than collect the routing metrics. By the same analysis, you might conclude that route-topology protocols should still run, using much less of the machine than routing itself, but run quickly when they need to.
Such an analysis leads to three partitions:
In this case, we chose to separate the functionally-related components of routing and logging the routing metrics because we prefer to starve just one if we're forced to starve something. Similarly, we chose to group two functionally-unrelated components, the logging of routing metrics and the logging of topology metrics, because we want to starve them under the same circumstances.
The amount of CPU time that each adaptive partition tends to use under unloaded conditions is a good indication of the budget you should assign to it. If your application is a transaction processor, it may be useful to measure CPU consumption under a few different loads and construct a graph of offered load versus CPU consumed.
In general, the key to getting the right combination of partition budgets is to try them:
It's possible to set the budget of a partition to zero as long as the SCHED_APS_SEC_NONZERO_BUDGETS security flag isn't set--see the SCHED_APS_ADD_SECURITY command for SchedCtl().
Threads in a zero-budget partition run only in these cases:
When is it useful to set the budget of a partition to zero?
But in general, setting a partition's budget to zero is risky. (This is why the SCHED_APS_SEC_RECOMMENDED security setting doesn't permit partition budgets to be zero.) The main risk in placing code into a zero-budget partition is that it may run in response to a pulse or event (i.e. not a message), and hence not run in the sender's partition. So, when the system is loaded (i.e. there's no free time), those threads may simply not run; they might hang, or things might happen in the wrong order.
For example, it's hazardous to set the System partition's budget to zero. On a loaded machine with a System partition of zero, requests to procnto to create processes and threads may hang, for example, when MAP_LAZY is used.
If your system uses zero-budget partitions, you should carefully test it with all other partitions fully loaded with while(1) loops. |
Ideally we'd like resource managers, such as filesystems, to run with a budget of zero. That way they'd always be billing time to their clients. However, sometimes device drivers find out too late which client a particular thread has been doing work for. Some device drivers may have background threads for audits or maintenance that require CPU time that can't be attributed to a particular client.
In those cases, you should measure the resource manager's background and unattributable loads and add that amount to its partition's budget.
|
You can set the size of the time-averaging window to be from 8 to 400 ms. This is the time over which the scheduler tries to balance adaptive partitions to their guaranteed CPU limits. Different choices of window sizes affect both the accuracy of load balancing and, in extreme cases, the maximum delays seen by ready-to-run threads.
Some things to consider:
A small window size means that an adaptive partition that opportunistically goes over budget might not have to pay the time back. If a partition sleeps for longer than the window size, it won't get the time back later. So load balancing won't be accurate over the long term if both the system is loaded and some partitions sleep for longer than the window size.
In an underload situation, the scheduler doesn't delay ready-to-run threads, but the highest-priority thread might not run if the adaptive partitioning scheduler is balancing budgets.
In very unlikely cases, a large window size can cause some adaptive partitions to experience runtime delays, but these delays are always less than what would occur without adaptive partitioning scheduling. There are two cases where this can occur.
If an adaptive partition's budget is budget milliseconds, then the delay is never longer than:
window_size - smallest_budget + largest_budget
This upper bound is only ever reached when low-budget and low-priority adaptive partitions interact with two other adaptive partitions in a specific way, and then only when all threads in the system are ready to run for very long intervals. This maximum possible delay has an extremely low chance of occurring.
For example, let's suppose we have these adaptive partitions:
This delay happens if the following happens:
Note this scenario can't happen unless a high-priority partition wakes up exactly when a lower-priority partition just finishes paying back its opportunistic run time.
Still rare, but more common, is a delay of:
window_size - budget
milliseconds, which may occur to low-budget adaptive partitions with, on average, priorities equal to other partitions.
However, with a typical mix of thread priorities, each adaptive partition typically experiences a maximum delay, when ready to run, of much less than window_size milliseconds.
For example, let's suppose we have these adaptive partitions:
This delay happens if the following happens:
However, this pattern occurs only if the 10% application never suspends (which is exceedingly unlikely) and if there are no threads of other priorities (also exceedingly unlikely).
Because these scenarios are complicated, and the maximum delay time is a function of the partition shares, we approximate this rule by saying that the maximum ready-queue delay time is twice the window size.
If you change the tick size of the system at runtime, do so before defining the adaptive partitioning scheduler's window size. That's because Neutrino converts the window size from milliseconds to clock ticks for internal use. |
The practical way to check that your scheduling delays are correct is to load your system with stress loads and use the IDE's System Profiler to study the delays. The aps command lets you change budgets dynamically, so you can quickly confirm that you have the right configuration of budgets.
The API allows a window size as short as 8 ms. However practical window sizes may need to be larger. For example, in an eight-partition system, with all partitions busy, to reasonably expect all eight to run during every window, the window size needs to be at least 8 timeslices long, which for most systems is 32 ms.
There are cases where an adaptive partition can prevent other applications from being given their guaranteed percentage CPU:
However, time spent in interrupt threads (those that use InterruptAttachEvent()) is correctly charged to those threads' adaptive partitions.
By default, anyone on the system can add partitions and modify their attributes. We recommend that you use the SCHED_APS_ADD_SECURITY command to SchedCtl(), or the aps modify command to specify the level of security that suits your system.
Here are the main security options, in increasing order of security. This list shows the aps command and the corresponding SchedCtl() flag:
Unless you're testing the partitioning and want to change all parameters without needing to restart, you should set at least basic security.
After setting up the partitions, you can use SCHED_APS_SEC_LOCK_PARTITIONS to prevent further unauthorized changes. For example:
sched_aps_security_parms p; APS_INIT_DATA( &p ); p.sec_flags = SCHED_APS_SEC_LOCK_PARTITIONS; SchedCtl( SCHED_APS_ADD_SECURITY, &p, sizeof(p));
Before you call SchedCtl(), make sure you initialize all the members of the data structure associated with the command. You can use the APS_INIT_DATA() macro to do this. |
The security options listed above are composed of the following options (but it's more convenient to use the compound options):
Any thread can make itself critical, and any designer can make any sigevent critical (meaning that it will cause the eventual receiver to run as critical), but this isn't a security hole. That's because a thread marked as critical has no effect on the scheduler unless the thread is in a partition that has a critical budget. The adaptive partitioning scheduler has security options that control who may set or change a partition's critical budget.
For the system to be secure against possible critical thread abuse, it's important to: