Managing relative priorities for users and jobs
This is a brief description of how I have set up the scheduler policies and queue access controls in SGE. Please refer to the
SGE Administration Guide for details of the terminology. We are currently using
Functional Policy to determine the relative priority of groups of users, and
Urgency Policy to determine the relative priority of jobs submitted by users within the same group. Functional Policy and Urgency Policy both contribute to the calculation of the overall priority of a waiting job, which is represented by a number between 0 and 1. The normalised, relative priorities for all categories of priority can be seen with the "
qstat -pri " command. The weightings given to Functional Priority and Urgency Policy in this calculation are 0.2 and 0.1 respectively. This means that differences in priority resulting from application of the Functional Policy are twice as important as differences in priority resulting from application of the Urgency Policy. This ensures that jobs submitted by a member of an unprivileged user group can never have a priority greater than a job submitted by a privileged user. Note that the priority values of
waiting jobs usually take a long time to show up in
qstat listings, so it can sometimes appear that waiting jobs have a lower priority than they should have. However, priority only affects jobs that are
waiting to run, which means that the priority values for running jobs are basically irrelevant, and should not be compared to the priorities of waiting jobs. Furthermore, once a job has started running it has an equal status with respect to all other running jobs. Another thing to note about priority values is that they are
relative, which means that they are recalculated by SGE every time a job enters or leaves the system.
The groups of users are SGE
departments, which are also doubling as
access control lists for queues. A department does not act like an access control list by default; the adminstrator must enable this dual role explicitly when creating a new department. Priority levels in the Functional Policy are defined by the relative share of functional tickets given to users, departments, projects and job types. I have given the departments (i.e. user groups) a 100% weighting compared to the others; in other words, membership of a particular department is the only factor that affects the relative importance of one user compared to another. At the time of writing there are two departments: "Privileged" and "
ClusterGrid" . The
ClusterGrid department is for unprivileged users, and will eventually contain users submitting jobs via the Reading Campus Grid. The
ClusterGrid department has a smaller share of functional tickets than the Privileged department, which means that members of the
ClusterGrid department have a lower priority compared to members of the other department. This means that if there are two jobs waiting and one of them was submitted by a member of the
ClusterGrid department, the
ClusterGrid user's job will not be the first to run when the required number of processors for one job become available, even if it was the first to be submitted.
The actual values given to the various weightings and shares involved in the priority calculations can be viewed and altered in the "Policy Configuration" section of Qmon. The actual number of tickets does not matter, as long as there are enough to be distributed among all the running and waiting jobs that there are ever likely to be in the system at any one time; I have set the total number of tickets to be a million, an arbitrary large number. One important consideration when choosing the number of tickets and the distribution of functional shares is that functional tickets allocated to a department are
shared by all members of that department. Therefore, it is important to make sure that the number of tickets
per user in a privileged department is much larger than the number of tickets per user in an unprivileged department. At the time of writing the
shares of functional tickets given to the privileged and unprivileged departments are 1000 and 1 respectively, which means that there are 1000 times more functional tickets available to the privileged department as a whole, which are distributed among the members of that department. Occasionally there will be minor differences between the normalised ticket priority values of waiting jobs submitted by members of the same department. These differences should always be negligible compared to the differences resulting from the Urgency Policy calculation, which is used to determine the relative priority of jobs submitted by members of the same department.
The Urgency Policy governs the relative priorities of two or more jobs submitted by members of the same department. Urgency values are associated with various queue attributes (which are referred to as
complex resource attributes, or simply
"complexes" in SGE), and are assigned on a per-slot basis. In other words, a job requesting a particular attribute will be given an Urgency value equal to the attribute's value multiplied by the number of slots (i.e. processors) requested. The number of slots is itself a queue attribute, which means that a job requesting a large number of slots is given a higher Urgency than a job requesting a small number of slots. The only other attributes that have an associated Urgency value are
high_priority (
hp) and
low_priority (
lp). Requesting the
high_priority attribute with "
-l hp " in the
qsub command increases a job's Urgency. Similarly, requesting
low_priority decreases Urgency.
ClusterGrid users' batch jobs are restricted to the
ClusterGrid .q queue, which has a fixed run time limit of three hours. Members of other departments must set a run time limit for each job by specifying a value for the
h_rt queue attribute. The only interactive queue available for
ClusterGrid users is the compile queue.
The waiting time weighting has been set to 0.004. This allows the priority of waiting jobs to gradually increase over time. A weighting of 0.004 results in the urgency of a waiting job increasing by a value equivalent to four processors after about four hours, which is the recommended maximum resubmission interval for jobs running on the ESSC Cluster. The reasoning behind the waiting time weighting is as follows. Large jobs (i.e. those needing a large number of processors) start off with a higher urgency value than small jobs because they are generally harder to find space for. However, a small job can experience starvation in the situation where a sequence of large, dependent jobs has been submitted. Each time one of the large jobs finishes the next in the sequence will be ready to run. If a small job is waiting to run at the same time it will always be beaten by the large job because the large job has a higher urgency value. Increasing the urgency of he small job over time prevents it from waiting until the whole sequence of large jobs comes to an end. The value of the weighting has been chosen so that the small job should never have to wait longer than two resumbission intervals before being allowed to run.
--
DanBretherton - 17 Jul 2009