Google
 

Trailing-Edge - PDP-10 Archives - k20v7b - manuals/scheduler.memos
There are 6 other files named scheduler.memos in the archive. Click here to see a list.
THIS FILE CONTAINS:

   o	Scheduler Controls for Interactive/Computational Bias
   o	Working Set Swapping
   o	Release 4 Scheduler Changes
   o	Scheduler Controls User Interface
   o	New Swapping Logic Implementation
   o	Release 6 Scheduler Changes

------------------



+---------------+
! d i g i t a l !   I N T E R O F F I C E  M E M O R A N D U M
+---------------+


TO:     TOPS20 Monitor Meeting List
   cc: Dave Braithwaite
        Norm Valentine
                                                DATE:   11 Apr 78
                                                FROM:   Dan Murphy
                                                DEPT:   LCG S.E.
                                                EXT:    6356
                                                LOC:    MR1-2/E37
                                                FILE:   SBIAS

SUBJ:   SCHEDULER CONTROLS FOR INTERACTIVE/COMPUTATIONAL BIAS

One of  the  administrative  scheduler  controls  which  has  been
discussed  for  TOPS20  is  a  way  to bias the system more toward
interactive jobs or compute-bound  jobs.   The  behaviour  of  the
system  in  this  regard  is  the  result  of a number of somewhat
arbitrary decisions as well as such basic limitations  as  machine
speed,  swap speed, job load etc.  The decisions which are made at
present are the result of nothing  significantly  more  scientific
than  a  coin flip, but generally based intuitively on the goal of
making the CPU resource distribution fair and  the  response  time
adequate  to  good.  Each release has a somewhat different feel of
responsiveness, although it  was  never  the  specific  intent  to
change it.

Therefore, recognizing that the choice of an  operating  point  on
the   continuum   between   maximum   responsiveness  and  maximum
computational  throughput   properly   rests   with   the   system
administrator,  we intend to provide a means of globally affecting
certain scheduling policies.  At the extremes, some  systems  wish
to ensure good response even in the presence of many compute-bound
jobs, and are willing to see the computational jobs get little  or
no  service  in  order  that the system remains responsive.  Other
sites consider the interactive jobs  (e.g.   program  development)
merely  background  and  wish  to  minimize  context switching and
swapping overhead.  Some sites may wish to run in different  modes
during different parts of the day.

This machanism does not provide any absolute priorities, rather it
biases  the  system  toward one class of jobs or the other.  It is
not a substitute for a class scheduling system, but can work  with
such  a system by affecting the handling of jobs within a class or
which do not otherwise fall within any class rules.

The most primitive level of control is effected by specifying  the
value  of  each of a set of flags which in turn affect a scheduler
decision.  The exact interface to the system administrator has not
been determined as yet, but the following are some possibilities:

    1. Document the typical effect of each of the flags and  allow
                                                            Page 2


        the operator to specify an arbitrary combination.

    2. Define a set of numeric values representing  a  scale  from
        most interactive to most computational, one of which would
        be selected by the operator.

I suspect that the  latter  approach  is  better  suited  to  most
customers,  but  that  a few, perhaps with SWS help, would want to
determine their own setting.  Listed below are the current default
values and several typical values on either side.

The following flags and controls are  defined:   (note  that  a  1
value  is  always  the  more interactive setting regardless of the
default value)

        SKCYTF - bit 18 (default = 1)

This flag modifies the length of the two basic  scheduler  cycles.
Normally,  the  scheduler  has  a  20  ms  cycle  during  which it
processes terminal input and may switch processes if a page  fault
has  completed.   It  also  has  a  100 ms.  cycle during which it
checks various  queues  and  may  swap  processes  in  or  out  if
necessary  to conform to scheduling parameters.  Setting this flag
to 0 (more computational) will change these cycles to 80 ms.   and
400  ms.   respectively,  thus  reducing the context switching and
scheduling overhead, and giving  longer  uninterrupted  processing
time to compute-bound jobs.


        SKIOCF - bit 19 (default = 0)

This  flag  is  not  strictly  relevant   to   computational   vs.
interactive  jobs, but rather controls the effect of disk file IO.
Normally, a process  is  charged  a  certain  amount  against  its
current  quantum  (but  not  against its actual run time) for each
block because of disk IO.  This tends to prevent an IO-bound  jobs
from   remaining  on  high-priority  queues  for  too  long  while
generating high file traffic.  Setting this flag to  1  eliminates
the  quantum  charge;   hence the job is charged for only such CPU
time as it actually uses.  This favors IO-bound jobs whether  they
be short-lived (interactive) or long-lived (computational).


        SKHTF - bits 20-21 (default = 10)

This field allows one of four  possible  values  to  be  used  for
balance set hold time.  Balance set hold time is that quantum time
(as distinct from actual runtime) which must be used by a  process
when  once  loaded into the balance set.  If the hold time has not
been used, the process will not be preempted and  swapped  out  in
favor  of  a  higher  priority  process.  The default setting of 1
second is sufficient to prevent thrashing  without  causing  undue
degradation  of  response.   Less hold time will improve response,
while more will improve throughput, but only  under  circumstances
where  not  all  runnable  jobs  can  fit  in core simultaneously.
                                                            Page 3


Defined values are:  00 = infinite hold time (i.e.   process  runs
to  completion);   01  =  5 seconds;  10 = 1 second;  11 = no hold
time.


        SKHQRF - bit 22 (default = 1)
        SKLQRF - bit 23 (default = 0)

These  two  flags  affect  the  operation  of  the  "balance   set
partitioning"  algorithm  of AJBALS which normally acts to prevent
either large computational jobs or interactive jobs from excluding
the  other.  This algorithm computes a "reserve" of pages for each
of the two classes as a function of the  number  of  processes  in
that  class,  up  to  a  maximum depending on the size of physical
core.  Setting SKHQRF  to  0  eliminates  the  high-queue  process
reserve  thus  favoring  large  computational  jobs, while setting
SKLQRF to 1 eliminates the low-queue process reserve thus favoring
interactive  jobs.   The  default values include both reserves and
thus cause the partitioning to be fully operative.  The complement
of  the  default  values  eliminates  all  effects  of balance set
partitioning.


        SKBQEF - bit 24 (default = 1)
        SKBQRF - bit 25 (default = 1)

These two flags affect the handling of processes  in  the  balance
set.   Scheduling  of  processes in the balance set is done on the
basis of two queues which are distinct  from  the  normal  process
priority  queue.   The  balance  set  scheduler always selects the
first runnable process found by scanning these queues in the order
head  to  tail,  higher  priority  to  lower priority.  A selected
process is run for a short quantum and then  requeued  within  the
balance  set.   A  low-queue  process  (one  on the lowest regular
priority queue) is always placed on the lower priority balance set
queue.   A  high-queue  process  is  normally placed on the higher
balance set queue when  entering  the  balance  set,  but  setting
SKBQEF  to 0 will cause it to be placed on the lower queue.  After
having completed a balance set quantum, a  high-queue  process  is
normally  placed  on the higher balance set queue.  Setting SKBQRF
to 0 will cause it to be placed on the lower queue.  The release 3
monitor  as  shipped  effectively  has the default values 10, i.e.
high-queue processes enter the balance set on the high balance set
queue,  but  are  moved  to  the lower balance set queue after the
first quantum.  At least one site has felt this to be  a  decrease
in  responsiveness  from  release  2,  hence I propose to make the
default be as indicated, i.e.  high-queue  processes  continue  to
get balance set priority until they drop to low queue.  


        SKRQ1F - bit 26 (default = 1)

Setting this flag to 0 eliminates any normal use of queue  0  thus
giving  the effect of 3-queue scheduling with quanta of 2 seconds,
2 seconds, and 4 seconds.  (Queue 0 is 0.3 seconds.)  Under  mixed
                                                            Page 4


loads,  this is mildly favorable to computational jobs but has one
other property which is important in certain  applications.   With
this  modification,  a process will be given at least 2 seconds of
computation  after  a  wakeup  before  possible  rescheduling  and
swapout, rather than 0.3 seconds for the normal queue 0.  This has
been shown to significantly reduce overhead in systems  which  are
running  many  identical  jobs which are interactive but which may
require substantial compute time for some  interactions  (e.g.   a
data  base  query/update  application).  In general, the effect of
this modification is to increase the  response  time  for  trivial
interactions,  but  decrease  it for those requiring more than 0.3
seconds of compute time.


        SKTTPF - bit 27 (default = 1)

Normally, a process which blocks for terminal input for  1  second
or  more  will be placed on the highest queue regardless of system
load.  This means that properly timed  terminal  interactions  can
give  a job more than its fair share of the machine.  Setting this
flag to 0 eliminates any special consideration for terminal  input
blocks and causes the normal requeue algorithm to be used.


        SKWCF - bit 28 (default = 0)

A process unblocked after a low-speed  IO  wait  is  placed  on  a
higher  priority queue than it was on when it blocked.  The longer
the wait, the higher the queue, and a long enough wait  will  give
the  highest  queue.   (See discussion in SCHED.MAC before NEWST.)
This procedure tends to  maintain  fairness  among  compute-bound,
short  interaction, and long interaction processes.  Normally, the
actual process wait time is divided by the short-term load average
so  as  to  require longer waits for high queue when the system is
more heavily loaded.  Setting SKWCF to 1  eliminates  this  divide
and  thus  causes  interactive jobs to be favored, particularly as
the load average increases.


STANDARD VALUES

The following five standard values  should  provide  a  sufficient
degree  of  control  for  most  systems.  The system administrator
should be  able  to  select  an  integer  on  the  scale  0  (most
interactive)  to  4  (most computational) to produce the indicated
flag settings.


Value   B18-28 (binary)

0       101 111 111 111
1       101 011 111 110
2       101 010 111 100 (system default)
3       101 000 100 100
4       000 100 100 000
                                                            Page 5




All of the flags described above have been implemented at present,
but  some  may  have to be deleted or added in the course of other
planned work on scheduler controls, e.g.  group scheduling.


+---------------+
! d i g i t a l !   I N T E R O F F I C E  M E M O R A N D U M
+---------------+


                                        DATE:   5-May-78
TO:     TOPS20 Meeting List
                                        FROM:   Dan Murphy
                                        DEPT:   LCG S.E.
                                        EXT:    6356
                                        LOC:    MR1-2/E37

SUBJ:   Working Set Swapping


This document describes the implementation of  "Working  Set
Swapping"  in  TOPS20.   "Working  Set  Swapping" means that
process working sets are treated as units to be swapped into
or  out of core as determined by scheduler decisions.  It is
in contrast to "Demand Page  Swapping",  wherein  a  process
working  set  will  be loaded into core page by page as each
page  is  referenced.    For   additional   discussion   and
background,  see "WORKING SET SWAPPING" D. Murphy 29-Jul-76,
attached.


EXISTING ALGORITHMS

TOPS20 at present (through Release 3) operates as follows:

  1. When a process is selected for the balance set  by  the
        scheduler,  an  amount of physical core equal to the
        estimated size  of  its  working  set  is  reserved.
        Initially, only the PSB, JSB, UPT, and (in release 3
        only) stack page are swapped in  from  the  swapping
        device.  No attempt is made to run the process until
        these pages have all been read into core.  Then, the
        process  context  is  setup and the process started.
        Obviously, the first  page  referenced  is  that  to
        which the PC points.  If that page is not already in
        core, a  page  fault  immediately  occurs,  and  the
        process  is  blocked  while  the  page  is  read.  A
        similar  process  occurs  on  subsequent  references
        until all of the pages needed are in core.

  2.  A process leaves  the  balance  set  upon  any  normal
        dismiss  or  when selected by the scheduler in favor
        of  a  higher  priority  process.   No   pages   are
        explicitly   swapped   out  at  that  time  however.
        Asynchronously, when the count of available physical
        core  pages (NRPLQ) drops below a certain value, the
        core status tables are scanned and pages are written
        out  (GCCOR).   Generally,  the  pages  selected for
        writing during this scan are those which  are  owned
        by  processes  no  longer  in  the balance set.  The
                                                      Page 2


        process use bits in CST0 serve  to  prevent  a  page
        being  removed  from  core  if  at least one process
        using it is in the balance set.  The scan of the CST
        is  continued  only  until  a  sufficient  number of
        pages, typically less than  1/4  of  available  core
        have been removed.  Note also that any selected page
        which has not been modified according  to  CORMB  in
        CST0  is  not  actually  written  but  rather placed
        immediately on the replaceable queue.

        The result of this is that  process  pages  tend  to
        stay  in  core  for  some time after the process has
        left the balance set.  Hence, if it  should  reenter
        the  balance  set  within  that  time, the pages are
        immediately available and need not  be  read  again.
        Even  pages which have been selected will spend some
        time in the disk write queues  and  some  additional
        time  in  the  replaceable  queue  and may be reused
        during this time.  This saves both the elapsed  time
        that  the swapin reads would otherwise take plus the
        compute time that would  otherwise  be  required  to
        scan the CST, process the writes and reads, etc.

  3. The  management  of  the  process  working  set  occurs
        asynchroneously  to  the  above.  A routine (XGC) is
        invoked periodically in process time  to  check  all
        pages   assigned   to   the  running  process.   If,
        according to the age field in CST0, the page has not
        been  referenced in the recent past (the working set
        "window"), it is swapped out  and  the  working  set
        size estimate reduced.

        A 512-bit (15-word) mask is maintained to  represent
        the  actual process working set.  This mask maps the
        512-page  process  address  space,  and  a   1   bit
        indicates  that the corresponding page is present in
        the working set.  The mask is not otherwise used  by
        the  monitor,  and  working set pages in the monitor
        address space are not represented at all.


REPRESENTATION OF THE WORKING SET

The  first  problem  to  be  solved  is  to  find  a  viable
representation   of  the  process  working  set.   The  mask
approach  described  above  does  not  include  the  monitor
address  space,  nor  does  it  gracefully expand to support
extended addressing.  Merely to map 32 user  sections  would
take  480  words.   Any  alternate  representation  must  be
processor-efficient for two specific operations:

  1.  Adding specific pages to the data  base  as  they  are
  referenced.

  2. Scanning to produce a list of all pages in the  working
                                                      Page 3


        set for swapin or swapout.

These two requirements, plus any reasonable size limitation,
quickly  eliminate  all  simple  structures.   The  proposed
representation is as follows:  (those familiar with the KL10
cache   and   page   table   organization   may   note  some
similarities)

A virtual address consists of 23 bits plus the  user/monitor
bit.   For purposes of page management, the low-order 9 bits
(the word-within-page) are uninteresting,  and  we  have  15
which must be remembered in the data base.

                --- -------------------------------------
                ! ! !             !                     !
                ! ! !             !                     !
                ! ! !             !                     !
                --- -------------------------------------
                 U       SECTION          PAGE

The low-order 7 bits are used as an index to select an entry
in  a  128-entry  table.  Each table entry is 9 bits, and is
used to hold the remaining 8 bits of the 15-bit virtual page
address plus a valid bit.  

                !                               !
  INDEX (7)     !                               !
        !       !-------------------------------!
        !       ! ! !                      !    !
        !------>! ! !                      !    !
                ! ! !                      !    !
                !-------------------------------!
                !V U      SECTION (5)        PG !
                !                            (2)!
                !                               !

Hence, a given address is in the  data  base  if  the  entry
selected  by  its low-order 7 bits is valid and equal to the
user, section, and high-order page bits.  The list of  pages
in  the data base may be produced by scanning the table and,
for each valid entry, concatenating the data bits  from  the
table with its index.

There is potential conflict however,  in  that  all  virtual
addresses  with  the same low-order 7 page bits will attempt
to use the same table entry.  To reduce this problem, we use
four parallel tables, each formatted as above.  A particular
virtual address may  be  represented  in  any  of  the  four
fields;    hence,   up  to  four  addresses  with  identical
low-order bits may be remembered without conflict.

This data base is implemented as a 128-word table, with each
word  holding  the  four entries for that index value.  This
table is located in the  second  PSB  page  along  with  the
monitor process stack.
                                                      Page 4


This structure supports a maximum of 32 user sections and 32
monitor  sections  as described.  The parameters may readily
be changed however to support more sections.   For  example,
the  full  12-bit  sections  field  could  be  supported  by
increasing the field width from 9 to  16  bits  to  hold  12
section  +  2  page  +  1 user + 1 valid bits.  The division
between the index and the associative  fields  can  also  be
moved  to  vary  the  table length and number of associative
entries.  These modifications are  invisible  to  users  and
have  no effect on any external data structures;  hence they
may be done at any time.

Even with four  associative  entries,  some  conflicts  will
occur.   As  described above, the data base can hold any set
of addresses in a single user (or monitor) section  with  no
conflicts.  The possibility of conflicts arises with sets of
addresses from multiple  sections,  e.g.   user  0,  200000,
400000, monitor 200000, 400000.  The action taken when there
is no free entry to store a new address is to  shift  the  4
entries at that index position left by one entry thus losing
the leftmost  entry,  and  store  the  new  address  in  the
rightmost  entry.   If this occurs several times, the effect
is to have the most recently used entries in the data  base.
The  lost entries represent pages which are presumably still
in the working set but are not represented in the data base.
The  process  may continue to use them, but they will not be
seen for swapout, swapin, or XGC operations.  This will tend
to  reduce  the  efficiency  of  these  operations,  but not
significantly if the number of lost entries is small, and it
has  no  effect  on  the behavior of programs in any respect
other than time.  We should occasionally monitor the  number
of  lost entries as an indication of the effectivness of the
working set cache.


BASIC OPERATIONS

PAGE FAULTS - The virtual address of each page fault will be
added to the working set cache.

XGC - XGC will be completely revised.  Rather than  scanning
the  CST  to  find  physical core pages owned by the running
process, it will scan the working set cache to find  virtual
addresses in the working set.  It will trace and resolve the
page pointers, and check the  age  field  on  physical  core
pages  found.   "Old"  pages  will  be  swapped out, and any
virtual addresses which do not resolve to "current" assigned
physical core pages will be removed from the WS cache.

SETMPG, etc.  - SETMPG and  MSETMP  will  remove  the  given
addresses  from  the  WS  cache  if  no  new pages are being
mapped.  This should reduce the load on the  WS  cache  from
various  monitor  functions  such  as directory mapping.  It
does not appear to be necessary to do the same for user PMAP
operations.
                                                      Page 5


SWAPOUT - When a process leaves the balance set, it will  be
placed  on a list of processes awaiting swapout.  When GCCOR
runs, it will select processes from  this  list  (FIFO)  and
swap  each working set out.  The swapout of a working set is
similar to XGC and is implemented with most of the same code
accessed  via  a  second  entry  point, WSSWPO.  During this
operation, the working set cache is scanned, and non-current
pages are swapped out as with XGC.  A list is constructed of
current pages which  need  writing,  one  or  more  sets  of
sequential  swap  addresses  are assigned, and the pages are
given to the swap device driver as a group.

Unmodified pages are placed immediately on  the  replaceable
queue  and  are  not  written.   This  appears  to  be  more
efficient than re-writing the  pages  with  sequential  swap
addresses  even  through  it  does  not achieve as much swap
device locality.

Note that pages are not removed from  core  unless  core  is
needed,  and  then  only  as many processes from the swapout
list  are  removed  as  may  be  necessary  to   bring   the
replaceable  queue  up to the desired level.  This maximizes
the probability that a process may return to the balance set
while  its pages are still in core and avoid the overhead of
swapping them out and in again.

During the swapout scan, some pages may be discovered in the
"read completed" state.  This is generally because the pages
were read at swapin but  not  subsequently  referenced.   If
significant  process  time  has elapsed since the swapin (or
last XGC), or if the process  is  leaving  the  balance  set
because of a dismiss, the pages are assumed to have left the
working set.  Otherwise, we assume that the process has just
not had time to reference them and they are retained.

SWAPIN - When a process  is  selected  for  entry  into  the
balance  set,  its  PSBs,  UPT,  and  JSB  are swapped in if
necessary.  The process is removed from the swapout list  if
present,  and the fact of its presence remembered.  When the
overhead pages are all in core, the process begins execution
immediately  if it was on the swapout list since its working
set is still in core.  If not, the preload routine is called
to scan the WS cache and initiate reads on all pages.  Since
the WS cache was scanned and checked at swapout, only  valid
current  pages  should be loaded during this operation.  The
scan will quickly place all  necessary  reads  in  the  swap
device  queues,  so  maximum  optimization efficiency should
occur.  No explicit test is made for the completion of these
reads;   the  process is merely allowed to run and will page
fault on any pages not  yet  in  core.   After  one  or  two
faults,   sufficient   time  should  have  elapsed  for  the
requested reads to have completed.

GCCOR - Part of the GCCOR operation has been described above
in connection with swapout.  That operation will remove from
                                                      Page 6


core all pages which are in process working sets,  but  will
miss  those  which  are  not  in  the  WS  cache  because of
conflicts.  It will also miss others which are  assigned  to
no  process  but  are locked in core, etc.  Therefore, after
each swapout of a particular process, the contents of  FKWSP
(number  of  assigned pages in CST3) will be checked.  If it
is still non-0, a scan of CST3 (using FKGC) will be done  to
remove  the  remaining  pages.   Also, the traditional GCCOR
scan will be performed  occasionally  to  remove  unassigned
pages  and  update  the  balance set share count.  This scan
will not look at any pages assigned  to  processes  however,
since these should always be handled by WSSWPO or FKGC.

PAGE AGING - A swapout-swapin cycle will lose all of the age
information   for   the  working  set  pages,  although,  as
explained above, all non-current pages will  be  removed  at
swapout.  Since a swapout is equivalent to a full XGC cycle,
the next XGC will not occur until significant  process  time
after the subsequent swapin.  This time is comparable to the
working set window, so the process will have referenced  all
pages which are still in the working set.  The XGC scan will
remove any read-complete (but unreferenced) pages found.  In
principle,  it  would  be of some value to retain all of the
age information and write it onto the swap device along with
the working set pages, but in practice it does not appear to
be worth the processor time and storage.


+---------------+
! d i g i t a l !   I N T E R O F F I C E  M E M O R A N D U M
+---------------+

TO:     Dave Braithwaite
        Jim Flemming
        Dave Kiarsis
        Norm Valentine
        TOPS20 Monitor Memo List

                                       DATE: July 17, 1978

                                       FROM: Arnold Miller

                                       DEPT: DEC20 S. E. Dept.

                                       LOC:  MR1-2   EXT:6473  

                                       PDM: ASM-78-001-01



SUBJ: TOPS20 Release 4 Scheduler Changes

This is a brief description of the monitor changes that will be
made  to  support  the scheduler controls project.  The changes
described herein represent  a  baseline  from  which  to  begin
testing   and   tuning   of   the   scheduler   controls.   The
user-interface to the features, including EXEC commands and new
JSYSes, will be specified in a subsequent document.

                New scheduling queues


Two new queues have been added to the scheduler.   In  previous
releases  of  the monitor, the scheduler maintains four queues.
Every process resides on one of the  queues,  representing  its
activity  in  the  recent  past.   Queues zero to two represent
"interactive"   activity    and    queue    three    represents
"compute-bound" activity.  The transitions among the queues are
controlled by several factors including:  amount of  continuous
computing, dismissing and several heuristics.

A need has been identified for reserving two  scheduler  queues
representing  the  two  extremes  of priority.  A high priority
process needs to always be given special consideration so  that
its  response  and  service  is prompt and predictable.  By the
same token, a low priority process must not interfere with  the
response and service of other "normal" processes on the system.
The two new queues are zero and five.  All  "normal"  processes
reside  on  one  of the queues from one to four inclusive.  The
transition rules for these "normal" queues are identical to the
rules for the queues in previous monitor releases.

Queue zero is the "high priority" queue.   Besides  tending  to
                                                         Page 2


reside at or near the top of the GOLST when runnable, processes
on queue zero always enter the balance set on the  balance  set
high  priority  queue.   Note that a process on queue zero will
not run to the exclusion of all other processes in the  system,
but  will tend to be run more often than the processes that are
selected by the round-robin scheduler.

Queue five is the low priority queue.  Processes on this  queue
tend  to  reside  at  or  near  the  bottom  of  the GOLST when
runnable.   When  a  low  priority  process  is  selected   for
inclusion in the balance set, it is treated as if it resided on
one of the "normal" queues.

The SJPRI and SPRIW JSYSes are used to set a job or  a  process
on one of the "special" queues.


                Class Scheduler

The class  scheduler  is  an  extension  to  the  "CPU  runtime
guarantee"  that  exists  in  previous  monitors.  The existing
feature allows the system administrator to specify that a given
job  is to receive a certain percentage of the processor.  This
mechanism is limited to specifying one job per  guarantee,  and
the  guarantee  is  over a short interval with no regard to the
history of the  job.   The  major  limitation  of  the  present
facility  is  that it is not possible to designate a "class" of
jobs as having a guarantee.  Normally, this is what is required
and cannot be achieved with the present capability.

The monitor support for the class scheduler allows  any  number
of  jobs  to be placed in a class.  The monitor will support at
least eight classes, and will allow the "guarantee"  per  class
to  be assigned by the user.  The guarantees are in percentages
just like the present mechanism.  Percentages can  be  assigned
in  units  of  .5%  from  .5%  to  100%, however the sum of all
assigned percentages may not exceed 100%.

The class scheduler code uses  the  assigned  class  and  class
percentage  to  influence  a  process's  priority.   That is, a
process that is ahead of its percentage will tend to  have  its
priority  reduced  and  a process that is behind its percentage
will  tend  to  have  its  priority  increased.   Therefore,  a
process's  relation  to its "target" percentage is a new metric
that becomes part of the heuristic that dictates the transition
rules among the scheduler's queues.  Following is a description
of this new metric and how it is employed.

The scheduler periodically computes a job's "utilization" as  a
decaying  average  of the CPU time used by the processes of the
job.  The computation performed is:

        U(I+1)=U(I)*e^(-t/C)+F(1-e^(-t/C))

where:
                                                         Page 3


        U(I) is the current process utilization


        F is the fraction of the CPU used by the process in the
        recent interval, "t"


        t is the time since the last computation of U


        C  is  the  "decay"  interval.   After  time   C,   the
        utilization decays by e (approximately one-half).  This
        number represents the  amount  of  "history"  that  the
        monitor  uses in determining a process's behavior.  The
        number is chosen to insure as  uniform  a  behavior  as
        possible and may be set by the user.

The computation is performed periodically for  all  jobs.   The
utilization  function for each of the classes is the sum of the
utilizations of the jobs in the class.  The class  utilization,
CU,  is  used  to  compute  the "distance" of each job from its
ideal or target utilization.  The "distance" is the metric that
is  used in adjusting process priorities.  The distance, CD, of
each job is expressed as:


                CD = (CP/N-CU)/(CP/N)

where:

        CD  is  the  "distance"  of  a  job  from  its   target
        utilization

        CP is the class's ideal utilization.  CP  is  specified
        by the user when the class is defined.

        N is the "population" of the class.  N is currently the
        number of logged-in jobs belonging to the class.

This  function  is  designed  to  be  nearly   exponential   in
character.    That  is,  CD  increases  rapidly  as  the  job's
utilization becomes insufficent and decreases  rapidly  as  the
job's utilization increases.  With these characterisitcs, CD is
an accurate measure of the priority of each job.

The priority of a process at any time  is  a  function  of  the
process's queue number and of its job's CD value.  The priority
is expressed as a single number whose value is used to position
the  process  on  the  GOLST.   Also, CD is used to compute the
process's balance set quantum.

The intention of the class scheduler is to create a  number  of
independent  scheduling  groups  that  behave  according to the
scheduler's micro-controls.  That is, within  each  class,  the
jobs  compete with one another as if they were running alone on
                                                         Page 4


their own computer that is CP percent as  fast  as  the  actual
processor.   Each class is allowed to consume CP percent of the
processor by running its jobs in an  order  prescribed  by  the
scheduler's micro-controls.

In summary, the class scheduling action can be expressed  as  a
simple  set  of  changes  that  creates  individual  scheduling
partitions according to the class  guarantees.   These  changes
are:

        1.  The priority function of each process  is  affected
        by its job's CD function.  The application of CD to the
        priority is to raise a process's priority if its job is
        behind  its  "target"  and to lower the priority if its
        job is ahead of "target".

        2.  The balance set quantum assigned to  a  process  on
        the balance set queue is also affected by the process's
        associated CD.  The effect is to  give  porportionately
        higher  quanta to processes who belong to behind target
        jobs  and  to  give  proportionately  lower  quanta  to
        processes  who  belong  to  ahead  of target jobs.  The
        result of this is to favor running processes  that  are
        behind  target  over  processes that are on or ahead of
        target.


As a final note on the class  scheduler,  something  should  be
said  about  "windfall".   Windfall  is  the  term  adopted  to
describe unused portions of the machine.  If a particular class
is  unrepresented  (i.e.   has  no  logged  in  jobs),  or  has
insufficient  jobs  to  use  its  share  of  the  machine,  the
remaining CPU is windfall.  The scheduler algorithm is designed
to distribute windfall proportionately over the active classes.
That  is, higher guarantee classes will receive proportionately
more of the available windfall than  lower  guarantee  classes.
It  is  assumed that a user-mode control program will be doling
out windfill according to administrative fiat and will not rely
on the monitor's allocaton of unused class shares.  


                Other Scheduler Changes

In trying to address the problems  expressed  by  customers  at
DECUS,  we  will be adding numerous data collecting routines to
the monitor in release 4.  A large  number  of  these  routines
will  be  in  the scheduler as the progress of a program in the
system depends heavily on its treatment by the scheduler.   The
model for these routines is a SNOOP program now being developed
that  gathers  information  from  release  3  and  release   3A
monitors.

Finally, the attendees of the session on the scheduler controls
at  DECUS  requested  that  the  scheduler  be  able to provide
guaranteed and consistent service.  This implies not only class
                                                         Page 5


guarantees,  which  we have addressed with the class scheduler,
but also assurances that a user or class of users  receives  no
more  than  a certain level of service.  In fact, the consensus
was that it was better to throw away machine cycles rather than
allow  certain  programs  to  run beyond their class guarantee.
While this is contrary to our  beliefs  and  philosophies,  the
release  4 scheduler allows individual jobs to be designated as
not receiving more than their class share of the machine.   The
specific  algorithm  used  is  that  any  process  of  a job so
designated will not be run if its class is  now  ahead  of  its
target  utilization  and the process is compute-bound (i.e.  on
scheduler queues four or five).


+---------------+
! d i g i t a l !   I N T E R O F F I C E  M E M O R A N D U M
+---------------+

TO:     Monitor Memos List
        Don Allen (BBN)
        Dave Braithwaite
        Norm Valentine

                                       DATE: Sept. 15, 1978

                                       FROM: Arnold Miller

                                       DEPT: DEC20 S. E. Dept.

                                       LOC:  MR1-2   EXT: 6473 

                                       PDM: ASM-78-002-03


SUBJ: Scheduler Controls User Interface


                        Abstract

Two schemes for setting up a class scheduling  environment  are
presented.   The  first uses the accounting system of TOPS20 to
define the valid classes for users.  The second uses the  newly
implemented  GIVOK/GETOK mechanism to define the valid classes.
New commands for the n-CONFIG file are  presented  that  define
class percentages and that specify if class scheduling is to be
used.  New TOPS20 command language commands are  not  presented
in  this  document but are alluded to in several places.  A new
TOPS20 monitor call is defined for setting and reading the  new
scheduler controls parameters.
                                                         Page 2



                        Glossary

Bias Controls           A set of indicators that influence  the
                        scheduler's decisions.  In general, the
                        bias controls make the scheduler  favor
                        either  compute  bound  or  interactive
                        programs.  The bias controls apply only
                        if  the  class  scheduler  is not being
                        used.

Class                   A newly defined unit of scheduling.   A
                        class  consists  of an ideal percentage
                        of the processor and  a  collection  of
                        users.   A class may also be identified
                        by a set of valid accounts.

Class percentage        The amount of the  processor  that  the
                        members  of  the class are entitled to.
                        This  is  also  known  as  the  "target
                        percentage".

"Dregs" queue           As described in reference [2], this  is
                        a  special  queue  whose  occupants are
                        allowed to run only  when  no  normally
                        scheduled  processes  are  available to
                        run.


Job Utilization         The average  amount  of  the  processor
                        that   the   job   has  accrued.   This
                        quantity is  computed  periodically  by
                        the  scheduler and is used to determine
                        the priority of the job.

Policy program          A user  mode  program  responsible  for
                        handling   GETOK   requests   from  the
                        monitor.  This  program  may  handle  a
                        wide range of requests and decisions at
                        the system manager's whim.  One set  of
                        requests  that  it  may  be directed to
                        handle is the assignment of classes  to
                        jobs.

Windfall                The amount of the processor not claimed
                        by  active classes.  Windfall occurs if
                        one or more classes has  no  logged  in
                        jobs  or  if one or more classes cannot
                        use  all  of  its   class   percentage.
                        Windfall  may  be  manifested either as
                        idle time or as excess class percentage
                        awarded to the active classes.

Windfall Policy program A program that uses a complex series of
                        input  data  to  determine which groups
                                                         Page 3


                        are  to   receive   portions   of   the
                        available  windfall.   This program may
                        be distinct  from  the  Policy  program
                        that   handles   GETOK   requests  from
                        TOPS20.

0.0 References

1.  Scheduler Controls for Interactive/Computational Bias
Dan Murphy
April 11, 1978
Document available from author

2.  TOPS20 Release 4 Scheduler Changes
Arnold Miller
July 17, 1978
PDM:  ASM-78-001-01


1.0 Introduction

The TOPS20 class scheduler allows the system manager to  divide
his/her  computer  into several individually scheduled classes.
Each class  is  assigned  a  percentage  of  time  to  use  the
processor.  The scheduling of other system resources (e.g.  I/O
channels) are unaffected by the class scheduler.  The number of
users in a class is also determined by the manager.

This scheme parallels the "project control"  concept  that  was
introduced  with  release  3  of  TOPS20.   The class scheduler
allows the system administrator to  allocate  portions  of  the
processor  to  individual  groups.   These  groups  can then be
administered and controlled  by  individual  project  managers.
Furthermore  it  allows  the  system  administrator to set up a
separate class for batch jobs so that they  can  be  controlled
and scheduled independently of time-sharing jobs.

The basic element in the class scheduler is a class.   A  class
consists  of a percentage (or target) and a number of users.  A
class's percentage indicates the ideal amount of the  processor
that  it  is to receive.  Many factors can affect the monitor's
ability to provide this level of service such as:

        1.  The class does not make sufficient demand  for  the
        processor.  This can be caused by two conditions.  One,
        the class  is  under-represented;   i.e.   its  current
        population  cannot  use  the  amount  of  the processor
        allocated to it.  Two, the class's jobs require another
        resource  (such  as  disk  I/O)  that is not available.
        This latter case might be solved by partitioning all of
        the  system's resources (I/O, memory and the processor)
        acccording to the class  targets.   This  extension  of
        partitioning all of the system's resources according to
        a common  formula  is  a  consideration  for  a  future
        release of TOPS20.
                                                         Page 4


        2.  The class's demand exceeds its target and there are
        excess  shares available on the system.  This can arise
        in several ways:  the administrator has  not  allocated
        all of the processor to existing groups, or one or more
        groups is not using all  of  its  shares  (see  item  1
        above).   The  occurrence  of  unused  portions  of the
        processor is known as windfall.  In order  to  allocate
        windfall  in  a controlled fashion, a specially written
        program must exist to determine the amount of available
        windfall, determine each active group's needs (this may
        be a function of time)  and  distribute  the  available
        windfall  accordingly.   If no such program is present,
        the   monitor   will   either    distribute    windfall
        proportionately among the active classes or not at all.


2.0 The User Interface


2.1 New monitor call SKED

The monitor's data base can be read and modified by means of  a
new  monitor call, SKED.  This monitor call allows a program to
set and read all of the items affecting  the  class  scheduler.
Also,  it  allows  a program to set and read the "bias control"
settings that influence scheduling  decisions  when  the  class
scheduler is not being used.  The calling sequence is:



                SKED

AC1/    function
AC2/    address of function-dependent argument block where  the
        first word is the length of the block.


Returns:        +1 always
Illegal Instruction trap on error.

The   available   functions   include   both   privileged   and
non-privileged operations.

Functions (specific offsets and formats will  be  specified  in
the near future):

 .SKRBC         Read  bias  control  setting.   This   function
                returns  a single number indicating the current
                bias control setting.  AC2 points to a word  to
                receive the value.


 .SKSBC         Set bias control setting.  This  function  sets
                the  bias control to the specified value.  This
                function is privileged.
                                                         Page 5


 .SKSCS         Set the percentage or target of a  class.   The
                arguments  are  a  floating-point number in the
                range 0 to 1.0 and a class number.  The sum  of
                the  defined  class  percentages may not exceed
                1.0.  This funtion is privileged.

 .SKRCS         Read class parameters.  This  function  returns
                the  class  percentage  and utilization for the
                designated class.

 .SKSCJ         Set the class of a job.  This function takes  a
                pair of numbers, the job to set and the desired
                class.  If setting the  class  of  the  calling
                job,  this  function  is  not  privileged.   If
                setting  the  class  of  another  job   it   is
                privileged.   In  either  case, the job must be
                allowed to reside in the selected  class.   The
                calling job may be designated as -1.

 .SKRJP         Read all of the class scheduler parameters  for
                a given job.  This function returns:  the class
                the job is in, the job's  current  utilization,
                the   job's  target  utilization,  the  class's
                percentage, and the class's utlilization.

 .SKRRA         Read class run or load averages.   The  monitor
                maintains  the  load  averages  for each active
                class and will make them available through this
                function.  The returned data are four words per
                class:   the  class  number,  the  one   minute
                average,  the  five  minute  average,  and  the
                fifteen minute average.

 .SKBCR         Read the class setting for batch  jobs.   A  -1
                indicates  that  there  is no special class for
                batch jobs.

 .SKBCS         Set batch  class.   This  function  allows  the
                specification  of  the class in which all batch
                jobs will run.  A -1 indicates no special class
                for  batch  jobs.  If this is set, it overrides
                the valid classes for any user running a  batch
                job.

 .SKBBG         Run all batch jobs on the "dregs" queue.   This
                function applies only if the class scheduler is
                not being used.  The argument is either a  zero
                (clear)   or  a  non-zero  (set).   A  non-zero
                indicates that batch jobs should be run on  the
                "dregs" queue.

 .SKDDC         Set system class default.  The argument is  the
                class  to  be  used  for all users that are not
                otherwise assigned classes (either  within  the
                accounting  data  base or within the the policy
                                                         Page 6


                program's data base).

 .SKICS         Start or stop  the  class  scheduler.   If  the
                class scheduler is being started, this function
                also specifies the mode in which  class-to-user
                assignments are made and whether windfall is to
                be allocated to the active classes or  withheld
                from  the  active  classes.   See the following
                discussion for details.



2.1 Setting a job's class

A job's class is set in one of two ways:   by  the  account  to
which  the  job  is charging its use of the processor or by the
GIVOK/GETOK policy program.  The selection  is  made  when  the
.SKICS  of  SKED  is  executed  to  begin class scheduling.  If
accounts are being used, then the accounting data base contains
the  default  class  for  each  account.  When a job selects an
account (either by logging in or by  executing  a  CACCT),  the
accounting  data base provides the class for the job.  There is
only one class for each account in the  accounting  data  base.
Therefore  a  job's  class  may  not be changed with the .SKSCJ
function of SKED.  Only by changing the account will the  job's
class be changed.

If the GIVOK/GETOK policy  program  is  being  used  to  select
classes,  then  each  time  the .SKSCJ function is executed the
monitor  will  request  that  the  policy  program  verify  the
specified  class.   Futhermore, in this mode, the default class
assigned at LOGIN must also be specified by the policy program.

Regardless of the manner in which class  assignments  are  made
and  verified,  each time a job changes its class a USAGE entry
will be made indicating the end of a session.

The two methods are  provided  so  as  to  be  as  flexible  as
possible.  In an installation using account validation, it will
be most convenient to associate accounts  with  classes.   This
way   one,   static   data   base   suffices  for  all  of  the
administrative  requirements.   In  an  installation  in  which
account  validation  is  not  being  used  or in which the user
population changes, the GIVOK/GETOK  mechanism  may  be  called
for.


2.2 Establishing a policy data base

2.2.1 Using Account Validation

2.2.1.1 ACTGEN changes

Since  there  are  two  modes  of   establishing   job-to-class
correspondence,  there are two methods for establishing a class
                                                         Page 7


scheduling data base.  If accounts are being used  to  identify
valid  classes,  then the accounting data base must contain the
valid class for each account.  This is done with a  new  switch
on the ACCOUNT command.  The switch, CLASS, specifies the class
that is valid for a particular account.  Therefore:

                ACCOUNT 341/CLASS:3

would declare that class  3  is  the  valid  class  when  using
account  341.   ACTGEN  will  place  this  information  in  the
system's accounting data base for use by the  class  scheduling
routines.

2.2.1.2 n-CONFIG commands

For any account that is not given a  valid  class,  the  system
default class (class 0) will serve as the valid class.

Once the accounting data base is properly  updated,  the  class
scheduler is invoked via the n-CONFIG file.  Three new commands
are provided:

BATCH-CLASS n
ENABLE CLASS-SCHEDULING (using) ACCOUNTS (windfall to be) m
CREATE (class) n (to have percentage) k


The first command specifies the class  to  be  used  for  batch
jobs.   This  may  be  omitted,  in  which  case batch jobs are
treated identically to time-sharing jobs.


The second command directs the system to do  class  scheduling.
The  argument  ACCOUNTS indicates that the accounting data base
contains the class assignments.  The last argument m is one  of
the   following   keywords:   "WITHHELD"  or  "ALLOCATED".   If
windfall is WITHHELD, no class will be  allowed  to  accumulate
more  than  its  target  percentage of the processor.  This may
imply that the processor will be idle even though some jobs are
ready to run.  The default is ALLOCATED.


The last command is a set of commands that  defines  the  class
percentages.   Any  class  not specified in a CREATE command is
assumed to have a 0% share of the processor.  Jobs  in  such  a
class  will  occasionally be allowed to run, but their relative
priority is low (note that if windfall is WITHHELD, any job  in
a  0%  class  will  not  be able to run a compute-bound program
since all of the compute-bound use of the  processor  by  a  0%
class is derived from available windfall).

2.2.2 Using a policy program

The alternate way of assigning classes  to  jobs  is  with  the
GIVOK/GETOK  policy  program.   Using  this  scheme, the .SKSCJ
                                                         Page 8


function of SKED will perform a GETOK function  to  verify  the
class  change.   Also,  the  LOGIN JSYS will execute a GETOK to
obtain the default class for the newly logged-in job.  As  with
the  scheme employing accounts, each class change will generate
a USAGE file entry.  Given the time constraints of  release  4,
it  is unlikely that we will produce a GIVOK/GETOK program that
implements a class assignment scheme.  However, this paper will
attempt to outline a method of establishing such a data base as
follows:

2.2.2.1 n-CONFIG commands

To begin using class scheduling,  the  n-CONFIG  comand  is  as
follows:

ENABLE CLASS-SCHEDULING (using) POLICY-PROGRAM (windfall to be)
m

If this method is selected and no policy program is running  on
the system, all jobs will be placed in the same class (class 0)
and no .SKSCJ functions will be allowed.  See section  2.2.1.21
for a more complete description of the n-CONFIG commands.


3.0 Windfall Allocation

As noted earlier, in the absence  of  any  controls  the  class
scheduler    will    distribute    the    available    windfall
proportionately to the active  classes.   However,  it  may  be
desirable  to  alter  this  scheme so that windfall is given to
classes that are "needy" or "favored".  A "needy" class is  one
that has not used its share of the processor in the recent past
and that requires the additional percentage now.   A  "favored"
class  is  one singled out by the system administrator as being
the recepient of a large portion  of  the  available  windfall.
Both  of these examples have valid application.  The example of
the needy class would be a consideration in a system that sells
shares  of  its  machine  and  wants to guarantee that over one
billing period  (say  a  month)  each  of  its  subscribers  is
guaranteed  its exact percentage of the available computing.  A
favored class might  be  one  that  consists  of  people  doing
critical  work  that  by its nature takes precedence over other
projects using the computer.

From the above description, the components of a windfall policy
maker  are:   medium  or  long-term  usage  data, favored class
status, and a set of heurisitic functions  that  prescribe  the
desired  long-term behavior of the system.  Note that the class
scheduler  implements  relatively  short-term  policy  (several
minutes at most) and is oblivious to the needs and requirements
of  long-term  business  strategy.   Therefore  an  independent
windfall  policy  program  is  required  that  can  digest  the
components, apply the desired functions, and adjust the initial
class percentages so as to produce a "fair" scheduling policy.
                                                         Page 9


At present this is considered beyond the scope  of  release  4.
However,   there   is   nothing   to   prevent  customers  from
implementing their own policy programs and  from  our  learning
from their experiences.

4.0 Additional Features

If the class scheduler is not being used, one may  still  elect
to limit the scheduling of batch jobs.  The n-CONFIG command:

                BATCH-BACKGROUND

specifies that all batch jobs will run on  the  "dregs"  queue.
If  it is required that batch jobs receive favorable treatment,
the class scheduler must be used.

Also, if the class  scheduler  is  not  being  used,  the  bias
control value may be set by the n-CONFIG command:

                BIAS-CONTROL n

where n is a decimal number representing the desired setting  (
see reference [1] for the values of n).


               THOUGHTS ON SWAPPING STRATEGY
                   Dan Murphy - 20-Oct-78

We traditionally think of forks as being  in  one  of  three
major  states,  1.  blocked, 2.  on GOLST but not in balset,
3.  on GOLST and in balset.  There is  another  state  which
exists   but   which   the  scheduler  does  not  explicitly
recognized, that is not in balset but working set  in  core.
Within  that  state,  there are two sub-states, on GOLST and
BLOCKED.

This state exists in fact for some time after the  fork  has
nominally  been  removed  from  the  balset.   There  is the
complementary state, in balset but working set  not  yet  in
core which may also be of interest.

The routines AJBALS, LOADBS, and REMBSJ/REMBSF  control  the
movement  of  forks  into  and  out  of  the  balset.   They
implicitly  assume  that  their  decisions  are  implemented
instantly.   E.g.,  AJBALS  will  remove  one  fork and load
another into the space "released" by the first in one  pass.
The  movement  from  the  state  of  BSBAR+WS  (BS = fork in
balset, BSBAR = fork not in balset, WS = fork working set in
core,  WSBAR  = fork working set not in core) to BSBAR+WSBAR
is implicitly controlled  by  GCCOR.   There  is  increasing
evidence that the decisions relative to this transition have
a  major  impact  on  system   performance   (response   and
throughput).   What  happens here determines the probability
that a fork coming into the balset will find some or all  of
its working set pages in core already.

We know that this probability is  already  quite  high  from
available  data.   For  example, FRIDAY output for a typical
1-minute period on 2102  indicates  that  the  average  fork
working  set  size  was  54  pages  and  that there were 900
wakeups  during  the  interval.   Hence,  if  every   wakeup
required  that  the  full  working  set be swapped in, there
would have been 48,600 disk  reads  during  the  period,  or
about  810 pages/second.  The actual reported disk read rate
was 68 pages/second.

Over the years, various changes  to  GCCOR  have  been  made
which  improved  performance.  It appears that these all had
the effect of increasing this probability.

1.  Originally, GCCOR scanned all of the CST and  threw  out
everything  not needed.  It was changed (circa 1972) to stop
after collecting some reasonable number of pages.

2.  Early in the working set swapping project, I implemented
a  queue  of  fork which had left the balset.  When swapping
out working sets, GCCOR would  use  this  queue  on  a  FIFO
basis.
                                                      Page 2


3.  Recently, I tried doing GCCOR on the basis of  a  global
value in the age field of CST0.  This produced a significant
improvement in  some  cases  relative  to  the  working  set
swapout list.  However, it tended to remove pages from forks
still in the balset which hadn't been run for a  while  thus
making it even more difficult for them to run.  

I am now looking for a procedure which will do a better  job
of  picking  the right fork to swapout (and swapin) while at
the  same  time  generally   reflecting   scheduler   policy
decisions and neither thrashing nor wasting memory, etc.  It
occurred to me at a seminar on paging techniques  last  week
that  one could use the same kinds of algorithms to select a
whole working set for removal  as  have  been  traditionally
used  for  page replacement, e.g.  LRU, FIFO, etc.  That is,
for any fork with a working set now in core, there  is  some
probability  that  that  fork  will  want to run in the next
interval T, and if we have to swap out  a  fork,  we  should
pick the one with the lowest probability.

Assuming we have a reasonable probability function, the next
question  is  how  to  make  use of it in relation to AJBALS
decisions.  In particular, what  happens  under  heavy  load
when there are always more forks wanting to run than can fit
in core.  At present under these conditions, GCCOR  is  busy
throwing  out  everything  possible  and  the probability of
finding a page still in core is non-0 only  because  of  the
finite time it takes the disk to write the changed pages and
the time for clean pages  to  percolate  up  the  replacable
queue.  

Here is the basic operation of this approach:

1.  A swapin/swapout manager (which I will call GCCOR  since
it  now  does  the  swapout  part  of  that  function) would
periodically perform the following algorithm

     1.  IF there are no forks waiting for swapin THEN done,
     ELSE

     2.  IF there is sufficient  available  memory  for  the
     best  fork  (i.e.   fork  size  .le.   NRPLQ+IOIP) THEN
     swapin the fork and go back to step 1, ELSE

     3.  compare the probability value of the  working  sets
     in  memory  to  that  of the working sets waiting to be
     swapped in.

     4.  IF the minimum P in memory is less than the maximum
     P  not  in memory THEN swapout the working set with the
     minimum P and go back to step 1, ELSE done.

2.  The routine AJBALS  works  generally  as  it  does  now,
except  that it does not actually cause swaps into or out of
memory, it merely determines at any instant what the balance
                                                      Page 3


set  should be.  The probability value for each fork working
set is then maintained by integrating BS over time (BS  =  1
if  fork  in balset, 0 if fork not in balset).  Keep in mind
that being in the balset or not does not instantly determine
whether  the  fork's  working  set  is in memory or not, nor
whether the fork can be selected for running.  A fork may be
selected for running IFF its working set is in memory and if
it is runnable (on the GOLST).  Unlike at  present,  a  fork
may  be run even if it is not in the balset!  For example, a
fork which blocks for a short time and then wakes up  before
its  working set has been swapped out may be run immediately
even if AJBALS does not decide to put it  into  the  balset.
If,  because  of  scheduling  policy,  it remains out of the
balset, its probability value will  decrease  to  the  point
where GCCOR will swap it out and it will be unable to run.

It  seems  to  me  that  this  approach  has  the  following
characteristics:

1.  It eliminates the need for the balset hold quantum which
we now use to prevent thrashing when forks block and unblock
frequently.  With the new approach,  a  high  priority  fork
which  unblocked  would  not  immediately  force out a lower
priority fork, since the former fork would have initially  a
0 or small probability value while the latter one would have
a high value.  AJBALS would  immediately  declare  the  high
priority  fork  to  be  in the balset and the lower priority
fork  out,  but  the  time  constants  of  the   probability
calculation  would  determine  how soon GCCOR would actually
swap.

2.  Poorly behaved forks which wake up frequently appear  to
cause  considerable  swap activity and system degradation at
present.  Under this new approach, these forks would have  a
high  probability  value  and so would not be swapped out as
frequently.  In particular, we  can  distinguish  between  a
fork  which  has  run  1 second out of 5 for the last minute
from one which has run 1 second  out  of  30  for  the  last
minute,  even  though  both just blocked.  We will then swap
out the 1/30 fork based on the probability that it will  not
need to run again as soon as the 1/5 fork.

3.  This algorithm would appear to be  of  particular  value
with working set swapping, since with WSS a decision to swap
in or out a fork may be much more costly in  terms  of  swap
channel capacity than under straight demand paging.

I  am  interested  in  comments  on  the  above,   including
pathological  cases,  other situations where it might win or
lose, any statistical analysis, etc.  One way or the  other,
I will probably try it experimentally in the next few weeks.

      New Swapping Logic Implementation (Preliminary)

            D. Murphy  8-Dec-78

The new swapping logic (along the lines of my 20-Oct memo)
has been merged into <4.MONITOR>.  Some of the traditional
things in the scheduler data base are now different.  Most
importantly:

1. There are no balance set tables (BALSET, NBW, etc.).  Data
	formerly contained therein now lives in fork tables or
	does not exist.

2. The LH of FKPT does not mean what it used to, i.e.
	whether the fork is on the GOLST or WTLST.  That
	is determined as follows.

3. There is a new fork table called FKSWP.  It contains
	various bits and fields relating to fork state,
	including:

  B0 (FKWSL) - "working set loaded", on if fork's PSB, etc.
	are locked in memory; off otherwise.  Much like
	what used to be "in the balance set".

  B1 (FKBLK) - "fork blocked", on if fork is blocked and on
	the waitlist; off if fork is on GOLST.

  B2 (FKIBS) - "in balance set", as indicated, keeping in mind
	however, that in balance set does not necessarily mean
	in memory.

  B3 (BSWTB) - "balset wait", on if fork is in short-term wait,
	e.g. on page fault.  Wait test is in FKPGST as before.

  B4 (BSNSK) - "no sked", on if fork is nosked but only when
	fork is not actually running.

  B5 (BSCRSK) - "critical section", on if fork is in critical
	section and fork is not actually running.

  B6 (FKIBSH) - "IBS hold", on if fork entered balset since
	last update to history
  B18-35 (FKHST) - "history", value which is a function of
	fork's recent history, in particular whether fork
	has been in balset or not.

To a first approximation, determining the state of a fork from a
crash dump or with EDDT goes as follows:

  1. Look at FKPT to see if fork exists; B0 on means that it
	does not.

  2. Look at FKSWP to see if fork is blocked or runnable (B1);

  3. If blocked, check wait test (FKSTAT);

  4. If runnable, see if in memory (B0 of FKSWP);

  5. If in memory, see if waiting (B3 of FKSWP);

  6. If waiting, check wait test (FKPGST).



















                  Design and internal description of the
                       Release 6.0 scheduler changes

                                 1-Oct-84

                                 Version 1
SCHEDULER CHANGES RELEASE 6.0                                        Page 2


The following describes the changes to the scheduler for release 6.0.   The
main reasons for the changes were:

     1.  The scheduler can represent a substantial amount of overhead.

     2.  Response and interactive feel seem to have degraded since  release
         4.

     3.  Correct several minor Bugs.

The following sections describe the major changes to the scheduler and  the
reasoning behind these changes.



1.0  RESCHEDULING AND WAKEUPS

The scheduler depends on certain flags to indicate what actions are  to  be
performed.   In particular, the flag, PSKED, indicates that the current job
should be dismissed and a rescheduling cycle should occur.  This flag  also
causes the null job (idle loop) to wakeup and attempt to schedule a job.



1.1  Old Rules

The old formula for setting PSKED was as follows:

     1.  IF a fork was removed from a wait list THEN call APSKED

     2.  IF a fork did a disk read THEN call APSKED

     3.  IF a fork did a disk write AND the fork  identified  was  not  the
         "null fork" THEN call APSKED

     4.  APSKED:  - always set PSKED flag

     5.  APSKED:  - IF fork identified  has  a  higher  or  equal  priority
         (lower  numerical queue number is higher priority) THEN set SKEDF3
         AND immediatly run the scheduler.




1.2  The Problems

There are several deficiencies with this algorithm, as follows:

     1.  APSKED always causes the current fork to be dismissed on the  next
         scheduler  cycle,  regardless  of  its relative priority.  This is
         caused by setting PSKED.

     2.  APSKED wakes the scheduler immediatly  for  forks  with  the  same
         priority.
SCHEDULER CHANGES RELEASE 6.0                                        Page 3


     3.  Queue number if not the sole determiner of  fork  priority  APSKED
         should be more clever in waking forks (calling CORFCT?).

     4.  APSKED should be called in the "null fork" case (case 3 above), in
         case  the  fork  caused a write by an unmap of close.  This allows
         faster wakeups.

     5.  TSKED (tty wakeups available) could cause faster wakeups, too, but
         does not.




1.3  The Changes

The following changes attempt to correct these problems:

     1.  Create a new flag PSKD1.  It causes null job wakeups but does  not
         force  a dismiss of the running fork.  This new flag is always set
         by APSKED.

     2.  Set this new flag in the TSKED code.

     3.  Always call APSKED after an I/O operation completes;  even  if  it
         is attributed to the "null fork".

     4.  Change APSKED to only consider forks of Higher rather  than  equal
         or higher priority.

     5.  NOT YET IMPLEMENTED - Make APSKED correctly compute fork priority.




2.0  JOB SCHEDULING

The scheduler often finds itself in the situation where it  must  select  a
fork  to  run.  The routine that does this, SKDJOB, scans the GOLST looking
for a runnable fork.



2.1  Old Rules

The algorithm used by SKDJOB was:

     1.  If all forks on the GOLST are in "balance set wait" (on GOLST  but
         blocked  for  fast event such as page fault or very short DISMS%),
         scan the GOLST and attempt to free up as  many  waiting  forks  as
         possible.

     2.  Scan the GOLST looking for a runnable fork.  A fork is runnable if
SCHEDULER CHANGES RELEASE 6.0                                        Page 4


         1.  Its working set is loaded

         2.  It is not in balance set wait

         3.  It is the NOSKED fork or there is no NOSKED fork


     3.  IF a fork is found THEN run it ELSE GOTO BGND1 (discussed later)




2.2  The Problems

The major problems arose when many forks were in balance set wait  and  all
were  ready  to  be freed.  The scheduler spent a lot of time freeing forks
that it could not run (it can only run 1).  This was especially bad because
the  code  that  removed  a  fork from the GOLST would cause the list to be
rescanned, from the beginning, after each removal.

Time could also be saved by special casing the code for a NOSKED fork.



2.3  The Changes


     1.  SKDJOB was changed to treat NOSKED forks as a special case

     2.  In the normal case, the GOLST is scanned.  If a fork is in balance
         set  wait, it is tested.  If the test fails, return immediatly and
         continue  scanning.   (Previously,  the  code  could  do   complex
         removals at this point.

     3.  If the test succeeds, run the fork and do not check any others  at
         this time.

     4.  Never reinit the list scan from the beginning.




3.0  BGND1

If SKDJOB finds no fork to run, it calls  BGND1,  the  background  process.
BGND1  performs  various "background" tasks and attempts to find a runnable
fork on one of the system wait lists.  It was  rewritten  to  perform  more
tasks  and to offload other background processes (i.e., adjust balance set,
AJBALS) when the system is idle.
SCHEDULER CHANGES RELEASE 6.0                                        Page 5


4.0  QUEUE MOVEMENT AFTER BLOCKING - WAIT CREDIT

The most extensive and potentially benificial changes were to the  routine,
NEWST.   This  routine is called after a process blocks.  It is responsible
for giving the process a higher run priority, based on the length  of  time
the  process  was  blocked.   This  algorithm  was changed extensivly after
release 4 of TOPS-20.  Some customers are running a modified 5.1 SCHED with
NEWST  replaced  with its release 4 counterpart.  The changes to NEWST make
it, to some extent, resemble the release 4 version.   For  that  reason  we
include the release 4 algorithm in the following sections.



4.1  Terminology

The following review of queue structure may be helpful in understanding the
algorithms presented below.


HIGHQ     is queue 0 it is for special high priority processes

LOWQ      is the dregs queue.  It is for low priority  processes.   It  was
          queue  5  in release 5 and 5.1.  In release 4 and release 6 it is
          queue 6.

INTQ0     is the highest priority interactive queue is is always queue 1.

INTQ1     is the lowest priority interactive queue it is queue 3 in release
          6 but was queue 2 in releases 4 and 5.

MAXQ      is  the  compute  bound  queue.   It  has  the  lowest   priority
          (excepting  the special LOWQ) in release 5 and 5.1 this was queue
          4.  In release 6 (and in 4) it is queue 5.

Forks move down in priority by exhausting their queue quanta  for  a  queue
without  blocking.  They move into the next numerically higher queue.  They
stay in MAXQ when they reach it.

The queue quantums for each queue in milliseconds (bias scheduler) are:

   QUEUE       R4          R5.1          R6 (old)  R6 (new)
   -----     -----        -----        -----      ----
     0         300          300          300       300
     1         200          300          100       200
     2         400         1000          400       400
     3        1000         3000         1600      1500
     4        2000         3000         5000      3000
     5        3000        10000         3000      5000
     6       10000          --         10000     10000
SCHEDULER CHANGES RELEASE 6.0                                        Page 6


4.2  Code Comments

NEWST is preceeded by a monument, that is, a comment section describing the
algorithm  and  reasoning.   While  this monument is somewhat outdated, the
reasoning is still sound.  It is included here also.
HEURISTIC FOR ADJUSTING QUEUE LEVEL AFTER I/O WAIT

THIS ROUTINE IS THE PRINCIPLE CONTROL OVER THE EXTENT TO WHICH
'INTERACTIVE' OR 'COMPUTE-BOUND' JOBS ARE FAVORED.  IT GIVES
PRIORITY 'CREDIT' TO A FORK AS A RESULT OF WAITING.  THE MORE
CREDIT GIVEN FOR A CERTAIN LENGTH WAIT (OR THE SHORTER THE WAIT
REQUIRED TO BECOME HIGH-Q), THE MORE THE SYSTEM WILL FAVOR
INTERACTIVE FORKS, AND THE GREATER THE CHANCE THAT FREQUENT OR
WELL-TIMED INTERACTIONS WILL GIVE A PROCESS AN EXCESSIVELY LARGE
SHARE OF COMPUTE TIME.  IT HAS BEEN DEMONSTRATED HOWEVER, THAT
A COMPLETELY 'FAIR' ALGORITHM HERE, I.E. ONE WHICH PREVENTS AN
INTERACTIVE FORK FROM GETTING ANY GREATER SHARE OF THE MACHINE
THAN A COMPUTE-BOUND FORK, IS HIGHLY UNSATISFACTORY TO INTERACTIVE
USERS UNDER MEDIUM AND HEAVY LOADS (AND ALL USERS ARE INTERACTIVE
SOMETIMES), AND RESULTS IN EXPONENTIALLY INCREASING LEVELS OF
FRUSTRATION, CURSING AND BEATING OF TERMINALS, ETC.  THEREFORE
THIS ROUTINE IS GENUINELY A HEURISTIC, MODIFIED AS A RESULT OF
PRESSURES BROUGHT TO BEAR ON SYSTEM PROGRAMMERS.

THE FOLLOWING DESCRIBES THE CURRENT PRACTICE:
 1. TTY INPUT WAITS OF .GE. 1 SEC GIVE HIGH-Q.  GREATLY REDUCES
    USER FRUSTRATION LEVEL.
 2. WAITS BY FORKS ON QUEUE 0 RESULT IN NO CHANGE TO Q VALUE
 3. FORKS ON QUEUES 1 TO MAXQ-1 WILL BE HIGH-Q IF WAITING TIME IS
    LONGER THAN LAST RUNTIME AS IMPLIED BY Q LEVEL.
 4. FORKS ON MAXQ WILL BE HIGH-Q IF WAITING TIME IS LONGER THAN
    THE MAXQ QUANTUM, AND WILL BE MOVED UP TO MAXQ-1 IF WAITING
    TIME IS LONGER THAN SOME 'MINIMAL' TIME (500 MS)

'WAITING TIME' ABOVE MEANS ACTUAL ELAPSED WAITING TIME
DIVIDED BY THE 1-MINUTE LOAD AVERAGE.  THIS KEEPS 'WELL-TIMED'
INTERACTIONS FROM USING MORE THAN ABOUT 1/LDAV OF THE CPU.


     Another code comment is found above the queue quanta tables.  It is:

HEURISTIC: A JOB SHOULD GET AT LEAST A FULL LOWEST QUEUE QUANTUM
BEFORE FALLING TO LOW QUEUE AND THEREBY YEILDING TO ALL THE COMPUTE
BOUND JOBS.  HENCE, THE SUM OF QUANTA FOR QUEUES 0 TO MAXQ-1
SHOULD BE _.GE. THE QUANTUM FOR MAXQ



4.3  NEWST Release 4


     1.  IF wait time is .LT.  100ms THEN return doing nothing.
SCHEDULER CHANGES RELEASE 6.0                                        Page 7


     2.  IF wait credit is proportional to  load  average  (SK%WCF  set  in
         SCHFLG)  AND  load  average .GT.  5 THEN compute proportional wait
         credit (PWC) = wait time / (load average -  5)  ELSE  PWC  =  wait
         time.

     3.  IF SK%TTP (high priority to tty waits) AND wait was TTY input wait
         for real TTY and PWC .GT.  1 sec requeue into queue 1 and RETURN.

     4.  ELSE add forks remaining time on current queue to PWC if  this  is
         .LT.   full  queue  quantum  for  this  queue then store it as new
         quantum and RETURN.  That is, add the PWC to the  remaining  queue
         quantum.

     5.  If the  PWC+  remaining  quantum  is  greater  than  the  original
         quantum, repeatedly subtract the full queue quantum and reduce the
         queue number until the remaining sum is less than the  full  queue
         quantum, or the queue number reaches 0.

     6.  Check the queue number to insure that it no lower than 1 store and
         RETURN.




4.4  NEWST Release 5.1 And 6


     1.  IF wait time is .LT.  100ms THEN return doing nothing.

     2.  IF SK%TTP (high priority to tty waits) AND wait was TTY input wait
         for real TTY and PWC .GT.  1 sec requeue into queue 1 and RETURN.

     3.  IF SK%TOP (TTY output prio) apply the above rule  for  TTY  output
         waits.

     4.  IF wait credit is proportional to  load  average  (SK%WCF  set  in
         SCHFLG) AND load average is .GT.  2 then compute PWC = wait time /
         load average.

     5.  If on queue MAXQ (4) move to queue MAXQ - 1 (3).

     6.  If PWC + remaining quantum .GT.  full quantum move down  1  queue.
         Else  credit  remaining  quantum,  but  do  not  move  to a higher
         priority queue than INTQ1+1 (3).




4.5  Release 6 NEWST

NEWST in release 6 is very similar to 5.1.  The only real difference is  in
queue values.  MAXQ is queue 5 in release 6 and INTQ1 is queue 3.
SCHEDULER CHANGES RELEASE 6.0                                        Page 8


4.6  The Problems

The major problem with NEWST is that, except in the case of a TTY IO  wait,
a  process  cannot  move into a higher priority queue unless it is on MAXQ.
Furthermore, forks that are on MAXQ are almost guaranteed to  move  into  a
higher  priority  queue.   The  interactive  feel of the machine is lost as
forks that are well behaved (block frequently) are forced to  compete  head
to  head  with forks that are very poorly behaved (block infrequently).  In
both of these cases, the fork cannot do  better  than  moving  to  queue  4
(MAXQ-1  or  INTQ1+1).   The queue quanta for release 6 are very strange in
that the queue 4 quanta is 5 secs but the MAXQ (5) quanta is only  3  secs.
If  we  believe  the comments above the quanta tables, then this is clearly
wrong.



4.7  The Changes

In an attempt to correct these problems, NEWST  was  changed  to  follow  a
modified  release 4 algorithm.  NEWST is also called for balance set waits;
that is, waits where the fork does not leave the GOLST.   For  balance  set
waits,  the  process  receives  only  1/2 credit since there is significant
overhead associated with balance set waits.  The algorithm is:

     1.  IF wait time is .LT.  100ms THEN return doing nothing.

     2.  IF SK%TTP (high priority to tty waits) AND wait was TTY input wait
         for real TTY and PWC .GT.  1 sec requeue into queue 1 and RETURN.

     3.  IF SK%TOP (TTY output prio) apply the above rule  for  TTY  output
         waits.

     4.  IF wait credit is proportional to  load  average  (SK%WCF  set  in
         SCHFLG) AND load average is .GT.  2 then compute PWC = wait time /
         (load average - 2).

     5.  If the fork is on INTQ1 (3) or higher AND the PWC .GT.   the  full
         queue  quanta  for the next highest queue then requeue to queue 1.
         This provides a big bonus for forks that wait a long time.

     6.  ELSE add forks remaining time on current queue to PWC if  this  is
         .LT.   full  queue  quantum  for  this  queue then store it as new
         quantum and RETURN.  That is, add the PWC to the  remaining  queue
         quantum.

     7.  If the PWC+ remaining quantum is greater than the original quantum
         and  the fork is on INTQ1 (3) or lower, requeue with a full quanta
         on the same queue.  If the  queue  is  higher  than  INTQ1  (4  or
         higher)  repeatedly subtract the full queue quantum and reduce the
         queue number until the remaining sum is less than the  full  queue
         quantum, or the queue number reaches 0.  Insure that the fork does
         not get a higher priority than INTQ1 (3) and requeue with  a  full
         quantum.
SCHEDULER CHANGES RELEASE 6.0                                        Page 9


     8.  If all else fails and the fork is on MAXQ (5) and has  waited  the
         load average * 500 ms requeue to queue 4.




5.0  MISCELLANEOUS

The calculation of the load average was changed to  not  include  forks  in
balance  set  wait.   This  will  decrease the load average on systems with
heavy swapping and file activity.  Forks that are in balance set  wait  are
not runnable and are not competing for the CPU.