TITLE(«
The cluster makes it possible to do, in half an hour, tasks
which were completely unnecessary to do before. -- Unknown
», __file__)
SECTION(«Terminology»)
- *Cluster* (often also called *grid*): a group of machines cooperating
to do some work
- *Job* : what you want the cluster to do for you
- *Nodes* (also called exec hosts): The machines that actually do
the work
- *Master*: machine that knows all jobs, all nodes, their capabilities
and current load
- *SGE* (Sun gridengine): Software that runs on all relevant machines
- *Submit host*: takes a list of jobs to be executed and sends them
to the master (ilm). The master then distributes them across the
available nodes
- *Slot*: a single computing unit (one CPU core)
Users (circles) submit their jobs (squares) from the submit hosts
(white) to the master (yellow). The Master assigns for each job a
suitable execution host (grey) on which the job is scheduled.
EXERCISES()
- Read this
XREFERENCE(«https://web.archive.org/web/20160506102715/https://blogs.oracle.com/templedf/entry/sun_grid_engine_for_dummies»,
«introduction to SGE»)
SECTION(«Cluster hardware and setup»)
- 48/64 core AMD (Opteron and Epyc), 512G-2T RAM, 25Gbit ethernet
- separate network (no internet, limited campus services)
- NFS root, local /tmp, two global temp file systems
- SGE
EXERCISES()
- Look at XREFERENCE(«http://ilm.eb.local/ganglia/», «web
frontend») of the ganglia monitoring system.
- Run the CMD(qhost), CMD(lscpu), CMD(free), CMD(w), CMD(htop)
commands to list nodes, print CPUs, available memory and
swap, and the load average.
- Examine all columns of the CMD(«q-charge --no-joblist
--no-globals») output.
- Open two terminals and ssh into two different cluster nodes
(note: the CMD(qhost) command prints the names of all nodes),
run CMD(touch ~/foo-$LOGNAME) on one of them to create a
file in your home directory. Check whether the file exists on
the other node by executing CMD(«ls -l ~/foo-$LOGNAME»). Do
the same with CMD(touch /tmp/foo-$LOGNAME).
- Read the section on the accounting system of the
XREFERENCE(«http://ilm.eb.local/clusterdoc/The-Accounting-System.html#The-Accounting-System»,
«cluster documentation») to learn how charges are computed.
HOMEWORK(«
Find three different ways to determine how many CPU cores
the cluster has.
», «
- Log in to any cluster node and read the message of the day.
- Run CMD(«qhost») and add up the third column.
- Run CMD(«nproc, lscpu») or CMD(«cat /proc/cpuinfo») on each
node and sum up the results.
- Run CMD(«qconf -se ») for each node and
sum up the values shown as CMD(«num_proc»).
- Run CMD(«q-gstat -s») and add the slot counts.
- Read the first sentence on the
XREFERENCE(http://ilm.eb.local/clusterdoc/, cluster documentation
main page).
- Visit the XREFERENCE(«http://ilm.eb.local/ganglia/», «ganglia»)
page and subtract from the number shown as "CPUs Total" the CPU count
of the (two) servers which are not cluster nodes.
»)
HOMEWORK(«
Read the CMD(«q-charge») manual page and learn about the
CMD(«--no-joblist») option. Write a config file for CMD(«q-charge»)
to activate this option automatically. Hand in your config file.
», «
Simply create the file named CMD(«.q-chargerc») in the home directory
which contains a single line CMD(«no-joblist»). However, with this
file in place, there is no easy way to EMPH(«enable») the job list.
»)
SECTION(«Submitting and Monitoring»)
- interactive and non-interactive (batch) jobs
- CMD(«qsub»): submitting job scripts
- CMD(«qstat»): monitoring job state
- CMD(«h_vmem»), CMD(«h_rt»): Specify memory and running time
- CMD(«qdel»): removing running or waiting jobs
EXERCISES()
- Execute CMD(«qlogin -l h_rt=60») to get a shell on a
random(?) cluster node.
- Write in a file called REFERENCE(«testscript.sh»,
«testscript.sh») with the content below.
- Look at the CMD(«qsub») man page to tell which of the following
options of CMD(«qsub») might be useful to set. CMD(«-l h_vmem -l
h_rt -cwd -j -V -N -pe»).
- Submit CMD(«testscript.sh») with CMD(«qsub -cwd testscript.sh»)
- Quick! Type CMD(qstat). Depending on the current cluster load you
will either see your job in the queue (waiting), running, or no longer
there (already finished).
- After your job has finished, find out if it was successful using
CMD(qacct -j "jobID"). If you can't remember, look at the files that
were created.
- How much memory did your job use?
- Let's see what our CMD(«testscript.sh») did. Where and what is
the output of the three commands?
- Submit the same job script again and remove it with CMD(«qdel»)
while the job is running or waiting.
HOMEWORK(«
Write a submit script which prints out the host it is running
on. Submit the script and request a running time of one minute, and
500M of memory. Hand in the script, the command you specified for
submitting the script, and the output.
», «
The script only needs to contain the single line CMD(«hostname»). In
particular, the shebang (CMD(«#!/bin/sh»)) may be omitted.
»)
SECTION(«Array jobs and parallel jobs»)
- array job: a single job with many subjobs. Equivalent to a set of
jobs which all run the same job script.
- parallel job: jobs that use more than one slot (CPU core)
EXERCISES
- Run CMD(«mkdir array_job_dir») and create 20 files in that
directory called CMD(«input-1») to CMD(«input-20») (hint: example
from last week).
- Create REFERENCE(«array_job.sh», «array_job.sh») and discuss
what the script does.
- Submit an array job to the cluster using CMD(«qsub -t 1-20
array_job.sh»). Once all array tasks have finished, you'll find that
all your files were renamed.
- You might want to check if the jobs succeeded. Use CMD(qacct) to
check the exit codes of all jobs. Think about pipes and the commands
CMD(sort), CMD(uniq) and CMD(grep) to make it easier for you.
- Run CMD(«echo stress -c 2 | qsub -l h_rt=100») to submit a job.
Use CMD(«qstat») to find the node on which the job in running. Run
CMD(«ssh -t htop») and check how many stress processes
are running and the share of CPU time they get. Repeat, but this
time submit a parallel job by adding CMD(«-pe parallel 2») to the
options for CMD(«qsub»).
HOMEWORK(«
Discuss when it makes sense to restrict the number of simultaneously
running jobs.
», «
One reason is to be nice to others: if you limit the number of your
jobs you don’t block other users by occupying the whole cluster. This
is only important for long running jobs though, as the SGE software
tries to balance jobs between users. Another reason is to not overload
the file server in case your jobs do heavy I/O.
»)
HOMEWORK(«
Submit the REFERENCE(«array_job.sh», «array_job.sh») script
again as an array job, but make sure that only at most two of the
10 tasks are going to run simultaneously. Hand in the corresponding
CMD(«qsub») command.
», «
The command CMD(«qsub -t 1-20 -tc 2 array_job.sh») will run at most
two of the 10 tasks simultaneously.
»)
SECTION(«Job running time and memory consumption»)
- Default: hard limit of 1G RAM, killed after one day
- Q: How long will my job run? How much memory does it need? A:
CMD(«qacct»)
- Long job waiting times for high requests
- Short queue
EXERCISES()
- If a job needs much memory, the default of 1G might not be
enough. Find out how much memory one terminated job of yours actually
needed by running CMD(«qacct -j »). In particular,
look at CMD(«exit status») (not zero if something went wrong)
and CMD(«maxvmem») (actual memory consumption of your process).
- Submit the job script again, but this time specify CMD(«-l
h_vmem») to request more memory. Once the job is complete, compare
the CMD(«maxvmem») field of the CMD(«qacct») output and the value
specified with CMD(-l h_vmem).
- Jobs could also be much longer than the default value allows (1
day). Use CMD(«-l h_rt») to request a longer running time. Run a
test job with default settings or a rough estimation and see if it
fails (CMD(«qacct»), exit status not zero). Look at start and end
time and compare with CMD(-l h_rt) value. Adjust CMD(«-l h_rt»)
and run the job again. Reevaluate until your job ran successfully.
- If your job is very short, you might set CMD(«-l h_rt») below
1h to enter the short queue, for example CMD(«-l h_rt=0:30:0»)
for 30mins maximum run time. By setting a small value for CMD(«-l
h_rt») you could use this resource and possibly get your job queued
earlier than with default values. The command CMD(«qconf -sql»)
lists the names of all queues, and CMD(«qconf -sq | grep
"^._rt"») shows you the soft and the hard limit of running time.
See the section on resource limits of the CMD(«queue_conf») manual
page to learn more about the two types of limits.
SECTION(«Queues, Queue Instances»)
A queue is named description of the requirements a job must have to
be started on one of the nodes, like the maximal running time or the
number of slots. The queue descriptions are organized in plaintext
files called queue configurations which are managed by the
qmaster and which can be modified by privileged users by means of the
qconf(1) command.
Among other configuration parameters, a queue configuration always
contains the list of execution hosts. On each on each node of this
list one relalization of the queue, a queue instance, is
running as part of the execution damon sge_execd(8).
The list is usually described in terms of hostgroups
where each hostgroup contains execution hosts which are similar in
one aspect or another. For example, one could define the hostgroup
@core64 to contain all nodes which have 64 CPU cores.
The diagram to the left tries to illustrate these concepts.
While a running job is always associated with one queue instance,
it is recommended to not request a specific queue at job submission
time, but to let the qmaster pick a suitable queue for the job.
An execution host can host more than one queue instance, and queues
can be related to each other to form a subordination tree.
Jobs in the superordinate queue can suspend jobs in the subordinated
queue, but suspension always takes place at the queue instance level.
EXERCISES()
- Run CMD(«qconf -sql») to see the list of all defined queues. Pick a
queue and run CMD(«qconf -sq ») to show the parameters of
the queue. Consult the CMD(«queue_conf(5)») manual page for details.
- Read the CMD(«prolog») section in CMD(«queue_conf(5)») manual
page. Examine the CMD(«/usr/local/sbin/prolog») file on the nodes and
try to understand what it actually does. See commit CMD(«0e44011d»)
in the cluster repostitory for the answer.
- Run CMD(«echo stress -c 2 | qsub») to submit a job which starts two
threads. Determine the node on which the job is running, log in to
this node and examine the CPU utilization of your job.
SECTION(«Accounting»)
- accounting file contains one record for each _finished_ job
- plain text, one line per job, entries separated by colons
- qacct: scans accounting file
- summary or per-job information
- buggy
- easy to parse "by hand"
EXERCISES()
- Run CMD(«qacct -o») to see the full user summary and CMD(«qacct
-o $LOGNAME -d 90») to see the summary for your own user, including
only the jobs of the last 3 months.
- Check the CMD(«accounting(5)») manual page to learn more about
the fields stored in the accounting records.
- Submit a cluster job with CMD(«echo sleep 100 | qsub -l h_vmem=200M
-l h_rt=10»), wait until it completes, then check the accounting
record for your job with CMD(«qacct -j »). In particular,
examine the CMD(«failed») and CMD(«maxvmem») fields. Compare
the output with CMD(«print_accounting_record.bash »),
where the CMD(«print_accounting_record.bash») script is shown
REFERENCE(«print_accounting_record.bash», «below»).
- Check out the XREFERENCE(«http://ilm.eb.local/stats/», «statistics
page»). Tell which histograms were created from the accounting file.
- Search for CMD(«com_stats») in the
XREFERENCE(«http://ilm.eb.local/gitweb/?p=cluster;a=blob;f=scripts/admin/cmt;hb=HEAD»,
«cluster management tool») and examine how these statistics are
created.
SECTION(«Complex Attributes»)
- used to manage limited resources
- requested via CMD(«-l»)
- global, or attached to a host or queue
- predefined or user defined
- each attribute has a type and a relational operator
- requestable and consumable
EXERCISES()
- Run CMD(«qconf -sc») to see the complex configuration.
- Check the contents of
CMD(«/var/lib/gridengine/default/common/sge_request»).
- Run CMD(«qconf -se node444») to see the complex configuration
attached to node444.
- Discuss whether it would make sense to introduce additional complex
attributes for controlling I/O per file system.
SECTION(«Tickets and Projects»)
- tickets: functional/share/override
- project: (name, oticket, fshare, acl)
- jobs can be submitted to projects (CMD(«qsub -P»))
EXERCISES()
- Read the CMD(«sge_project») manual page to learn more about SGE
projects.
- Examine the output of CMD(«qconf -ssconf») with respect to the three
types of tickets and their weights.
- Check the CMD(«sge_priority(5)») manual page to learn more about the
three types of tickets.
- Discuss whether the SGE projects concept is helpful with respect
to accounting issues and grants (e.g., ERC).
- Discuss whether introducing override or functional share tickets
for projects is desirable.
SECTION(«Scheduler Configuration»)
- fair share: heavy users get reduced priority
- share tree: assign priorities based on historical usage
- reservation and backfilling
EXERCISES()
- Run CMD(«qstat -s p -u "*"») to see all pending jobs. Examine
the order and the priority of the jobs.
- Run CMD(«qconf -ssconf») to examine the scheduler configuration. In
particular, look at the CMD(«policy_hierarchy») entry. Consult
the CMD(«sched_conf(5)») and CMD(«share_tree(5)») manual pages
for details.
- Discuss the various scheduling policies described in this
XREFERENCE(«http://gridscheduler.sourceforge.net/howto/geee.html»,
«document»).
- Discuss the pros and cons to schedule preferentially to hosts which
are already running a job. That is, should CMD(«load_formula»)
be CMD(«np_load_avg») (the default) or CMD(«slots»)? See
XREFERENCE(«http://arc.liv.ac.uk/SGE/howto/sge-configs.html»,
«sge-configs») and CMD(«sched_conf(5)») for details.
SUPPLEMENTS()
SUBSECTION(«testscript.sh»)
#!/bin/sh
sleep 100 # wait to give us time to look at the job status
echo "This is my output" > ./outputfile
echo "Where does this go?"
ls ./directorythatdoesnotexisthere
SUBSECTION(«array_job.sh»)
#!/bin/sh
# Lines beginning with #$ tell the program to use the following as
# option for the command. By the way, you don't need to write this
# line into "testscript.sh" ;)
#$ -cwd
#$ -j y
#$ -l h_rt=0:1:0
mv input-$SGE_TASK_ID ./output-$SGE_TASK_ID