3 The cluster makes it possible to do, in half an hour, tasks
4 which were completely unnecessary to do before. -- Unknown
10 - *Cluster* (often also called *grid*): a group of machines cooperating
12 - *Job* : what you want the cluster to do for you
13 - *Nodes* (also called exec hosts): The machines that actually do
15 - *Master*: machine that knows all jobs, all nodes, their capabilities
17 - *SGE* (Sun gridengine): Software that runs on all relevant machines
18 - *Submit host*: takes a list of jobs to be executed and sends them
19 to the master (ilm). The master then distributes them across the
21 - *Slot*: a single computing unit (one CPU core)
25 width="300" height="150"
27 xmlns="http://www.w3.org/2000/svg"
28 xmlns:xlink="http://www.w3.org/1999/xlink"
30 <!-- Upper exec host -->
32 fill="none" stroke="black" stroke-width="1"
33 x="6" y="8" width="45" height="40" rx="5"
35 <rect fill="white" height="10" width="34" x="12" y="13" rx="3"/>
36 <rect fill="#aaa" height="6" width="2" x="17" y="15"/>
37 <rect fill="#aaa" height="6" width="2" x="22" y="15"/>
38 <rect fill="#aaa" height="6" width="2" x="27" y="15"/>
39 <rect fill="#aaa" height="6" width="2" x="32" y="15"/>
40 <circle cx="38" cy="16" fill="#c00" r="2"/>
41 <circle cx="43" cy="16" fill="#0a5" r="2"/>
43 <!-- Users and jobs -->
45 fill="#ff0" stroke="black"
46 x="9" y="31" width="10" height="10" rx="10"
49 fill="lightgreen" stroke="black"
50 x="15" y="38" width="5" height="5"
54 fill="#aa0" stroke="black"
55 x="22" y="31" width="10" height="10" rx="10"
58 fill="orange" stroke="black"
59 x="28" y="38" width="5" height="5"
62 fill="#ae3" stroke="black"
63 x="35" y="31" width="10" height="10" rx="10"
66 fill="red" stroke="black"
67 x="42" y="38" width="5" height="5"
70 <!-- Lower exec host -->
72 fill="none" stroke="black" stroke-width="1"
73 x="6" y="56" width="45" height="40" rx="5"
75 <rect fill="white" height="10" width="34" x="12" y="61" rx="3"/>
76 <rect fill="#aaa" height="6" width="2" x="17" y="63"/>
77 <rect fill="#aaa" height="6" width="2" x="22" y="63"/>
78 <rect fill="#aaa" height="6" width="2" x="27" y="63"/>
79 <rect fill="#aaa" height="6" width="2" x="32" y="63"/>
80 <circle cx="38" cy="64" fill="#c00" r="2"/>
81 <circle cx="43" cy="64" fill="#0a5" r="2"/>
84 fill="green" stroke="black"
85 x="12" y="79" width="10" height="10" rx="10"
88 fill="white" stroke="black"
89 x="19" y="86" width="5" height="5"
92 fill="cyan" stroke="black"
93 x="32" y="79" width="10" height="10" rx="10"
96 fill="black" stroke="black"
97 x="39" y="86" width="5" height="5"
118 fill="none" stroke="black"
119 x="71" y="13" width="45" height="65" rx="5"
121 <rect fill="gold" height="10" width="34" x="76" y="17" rx="3"/>
122 <rect fill="#aaa" height="6" width="2" x="81" y="19"/>
123 <rect fill="#aaa" height="6" width="2" x="86" y="19"/>
124 <rect fill="#aaa" height="6" width="2" x="91" y="19"/>
125 <rect fill="#aaa" height="6" width="2" x="96" y="19"/>
126 <circle cx="102" cy="20" fill="#c00" r="2"/>
127 <circle cx="107" cy="20" fill="#0a5" r="2"/>
129 fill="lightgreen" stroke="black"
130 x="83" y="49" width="5" height="5"
133 fill="red" stroke="black"
134 x="94" y="42" width="5" height="5"
137 fill="orange" stroke="black"
138 x="93" y="62" width="5" height="5"
141 fill="white" stroke="black"
142 x="81" y="60" width="5" height="5"
145 fill="black" stroke="black"
146 x="98" y="52" width="5" height="5"
151 c 10,-10 20,-10 25,-5
179 fill="none" stroke="black"
180 x="146" y="10" width="45" height="25" rx="5"
182 <rect fill="grey" height="10" width="34" x="152" y="13" rx="3"/>
183 <rect fill="#aaa" height="6" width="2" x="157" y="15"/>
184 <rect fill="#aaa" height="6" width="2" x="162" y="15"/>
185 <rect fill="#aaa" height="6" width="2" x="167" y="15"/>
186 <rect fill="#aaa" height="6" width="2" x="172" y="15"/>
187 <circle cx="178" cy="16" fill="#c00" r="2"/>
188 <circle cx="183" cy="16" fill="#0a5" r="2"/>
190 fill="orange" stroke="black"
191 x="159" y="27" width="5" height="5"
194 fill="lightgreen" stroke="black"
195 x="176" y="27" width="5" height="5"
198 fill="none" stroke="black"
199 x="146" y="39" width="45" height="25" rx="5"
201 <rect fill="grey" height="10" width="34" x="152" y="42" rx="3"/>
202 <rect fill="#aaa" height="6" width="2" x="157" y="44"/>
203 <rect fill="#aaa" height="6" width="2" x="162" y="44"/>
204 <rect fill="#aaa" height="6" width="2" x="167" y="44"/>
205 <rect fill="#aaa" height="6" width="2" x="172" y="44"/>
206 <circle cx="178" cy="45" fill="#c00" r="2"/>
207 <circle cx="183" cy="45" fill="#0a5" r="2"/>
209 fill="white" stroke="black"
210 x="167" y="55" width="5" height="5"
213 fill="none" stroke="black"
214 x="146" y="68" width="45" height="25" rx="5"
216 <rect fill="grey" height="10" width="34" x="152" y="71" rx="3"/>
217 <rect fill="#aaa" height="6" width="2" x="157" y="73"/>
218 <rect fill="#aaa" height="6" width="2" x="162" y="73"/>
219 <rect fill="#aaa" height="6" width="2" x="167" y="73"/>
220 <rect fill="#aaa" height="6" width="2" x="172" y="73"/>
221 <circle cx="178" cy="74" fill="#c00" r="2"/>
222 <circle cx="183" cy="74" fill="#0a5" r="2"/>
224 fill="red" stroke="black"
225 x="159" y="85" width="5" height="5"
228 fill="black" stroke="black"
229 x="176" y="85" width="5" height="5"
234 <p> Users (circles) submit their jobs (squares) from the submit hosts
235 (white) to the master (yellow). The Master assigns for each job a
236 suitable execution host (grey) on which the job is scheduled. </p>
242 XREFERENCE(«https://web.archive.org/web/20160506102715/https://blogs.oracle.com/templedf/entry/sun_grid_engine_for_dummies»,
243 «introduction to SGE»)
245 SECTION(«Cluster hardware and setup»)
246 - 48/64 core AMD (Opteron and Epyc), 512G-2T RAM, 25Gbit ethernet
247 - separate network (no internet, limited campus services)
248 - NFS root, local /tmp, two global temp file systems
253 - Look at XREFERENCE(«http://ilm.eb.local/ganglia/», «web
254 frontend») of the ganglia monitoring system.
255 - Run the CMD(qhost), CMD(lscpu), CMD(free), CMD(w), CMD(htop)
256 commands to list nodes, print CPUs, available memory and
257 swap, and the load average.
258 - Examine all columns of the CMD(«q-charge --no-joblist
259 --no-globals») output.
260 - Open two terminals and ssh into two different cluster nodes
261 (note: the CMD(qhost) command prints the names of all nodes),
262 run CMD(touch ~/foo-$LOGNAME) on one of them to create a
263 file in your home directory. Check whether the file exists on
264 the other node by executing CMD(«ls -l ~/foo-$LOGNAME»). Do
265 the same with CMD(touch /tmp/foo-$LOGNAME).
266 - Read the section on the accounting system of the
267 XREFERENCE(«http://ilm.eb.local/clusterdoc/The-Accounting-System.html#The-Accounting-System»,
268 «cluster documentation») to learn how charges are computed.
271 Find three different ways to determine how many CPU cores
274 - Log in to any cluster node and read the message of the day.
275 - Run CMD(«qhost») and add up the third column.
276 - Run CMD(«nproc, lscpu») or CMD(«cat /proc/cpuinfo») on each
277 node and sum up the results.
278 - Run CMD(«qconf -se <nodexxx>») for each node and
279 sum up the values shown as CMD(«num_proc»).
280 - Run CMD(«q-gstat -s») and add the slot counts.
281 - Read the first sentence on the
282 XREFERENCE(http://ilm.eb.local/clusterdoc/, cluster documentation
284 - Visit the XREFERENCE(«http://ilm.eb.local/ganglia/», «ganglia»)
285 page and subtract from the number shown as "CPUs Total" the CPU count
286 of the (two) servers which are not cluster nodes.
291 Read the CMD(«q-charge») manual page and learn about the
292 CMD(«--no-joblist») option. Write a config file for CMD(«q-charge»)
293 to activate this option automatically. Hand in your config file.
295 Simply create the file named CMD(«.q-chargerc») in the home directory
296 which contains a single line CMD(«no-joblist»). However, with this
297 file in place, there is no easy way to EMPH(«enable») the job list.
300 SECTION(«Submitting and Monitoring»)
302 - interactive and non-interactive (batch) jobs
303 - CMD(«qsub»): submitting job scripts
304 - CMD(«qstat»): monitoring job state
305 - CMD(«h_vmem»), CMD(«h_rt»): Specify memory and running time
306 - CMD(«qdel»): removing running or waiting jobs
310 - Execute CMD(«qlogin -l h_rt=60») to get a shell on a
311 random(?) cluster node.
312 - Write in a file called REFERENCE(«testscript.sh»,
313 «testscript.sh») with the content below.
314 - Look at the CMD(«qsub») man page to tell which of the following
315 options of CMD(«qsub») might be useful to set. CMD(«-l h_vmem -l
316 h_rt -cwd -j -V -N -pe»).
317 - Submit CMD(«testscript.sh») with CMD(«qsub -cwd testscript.sh»)
318 - Quick! Type CMD(qstat). Depending on the current cluster load you
319 will either see your job in the queue (waiting), running, or no longer
320 there (already finished).
321 - After your job has finished, find out if it was successful using
322 CMD(qacct -j "jobID"). If you can't remember, look at the files that
324 - How much memory did your job use?
325 - Let's see what our CMD(«testscript.sh») did. Where and what is
326 the output of the three commands?
327 - Submit the same job script again and remove it with CMD(«qdel»)
328 while the job is running or waiting.
331 Write a submit script which prints out the host it is running
332 on. Submit the script and request a running time of one minute, and
333 500M of memory. Hand in the script, the command you specified for
334 submitting the script, and the output.
336 The script only needs to contain the single line CMD(«hostname»). In
337 particular, the shebang (CMD(«#!/bin/sh»)) may be omitted.
340 SECTION(«Array jobs and parallel jobs»)
342 - array job: a single job with many subjobs. Equivalent to a set of
343 jobs which all run the same job script.
344 - parallel job: jobs that use more than one slot (CPU core)
347 - Run CMD(«mkdir array_job_dir») and create 20 files in that
348 directory called CMD(«input-1») to CMD(«input-20») (hint: example
350 - Create REFERENCE(«array_job.sh», «array_job.sh») and discuss
351 what the script does.
352 - Submit an array job to the cluster using CMD(«qsub -t 1-20
353 array_job.sh»). Once all array tasks have finished, you'll find that
354 all your files were renamed.
355 - You might want to check if the jobs succeeded. Use CMD(qacct) to
356 check the exit codes of all jobs. Think about pipes and the commands
357 CMD(sort), CMD(uniq) and CMD(grep) to make it easier for you.
358 - Run CMD(«echo stress -c 2 | qsub -l h_rt=100») to submit a job.
359 Use CMD(«qstat») to find the node on which the job in running. Run
360 CMD(«ssh -t <nodeX> htop») and check how many stress processes
361 are running and the share of CPU time they get. Repeat, but this
362 time submit a parallel job by adding CMD(«-pe parallel 2») to the
363 options for CMD(«qsub»).
366 Discuss when it makes sense to restrict the number of simultaneously
369 One reason is to be nice to others: if you limit the number of your
370 jobs you don’t block other users by occupying the whole cluster. This
371 is only important for long running jobs though, as the SGE software
372 tries to balance jobs between users. Another reason is to not overload
373 the file server in case your jobs do heavy I/O.
377 Submit the REFERENCE(«array_job.sh», «array_job.sh») script
378 again as an array job, but make sure that only at most two of the
379 10 tasks are going to run simultaneously. Hand in the corresponding
382 The command CMD(«qsub -t 1-20 -tc 2 array_job.sh») will run at most
383 two of the 10 tasks simultaneously.
386 SECTION(«Job running time and memory consumption»)
388 - Default: hard limit of 1G RAM, killed after one day
389 - Q: How long will my job run? How much memory does it need? A:
391 - Long job waiting times for high requests
396 - If a job needs much memory, the default of 1G might not be
397 enough. Find out how much memory one terminated job of yours actually
398 needed by running CMD(«qacct -j <jobname>»). In particular,
399 look at CMD(«exit status») (not zero if something went wrong)
400 and CMD(«maxvmem») (actual memory consumption of your process).
401 - Submit the job script again, but this time specify CMD(«-l
402 h_vmem») to request more memory. Once the job is complete, compare
403 the CMD(«maxvmem») field of the CMD(«qacct») output and the value
404 specified with CMD(-l h_vmem).
405 - Jobs could also be much longer than the default value allows (1
406 day). Use CMD(«-l h_rt») to request a longer running time. Run a
407 test job with default settings or a rough estimation and see if it
408 fails (CMD(«qacct»), exit status not zero). Look at start and end
409 time and compare with CMD(-l h_rt) value. Adjust CMD(«-l h_rt»)
410 and run the job again. Reevaluate until your job ran successfully.
411 - If your job is very short, you might set CMD(«-l h_rt») below
412 1h to enter the short queue, for example CMD(«-l h_rt=0:30:0»)
413 for 30mins maximum run time. By setting a small value for CMD(«-l
414 h_rt») you could use this resource and possibly get your job queued
415 earlier than with default values. The command CMD(«qconf -sql»)
416 lists the names of all queues, and CMD(«qconf -sq <queuename> | grep
417 "^._rt"») shows you the soft and the hard limit of running time.
418 See the section on resource limits of the CMD(«queue_conf») manual
419 page to learn more about the two types of limits.
421 SECTION(«Queues, Queue Instances»)
423 <p> A queue is named description of the requirements a job must have to
424 be started on one of the nodes, like the maximal running time or the
425 number of slots. The queue descriptions are organized in plaintext
426 files called <em> queue configurations </em> which are managed by the
427 qmaster and which can be modified by privileged users by means of the
428 <code> qconf(1)</code> command. </p>
432 <!-- The viewBox scales the coordinates to the specified range -->
434 width="400" height="264"
435 viewBox="0 0 167 110"
436 xmlns="http://www.w3.org/2000/svg"
437 xmlns:xlink="http://www.w3.org/1999/xlink"
439 <!-- vertical boxes -->
441 fill="none" stroke="black"
442 x="60" y="10" width="15" height="85" rx="5"
445 fill="none" stroke="black"
446 x="85" y="10" width="15" height="85" rx="5"
449 fill="none" stroke="black"
450 x="110" y="10" width="15" height="85" rx="5"
453 fill="none" stroke="black"
454 x="135" y="10" width="15" height="85" rx="5"
456 <!-- horizontal boxes -->
458 fill="none" stroke="black"
459 x="55" y="20" width="100" height="14" rx="5"
462 fill="none" stroke="black"
463 x="80" y="45" width="75" height="14" rx="5"
466 fill="none" stroke="black"
467 x="55" y="70" width="100" height="14" rx="5"
471 fill="yellow" stroke="black"
472 x="63" y="22" width="10" height="10" rx="10"
475 fill="yellow" stroke="black"
476 x="88" y="22" width="10" height="10" rx="10"
479 fill="yellow" stroke="black"
480 x="113" y="22" width="10" height="10" rx="10"
483 fill="yellow" stroke="black"
484 x="138" y="22" width="10" height="10" rx="10"
487 fill="cyan" stroke="black"
488 x="88" y="47" width="10" height="10" rx="10"
491 fill="cyan" stroke="black"
492 x="113" y="47" width="10" height="10" rx="10"
495 fill="cyan" stroke="black"
496 x="138" y="47" width="10" height="10" rx="10"
499 fill="orange" stroke="black"
500 x="63" y="72" width="10" height="10" rx="10"
503 fill="orange" stroke="black"
504 x="88" y="72" width="10" height="10" rx="10"
507 fill="orange" stroke="black"
508 x="113" y="72" width="10" height="10" rx="10"
511 fill="orange" stroke="black"
512 x="138" y="72" width="10" height="10" rx="10"
514 <text x="5" y="10" font-size="5">
517 <!-- Queue instaces lines -->
520 x1="50" y1="4" x2="142" y2="27"
524 x1="50" y1="7" x2="117" y2="27"
528 x1="50" y1="10" x2="92" y2="27"
532 x1="50" y1="13" x2="67" y2="27"
534 <text x="5" y="47" font-size="5">
537 <text x="5" y="52" font-size="5">
540 <text x="5" y="57" font-size="5">
543 <!-- Cluster Queue lines -->
546 x1="47" y1="49" x2="57" y2="26"
550 x1="48" y1="52" x2="83" y2="52"
554 x1="47" y1="55" x2="57" y2="77"
556 <text x="5" y="100" font-size="5">
562 x1="25" y1="99" x2="143" y2="99"
566 x1="68" y1="99" x2="68" y2="92"
570 x1="93" y1="99" x2="93" y2="92"
574 x1="118" y1="99" x2="118" y2="92"
578 x1="143" y1="99" x2="143" y2="92"
583 <p> Among other configuration parameters, a queue configuration always
584 contains the list of execution hosts. On each on each node of this
585 list one relalization of the queue, a <em> queue instance</em>, is
586 running as part of the execution damon <code> sge_execd(8)</code>.
587 The list is usually described in terms of <em> hostgroups</em>
588 where each hostgroup contains execution hosts which are similar in
589 one aspect or another. For example, one could define the hostgroup
590 <code> @core64 </code> to contain all nodes which have 64 CPU cores.
591 The diagram to the left tries to illustrate these concepts. </p>
593 <p> While a running job is always associated with one queue instance,
594 it is recommended to not request a specific queue at job submission
595 time, but to let the qmaster pick a suitable queue for the job. </p>
597 <p> An execution host can host more than one queue instance, and queues
598 can be related to each other to form a <em> subordination tree</em>.
599 Jobs in the superordinate queue can suspend jobs in the subordinated
600 queue, but suspension always takes place at the queue instance level.
607 - Run CMD(«qconf -sql») to see the list of all defined queues. Pick a
608 queue and run CMD(«qconf -sq <queue_name>») to show the parameters of
609 the queue. Consult the CMD(«queue_conf(5)») manual page for details.
610 - Read the CMD(«prolog») section in CMD(«queue_conf(5)») manual
611 page. Examine the CMD(«/usr/local/sbin/prolog») file on the nodes and
612 try to understand what it actually does. See commit CMD(«0e44011d»)
613 in the user-info repostitory for the answer.
614 - Run CMD(«echo stress -c 2 | qsub») to submit a job which starts two
615 threads. Determine the node on which the job is running, log in to
616 this node and examine the CPU utilization of your job.
618 SECTION(«Accounting»)
620 - accounting file contains one record for each _finished_ job
621 - plain text, one line per job, entries separated by colons
622 - qacct: scans accounting file
623 - summary or per-job information
625 - easy to parse "by hand"
629 - Run CMD(«qacct -o») to see the full user summary and CMD(«qacct
630 -o $LOGNAME -d 90») to see the summary for your own user, including
631 only the jobs of the last 3 months.
632 - Check the CMD(«accounting(5)») manual page to learn more about
633 the fields stored in the accounting records.
634 - Submit a cluster job with CMD(«echo sleep 100 | qsub -l h_vmem=200M
635 -l h_rt=10»), wait until it completes, then check the accounting
636 record for your job with CMD(«qacct -j <jobid>»). In particular,
637 examine the CMD(«failed») and CMD(«maxvmem») fields. Compare
638 the output with CMD(«print_accounting_record.bash <jobid>»),
639 where the CMD(«print_accounting_record.bash») script is shown
640 REFERENCE(«print_accounting_record.bash», «below»).
641 - Check out the XREFERENCE(«http://ilm.eb.local/stats/», «statistics
642 page»). Tell which histograms were created from the accounting file.
643 - Search for CMD(«com_stats») in the
644 XREFERENCE(«http://ilm.eb.local/gitweb/?p=user-info;a=blob;f=scripts/admin/cmt;hb=HEAD»,
645 «cluster management tool») and examine how these statistics are
648 SECTION(«Complex Attributes»)
650 - used to manage limited resources
651 - requested via CMD(«-l»)
652 - global, or attached to a host or queue
653 - predefined or user defined
654 - each attribute has a type and a relational operator
655 - requestable and consumable
659 - Run CMD(«qconf -sc») to see the complex configuration.
660 - Check the contents of
661 CMD(«/var/lib/gridengine/default/common/sge_request»).
662 - Run CMD(«qconf -se node444») to see the complex configuration
664 - Discuss whether it would make sense to introduce additional complex
665 attributes for controlling I/O per file system.
667 SECTION(«Tickets and Projects»)
669 - tickets: functional/share/override
670 - project: (name, oticket, fshare, acl)
671 - jobs can be submitted to projects (CMD(«qsub -P»))
675 - Read the CMD(«sge_project») manual page to learn more about SGE
677 - Examine the output of CMD(«qconf -ssconf») with respect to the three
678 types of tickets and their weights.
679 - Check the CMD(«sge_priority(5)») manual page to learn more about the
680 three types of tickets.
681 - Discuss whether the SGE projects concept is helpful with respect
682 to accounting issues and grants (e.g., ERC).
683 - Discuss whether introducing override or functional share tickets
684 for projects is desirable.
686 SECTION(«Scheduler Configuration»)
688 - fair share: heavy users get reduced priority
689 - share tree: assign priorities based on historical usage
690 - reservation and backfilling
694 - Run CMD(«qstat -s p -u "*"») to see all pending jobs. Examine
695 the order and the priority of the jobs.
696 - Run CMD(«qconf -ssconf») to examine the scheduler configuration. In
697 particular, look at the CMD(«policy_hierarchy») entry. Consult
698 the CMD(«sched_conf(5)») and CMD(«share_tree(5)») manual pages
700 - Discuss the various scheduling policies described in this
701 XREFERENCE(«http://gridscheduler.sourceforge.net/howto/geee.html»,
703 - Discuss the pros and cons to schedule preferentially to hosts which
704 are already running a job. That is, should CMD(«load_formula»)
705 be CMD(«np_load_avg») (the default) or CMD(«slots»)? See
706 XREFERENCE(«http://arc.liv.ac.uk/SGE/howto/sge-configs.html»,
707 «sge-configs») and CMD(«sched_conf(5)») for details.
711 SUBSECTION(«testscript.sh»)
715 sleep 100 # wait to give us time to look at the job status
716 echo "This is my output" > ./outputfile
717 echo "Where does this go?"
718 ls ./directorythatdoesnotexisthere
721 SUBSECTION(«array_job.sh»)
725 # Lines beginning with #$ tell the program to use the following as
726 # option for the command. By the way, you don't need to write this
727 # line into "testscript.sh" ;)
731 mv input-$SGE_TASK_ID ./output-$SGE_TASK_ID
734 SUBSECTION(«print_accounting_record.bash»)
738 (($# != 1)) && exit 1
739 awk -F: "{if (\$6 == $1) print \$0}" /var/lib/gridengine/default/common/accounting