Initial commit.
[aple.git] / Gridengine.m4
1 TITLE(«
2
3 The cluster makes it possible to do, in half an hour, tasks
4 which were completely unnecessary to do before. -- Unknown
5
6 », __file__)
7
8 SECTION(«Terminology»)
9
10 - *Cluster* (often also called *grid*): a group of machines cooperating
11 to do some work
12 - *Job* : what you want the cluster to do for you
13 - *Nodes* (also called exec hosts): The machines that actually do
14 the work
15 - *Master*: machine that knows all jobs, all nodes, their capabilities
16 and current load
17 - *SGE* (Sun gridengine): Software that runs on all relevant machines
18 - *Submit host*: takes a list of jobs to be executed and sends them
19 to the master (ilm). The master then distributes them across the
20 available nodes
21 - *Slot*: a single computing unit (one CPU core)
22
23 <div>
24 <svg
25 width="300" height="150"
26 viewBox="0 0 200 100"
27 xmlns="http://www.w3.org/2000/svg"
28 xmlns:xlink="http://www.w3.org/1999/xlink"
29 >
30 <!-- Upper exec host -->
31 <rect
32 fill="none" stroke="black" stroke-width="1"
33 x="6" y="8" width="45" height="40" rx="5"
34 />
35 <rect fill="white" height="10" width="34" x="12" y="13" rx="3"/>
36 <rect fill="#aaa" height="6" width="2" x="17" y="15"/>
37 <rect fill="#aaa" height="6" width="2" x="22" y="15"/>
38 <rect fill="#aaa" height="6" width="2" x="27" y="15"/>
39 <rect fill="#aaa" height="6" width="2" x="32" y="15"/>
40 <circle cx="38" cy="16" fill="#c00" r="2"/>
41 <circle cx="43" cy="16" fill="#0a5" r="2"/>
42
43 <!-- Users and jobs -->
44 <rect
45 fill="#ff0" stroke="black"
46 x="9" y="31" width="10" height="10" rx="10"
47 />
48 <rect
49 fill="lightgreen" stroke="black"
50 x="15" y="38" width="5" height="5"
51 />
52 />
53 <rect
54 fill="#aa0" stroke="black"
55 x="22" y="31" width="10" height="10" rx="10"
56 />
57 <rect
58 fill="orange" stroke="black"
59 x="28" y="38" width="5" height="5"
60 />
61 <rect
62 fill="#ae3" stroke="black"
63 x="35" y="31" width="10" height="10" rx="10"
64 />
65 <rect
66 fill="red" stroke="black"
67 x="42" y="38" width="5" height="5"
68 />
69
70 <!-- Lower exec host -->
71 <rect
72 fill="none" stroke="black" stroke-width="1"
73 x="6" y="56" width="45" height="40" rx="5"
74 />
75 <rect fill="white" height="10" width="34" x="12" y="61" rx="3"/>
76 <rect fill="#aaa" height="6" width="2" x="17" y="63"/>
77 <rect fill="#aaa" height="6" width="2" x="22" y="63"/>
78 <rect fill="#aaa" height="6" width="2" x="27" y="63"/>
79 <rect fill="#aaa" height="6" width="2" x="32" y="63"/>
80 <circle cx="38" cy="64" fill="#c00" r="2"/>
81 <circle cx="43" cy="64" fill="#0a5" r="2"/>
82
83 <rect
84 fill="green" stroke="black"
85 x="12" y="79" width="10" height="10" rx="10"
86 />
87 <rect
88 fill="white" stroke="black"
89 x="19" y="86" width="5" height="5"
90 />
91 <rect
92 fill="cyan" stroke="black"
93 x="32" y="79" width="10" height="10" rx="10"
94 />
95 <rect
96 fill="black" stroke="black"
97 x="39" y="86" width="5" height="5"
98 />
99
100 <--! left arrow -->
101 <path
102 d="
103 M 46 52
104 l 20 0
105 m 0 0
106 l -4 -3
107 l 0 6
108 l 4 -3
109 z
110 "
111 stroke-width="2"
112 stroke="black"
113 fill="black"
114 />
115
116 <!-- Master -->
117 <rect
118 fill="none" stroke="black"
119 x="71" y="13" width="45" height="65" rx="5"
120 />
121 <rect fill="gold" height="10" width="34" x="76" y="17" rx="3"/>
122 <rect fill="#aaa" height="6" width="2" x="81" y="19"/>
123 <rect fill="#aaa" height="6" width="2" x="86" y="19"/>
124 <rect fill="#aaa" height="6" width="2" x="91" y="19"/>
125 <rect fill="#aaa" height="6" width="2" x="96" y="19"/>
126 <circle cx="102" cy="20" fill="#c00" r="2"/>
127 <circle cx="107" cy="20" fill="#0a5" r="2"/>
128 <rect
129 fill="lightgreen" stroke="black"
130 x="83" y="49" width="5" height="5"
131 />
132 <rect
133 fill="red" stroke="black"
134 x="94" y="42" width="5" height="5"
135 />
136 <rect
137 fill="orange" stroke="black"
138 x="93" y="62" width="5" height="5"
139 />
140 <rect
141 fill="white" stroke="black"
142 x="81" y="60" width="5" height="5"
143 />
144 <rect
145 fill="black" stroke="black"
146 x="98" y="52" width="5" height="5"
147 />
148 <path
149 d="
150 M 80,45
151 c 10,-10 20,-10 25,-5
152 c 10,15 10,15 -7,32
153 c -5,2 -10,5 -17,0
154 c -2,-2 -3,-1 -5,-10
155 c -3,-4 -3,-10 4,-17
156 z
157 "
158 stroke-width="2"
159 stroke="black"
160 fill="none"
161 />
162 <--! right arrow -->
163 <path
164 d="
165 M 121 52
166 l 20 0
167 m 0 0
168 l -4 -3
169 l 0 6
170 l 4 -3
171 z
172 "
173 stroke-width="2"
174 stroke="black"
175 fill="black"
176 />
177 <!-- exec hosts -->
178 <rect
179 fill="none" stroke="black"
180 x="146" y="10" width="45" height="25" rx="5"
181 />
182 <rect fill="grey" height="10" width="34" x="152" y="13" rx="3"/>
183 <rect fill="#aaa" height="6" width="2" x="157" y="15"/>
184 <rect fill="#aaa" height="6" width="2" x="162" y="15"/>
185 <rect fill="#aaa" height="6" width="2" x="167" y="15"/>
186 <rect fill="#aaa" height="6" width="2" x="172" y="15"/>
187 <circle cx="178" cy="16" fill="#c00" r="2"/>
188 <circle cx="183" cy="16" fill="#0a5" r="2"/>
189 <rect
190 fill="orange" stroke="black"
191 x="159" y="27" width="5" height="5"
192 />
193 <rect
194 fill="lightgreen" stroke="black"
195 x="176" y="27" width="5" height="5"
196 />
197 <rect
198 fill="none" stroke="black"
199 x="146" y="39" width="45" height="25" rx="5"
200 />
201 <rect fill="grey" height="10" width="34" x="152" y="42" rx="3"/>
202 <rect fill="#aaa" height="6" width="2" x="157" y="44"/>
203 <rect fill="#aaa" height="6" width="2" x="162" y="44"/>
204 <rect fill="#aaa" height="6" width="2" x="167" y="44"/>
205 <rect fill="#aaa" height="6" width="2" x="172" y="44"/>
206 <circle cx="178" cy="45" fill="#c00" r="2"/>
207 <circle cx="183" cy="45" fill="#0a5" r="2"/>
208 <rect
209 fill="white" stroke="black"
210 x="167" y="55" width="5" height="5"
211 />
212 <rect
213 fill="none" stroke="black"
214 x="146" y="68" width="45" height="25" rx="5"
215 />
216 <rect fill="grey" height="10" width="34" x="152" y="71" rx="3"/>
217 <rect fill="#aaa" height="6" width="2" x="157" y="73"/>
218 <rect fill="#aaa" height="6" width="2" x="162" y="73"/>
219 <rect fill="#aaa" height="6" width="2" x="167" y="73"/>
220 <rect fill="#aaa" height="6" width="2" x="172" y="73"/>
221 <circle cx="178" cy="74" fill="#c00" r="2"/>
222 <circle cx="183" cy="74" fill="#0a5" r="2"/>
223 <rect
224 fill="red" stroke="black"
225 x="159" y="85" width="5" height="5"
226 />
227 <rect
228 fill="black" stroke="black"
229 x="176" y="85" width="5" height="5"
230 />
231 </svg>
232 </div>
233
234 <p> Users (circles) submit their jobs (squares) from the submit hosts
235 (white) to the master (yellow). The Master assigns for each job a
236 suitable execution host (grey) on which the job is scheduled. </p>
237
238 <br>
239
240 EXERCISES()
241 - Read this
242 XREFERENCE(«https://web.archive.org/web/20160506102715/https://blogs.oracle.com/templedf/entry/sun_grid_engine_for_dummies»,
243 «introduction to SGE»)
244
245 SECTION(«Cluster hardware and setup»)
246 - 48/64 core AMD (Opteron and Epyc), 512G-2T RAM, 25Gbit ethernet
247 - separate network (no internet, limited campus services)
248 - NFS root, local /tmp, two global temp file systems
249 - SGE
250
251 EXERCISES()
252
253 - Look at XREFERENCE(«http://ilm.eb.local/ganglia/», «web
254 frontend») of the ganglia monitoring system.
255 - Run the CMD(qhost), CMD(lscpu), CMD(free), CMD(w), CMD(htop)
256 commands to list nodes, print CPUs, available memory and
257 swap, and the load average.
258 - Examine all columns of the CMD(«q-charge --no-joblist
259 --no-globals») output.
260 - Open two terminals and ssh into two different cluster nodes
261 (note: the CMD(qhost) command prints the names of all nodes),
262 run CMD(touch ~/foo-$LOGNAME) on one of them to create a
263 file in your home directory. Check whether the file exists on
264 the other node by executing CMD(«ls -l ~/foo-$LOGNAME»). Do
265 the same with CMD(touch /tmp/foo-$LOGNAME).
266 - Read the section on the accounting system of the
267 XREFERENCE(«http://ilm.eb.local/clusterdoc/The-Accounting-System.html#The-Accounting-System»,
268 «cluster documentation») to learn how charges are computed.
269
270 HOMEWORK(«
271 Find three different ways to determine how many CPU cores
272 the cluster has.
273 », «
274 - Log in to any cluster node and read the message of the day.
275 - Run CMD(«qhost») and add up the third column.
276 - Run CMD(«nproc, lscpu») or CMD(«cat /proc/cpuinfo») on each
277 node and sum up the results.
278 - Run CMD(«qconf -se <nodexxx>») for each node and
279 sum up the values shown as CMD(«num_proc»).
280 - Run CMD(«q-gstat -s») and add the slot counts.
281 - Read the first sentence on the
282 XREFERENCE(http://ilm.eb.local/clusterdoc/, cluster documentation
283 main page).
284 - Visit the XREFERENCE(«http://ilm.eb.local/ganglia/», «ganglia»)
285 page and subtract from the number shown as "CPUs Total" the CPU count
286 of the (two) servers which are not cluster nodes.
287 »)
288
289
290 HOMEWORK(«
291 Read the CMD(«q-charge») manual page and learn about the
292 CMD(«--no-joblist») option. Write a config file for CMD(«q-charge»)
293 to activate this option automatically. Hand in your config file.
294 », «
295 Simply create the file named CMD(«.q-chargerc») in the home directory
296 which contains a single line CMD(«no-joblist»). However, with this
297 file in place, there is no easy way to EMPH(«enable») the job list.
298 »)
299
300 SECTION(«Submitting and Monitoring»)
301
302 - interactive and non-interactive (batch) jobs
303 - CMD(«qsub»): submitting job scripts
304 - CMD(«qstat»): monitoring job state
305 - CMD(«h_vmem»), CMD(«h_rt»): Specify memory and running time
306 - CMD(«qdel»): removing running or waiting jobs
307
308 EXERCISES()
309
310 - Execute CMD(«qlogin -l h_rt=60») to get a shell on a
311 random(?) cluster node.
312 - Write in a file called REFERENCE(«testscript.sh»,
313 «testscript.sh») with the content below.
314 - Look at the CMD(«qsub») man page to tell which of the following
315 options of CMD(«qsub») might be useful to set. CMD(«-l h_vmem -l
316 h_rt -cwd -j -V -N -pe»).
317 - Submit CMD(«testscript.sh») with CMD(«qsub -cwd testscript.sh»)
318 - Quick! Type CMD(qstat). Depending on the current cluster load you
319 will either see your job in the queue (waiting), running, or no longer
320 there (already finished).
321 - After your job has finished, find out if it was successful using
322 CMD(qacct -j "jobID"). If you can't remember, look at the files that
323 were created.
324 - How much memory did your job use?
325 - Let's see what our CMD(«testscript.sh») did. Where and what is
326 the output of the three commands?
327 - Submit the same job script again and remove it with CMD(«qdel»)
328 while the job is running or waiting.
329
330 HOMEWORK(«
331 Write a submit script which prints out the host it is running
332 on. Submit the script and request a running time of one minute, and
333 500M of memory. Hand in the script, the command you specified for
334 submitting the script, and the output.
335 », «
336 The script only needs to contain the single line CMD(«hostname»). In
337 particular, the shebang (CMD(«#!/bin/sh»)) may be omitted.
338 »)
339
340 SECTION(«Array jobs and parallel jobs»)
341
342 - array job: a single job with many subjobs. Equivalent to a set of
343 jobs which all run the same job script.
344 - parallel job: jobs that use more than one slot (CPU core)
345
346 EXERCISES
347 - Run CMD(«mkdir array_job_dir») and create 20 files in that
348 directory called CMD(«input-1») to CMD(«input-20») (hint: example
349 from last week).
350 - Create REFERENCE(«array_job.sh», «array_job.sh») and discuss
351 what the script does.
352 - Submit an array job to the cluster using CMD(«qsub -t 1-20
353 array_job.sh»). Once all array tasks have finished, you'll find that
354 all your files were renamed.
355 - You might want to check if the jobs succeeded. Use CMD(qacct) to
356 check the exit codes of all jobs. Think about pipes and the commands
357 CMD(sort), CMD(uniq) and CMD(grep) to make it easier for you.
358 - Run CMD(«echo stress -c 2 | qsub -l h_rt=100») to submit a job.
359 Use CMD(«qstat») to find the node on which the job in running. Run
360 CMD(«ssh -t <nodeX> htop») and check how many stress processes
361 are running and the share of CPU time they get. Repeat, but this
362 time submit a parallel job by adding CMD(«-pe parallel 2») to the
363 options for CMD(«qsub»).
364
365 HOMEWORK(«
366 Discuss when it makes sense to restrict the number of simultaneously
367 running jobs.
368 », «
369 One reason is to be nice to others: if you limit the number of your
370 jobs you don’t block other users by occupying the whole cluster. This
371 is only important for long running jobs though, as the SGE software
372 tries to balance jobs between users. Another reason is to not overload
373 the file server in case your jobs do heavy I/O.
374 »)
375
376 HOMEWORK(«
377 Submit the REFERENCE(«array_job.sh», «array_job.sh») script
378 again as an array job, but make sure that only at most two of the
379 10 tasks are going to run simultaneously. Hand in the corresponding
380 CMD(«qsub») command.
381 », «
382 The command CMD(«qsub -t 1-20 -tc 2 array_job.sh») will run at most
383 two of the 10 tasks simultaneously.
384 »)
385
386 SECTION(«Job running time and memory consumption»)
387
388 - Default: hard limit of 1G RAM, killed after one day
389 - Q: How long will my job run? How much memory does it need? A:
390 CMD(«qacct»)
391 - Long job waiting times for high requests
392 - Short queue
393
394 EXERCISES()
395
396 - If a job needs much memory, the default of 1G might not be
397 enough. Find out how much memory one terminated job of yours actually
398 needed by running CMD(«qacct -j <jobname>»). In particular,
399 look at CMD(«exit status») (not zero if something went wrong)
400 and CMD(«maxvmem») (actual memory consumption of your process).
401 - Submit the job script again, but this time specify CMD(«-l
402 h_vmem») to request more memory. Once the job is complete, compare
403 the CMD(«maxvmem») field of the CMD(«qacct») output and the value
404 specified with CMD(-l h_vmem).
405 - Jobs could also be much longer than the default value allows (1
406 day). Use CMD(«-l h_rt») to request a longer running time. Run a
407 test job with default settings or a rough estimation and see if it
408 fails (CMD(«qacct»), exit status not zero). Look at start and end
409 time and compare with CMD(-l h_rt) value. Adjust CMD(«-l h_rt»)
410 and run the job again. Reevaluate until your job ran successfully.
411 - If your job is very short, you might set CMD(«-l h_rt») below
412 1h to enter the short queue, for example CMD(«-l h_rt=0:30:0»)
413 for 30mins maximum run time. By setting a small value for CMD(«-l
414 h_rt») you could use this resource and possibly get your job queued
415 earlier than with default values. The command CMD(«qconf -sql»)
416 lists the names of all queues, and CMD(«qconf -sq <queuename> | grep
417 "^._rt"») shows you the soft and the hard limit of running time.
418 See the section on resource limits of the CMD(«queue_conf») manual
419 page to learn more about the two types of limits.
420
421 SECTION(«Queues, Queue Instances»)
422
423 <p> A queue is named description of the requirements a job must have to
424 be started on one of the nodes, like the maximal running time or the
425 number of slots. The queue descriptions are organized in plaintext
426 files called <em> queue configurations </em> which are managed by the
427 qmaster and which can be modified by privileged users by means of the
428 <code> qconf(1)</code> command. </p>
429
430 <div>
431
432 <!-- The viewBox scales the coordinates to the specified range -->
433 <svg
434 width="400" height="264"
435 viewBox="0 0 167 110"
436 xmlns="http://www.w3.org/2000/svg"
437 xmlns:xlink="http://www.w3.org/1999/xlink"
438 >
439 <!-- vertical boxes -->
440 <rect
441 fill="none" stroke="black"
442 x="60" y="10" width="15" height="85" rx="5"
443 />
444 <rect
445 fill="none" stroke="black"
446 x="85" y="10" width="15" height="85" rx="5"
447 />
448 <rect
449 fill="none" stroke="black"
450 x="110" y="10" width="15" height="85" rx="5"
451 />
452 <rect
453 fill="none" stroke="black"
454 x="135" y="10" width="15" height="85" rx="5"
455 />
456 <!-- horizontal boxes -->
457 <rect
458 fill="none" stroke="black"
459 x="55" y="20" width="100" height="14" rx="5"
460 />
461 <rect
462 fill="none" stroke="black"
463 x="80" y="45" width="75" height="14" rx="5"
464 />
465 <rect
466 fill="none" stroke="black"
467 x="55" y="70" width="100" height="14" rx="5"
468 />
469 <!-- circles -->
470 <rect
471 fill="yellow" stroke="black"
472 x="63" y="22" width="10" height="10" rx="10"
473 />
474 <rect
475 fill="yellow" stroke="black"
476 x="88" y="22" width="10" height="10" rx="10"
477 />
478 <rect
479 fill="yellow" stroke="black"
480 x="113" y="22" width="10" height="10" rx="10"
481 />
482 <rect
483 fill="yellow" stroke="black"
484 x="138" y="22" width="10" height="10" rx="10"
485 />
486 <rect
487 fill="cyan" stroke="black"
488 x="88" y="47" width="10" height="10" rx="10"
489 />
490 <rect
491 fill="cyan" stroke="black"
492 x="113" y="47" width="10" height="10" rx="10"
493 />
494 <rect
495 fill="cyan" stroke="black"
496 x="138" y="47" width="10" height="10" rx="10"
497 />
498 <rect
499 fill="orange" stroke="black"
500 x="63" y="72" width="10" height="10" rx="10"
501 />
502 <rect
503 fill="orange" stroke="black"
504 x="88" y="72" width="10" height="10" rx="10"
505 />
506 <rect
507 fill="orange" stroke="black"
508 x="113" y="72" width="10" height="10" rx="10"
509 />
510 <rect
511 fill="orange" stroke="black"
512 x="138" y="72" width="10" height="10" rx="10"
513 />
514 <text x="5" y="10" font-size="5">
515 Queue Instances
516 </text>
517 <!-- Queue instaces lines -->
518 <line
519 stroke="black"
520 x1="50" y1="4" x2="142" y2="27"
521 />
522 <line
523 stroke="black"
524 x1="50" y1="7" x2="117" y2="27"
525 />
526 <line
527 stroke="black"
528 x1="50" y1="10" x2="92" y2="27"
529 />
530 <line
531 stroke="black"
532 x1="50" y1="13" x2="67" y2="27"
533 />
534 <text x="5" y="47" font-size="5">
535 Cluster Queue:
536 </text>
537 <text x="5" y="52" font-size="5">
538 Set of Queue
539 </text>
540 <text x="5" y="57" font-size="5">
541 Instances
542 </text>
543 <!-- Cluster Queue lines -->
544 <line
545 stroke="black"
546 x1="47" y1="49" x2="57" y2="26"
547 />
548 <line
549 stroke="black"
550 x1="48" y1="52" x2="83" y2="52"
551 />
552 <line
553 stroke="black"
554 x1="47" y1="55" x2="57" y2="77"
555 />
556 <text x="5" y="100" font-size="5">
557 Hosts
558 </text>
559 <!-- Hosts lines -->
560 <line
561 stroke="black"
562 x1="25" y1="99" x2="143" y2="99"
563 />
564 <line
565 stroke="black"
566 x1="68" y1="99" x2="68" y2="92"
567 />
568 <line
569 stroke="black"
570 x1="93" y1="99" x2="93" y2="92"
571 />
572 <line
573 stroke="black"
574 x1="118" y1="99" x2="118" y2="92"
575 />
576 <line
577 stroke="black"
578 x1="143" y1="99" x2="143" y2="92"
579 />
580 </svg>
581 </div>
582
583 <p> Among other configuration parameters, a queue configuration always
584 contains the list of execution hosts. On each on each node of this
585 list one relalization of the queue, a <em> queue instance</em>, is
586 running as part of the execution damon <code> sge_execd(8)</code>.
587 The list is usually described in terms of <em> hostgroups</em>
588 where each hostgroup contains execution hosts which are similar in
589 one aspect or another. For example, one could define the hostgroup
590 <code> @core64 </code> to contain all nodes which have 64 CPU cores.
591 The diagram to the left tries to illustrate these concepts. </p>
592
593 <p> While a running job is always associated with one queue instance,
594 it is recommended to not request a specific queue at job submission
595 time, but to let the qmaster pick a suitable queue for the job. </p>
596
597 <p> An execution host can host more than one queue instance, and queues
598 can be related to each other to form a <em> subordination tree</em>.
599 Jobs in the superordinate queue can suspend jobs in the subordinated
600 queue, but suspension always takes place at the queue instance level.
601 </p>
602
603 <br>
604
605 EXERCISES()
606
607 - Run CMD(«qconf -sql») to see the list of all defined queues. Pick a
608 queue and run CMD(«qconf -sq <queue_name>») to show the parameters of
609 the queue. Consult the CMD(«queue_conf(5)») manual page for details.
610 - Read the CMD(«prolog») section in CMD(«queue_conf(5)») manual
611 page. Examine the CMD(«/usr/local/sbin/prolog») file on the nodes and
612 try to understand what it actually does. See commit CMD(«0e44011d»)
613 in the cluster repostitory for the answer.
614 - Run CMD(«echo stress -c 2 | qsub») to submit a job which starts two
615 threads. Determine the node on which the job is running, log in to
616 this node and examine the CPU utilization of your job.
617
618 SECTION(«Accounting»)
619
620 - accounting file contains one record for each _finished_ job
621 - plain text, one line per job, entries separated by colons
622 - qacct: scans accounting file
623 - summary or per-job information
624 - buggy
625 - easy to parse "by hand"
626
627 EXERCISES()
628
629 - Run CMD(«qacct -o») to see the full user summary and CMD(«qacct
630 -o $LOGNAME -d 90») to see the summary for your own user, including
631 only the jobs of the last 3 months.
632 - Check the CMD(«accounting(5)») manual page to learn more about
633 the fields stored in the accounting records.
634 - Submit a cluster job with CMD(«echo sleep 100 | qsub -l h_vmem=200M
635 -l h_rt=10»), wait until it completes, then check the accounting
636 record for your job with CMD(«qacct -j <jobid>»). In particular,
637 examine the CMD(«failed») and CMD(«maxvmem») fields. Compare
638 the output with CMD(«print_accounting_record.bash <jobid>»),
639 where the CMD(«print_accounting_record.bash») script is shown
640 REFERENCE(«print_accounting_record.bash», «below»).
641 - Check out the XREFERENCE(«http://ilm.eb.local/stats/», «statistics
642 page»). Tell which histograms were created from the accounting file.
643 - Search for CMD(«com_stats») in the
644 XREFERENCE(«http://ilm.eb.local/gitweb/?p=cluster;a=blob;f=scripts/admin/cmt;hb=HEAD»,
645 «cluster management tool») and examine how these statistics are
646 created.
647
648 SECTION(«Complex Attributes»)
649
650 - used to manage limited resources
651 - requested via CMD(«-l»)
652 - global, or attached to a host or queue
653 - predefined or user defined
654 - each attribute has a type and a relational operator
655 - requestable and consumable
656
657 EXERCISES()
658
659 - Run CMD(«qconf -sc») to see the complex configuration.
660 - Check the contents of
661 CMD(«/var/lib/gridengine/default/common/sge_request»).
662 - Run CMD(«qconf -se node444») to see the complex configuration
663 attached to node444.
664 - Discuss whether it would make sense to introduce additional complex
665 attributes for controlling I/O per file system.
666
667 SECTION(«Tickets and Projects»)
668
669 - tickets: functional/share/override
670 - project: (name, oticket, fshare, acl)
671 - jobs can be submitted to projects (CMD(«qsub -P»))
672
673 EXERCISES()
674
675 - Read the CMD(«sge_project») manual page to learn more about SGE
676 projects.
677 - Examine the output of CMD(«qconf -ssconf») with respect to the three
678 types of tickets and their weights.
679 - Check the CMD(«sge_priority(5)») manual page to learn more about the
680 three types of tickets.
681 - Discuss whether the SGE projects concept is helpful with respect
682 to accounting issues and grants (e.g., ERC).
683 - Discuss whether introducing override or functional share tickets
684 for projects is desirable.
685
686 SECTION(«Scheduler Configuration»)
687
688 - fair share: heavy users get reduced priority
689 - share tree: assign priorities based on historical usage
690 - reservation and backfilling
691
692 EXERCISES()
693
694 - Run CMD(«qstat -s p -u "*"») to see all pending jobs. Examine
695 the order and the priority of the jobs.
696 - Run CMD(«qconf -ssconf») to examine the scheduler configuration. In
697 particular, look at the CMD(«policy_hierarchy») entry. Consult
698 the CMD(«sched_conf(5)») and CMD(«share_tree(5)») manual pages
699 for details.
700 - Discuss the various scheduling policies described in this
701 XREFERENCE(«http://gridscheduler.sourceforge.net/howto/geee.html»,
702 «document»).
703 - Discuss the pros and cons to schedule preferentially to hosts which
704 are already running a job. That is, should CMD(«load_formula»)
705 be CMD(«np_load_avg») (the default) or CMD(«slots»)? See
706 XREFERENCE(«http://arc.liv.ac.uk/SGE/howto/sge-configs.html»,
707 «sge-configs») and CMD(«sched_conf(5)») for details.
708
709 SUPPLEMENTS()
710
711 SUBSECTION(«testscript.sh»)
712
713 <pre>
714 #!/bin/sh
715 sleep 100 # wait to give us time to look at the job status
716 echo "This is my output" > ./outputfile
717 echo "Where does this go?"
718 ls ./directorythatdoesnotexisthere
719 </pre>
720
721 SUBSECTION(«array_job.sh»)
722
723 <pre>
724 #!/bin/sh
725 # Lines beginning with #$ tell the program to use the following as
726 # option for the command. By the way, you don't need to write this
727 # line into "testscript.sh" ;)
728 #$ -cwd
729 #$ -j y
730 #$ -l h_rt=0:1:0
731 mv input-$SGE_TASK_ID ./output-$SGE_TASK_ID
732 </pre>
733
734 SUBSECTION(«print_accounting_record.bash»)
735
736 <pre>
737 #!/bin/bash
738 (($# != 1)) && exit 1
739 awk -F: "{if (\$6 == $1) print \$0}" /var/lib/gridengine/default/common/accounting
740 </pre>