TITLE(« The cluster makes it possible to do, in half an hour, tasks which were completely unnecessary to do before. -- Unknown », __file__) SECTION(«Terminology») - *Cluster* (often also called *grid*): a group of machines cooperating to do some work - *Job* : what you want the cluster to do for you - *Nodes* (also called exec hosts): The machines that actually do the work - *Master*: machine that knows all jobs, all nodes, their capabilities and current load - *SGE* (Sun gridengine): Software that runs on all relevant machines - *Submit host*: takes a list of jobs to be executed and sends them to the master (ilm). The master then distributes them across the available nodes - *Slot*: a single computing unit (one CPU core)
/> <--! left arrow --> <--! right arrow -->

Users (circles) submit their jobs (squares) from the submit hosts (white) to the master (yellow). The Master assigns for each job a suitable execution host (grey) on which the job is scheduled.


EXERCISES() - Read this XREFERENCE(«https://web.archive.org/web/20160506102715/https://blogs.oracle.com/templedf/entry/sun_grid_engine_for_dummies», «introduction to SGE») SECTION(«Cluster hardware and setup») - 48/64 core AMD (Opteron and Epyc), 512G-2T RAM, 25Gbit ethernet - separate network (no internet, limited campus services) - NFS root, local /tmp, two global temp file systems - SGE EXERCISES() - Look at XREFERENCE(«http://ilm.eb.local/ganglia/», «web frontend») of the ganglia monitoring system. - Run the CMD(qhost), CMD(lscpu), CMD(free), CMD(w), CMD(htop) commands to list nodes, print CPUs, available memory and swap, and the load average. - Examine all columns of the CMD(«q-charge --no-joblist --no-globals») output. - Open two terminals and ssh into two different cluster nodes (note: the CMD(qhost) command prints the names of all nodes), run CMD(touch ~/foo-$LOGNAME) on one of them to create a file in your home directory. Check whether the file exists on the other node by executing CMD(«ls -l ~/foo-$LOGNAME»). Do the same with CMD(touch /tmp/foo-$LOGNAME). - Read the section on the accounting system of the XREFERENCE(«http://ilm.eb.local/clusterdoc/The-Accounting-System.html#The-Accounting-System», «cluster documentation») to learn how charges are computed. HOMEWORK(« Find three different ways to determine how many CPU cores the cluster has. », « - Log in to any cluster node and read the message of the day. - Run CMD(«qhost») and add up the third column. - Run CMD(«nproc, lscpu») or CMD(«cat /proc/cpuinfo») on each node and sum up the results. - Run CMD(«qconf -se ») for each node and sum up the values shown as CMD(«num_proc»). - Run CMD(«q-gstat -s») and add the slot counts. - Read the first sentence on the XREFERENCE(http://ilm.eb.local/clusterdoc/, cluster documentation main page). - Visit the XREFERENCE(«http://ilm.eb.local/ganglia/», «ganglia») page and subtract from the number shown as "CPUs Total" the CPU count of the (two) servers which are not cluster nodes. ») HOMEWORK(« Read the CMD(«q-charge») manual page and learn about the CMD(«--no-joblist») option. Write a config file for CMD(«q-charge») to activate this option automatically. Hand in your config file. », « Simply create the file named CMD(«.q-chargerc») in the home directory which contains a single line CMD(«no-joblist»). However, with this file in place, there is no easy way to EMPH(«enable») the job list. ») SECTION(«Submitting and Monitoring») - interactive and non-interactive (batch) jobs - CMD(«qsub»): submitting job scripts - CMD(«qstat»): monitoring job state - CMD(«h_vmem»), CMD(«h_rt»): Specify memory and running time - CMD(«qdel»): removing running or waiting jobs EXERCISES() - Execute CMD(«qlogin -l h_rt=60») to get a shell on a random(?) cluster node. - Write in a file called REFERENCE(«testscript.sh», «testscript.sh») with the content below. - Look at the CMD(«qsub») man page to tell which of the following options of CMD(«qsub») might be useful to set. CMD(«-l h_vmem -l h_rt -cwd -j -V -N -pe»). - Submit CMD(«testscript.sh») with CMD(«qsub -cwd testscript.sh») - Quick! Type CMD(qstat). Depending on the current cluster load you will either see your job in the queue (waiting), running, or no longer there (already finished). - After your job has finished, find out if it was successful using CMD(qacct -j "jobID"). If you can't remember, look at the files that were created. - How much memory did your job use? - Let's see what our CMD(«testscript.sh») did. Where and what is the output of the three commands? - Submit the same job script again and remove it with CMD(«qdel») while the job is running or waiting. HOMEWORK(« Write a submit script which prints out the host it is running on. Submit the script and request a running time of one minute, and 500M of memory. Hand in the script, the command you specified for submitting the script, and the output. », « The script only needs to contain the single line CMD(«hostname»). In particular, the shebang (CMD(«#!/bin/sh»)) may be omitted. ») SECTION(«Array jobs and parallel jobs») - array job: a single job with many subjobs. Equivalent to a set of jobs which all run the same job script. - parallel job: jobs that use more than one slot (CPU core) EXERCISES - Run CMD(«mkdir array_job_dir») and create 20 files in that directory called CMD(«input-1») to CMD(«input-20») (hint: example from last week). - Create REFERENCE(«array_job.sh», «array_job.sh») and discuss what the script does. - Submit an array job to the cluster using CMD(«qsub -t 1-20 array_job.sh»). Once all array tasks have finished, you'll find that all your files were renamed. - You might want to check if the jobs succeeded. Use CMD(qacct) to check the exit codes of all jobs. Think about pipes and the commands CMD(sort), CMD(uniq) and CMD(grep) to make it easier for you. - Run CMD(«echo stress -c 2 | qsub -l h_rt=100») to submit a job. Use CMD(«qstat») to find the node on which the job in running. Run CMD(«ssh -t htop») and check how many stress processes are running and the share of CPU time they get. Repeat, but this time submit a parallel job by adding CMD(«-pe parallel 2») to the options for CMD(«qsub»). HOMEWORK(« Discuss when it makes sense to restrict the number of simultaneously running jobs. », « One reason is to be nice to others: if you limit the number of your jobs you don’t block other users by occupying the whole cluster. This is only important for long running jobs though, as the SGE software tries to balance jobs between users. Another reason is to not overload the file server in case your jobs do heavy I/O. ») HOMEWORK(« Submit the REFERENCE(«array_job.sh», «array_job.sh») script again as an array job, but make sure that only at most two of the 10 tasks are going to run simultaneously. Hand in the corresponding CMD(«qsub») command. », « The command CMD(«qsub -t 1-20 -tc 2 array_job.sh») will run at most two of the 10 tasks simultaneously. ») SECTION(«Job running time and memory consumption») - Default: hard limit of 1G RAM, killed after one day - Q: How long will my job run? How much memory does it need? A: CMD(«qacct») - Long job waiting times for high requests - Short queue EXERCISES() - If a job needs much memory, the default of 1G might not be enough. Find out how much memory one terminated job of yours actually needed by running CMD(«qacct -j »). In particular, look at CMD(«exit status») (not zero if something went wrong) and CMD(«maxvmem») (actual memory consumption of your process). - Submit the job script again, but this time specify CMD(«-l h_vmem») to request more memory. Once the job is complete, compare the CMD(«maxvmem») field of the CMD(«qacct») output and the value specified with CMD(-l h_vmem). - Jobs could also be much longer than the default value allows (1 day). Use CMD(«-l h_rt») to request a longer running time. Run a test job with default settings or a rough estimation and see if it fails (CMD(«qacct»), exit status not zero). Look at start and end time and compare with CMD(-l h_rt) value. Adjust CMD(«-l h_rt») and run the job again. Reevaluate until your job ran successfully. - If your job is very short, you might set CMD(«-l h_rt») below 1h to enter the short queue, for example CMD(«-l h_rt=0:30:0») for 30mins maximum run time. By setting a small value for CMD(«-l h_rt») you could use this resource and possibly get your job queued earlier than with default values. The command CMD(«qconf -sql») lists the names of all queues, and CMD(«qconf -sq | grep "^._rt"») shows you the soft and the hard limit of running time. See the section on resource limits of the CMD(«queue_conf») manual page to learn more about the two types of limits. SECTION(«Queues, Queue Instances»)

A queue is named description of the requirements a job must have to be started on one of the nodes, like the maximal running time or the number of slots. The queue descriptions are organized in plaintext files called queue configurations which are managed by the qmaster and which can be modified by privileged users by means of the qconf(1) command.

Queue Instances Cluster Queue: Set of Queue Instances Hosts

Among other configuration parameters, a queue configuration always contains the list of execution hosts. On each on each node of this list one relalization of the queue, a queue instance, is running as part of the execution damon sge_execd(8). The list is usually described in terms of hostgroups where each hostgroup contains execution hosts which are similar in one aspect or another. For example, one could define the hostgroup @core64 to contain all nodes which have 64 CPU cores. The diagram to the left tries to illustrate these concepts.

While a running job is always associated with one queue instance, it is recommended to not request a specific queue at job submission time, but to let the qmaster pick a suitable queue for the job.

An execution host can host more than one queue instance, and queues can be related to each other to form a subordination tree. Jobs in the superordinate queue can suspend jobs in the subordinated queue, but suspension always takes place at the queue instance level.


EXERCISES() - Run CMD(«qconf -sql») to see the list of all defined queues. Pick a queue and run CMD(«qconf -sq ») to show the parameters of the queue. Consult the CMD(«queue_conf(5)») manual page for details. - Read the CMD(«prolog») section in CMD(«queue_conf(5)») manual page. Examine the CMD(«/usr/local/sbin/prolog») file on the nodes and try to understand what it actually does. See commit CMD(«0e44011d») in the user-info repostitory for the answer. - Run CMD(«echo stress -c 2 | qsub») to submit a job which starts two threads. Determine the node on which the job is running, log in to this node and examine the CPU utilization of your job. SECTION(«Accounting») - accounting file contains one record for each _finished_ job - plain text, one line per job, entries separated by colons - qacct: scans accounting file - summary or per-job information - buggy - easy to parse "by hand" EXERCISES() - Run CMD(«qacct -o») to see the full user summary and CMD(«qacct -o $LOGNAME -d 90») to see the summary for your own user, including only the jobs of the last 3 months. - Check the CMD(«accounting(5)») manual page to learn more about the fields stored in the accounting records. - Submit a cluster job with CMD(«echo sleep 100 | qsub -l h_vmem=200M -l h_rt=10»), wait until it completes, then check the accounting record for your job with CMD(«qacct -j »). In particular, examine the CMD(«failed») and CMD(«maxvmem») fields. Compare the output with CMD(«print_accounting_record.bash »), where the CMD(«print_accounting_record.bash») script is shown REFERENCE(«print_accounting_record.bash», «below»). - Check out the XREFERENCE(«http://ilm.eb.local/stats/», «statistics page»). Tell which histograms were created from the accounting file. - Search for CMD(«com_stats») in the XREFERENCE(«http://ilm.eb.local/gitweb/?p=user-info;a=blob;f=scripts/admin/cmt;hb=HEAD», «cluster management tool») and examine how these statistics are created. SECTION(«Complex Attributes») - used to manage limited resources - requested via CMD(«-l») - global, or attached to a host or queue - predefined or user defined - each attribute has a type and a relational operator - requestable and consumable EXERCISES() - Run CMD(«qconf -sc») to see the complex configuration. - Check the contents of CMD(«/var/lib/gridengine/default/common/sge_request»). - Run CMD(«qconf -se node444») to see the complex configuration attached to node444. - Discuss whether it would make sense to introduce additional complex attributes for controlling I/O per file system. SECTION(«Tickets and Projects») - tickets: functional/share/override - project: (name, oticket, fshare, acl) - jobs can be submitted to projects (CMD(«qsub -P»)) EXERCISES() - Read the CMD(«sge_project») manual page to learn more about SGE projects. - Examine the output of CMD(«qconf -ssconf») with respect to the three types of tickets and their weights. - Check the CMD(«sge_priority(5)») manual page to learn more about the three types of tickets. - Discuss whether the SGE projects concept is helpful with respect to accounting issues and grants (e.g., ERC). - Discuss whether introducing override or functional share tickets for projects is desirable. SECTION(«Scheduler Configuration») - fair share: heavy users get reduced priority - share tree: assign priorities based on historical usage - reservation and backfilling EXERCISES() - Run CMD(«qstat -s p -u "*"») to see all pending jobs. Examine the order and the priority of the jobs. - Run CMD(«qconf -ssconf») to examine the scheduler configuration. In particular, look at the CMD(«policy_hierarchy») entry. Consult the CMD(«sched_conf(5)») and CMD(«share_tree(5)») manual pages for details. - Discuss the various scheduling policies described in this XREFERENCE(«http://gridscheduler.sourceforge.net/howto/geee.html», «document»). - Discuss the pros and cons to schedule preferentially to hosts which are already running a job. That is, should CMD(«load_formula») be CMD(«np_load_avg») (the default) or CMD(«slots»)? See XREFERENCE(«http://arc.liv.ac.uk/SGE/howto/sge-configs.html», «sge-configs») and CMD(«sched_conf(5)») for details. SUPPLEMENTS() SUBSECTION(«testscript.sh»)
	#!/bin/sh
	sleep 100 # wait to give us time to look at the job status
	echo "This is my output" > ./outputfile
	echo "Where does this go?"
	ls ./directorythatdoesnotexisthere
SUBSECTION(«array_job.sh»)
	#!/bin/sh
	# Lines beginning with #$ tell the program to use the following as
	# option for the command.  By the way, you don't need to write this
	# line into "testscript.sh" ;)
	#$ -cwd
	#$ -j y
	#$ -l h_rt=0:1:0
	mv input-$SGE_TASK_ID ./output-$SGE_TASK_ID
SUBSECTION(«print_accounting_record.bash»)
	#!/bin/bash
	(($# != 1)) && exit 1
	awk -F: "{if (\$6 == $1) print \$0}" /var/lib/gridengine/default/common/accounting