TITLE(« Free yourself from all the confusion «...» Forget about desktop decor, senseless glitter, and animations «....» Say goodbye «...» to the rodent and welcome the ultimate interface to your computer. -- Ratpoison propaganda », __file__) SECTION(«Essential Command Line Tools») - man, apropos - cd, pwd, ls, stat - rm, cp, mv, ln - echo, printf, cat - less - tar - file - find, xargs - cut, sort, uniq, wc - mkdir, rmdir - df, du - nice, ionice - ps, kill, top, htop, jobs - ping, wget SECTION(«CMD(«gzip»), CMD(«bzip2») and CMD(«xz»)») EXERCISES() - create the file CMD(reads.fq) from the supplement using CMD(cat > reads.fq). Discuss the usage of CMD(Ctrl-d) versus CMD(Ctrl-c) to end the CMD(cat) process. - run CMD(file) on CMD(reads.fq). Read it using CMD(less). Compress the file using CMD(gzip). Run CMD(file) again. Try reading it. HOMEWORK(« Reading from the special file CMD(«/dev/urandom») returns random data. Explain what the command CMD(«head -c 10000000 /dev/urandom | base64 > foo») does. Execute CMD(«cp foo bar») and CMD(«cp foo baz») to create two copies of CMD(«foo»). Use the CMD(«gzip»), CMD(«bzip2») and CMD(«xz») commands to compress the three files using three different compression algorithms. Hand in the running time of each command, the sizes of the compressed files and your conclusion about which algorithm/program should be preferred. », « The CMD(«head | base64») command reads 10M of random data and encodes these with the base64 algorithm, which sequentially encodes three input bytes into four output bytes, representing the six significant bits of each output byte as a character in the set CMD(«{a-z, A-Z, 0-9, +, /}»). The result is written to the file named CMD(«foo»). The commands CMD(«gzip foo»), CMD(«bzip2 bar») and CMD(«xz baz») each compress one of the files. Running times were 0.9s, 3.1s, 8.9s respectively. It's surprising to see how big the differences of the three running times are, although it is kind of expected that the most "modern" program, CMD(«xz»), has the highest running time. At first sight it might also be surprising that the sizes of the compressed files are almost identical (all three were almost exactly 10M, the size of the unencoded data). But when taking into account that we are dealing with random data, it is clear that one can only expect a compression factor of 3/4 due to the base64 encoding, no matter which algorithm is used. For non-random data the file sizes would have differed much more, and it depends on the use case which algorithm is the best. For example, if the compressed file will be downloaded many times from an FTP server, it might be worth to spend extra CPU time during compression to make the file as small as possible in order to save bandwith. ») SECTION(«CMD(«sed»), CMD(«grep»), CMD(«tr»)») EXERCISES() - Create a file with three lines (CMD(printf "one\ntwo\nthree\n" > foo)). Use CMD(sed) or CMD(tr) to replace all newline characters with spaces. Discuss cases where it is not an option to open the file in an editor. - Unpack (or recreate) the file CMD(reads.fq) from the previous exercise. - Extract only read2 from CMD(reads.fq) using CMD(grep). To do that, check CMD(man grep) for the CMD(-A) and CMD(-B) options. - Use CMD(sed) to extract only the lines containing the ID. What are the dangers of doing it the intuitive way using grep? - A little more advanced: use sed to write a very short, highly efficient FastQ to FastA converter. HOMEWORK(« Find a command which prints the usage of all file systems which are NFS-mounted from a specific server (e.g., neckar, or arc). », « This exercise shows that things can be trickier than they look like at first glance. The complications are (a) the string "CMD(«arc»)" may well be part of an unrealated mount point, and (b) the server name can be specified as either CMD(«arc») or CMD(«arc.eb.local»). Hence, the simple CMD(«df -t nfs | grep arc») is not robust enough, at least for scripts for which subtle future breakage (when another file system is mounted) should be avoided. A better variant is df -t nfs | grep '^arc\(\.eb\.local\)*:' ») SECTION(«CMD(«chmod»), CMD(«chgrp»)») - Change permissions or group. EXERCISES() - Discuss the implications of project directories with mode 777. - Use the -group option to CMD(find) to locate files and directories whose GID is ebio. Discuss whether changing the GID to e.g. abt6 with CMD(chgrp -R) would trigger a backup of these files. For the CMD(chgrp) command, which flags besides -R would you specify in order to handle symbolic links correctly? - Discuss why CMD(chown) is root-only. HOMEWORK(« Come up with a CMD(find | xargs chmod) command that turns on the group-executable bit for all files which are executable for its owner, but leaves non-executable files unchanged. Does your command work with filenames containing spaces. Does it work if a filename contains a newline character? », « CMD(«find ~ -perm /u+x -not -perm /g+x -print0 | xargs -0 chmod g+x»). The CMD(«-not -perm /g+x») part is not strictly necessary but it may speed up the command, and it preserves the ctime of those files which are already group-executable. The CMD(«-print0») option is the only way to make this type of command safe because any other character can be part of a path. So it is strongly recommend to always use it whenever possible. Unfortuntately this feature is not part of the POSIX standard. It is supported on Linux and FreeBSD (hence MacOS) though. ») SECTION(«The LDAP user database») - LDAP: campus-wide user database - stores your personal information: name, password, phone number,... - read-only access w/o auth - write access through web app - ldapsearch, id, finger, last, w EXERCISES() - Run the commands CMD(finger $LOGNAME) and discuss the meaning of all output fields. - Run the command CMD(id). What's the meaning of the terms CMD(uid) and CMD(gid) shown in the output? - Show your own LDAP entry: CMD(ldapsearch -x uid=$LOGNAME). Use a similar command to show the entry of somebody who left the institute. How can one tell whether an account is active? - List all abt6 users: CMD(ldapsearch -x cn=abt6) - use CMD(ldapsearch -x) to find other people at our institute with the same surname ("sn") as you. - use CMD(id) to figure out what groups you are in. HOMEWORK(« Find all former members of your department or group. », « Example for department 1: ldapsearch -x homeDirectory | grep -A 1 vault | grep '^homeDirectory: /ebio/abt1' ») SECTION(«CMD(«rsync»)») - file-copying tool by Andrew Tridgell (1996) - remote copying via ssh - synchronization - performant, sends only differences - aim: know the most common options EXERCISES() - Preparation: Create a scratch directory in /tmp and store 10 10M text files there: CMD(«for i in $(seq 10); do (head -c 7500000 /dev/urandom | base64 > $RANDOM &) ; done»). - Create a copy (also in CMD(«/tmp»)) with CMD(«rsync») and measure the running time: CMD(«time rsync -ax $src $dest»). Check the rsync manual for the meaning of these two options. - Modify and remove some files in source directory, run rsync again, this time with the CMD(«-ax --delete») options to propagate your changes to the destination. - Suppose files which contain a "1" in its file name are generated any you don't want to synchronize these. Find the correct argument for rsync's CMD(«--exclude») option. Use the CMD(«--dry-run -v») options to check. - Find a way to use rsync for a non-local copy where the remote ssh server listens on a port different than 22. To try this, forward the port 24865 + $UID to port 22 with CMD(«ssh -NfL $((24865 + $UID)):localhost:22 localhost») so you can ssh into localhost through this port. - Look at the XREFERENCE(http://people.tuebingen.mpg.de/maan/dss/, dyadic snapshot scheduler). - Suppose source and destination directories are on different hosts and contain slightly different versions of a single huge file. For example, somewhere near the beginning a small part of the file was deleted, and another small part was appended to the end of the file. Suppose further that the file is so large (or the network so slow) that copying all of it would take days. Think about an algorithm that finds the above differences without transferring the whole file. HOMEWORK(« The XREFERENCE(https://en.wikipedia.org/wiki/MD4, Wikipedia page) on the CMD(«MD4») message-digest algorithm states that the security of CMD(«MD4») has been severely compromised. The CMD(«MD5») algorithm is also known to be broken for years. Yet these two algorithms are an essential part of rsync. Is this a problem for the correctness of CMD(«rsync»)? », « The fact that one can find hash collisions for CMD(«MD4») and CMD(«MD5») is usually not a problem for the integrity of CMD(«rsync»). First, the weak hash algorithms are not used for any type of authentication, since CMD(«rsync») relies on SSH for this. Second, why would a malicious CMD(«rsync») user want to modify the files on the source or destination to create a hash collision if he already has access to these files? On the other hand, for non-manipulated files, the probability of a hash collision is so small that other types of failures are much more likely. Finally, the CMD(«MD4») and CMD(«MD5») algorithms are used in combination with other checksums, so modifying a file while keeping its CMD(«MD5») hash the same is not enough to fool CMD(«rsync»). ») HOMEWORK(« Describe the idea of rolling checksums and how they are used in CMD(«rsync»). ») SECTION(«The Cron service») - cron daemon (CMD(«crond»)) executes scheduled commands - CMD(«crontab») (file): table used to drive crond - CMD(«crontab») (program): command to install, update, or list tables EXERCISES() - Preparation: Set the CMD(«EDITOR») variable to your favorite editor. - Run CMD(«crontab -l») to see your current crontab. - Set up a cron job which runs every minute and appends the current date to a file. - Note that there are two CMD(«crontab») manual pages. The command CMD(«man crontab») gives the manual for the crontab program. Run CMD(«man 8 crontab») to see the manual page for the configuration file. Look at the examples at the end. - Write a script which checks if a device is mounted and if yes execute some operations. HOMEWORK(« Discuss the difference between a cron job and a script which does something like CMD(«while :; do run_experiment; sleep $TIMEOUT; done»). Describe what happens when - the user logs out or closes the terminal, - the server gets rebooted, - CMD(«run_experiment») runs for more than CMD(«$TIMEOUT») seconds. », « There are many differences: - A cron job is started by the cron daemon and is executed in different environment where some variables like CMD(«PATH») might have a different value. - The cron deamon is started automatically during boot, so a cron script will still be run after a reboot while the command loop will not be restarted automatically in this case. - If the user logs out or closes the terminal, the command loop might be killed. This depends on whether it was started in the background and whether CMD(«stdin»), CMD(«stdout») and CMD(«stderr») were redirected. - The cron job might be started twice if the previous invocation is still running. This can happen if a file system is slow or the job is running for longer than the scheduling period. - The timing of the command loop will drift over time because the running time of the script is not taken into account. The cron script, on the other hand, will always run at the specified times. ») SECTION(«Regular Expressions») - regex: sequence of characters that forms a search pattern - Kleene (1956) - compilation: RE -> finite automaton - Chomsky hierarchy - basic, extended, perl EXERCISES() - Understand the difference between basic, extended and perl regular expressions. Discuss which ones should be preferred in various cases. - In a web service, is it safe to match a user-specified expression against a known input? Does it help if both regex and input are bounded by, say, 1000 characters? Read the XREFERENCE(«https://en.wikipedia.org/wiki/ReDoS», «ReDos») wikipedia page for the answer. - Match the "evil" extended regex CMD(«(a|aa)+») against a long string which contains only "CMD(«a»)" characters. Try various regex implementations (CMD(«sed»), CMD(«grep»), CMD(«awk»), CMD(«perl»), CMD(«python»)). - Discuss the consequences of the fact that perl regular expressions do EMPH(«not») necessarily correspond to regular languages in the Chomsky hierarchy. - Is it possible to construct a regular expression that matches a line of text if and only if the line is a syntactically correct perl or C program? HOMEWORK(« Find a regular expression which matches any number of "CMD(«a»)" characters, followed by the EMPH(same) number of "CMD(«b»)" characters. Alternatively, prove that no such regular expression exists. », « If there was a regular expression that matched any number of "CMD(«a»)'s" followed by the same number of "CMD(«b»)'s" there would exist a finite automaton that describes the characters that have been matched so far. But apparently such an automaton would necessarily have infinitely many states. Hence no regular expression exists that matches all such strings. Another way to prove this is to apply the so-called EMPH(«pumping lemma»). The take-away of this exercise is (a) to get a feeling about what can be described by regular expressions, and (b) that understanding the underlying theory can help you to avoid wasting a lot of effort on trying to do provably impossible things. ») SECTION(«CMD(«awk»)») - general purpose text utility - Aho, Weinberger, Kernighan (1977) - search files for certain patterns, perform actions on matching lines - specified in POSIX, here: gawk (GNU awk) EXERCISES() - What does the following awk program do? CMD(«ls -l | awk '{x += $5} END {print x}'») - Check the awk uses in the XREFERENCE(«http://ilm.eb.local/gitweb/?p=cluster;a=blob;f=scripts/admin/cmt;hb=HEAD», «cluster management tool»). - Write an CMD(«awk») program which prints the length of the longest input line. - Write an awk program which uses an associative array to print how often different words appear in the input. - Check the XREFERENCE(«http://www.gnu.org/software/gawk/manual/gawk.html», «gawk manual») to get an idea of CMD(«awk») features and its complexity. - Sometimes it is useful to initialize an awk variable from a shell environment variable. Learn to use the CMD(«--assign») option to achieve this. - Discuss what the CMD(«--traditional») option to CMD(«awk») does and when it should be used. HOMEWORK(« On any system, determine the average file system size and the average percentage of used space. », « The command CMD(«df -P») prints both the size and the percentage of used space for each file system as the second and fifth column. The following pipeline prints the desired average percentages: df -P | awk '{s2 += $2; s5 += $5} END {print s2 / NR, s5 / NR}' ») SECTION(«CMD(«screen»)») - terminal multiplexer - Laumann, Bormann (1987), GNU - sessions, detach, attach - multiuser sessions EXERCISES() - Start screen, run CMD(«ls»), detach the session, log out, log in, re-attach the session. - In a screen session with multiple windows, type CMD(«CTRL+a "») to see the window list. - The key to move the cursor to the beginning of the command line is mapped to the same character as the screen escape key, CMD(«CTRL+a»). Learn how to workaround this ambiguity. - Type CMD(«CTRL+a :title») to set the window title. - Put the following line to your .screenrc: CMD(«caption always "%{cb} «%{wb}Act: %-w%{cb}%>%n(%t)%{-}%+w%<%{cb}»%{-}"») and see what it does. - Learn to copy and paste from the screen scrollback buffer (CMD(«CTRL+a ESCAPE»)). Increase the size of the scrollback buffer in your .screenrc file. - Learn how to use CMD(«split») and CMD(«split -v»). - To allow your colleage CMD(«nubi») to attach your screen session in read-only mode, create a suitable REFERENCE(«.screenrc.multi», «multiuser screenrc») file and start a session with this config file: CMD(«screen -C ~/.screenrc.multi -S multi»). Ask CMD(«nubi») to run CMD(«screen -x $OWNER/multi»), where CMD(«OWNER») is your username. Read the section on the CMD(«aclchg») command in the screen manual for how to allow write access. Discuss the security implications. SECTION(«CMD(«adu»)») - advanced disk usage - creates database, only slow "once" - produces summary or list of largest directories EXERCISES() - Create an adu database from your project directory: CMD(«adu -C -d $DB -b $BASE»), where CMD(«$DB») is the (non-existing) directory to create for the database, and CMD(«$BASE») is a existing (sub-)directory of your storage project. - List the 10 largest directories: CMD(«adu -S -d $DB -s "-m global_list -l 10"»), then the top-10 directories with respect to file count: CMD(«adu -S -d $DB -s "-m global_list -l 10 -s f"»). - Print the 10 largest directories, but consider only those which contain the letter "CMD(«a»)". - Read the adu manual page to learn how to customize the output with format strings. SECTION(«CMD(«make»)») - most often used to build software, useful also in other situations - Stuart Feldman (1976), POSIX, GNU - regenerates dependent files with minimal effort - keeps generated files up-to-date without running your entire workflow - Makefile abstracts out dependency tracking - rule, recipe, target, prerequisites EXERCISES() - Look at this XREFERENCE(«http://ilm.eb.local/gitweb/?p=user-info;a=blob;f=backup/Makefile;hb=HEAD», «simple Makefile») which creates the bareos configuration files from a (public) template file and a (secret) file that contains passwords. Identify targets, recipes, rules, prerequisites. - Create a couple of text files with CMD(«head -c 100 /dev/urandom | base64 -w $(($RANDOM / 1000 + 1)) > $RANDOM.txt»). Write a Makefile which creates for each CMD(«.txt») file in the current directory a corresponding CMD(«.wc») file which contains the number of lines in the text file. Extend the Makefile by a rule which reads the CMD(«.wc») files to create an additional file CMD(«sum»), which contains the sum of the line counts (CMD(«cat *.wc | awk "{s += $1} END {print s}"»)). Draw the dependency graph which is described in your Makefile. Modify some files (CMD(«echo hello >> $NUMBER.txt»)) and run CMD(«make») again. Discuss the time to compute the new sum. Add a CMD(«clean») target which lets you remove all derived files (CMD(«.wc»), CMD(«sum»)) with CMD(«make clean»). - There are two flavors of CMD(«make») variables. Read the XREFERENCE(«http://www.gnu.org/software/make/manual/make.html», «make documentation») to understand the difference. - Look at the list of automatic CMD(«make») variables in section on implicit rules of the XREFERENCE(«http://www.gnu.org/software/make/manual/make.html», «make manual»). HOMEWORK(« Explain the difference between simply and recursively expanded make variables. Discuss under which circumstances either of the two should be used in a Makefile. ») SECTION(«CMD(«autoconf»)») - creates shell scripts that configure software packages - makes life easier for the EMPH(users) of your software - Mackenzie (1991), GNU, written in perl, uses m4 - idea: test for features, not for versions EXERCISES() - Preparation: Assume for simplicity that your software package consists only of a single script REFERENCE(«s1l», «s1l») which prints the sha1sum of each file in the given directory. Create this script in a scratch directory and add the REFERENCE(«configure.ac», «configure.ac») and REFERENCE(«Makefile.in», «Makefile.in») files as well. - Run CMD(«autoconf») to create the configure script, then run CMD(«./configure -h») to see the automatically generated options. Run CMD(«configure --prefix $HOME»), CMD(«make») and CMD(«make install») to install the "package". Notice how the value specified to the CMD(«--prefix») option propagates from the command line to CMD(«Makefile»). - Draw a graph which illustrates how the generated files depend on each other. Compare your graph with diagram on the XREFERENCE(«https://en.wikipedia.org/wiki/Autoconf», «autoconf wikipedia page»). In this larger diagram, identify those parts of the graph which are present in the minimal CMD(«s1l») example. - Suppose you spent a lot of hard work to improve your program to use the much more secure sha2 checksum instead of sha1 (CMD(«sed -i s/sha1/sha2/g *; mv s1l s2l»)). To give your customers the best possible experience, you'd like the configuration step to fail on systems where the CMD(«sha2sum») program is not installed (notice the mantra: check for features, not versions). Add this REFERENCE(«AC_PATH_PROG», «test») to CMD(«configure.ac») and run CMD(«autoconf») and CMD(«./configure») again to see it failing gracefully at configure time (rather than at runtime). - Discuss the pros and cons of configure-time checks vs. run-time checks. - Change CMD(«configure.ac») to not fail any more but to create a Makefile which installs either s1l or s2l, depending on whether CMD(«sha2sum») is installed. - Read the "Making configure Scripts" and "Writing configure.ac" sections of the XREFERENCE(«http://www.gnu.org/software/autoconf/manual/autoconf.html», «autoconf manual»). - Notice that CMD(«configure.ac») is in fact written in the m4 macro language. Look at the XREFERENCE(«http://www.gnu.org/software/m4/manual/m4.html», «m4 manual») to get an overview. SUPPLEMENTS() SUBSECTION(«FastQ File»)
  @read1
  ATGCCAGTACA
  +
  DDDDDDDDDDD
  @read2
  ATCGTCATGCA
  +
  DDDDDDDDDDD

SUBSECTION(«.screenrc.multi»)
	multiuser on
	umask ?-wx
	acladd nubi

SUBSECTION(«configure.ac»)

	AC_INIT(«MPI sha1 list», «0.0.0», «me@s1l.org», «s1l», «http://s1l.mpg.de/»)
	AC_CONFIG_FILES(«Makefile»)
	AC_OUTPUT

SUBSECTION(«Makefile.in»)
	all:
		@echo "run make install to install s1l in @prefix@"
	install:
		install -m 755 s1l @prefix@/bin/

SUBSECTION(«s1l»)

	#!/bin/sh
	find "$1" -type f -print0 | xargs -0 sha1sum

SUBSECTION(«AC_PATH_PROG»)
	AC_PATH_PROG(«SHA2SUM», «sha2sum»)
	test -z "$SHA2SUM" && AC_MSG_ERROR(sha2sum required)