On compute power.

From time to time, the cluster load spikes and I get questions about what is happening. I'd like to address the acute and the long-term issues with cluster load and consequent availability in this message. Read Section 1 to learn how to figure out "what's going on?" Read Section 2 for more background, and some concerns I have for the future of the cluster. I especially hope that lab leaders will read it all, despite the length of this treatise.

Section 1: The Acute.

You can generally figure out why the load on a given machine is high by looking at the output of the top command. When executed, top will display a running list of processes that are running on the computer, initially in descending order of the percentage of available CPU time the processes are consuming. You can also see memory use per process in this listing. There is another command, htop, that has a bit more flexibility in sorting and has a nice bar-graph display that lets you see each of the processor cores' use on the computer. So for example, if I log into ba1.mit.edu right now and run top, I see something like the following:
top - 10:35:54 up 3 days, 18:10, 46 users,  load average: 25.31, 28.19, 27.52
Tasks: 1155 total,  11 running, 1142 sleeping,   2 stopped,   0 zombie
Cpu(s): 64.2%us, 35.2%sy,  0.0%ni,  0.0%id,  0.2%wa,  0.1%hi,  0.2%si,  0.0%st
Mem:  16465856k total, 15218828k used,  1247028k free,     6176k buffers
Swap: 39511828k total, 25003684k used, 14508144k free,   130420k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                               
11345 uuuuuuu   20   0  476m 231m 1648 R   98  1.4 648:52.74 ipython                                                                                 
16381 mmm       20   0 1308m 340m 2412 S   97  2.1 596:09.67 MATLAB                                                                                 
22595 mmm       20   0 1275m 382m 2412 S   95  2.4 583:21.80 MATLAB                                                                                 
15173 mmm       20   0 1236m 297m 2412 S   84  1.8 597:42.07 MATLAB                                                                                 
23370 mmm       20   0 1228m 316m 2416 S   79  2.0 596:11.77 MATLAB                                                                               
27240 bbb       20   0  555m 251m  896 R   50  1.6  70:03.98 mris_make_surfa                                                                         
11820 uuuuuuu   20   0 1031m 447m 1652 R   50  2.8 658:25.15 ipython                                                                                 
18216 mmm       20   0 1253m 361m 2412 S   49  2.2 606:46.50 MATLAB                                                                                 
17338 mmm       20   0 1337m 355m 2412 S   48  2.2 603:48.62 MATLAB                                                                                 
20285 mmm       20   0 1253m 338m 2412 S   48  2.1 592:00.26 MATLAB                                                             
I've truncated the output here, since it goes on quite a way. I've also changed the usernames. The output shows the processes sorted from most-cpu-consuming down to the least. Just in the rows you can see above, the percentage of CPU time goes way above 100% - so you might be wondering about that. Each server in our cluster has eight processor cores, so technically the maximum is 800%. top sometimes shows peaks that exceed even this, but that's another story. If you can see that any one person seems to be using a disproportionate share of resources - which is not easy to do, due to quotas that are enforced on the system, but are possible to circumvent - it's almost assuredly not due to malice, and just talking with the person can usually remedy the situation. More on automatic enforcement of resource quotas, "elastic computing", and robotic policing of servers in a bit.

Section 2: The Long-Term.

Probably more important than the actual CPU power available is the RAM - primary memory - that the system has available. RAM is the very fast working memory of a computer. Whenever you are fitting a model, or reconstructing a brain, or running an algorithm, parts of those processes must be loaded ("paged") into RAM for computation. When a system uses up all of its working memory - well, what does a human do when all of her cognitive processing power is in use? Not operate very efficiently. When all RAM is in heavy use, there are two things that can happen, depending on how the system is configured:
  1. Start killing off processes and releasing their memory back into the allocatable pool;
  2. Start using hard drive space - called "swap" - as an emergency alternative to RAM.
By the way, even systems that are configured to use swap will also begin killing off processes if all the swap gets consumed. So one might think, "Ah! just add tons and tons of swap space. Make a terabyte hard drive into a big ol' swap drive and go to town." There is a major flaw with this logic, however. Accessing RAM happens very quickly - typically access is measured in nanoseconds; hard drive access times are much greater, measured in milliseconds. That may sound too abstract to seem important, but there are a million nanoseconds in one millisecond... in short, you really, really do not want your computer relying on swap, ever. Some sysadmins disable it entirely, which means that sometimes things won't run at all, but those things that do run, run quickly. Swap is an emergency resource at best. Sometimes people have jobs to run that consume large amounts of memory, and possibly also large amounts of CPU. Most of the common neuroimaging and analysis packages like FSL, SPM (Matlab), Freesurfer, AFNI, and so forth, run with a fairly conservative memory footprint. Matlab gets knocked a lot for being a memory hog, which it is, but frequently home-grown programs that run in Python can consume vast quantities of memory and CPU. This is not necessarily a bad thing, since the humans running those programs usually have deadlines to meet and papers to write, so the faster the job runs, the better. This logic breaks down in a communal setting like our shared computing resource, the mindhive cluster.

What is a cluster anyway?

The definition compute cluster varies a lot from site to site. Often, a cluster is a resource that is heavily guarded by system admins, that is running a strict hierarchical system for allowing certain users to submit jobs through a narrow interface. These jobs are constrained on all sides by memory, disk, and CPU quotas, with their processing time being a function of the user's political importance and the computational complexity of the job. Preparing a job and submitting it to the cluster almost invariably requires the user to have advanced knowledge of the system and the way it requires the job to be laid out. Interaction with the job is usually limited to pre-defined "checkpoints" and a final output at the end - which, in effect, is the same process model that scientists had back in the days of punchcards and white lab-coat clad sysadmins who were the priests in the temple of computation. The only real difference is the media on which the job is handed into the processing queue and the number of FLOPS [1]. In fact, this model is pretty close to what I had envisioned when I left MIT CSAIL to come to work at the McGovern Institute. What I quickly discovered, however, is that this Draconian model of use did not fit the ethnographic evidence here at all; that the average user at MIBR did not have an intensely CS-centric background and the requisite understanding of queueing systems and their Procrustean constraints; and that even if I were able to force the users into thinking the way the computer wanted, the applications that most people use at MIBR were never invented to be run in such an environment. Live, iterative QA interaction with processes tend to trump sterile batch-mode processing here. It was clear that a traditional cluster system would not be adopted by the users here, so I had to work towards a hybrid system that could address the needs of the handful of highly computer-savvy users and the majority of people who use computers as more of a necessary evil.

The Donut Hole Means People Have to Be Good

There is a "donut hole" in the means by which a sysadmin can impose quotas on users in a multi-user computing community. At the server level, most modern operating systems have only a handful of ways to limit use of memory, disk, and CPU resources - and these methods are not very granular, nor do they provide any means other than arbitrary real-valued thresholds. For example, I can say that no single user can ever consume more than two gigabytes of memory on any server [2]. However, this is at best arbitrary, and at worst, could ruin a perfectly valid job from executing. So, at one end of the user-friendliness spectrum, we have emaciated and insufficient quota systems that let people still interact mostly with a compute server as they would their own workstations - log in, fire up some programs, go to work. Jump greatly ahead to clusters that use complicated queueing systems, and you can control anything and everything. You can create schedules for processing - no one can submit a job larger than some threshold from the hours of 9am-5pm, for example. You can let one person use as much memory as he wants for certain types of jobs on certain servers, but only a fraction of that memory on other servers. And so on - but this power comes ad the steep additional cost of complexity for the end user. In the middle? Not so much. The only thing for it is the same thing that has allowed humans to coexist and flourish for ages - cooperation. It's not fancy; it's not technical, and unfortunately for some who live and breath the mythical MIT credo of not being social animals, it does require interaction with others and being considerate. Almost all people are, and want to be, good citizens. Virtually everyone who is using more than his or her fair share of resources does not need to be fitted into an iron maiden of technical quotas; a simple, respectful word will do. If you're going to run an iPython job on all of the cluster nodes that you know is likely to bog things down, check first to see if anyone is working on a deadline. I believe extending the use of this mailing list, mindhive@mit.edu, to include community-centric communication of this nature, is perfectly warranted.

Elastic Computing, Unicorns and Such

"But, Mark, I hear the words 'Elastic Computing' all the time, and shouldn't you just be able to make the servers magically resize themselves to deal with users? Isn't that what a cluster is supposed to do?" Amazon.com found themselves in a unique situation. They had so many computers and resources - one of the world's largest clusters, to be sure - but they only needed most of them during the Christmas season. So they came up with the idea of subletting their servers to anyone who would pay for the service - and the coined the name "elastic computing cloud", partially because it sort of describes what it does, but largely because it sounds cool and would make waves in the media. And elastic does stretch - but not beyond its maximum capacity. And folks, we are hitting our heads on the maximum capacity of the cluster on a fairly regular basis. We are in desperate need of more servers so that we can do things like DTI and automatic surface reconstructions, as well as be able to support the important, but resource-ravenous, algorithms that an increasingly larger minority of users have been throwing at the cluster. We have a small number of servers - 7 servers, each with 8 cores - available, compared to other sites, like the one at Harvard (>2000 cores) or even the old system at MGH, which has much older technology, but trumps us in the sheer number of processor cores (~200). Of course, there is no money to be had anywhere, especially not for inanimate objects like computers. So I am not sure how to provide you with the additional resources you so clearly need. How about using EC2? Cloud computing? I did get a grant from Amazon.com to use their EC2 system for research. After setting up a complete mindhive-in-a-box for EC2, and running a set of 10 subjects through bedpostx, the "cost" against our research credits was a sixth of the whole balance. And it took me understanding exactly how to instantiate a virtual cluster on EC2 that was running Sun Grid Engine... not particularly a user-friendly experience. So, it was expensive in both dollars and in human time. It also required me to transfer data out of mindhive and onto EC2, and then back again... not seamless at all. It is not the panacea I have been looking for. There has been talk for ages about MIT creating a research data center out in western Mass, where labs can buy time for computation. This would obviate the high cost of data transfer - we might even be able to mount our filesystems directly on any such system. But it's years down the road, if it is even going to happen. What we do have, however, is Sun Grid Engine [3], which is a batch system that allows the submission of jobs through a fairly easy interface - in fact, if you are using any of the FSL tools that "speak" SGE, like bedpostx, you have already been using SGE; it has been set up by default on everyone's account for about 8 months now. SGE can run on systems that permit normal, interactive login, and also enables jobs to be spread across all of the cluster nodes evenly, while the queue master prevents runaway jobs from crashing the system or consuming all available resources. Just about any sort of process can be run via SGE. It tends to work best with processes that run unattended for a long time, like reconstructions or DTI processing. I am happy to meet with you one-on-one or in small groups to help you learn how to use SGE.

What is the Takeaway?

We are experiencing growing pains. Most of the users of mindhive are not computer scientists or techies by nature, and tend to use the cluster in a way more typical of a workstation. There is a growing, but still small, group of power users who can rapidly overwhelm the cluster by using homegrown programs, and the donut hole of enforcement means that there is no good way to control this. However, jobs that are submitted through SGE benefit from being batched across available systems. Power users - those who are savvy enough to run potentially hazardous jobs - need to learn to use SGE. Third-party "Elastic" and "Cloud" computing are generally not going to be useful for our needs due to cost and burden on the user. We have a total of 56 compute "cores" available; most systems of a similar nature have in excess of 512. Even with cooperation from all users, however, we need additional servers to serve the growing population. It remains to be seen how or when this will happen, so in the meantime, we have to be frugal and considerate of other users. To enable this goal, the mindhive@mit.edu list should be judiciously used. References [1] FLOPS, from Wikipedia: http://en.wikipedia.org/wiki/FLOPS [2] Not exactly true, since one must decide on how said memory should be partitioned - rss? stack? wired memory? See http://linux.die.net/man/5/limits.conf for more confusion. [3] A good intro to using SGE - http://web.njit.edu/topics/HPC/basement/sge/SGE.html If you've read this far, thanks! You are now a level 6 ninja.