Putting the "Shared" in "Shared Computing"
The term "cluster", as it is generally used in computing circles, means a whole bunch of computers that are all tied together with a fast backbone and that all act together to solve huge computational problems. Typical uses for a traditional cluster in this sense include things like nuclear fusion simulations, massive search algorithms, encryption-cracking routines, and many other very nerdworthy goals. However, neuroscience hasn't quite caught up yet with some of the rest of the computing world, and many programs we use are inherently interactive in nature - they require a human to run some commands, check the output, make adjustments, rerun the commands, view images, think about it all, and so forth. So a traditional cluster isn't really terribly useful to most of our users.
A newer breed of users has come along, though - users who do in fact both need the higher power afforded by traditional computer clusters and who also have the savvy to write their jobs in the Procrustean way that clusters enforce on users. Putting both sorts of users into the same computation fishbowl is not generally a good idea!
But funding is always scarce, so in an attempt to create a dual-purpose system, what we have is an unusual combination of interactive-login servers and a thing called a batch system (known as Sun Grid Engine, or SGE for short - though this will soon be called something else, since Oracle acquired Sun and will be rebranding everything as they go).
The mindhive cluster is shared amongst many research groups at MIT. Some system administrators have extremely tight control over what sorts of jobs can be run on a system, what kinds of software can be used, and so on. Often, systems that have Draconian admins go unused since no one can get any work done! Mindhive is more of a communal system that requires that people who share it are mindful of one another.
Although we've put as many checks and balances in place to keep one person from being able to dominate the entire cluster, it's still possible. You can read the hairy details at "On Compute Power"
So this section is dedicated to cluster etiquette, and how to be a good citizen. Almost everyone who uses the cluster is considerate. Those who are not, however, after repeated polite requests to shape up, will be ejected from the cluster.'Nuff said.
- Do check the load of the servers to see which is the least-used when you want to log in. Use the cluster load widget on the home page of this web site. Lower numbers mean less load.
- Do use the Sun Grid Engine (SGE) system to submit multiple jobs - SGE is designed to keep the servers from melting down under extreme load and also to prevent memory exhaustion.
- Do proactively monitor the amount of memory your jobs are using by running the
htop program - if a server begins to run out of physical memory, it will start to use the hard drive as supplementary working memory, and this is very, very bad.
- Do let the rest of the cluster users know if your group has a major deadline coming up if you'll need extra cycles on the cluster. You can send mail to the mindhive mailing list to do so.
- Don't run jobs on all the servers simultaneously UNLESS you are using SGE to submit the jobs. Running homegrown batch jobs or using iPython or brute force to commandeer the entire cluster is not OK.
- Don't consume vast amounts of disk space. Each lab has its own allocation of disk space, which is finite. If you fill up your own lab's share, your labmates might hunt you down. Some of the older disk shares like /users, /u2, and /g2 are shared by everyone - so if you overindulge, you can fill up the filesystem and cause other other lab's analyses to crash. Hell hath no fury like a PI whose lab is stalled during a deadline crunch due to a full filesystem...