Inferno Risk Analysis
The Inferno Grid is a new system for harnessing the unused power of desktop machines, similar in many respects to Condor. However it is much less well-established than Condor. This page contains a discussion of the potential risks involved in using Inferno in the Reading Campus Grid.
This page will make some direct comparisons between Condor and Inferno, and may unfairly ignore other solutions. This is not intended to be just "Condor vs Inferno" but it just happens that Condor is technology with which many readers will be familiar.
For more information on Inferno, see
VitaNuovaInferno and for a PowerPoint "tutorial" on the Inferno Grid, see
http://www.vitanuova.com/solutions/grid/Computation_Grid_Demo.ppt.
Brief intro
The Inferno Grid is based on the Inferno operating system; both systems are distributed by
Vita Nuova, a York-based company. Inferno is a lightweight operating system that can run as a hosted application under most major operating systems (Windows, Solaris, Linux, MacOSX). It is designed from the ground up for distributed computing. Applications in Inferno are written in its own C-like language (Limbo) and are guaranteed to run identically on all platforms. Inferno can also run applications in the host operating system, so Inferno Grid users do not have to rewrite applications in Limbo! They can simply submit their compiled code to the Inferno Grid; Inferno is just used as the middleware to select the machine on which to execute the code, then retrieve the results.
Although they achieve the same effect, the Inferno Grid operates in a different manner to Condor. Each "worker node" in an Inferno Grid makes requests to a "scheduler" for work. If the scheduler has work that is suitable for the worker node in question, the worker node will download the job and run it. This "worker-pull" model is rather like a labour exchange and means that the worker nodes do not run any server processes. With Condor, on the other hand, each worker node runs a server daemon process that listens for instructions from the scheduler to start a piece of work ("scheduler-push").
Feature Risks
(i.e. "What if the Inferno Grid doesn't do what we want?") This is, of course, rather difficult to answer until we know exactly what our users' requirements are. However, the Inferno Grid has been used successfully by other groups in industry and academia (at the time of writing, Evotec OAI and York University's biology department). Their requirements could be different to ours but are unlikely to differ greatly. The Inferno Grid is open-source (for non-commercial purposes), so it would theoretically be possible to add new features ourselves, given an adequate knowledge of the Limbo language. Also, we have a good relationship with Vita Nuova, which means that it might be possible to request new features (or help in adding them), if Vita Nuova see an overall commercial benefit in doing so.
It should be noted that Inferno does not have all of the features of Condor (e.g. checkpointing, MPI, migration of jobs). Such features may or may not appear in the future. In the case of checkpointing, it is my understanding that many users prefer to make their own checkpointing arrangements anyway, ignoring Condor's own mechanism.
Technical Risks
Any Inferno Grid would have to operate in the context of the Campus setup and we must bear in mind how it will sit in the network as a whole. There are several issues to consider:
Installation issues
The most recent version of the Inferno Grid has easy-to-use installation programs for Windows (2000 and XP). I am not sure how well-served other platforms are by installation programs, but I have managed to install an older version of the Inferno Grid under Linux with a little help from Vita Nuova (the new version may be much easier). At worst I think this is a matter of finding the right method for our systems, then documenting it.
Cross-domain behaviour
Problems have been encountered (and surmounted) with Condor when trying to "flock" Condor pools across domains (ESSC, for example, has a different domain to the rest of the campus). I see no reason why similar problems should be encountered with the Inferno Grid. The fact that worker nodes run no server processes makes this problem much simpler. As long as the worker nodes can "see" the scheduler (i.e. they are able to initiate a TCP connection) the Inferno Grid can cross boundaries effortlessly.
Firewall issues
The Inferno Grid is very "firewall-friendly" since only the scheduler machine runs a server process. The worker nodes can be literally anywhere on the Internet, as long as they can connect to the scheduler. This should remove many potential problems with firewalls, and opens up the (possibly) attractive future prospect of involving off-campus machines (e.g. home computers) in the Inferno Grid. This might encourage more people to get involved in helping with computational science at Reading, much like ClimatePrediction.net. In my own brief tests, I have successfully connected my home computer to a small Inferno Grid on campus.
Links with data stores
It is probable that we shall wish to operate a central data store for the Inferno Grid (even if it is only used for scratch purposes).
IanBland has written a report on how his group successfully interfaced Condor with a Storage Resource Broker. (SRB allows distributed storage resources to be brought together and appear as a single resource.) The interfacing of an Inferno Grid with SRB has not been tried as far as I know, but should not be any more difficult than under Condor. Each job could run the command-line SRB utilities from a script to get and put data from the data store or use the API. An alternative solution, which is a subject for another report, is that the Inferno operating system could be used to create a distributed data store. The creation of a basic distributed data store is an absolutely trivial task for Inferno, but the more advanced features of SRB (metadata catalogues, replication etc) might take some effort to implement (backups would, however, be much easier). It may be that an Inferno data store could be used internally, with an SRB solution employed in transactions with the outside world. Note that Inferno is being used as a Data Grid by Rutgers University, New Jersey:
http://www.vitanuova.com/solutions/grid/news/rutgers.pdf.
Maintenance Risks
("It's new software and not many people know how it works. Will we have the skills to keep it running, particularly if key people leave the University?") The Inferno Grid is a new system (although the Inferno OS is much more mature) and has a very small code base (albeit in an unfamiliar language!) The code base is small partly because the Inferno operating system makes it very easy to build such distributed applications. This reduces the maintenance effort.
The documentation associated with the system used to be rather lacking, but has been greatly improved recently. More work is needed to evaluate this properly. In any case, we will have to write our own documentation on our particular system, but since the Inferno Grid is probably less thoroughly-documented than Condor, we may have to make some extra effort here.
Compatibility Risks
We may well have a situation where some departments prefer to run their own solutions, whatever we recommend. We will certainly ultimately want to include dedicate clusters in the Campus Grid. To what extent will any Inferno Grid have to communicate with these other resources, and will this be possible?
Perhaps the most important consideration here is sign-on and authentication (see also "Security Risks" below). Inferno is secure, but it uses an authentication mechanism that is different from the GSI security infrastructure that other, more "standard" systems use. However, there are a few possibilities to work around this. Firstly (the easiest but messiest solution), we could use some kind of proxy authentication; a user logs on to a GSI resource, using their Globus certificate. The GSI resource possesses an Inferno certificate that is used to access the Inferno Grid. If we wanted user-level control, this might entail creating an Inferno certificate for each potential user on the GSI resource.
A neater solution would be to modify the Inferno Grid to allow GSI authentication. This is theoretically possible, either by editing the source code ourselves or by persuading Vita Nuova that this is a valuable thing to do. The work required to do this is unknown.
If any other type of interaction between the Inferno Grid and other resources (e.g. a central Campus Grid Job Manager) is required, we can use C or Java libraries to interact with the Grid (monitor status, submit jobs etc). Estimating the time required to do this would require more knowledge about the design of the whole system.
Interaction with the National Grid Service
I believe that it is unlikely that the National Grid Service would need to interact directly with an Inferno Grid (or indeed a Condor pool) on Campus; it would probably more likely interact with a central campus facility, or with dedicated resources such as clusters. However, GSI-based authentication for the Inferno Grid would be very desirable if any direct interaction were necessary (or if we ever wished to connect an Inferno Grid with an Inferno Grid or Condor pool in another institution).
Security Risks
The Inferno Grid, being based upon the Inferno OS, is inherently secure. Inferno uses an authentication mechanism based on public key certificates. Unfortunately these aren't the same as X.509 certificates as used by GSI (see "Compatibility Risks" above). In this way access to the Inferno Grid can be limited to only trusted machines. If desired, all communication between machines on the Inferno Grid can be encrypted using a variety of algorithms, although it is most common to use unencrypted communication, but authenticated connections. I believe that "admin privileges" are required to administer the Inferno Grid, with less restrictive privileges for users. This is all controlled by Inferno's permissions-based security system (everything in Inferno is a file, and every file has its access permissions, just like Unix).
Inferno requires no processes to run as root. By contrast I believe that the condor master daemon must run as root (although I believe Cambridge are trying to work around this).
We must also consider the security of the worker nodes. In this case, Condor and the Inferno Grid have the same issues, namely that they must allow arbitrary, compiled code to run on worker machines. I believe that it is possible with both systems to run jobs as heavily-restricted users and the security of this will depend on the features of the underlying OS (Windows, Linux etc).
Usability Risks
The usability of the system is key, and will depend on several factors:
Performance
At this stage it is difficult to establish the relative performance (i.e. throughput) of an Inferno Grid compared with an equivalent Condor pool (or other solution). I have heard that the Inferno Grid's "worker-pull" method gives theoretically higher throughput than a "scheduler-push" method, and makes fewer demands upon the scheduler. In the Inferno Grid, if the last few unfinished jobs are taking a long time (perhaps they have ended up on heavily-loaded machines), the jobs will be repeated on worker nodes that have become free. When the first instance of a given job is complete, the other duplicate instances are terminated. This means that a set of jobs is not held up by a few "stragglers".
The peformance from the end users' point of view will obviously depend greatly on the speed of the worker nodes, independent of the scheduling system. My own feeling is that the performance will not differ greatly between the systems (if it can be meaningfully measured at all), and factors such as reliability and user-friendliness are more important to its overall success.
Reliability
In the absence of direct comparisons, I can only speculate as to the reliability of the Inferno Grid compared with other systems. The continued popularity of Condor is presumably an indication of high reliability, as is the success of the Inferno Grid in an industrial setting (Evotec OAI). The "worker-pull" model of Inferno means that the scheduling machine (the "master") should not become heavily loaded as the scheduler does not have to maintain up-to-date information about the workers. The Inferno Grid takes steps to save the status of the scheduler at regular intervals, ensuring that if the scheduler fails it can be brought back up again in its previous state. The workers are unaffected by a scheduler failure.
If a worker node loses its connection to the scheduler, this doesn't matter; it doesn't need a connection while the job is running. When the job has finished, the worker will wait until it has a connection to the scheduler before giving its status as "complete". (This allowed Evotec to run jobs on laptops that are disconnected and taken home for the evening. The laptops did work offline, then synchronised with the server when reconnected to the network.)
If a worker node fails to complete a job (the criteria for a successful job completion are settable) it tries a few times, then it is "blacklisted" if it fails on several jobs in succession. In this way, unreliable or unsuitable machines are automatically excluded from the Grid.
For these reasons, it seems that in theory, the Inferno Grid should be reliable, but the only way to find out for sure will be to try it!
User-friendliness
By default, users and administrators currently interact with the Inferno Grid via a GUI application (written in Limbo) running under the Inferno OS. This provides an overview of the status of each worker node, with usage statistics. It is also the job submission mechanism, and custom job types can be created to make it easier to submit frequently-used jobs. This application seems reasonable but further testing with end-users will be required to test its suitability.
If this application turns out not to be suitable, we will be able to write our own front-end application(s), which might include web-based submission and/or command-line tools. This is possible because of a key principle of the Inferno operating system;
everything is a file. Therefore interacting with any Inferno system (including the Inferno Grid) is a matter of interacting with a set of shared files across the network. The protocol used to share these files is called Styx, and Styx libraries exist for C and Java, and can be used (by wrapping the C libraries) in other languages such as Python and PHP. So it is possible to write custom front-ends but the work involved will depend heavily on the requirements.
Scalability risks
("What happens if the Campus Grid grows to thousands rather than hundreds of nodes"?) It is known that huge Condor pools can be created (UCL have a very large one, I can't remember how many nodes), and there is a well-known mechanism for joining ("flocking") Condor pools together. This is more of an unknown quantity with the Inferno Grid. I am unaware of any established flocking mechanism, but I don't think this would be hard to achieve (Vita Nuova might be persuaded to do it, and if not, I think it's within our capabilities to do it ourselves). I believe that Inferno Grids exist with a single scheduler controlling several hundred nodes (recall that the system does not put heavy demands on the scheduler). More investigation of the Inferno Grid is required here.
Impact risks
Condor provides a variety of methods to ensure that the users of the worker nodes are not inconvenienced (e.g. jobs only run when the keyboard has not been used for an hour, etc). By default, Inferno simply runs jobs at the lowest priority (this approach is also used by ClimatePrediction.net), which appears to be sufficient for current customers. However, this behaviour could probably be changed. Recall that, in the Inferno Grid, worker nodes request work from the scheduler, rather than vice-versa. It could be arranged that worker nodes simply do not request work when they are busy; in this way, each worker node has control over its participation in the Grid. However, a job that has already been started on a worker node does not get kicked off and so users could return to their machines to find an Inferno Grid job hogging their CPU or (worse) memory or hard drive. More investigation would be required to find out how jobs could be paused in these circumstances.
Financial risks
As Vita Nuova is a commercial company, we would have to be aware that their policies, licensing models etc could change (of course, this is also true for non-commercial projects). They could decide to radically change the Inferno Grid in a direction that we do not agree with, or stop supporting it altogether. (It should be noted that there is a distinct possibility of commercialisation of Globus too. I haven't heard of any similar plans for Condor, but there is a potential similar risk in every direction.)
Change of licensing model
I.e. Vita Nuova start charging for Inferno Grid academic licences. I believe this is very unlikely, but even if it does happen, we can still continue to use our existing software (we don't have a support contract and our licence won't expire). We would, however, probably not be able to benefit from latest advances/bug fixes, but this wouldn't stop us from modifying the code ourselves (as long as we didn't try to sell it commercially). Another possibility is that the academic licence would continue to be free, but it could become a closed-source product.
Abandonment of product
I.e. Vita Nuova decide that it is no longer commercially viable to pursue the Inferno Grid and stop supporting it (or they change it so drastically that we can't use it). This is a risk that must be considered, but we must also realise that other non-commercial products have the same risk; Globus has stopped supporting the Globus Toolkit 3 (GT4 is very different and incompatible). Furthermore, the security mechanism in Globus is also likely to change (from an emphasis on transport-level security to message-level security). However, Condor is unlikely to change drastically in the near future as it has a very large user base. If the Inferno Grid were abandoned or drastically changed, we could still use our existing version.
Future-proofing
There is a risk that future operating systems will not run the Inferno operating system (and hence the Inferno Grid). As long as Vita Nuova stay in business and continue to develop Inferno this will almost certainly not be an issue. If the worst were to happen to Vita Nuova there could be problems here. Of course, any drastic changes to operating systems would also affect all other software (Globus, Condor etc) but with larger user bases, these would probably be less affected.
--
JonBlower - 08 Dec 2004