Today, many important applications run on multiple machines, for both fault-tolerance and to handle additional load. The most common example are web applications, which typically involve load balancers, web servers, application servers, and database servers. Sadly, there is no standard infrastructure for starting, stopping and updating these applications across multiple machines. In many cases, deploying the application requires manually configuring multiple servers, which is tedious and error prone. There are many tools which attempt to address these problems, but as a developer of distributed applications, none of them seem ideal. I want a system where I can declaratively state "here is my application and the resources it needs. Start
This may sound like I want Amazon EC2, RedHat oVirt, Eucalyptus, or one of the other virtual machine management systems. However, I think operating system virtualization is too low-level. It provides infrastructure that can work existing unmodified applications, which is why people are excited about it and there are many products out there. As an application developer, I am willing to modify my application to support certain tools, so this advantage is not compelling for me. Most importantly, I don't want the additional work required to configure a complete operating system for each application. I also don't want the extra overhead that implies.
An ideal system does need some sort of isolation, so that what one application is doing does not effect other applications. This isolation can be efficiently provided by container solutions such as Solaris Zones, lxc, OpenVZ or vservers. I think it is more appropriate to virtualize and isolate applications at the system call interface, which is effectively what these solutions do. Each application is provided a private view of the operating system's resources, without needing its own copy of the operating system, and its configuration. Being able to selectively share or not share resources is also useful. For example, for the kind of applications I develop, I don't need my applications to each have their own unique IP address, as long as I can assign them unique ports.
There are two projects I know of that manage the allocation of containers. PlanetLab is widely used research infrastructure built on Linux vservers. Unfortunately, it is designed so users can create containers on specific machines of their choosing, and when a container is allocated it is looks like a bare-bones Linux system. Thus, it basically looks like whole operating system virtualization, and is also too low-level. However, there are third-party PlanetLab tools that seem similar to what I want. For example, the PlanetLab Application Manager is close to what I want, but it doesn't help find and allocate resources.
Sun's Project Caroline looks to be very similar to what I want. Unfortunately, it appears to require Solaris, the machines must have access to a shared ZFS storage pool, and it is designed to deploy Java applications. This means I don't have the hardware to use it, and it won't run my applications which are written in C++ or Python. However, none of these are fundamental problems. If it were extended to use local disk, run on Linux, and support arbitrary binaries, I could probably use it.
There are a ton of tools for managing the configuration of a large number of machines. Notable tools include cfengine , bcfg2, lcfg, and Puppet. These tools are very useful, but seem to be orthogonal to what I want. They seem to be designed to automate the problem of deploying identical software and configuration across pools of machines. They don't address the "resource allocation" problem that I don't want to think about, nor do they help with monitoring. While I could use these tools, they don't quite solve the whole problem.
Two bare-bones tools for automating the deployment of applications across many machines are Capistrano and Fabric. It appears to me that the utility of these tools is that they let you write scripts that will run a set of commands via ssh across a bunch of machines. You still need to write the scripts and allocate resources yourself. Thus, these are sort of like the configuration management tools, but they are even lower level and simpler. Thus, they have the same limitations.
I think what I really want is what some "cloud providers" are calling Platform as a Service (PaaS). The allocation, management and monitoring parts of what I want are exactly what products provided by companies like Rightscale and Stax Networks claim to do. I guess that my hope is that some tool becomes freely available, widely used and "good enough" that for most applications, there is something obvious to use. It will likely take many years for people to figure out the best ways of deploying applications on clusters of computers, but I expect that eventually some "typical" infrastructure will emerge. I just want it today.