Styx Grid Services Tutorial: Workflow and scripting

The earlier sections of this tutorial have shown how remote Styx Grid Services can be executed exactly as if they were local programs. This means that we can link SGSs together to form a distributed application (or "workflow") just as easily as we can link local programs together to achieve a goal. Styx Grid Services, like local programs, can be linked together with simple shell scripts (or batch files under Windows). This paper describes how Styx Grid Services can be used in this way.

A simple workflow

Let us create a very simple distributed application (or workflow) from two of the Styx Grid Services that we have already met: HelloWorld and Reverse. We are going to use the HelloWorld SGS to output the string "Hello World" and the Reverse SGS to reverse that string.

We can achieve this by piping the output from the HelloWorld SGS to the input of the Reverse SGS, just as if they were local programs:

SGSRun localhost 9092 helloworld | SGSRun localhost 9092 reverse

The output from this simple workflow should be "dlroW olleH". If we were to create wrapper scripts called helloworld and reverse (as discussed earlier in this tutorial) we could simply write:

helloworld | reverse

In the above example, both SGSs were running on the same server. If you are able, try running the SGS server on two different machines and performing the same workflow again, for example:

SGSRun machine1 9092 helloworld | SGSRun machine2 9092 reverse

Using input and output files

The above example demonstrated the use of the pipe operator to send the data between the two SGSs. You could of course send the data to an intermediate file and use the reverse2 SGS, which reads input from a file rather than from its standard input:

SGSRun localhost 9092 helloworld > temp.txt
SGSRun localhost 9092 reverse2 -i temp.txt -o reversed.txt

The file reversed.txt should now contain the string "dlroW olleH".

Reading files from other servers

One of the strengths of the SGS system lies in the fact that you can pass input files by reference. In other words, instead of specifying an actual input file, you can specify a URL to a file on a different server.

For example, let's run the reverse2 Styx Grid Service, using input data from the Web:

SGSRun localhost 9092 reverse2 -i readfrom: -o output.txt

When this finishes, open output.txt and verify that it contains the contents of the Google home page (in HTML), but each line of text has its characters reversed.

IMPORTANT: You must use the syntax "-i readfrom:URL" rather than just "-i URL". There is a good reason for this, which we won't go into now.

Let's have a quick look in more detail at what has happened in this example:

  1. The SGSRun program connects to the server and creates a new instance of the reverse2 service.
  2. The URL was sent to the SGS server.
  3. The server downloaded the data from that URL into a temporary file.
  4. The server passed that file into the reverse2 program.
It's important to note that the server must be able to "see" the data at the URL you specify. If the server cannot make a connection to that URL an error will be raised.

Streaming data between Styx Grid Services

Let's create a silly workflow of two Styx Grid Services. We're going to reverse the contents of a file, then do the same again so that the contents of the final result are identical to the original file:

SGSRun localhost 9092 reverse2 -i input.txt -o output1.txt
SGSRun localhost 9092 reverse2 -i output1.txt -o output2.txt

If you run this with some input file (or you could pass in data from a URL as above) you should be able to verify that input.txt and output2.txt have the same contents.

Let's pretend that we were working with large files and that we weren't interested in the intermediate file (output1.txt). We have wasted time and bandwidth by downloading output1.txt to our local machine and then immediately uploading it to the second service in the above workflow.

We can be more efficient by downloading (and then uploading) a reference to the intermediate file, with a small change to the workflow. We just add a .sgsref extension to any output file that we want to get a reference to. Then we can upload that reference exactly as if it were the file itself:

SGSRun localhost 9092 reverse2 -i input.txt -o output1.txt.sgsref
SGSRun localhost 9092 reverse2 -i output1.txt.sgsref -o output2.txt

You should be able to verify that this has the same overall effect as the previous workflow. If you examine the contents of the output1.txt.sgsref file you will find that it contains the string "readfrom:styx://.../reverse2/instances/.../outputs/outputfile". This is a reference to the output file that was produced by the first SGS.

Streaming data using the pipe operator

Let's go back to the first example in this section of the tutorial. We printed the string "Hello World" then reversed it using two SGSs:

SGSRun localhost 9092 helloworld | SGSRun localhost 9092 reverse

What happened behind the scenes was this: the standard output from the helloworld service was redirected to the local console window. Instead of being printed out, it was redirected immediately to the remote reverse service. In other words the data made an unnecessary trip to our client machine and back out again.

As above, we can arrange for the data to be passed directly between the two services. However, this time we have no filename to which we can append the magic ".sgsref" extension so what do we do? You can find out by using the help system: enter SGSRun localhost 9092 helloworld --sgs-verbose-help (see Getting help). There is a command-line switch --sgs-ref-stdout, which will cause a reference to the output data to be printed to the console window instead of the data themselves. It is this reference that is passed to the reverse service:

SGSRun localhost 9092 helloworld --sgs-ref-stdout | SGSRun localhost 9092 reverse

The string "Hello World" has been passed directly between the two services.

Obtaining the error code

You should now be getting the picture that you can create shell scripts (or batch files) that tie Styx Grid Services together to produce distributed applications. The SGSRun program behaves exactly like the program that has been wrapped as a Styx Grid Service. It even captures the error code from the remotely-running executable and returns this error code when it finishes. Therefore, you can trap this error code to see if the remote executable has finished successfully.

Disadvantages of the SGS system

The SGS system is a very quick and easy way to create workflows that are based on remote services. We have seen how data can be passed directly between services. However, unlike other workflow systems (e.g. Web Service-based ones), the units of information that are being passed around are files. In other systems, these units might be strings, integers or perhaps objects. This means that it is up to the individual services in an SGS workflow to verify that its input files are valid (the inputs and outputs are very weakly typed). Exactly the same problem is of course faced when using shell scripts to tie together local programs.