Gowlab   Web Pages as Web Services

As anybody knows the richest pool of information resources is available as ordinary web pages, static or dynamic. They are mostly designed to be accessed by humans and their clicking fingers. However, their abundance is tempting. The Gowlab project tries to access them programmatically using the same API as the whole Soaplab, allowing to use their richness together with command-line analysis tools and building powerful workflows.

Of course, there is no free lunch. The web pages are very non-standardized and they tend to change often. Which puts more demands on the providers running Gowlab-based services but the Gowlab system tries to help as much as possible with creating and plugging-in new HTML parsers. The reward is that the end users get suddenly - without changing anything in their clients programs - vast amount of new resources.

  Executive summary

This is an executive summary what a Gowlab-service provider needs to do:

  • Create a metadata description of a web resource (preferably using an ACD format, see details below).
  • Create a Java class extracting useful information from the web page (or use a default one if the page is simple).
  • Deploying Gowlab-service (just by typing a single command).

These simple steps causes that your Tomcat servlet engine is now serving a web service that internally connects to a remote, original web page, extract data from there, and provide them in a unified Soaplab manner.

Here is an example of a simple ACD file defining a Gowlab service that goes to the SRS at EBI and fetches a MEDLINE citation from there:

appl: Medline [
  documentation: "Get MEDLINE citation (in XML)"
  groups: "Testing"
  nonemboss: "Y"
  comment: "launcher get"
  supplier: "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz"
  comment: "method -e [MEDLINE:'$pmid'] -ascii"
]
string: pmid  [
  parameter: "Y"
]
outfile: result  [
]
If you put this file into a right directory (in your local Soaplab copy), you may just type:
build.sh gowlab -Dga=medline
and a new service will be deployed, together with its twin, a derived web service for those preferring strong-type API.

The advantage of this approach is not to get data (that can be done by using directly your browser with the URL supplied in the ACD file above), but to get the data in a standardized way that can be combined with other services into workflows or other integrating applications.

  Acknowledgment

Thanks to Robert Stevens from BioHealth Informatics Group at the University of Manchester, UK, who came with the idea to extend Soaplab in this way. And thanks to all service providers and their developers who will write suitable parsers for accessing various web resources. In the future, we may share their efforts in a contrib directory.

The implementation uses a very helpful third-party package HTTPClient, developed by Ronald Tschalär, and distributed under the GNU LESSER GENERAL PUBLIC LICENSE.

It may also use the HTML Tidy package, distributed under W3C LICENSE. Or, it may use its Java port called jTidy.

  Origin of the name

What does Gowlab mean?

Well, the lab part is historical and it comes from AppLab and Soaplab names. The gow part is perhaps more interesting. Accessing web resource on the Internet became an addiction for many of us, and even though the resources are not always suitable for programmatic access we still tend to grab them and use them. Does it sound like an addiction to a drug? Therefore, here is what the Oxford dictionary says about the term "gow":

gow
   A drug; spec. opium. Hence gowster, a drug addict.
 
   1922 Dialect Notes V. v. 182 Terms for opium. Gow, a Chinese word
which..is meaningless unless allied with another Chinese word. 1926
J. BLACK You can't Win xii. 159 You're in with what gow I've got. 1933
Amer. Speech VIII. II. 27/1 When one has contracted the [drug]
habit..he is..hitting the gow. 1942 BERREY & VAN DEN BARK
Amer. Thes. Slang Gowster,..esp. an opium addict.

Etymology:
   [Shortening and adaptation of Chinese yao-kao (Mathews),
opium. f. yao drug + kao an oily, fatty substance, esp. an unguent.]

Pronunciation:
   (gau) 
However, you can also think about "gow" simply as "Go Web".

 Gowlab   Gory details

There are no special details for the clients. they can use existing Soaplab client software without any changes.

The only minor issue is how to name the list web service (also called factory web service) in case when the same endpoint serves both original web services accessing command-line tools, and the Gowlab-based services. Of course, the service providers can choose any name they like, but if they use default values, the names will be:

  • AnalysisFactory for the original web services, and
  • GowlabFactory for the new services

The client needs to know it when asking for a list of available services. For example, when using a testing command-line client (a part of the Soaplab distribution) the client needs to add a new option -s with the service name:

   run/run-list -s GowlabFactory ...
otherwise, he will get only original services (because the AnalysisFactory name is the default one).

The rest below describes what service providers must do.

  How to start

To use Gowlab is easy - easier than other Soaplab services. Mainly because you actually do not provide a real service, instead you merely redirect client requests to the third-party web resources.

You need to download Soaplab, install Tomcat, and optionally mySQL database for storing result for the asynchronous, non-blocking calls.

The only real add-on value (sometimes not so easy) is to create ACD files for web resources that you wish to present as Soaplab web services, and potentially to write few Java plug-in classes for parsing data coming from these web resources.

See the download page how and what to download, and how to install Soaplab. The rest is described below.

  ACD files for Gowlab

Each Soaplab service (which means also each Gowlab service) must be described by its metadata. The Soaplab run-time reads metadata in an XML format but service providers usually create metadata in a more human-doable format - in ACD.

The general description how to create ACD files is in the ACD Guide for Soaplab. Here are the details specific for Gowlab:

Let's create step-by-step an ACD file for a testing service that just echoes what is sent to it, and which has all types of HTML form elements (such web page really exists - on the URL given below - and its full ACD file is distributed with Soaplab).

The first part described the web page/service as the whole:

appl: Echo [
  documentation: "A testing application for Soaplab/Gowlab"
  groups: "Gowlab"
  nonemboss: "Y"
  comment: "launcher get"
  supplier: "http://www.ebi.ac.uk/~senger/cgi-bin/echo.cgi"
]
The supplier is mandatory - it contains the URL of the page that you want to access.

The comment specifies whether the page accepts requests using HTTP GET or POST method. The syntax is:

comment: "launcher get"
comment: "launcher post"
comment: "launcher <class-name>"
comment: "launcher <XSLT-filename>"
comment: "launcher <external-program-name>"
Default is launcher get.

The launcher <class-name> is used to specify your own plug-in Java class that will fetch the page. Details about plug-ins follow.

The launcher <XSLT-filename> is recognized if the given string is terminated by characters .xsl. In such case, it should contain a filename with an XSLT definition that will translate the fetcher page according to this definition. Details about XSLT plug-ins follow.

The launcher <external-program-name> is recognized if the given string is terminated by characters .sh, or .pl, or .external. In such case, it should contain a program name (the .external extension is removed by Gowlab) that is invoked to fetch data. Details about the external program plug-ins follow.

What was described above is used in common cases when the remote page takes data as HTML form elements (using GET, POST or any of these two methods). However, some pages expects data in a so-called query string (a string attached directly to the URL, but without form separators, such as &'s and = characters). The query string can have input data (very often data for a database query, that's why they call it a query string) anywhere, so we cannot go around just with neat set of name/value pairs. For such case, Gowlab has a method. Here is a complete ACD file for getting an EMBL nucleotide sequence using an SRS page:

appl: Embl [
  documentation: "Get DNA sequence from the EMBL"
  groups: "Testing"
  nonemboss: "Y"
  comment: "launcher get"
  supplier: "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz"
  comment: "method -e [EMBLRELEASE-AccNumber:'$acc_number'] -ascii"
]
string: acc_number  [
  parameter: "Y"
]
outfile: result  [
]
The method comment is a template where you can use names defined as data types elsewhere in the same ACD file, prefixed by a dollar sign.

The remaining part of the ACD file describes individual HTML form elements that are recognized by the web page:

TEXT or TEXTAREA form element
is represented in an ACD file as:
string: name  [
]
       
Actually data type string can be replaced by other ACD basic data types (such as float) to indicate better the expected contents of such input.

As with any other ACD data types you may use parameter to indicate that this is a mandatory input, and default to specify the default value that is used when the client does not specify any value with the given name. Here is an example of an ACD definition for the InParanoid page (Database of pairwise orthologs):

string: id  [
  parameter: "Y"
  information: "Swissprot ID... (e.g. EFTU_YEAST...)"
]

float: conf [
  default: 0.05
  information: "Include paralogs with this confidence value or higher"
]
       

SELECT with OPTIONs form element, with just-one selection allowed,
is represented in an ACD file as:
list: color  [
  values: "red;green;blue;chartreuse"
  min: 1
  max: 1
]
       
The Soaplab input data for the example above will be named color and allowed values will be red, green, blue and chartreuse.

RADIO form element
is represented in the same way as a just-one SELECT:
list: radio [
  values: "a;b;c"
  min: 1
  max: 1
]
       
Difference between SELECT and RADIO has only visual effect on the web page, but from the Gowlab's perspective they both are the same.

SELECT with OPTIONs form element, with multiple selections,
is represented in an ACD file as:
list: words  [
  max: "4"
  values: "ee:eenie;me:meenie;mi:minie;moe"
  prompt: "some words"
]
       
Notice that now the allowed values have two parts separated by a colon (this is a standard ACD feature, nothing special for Gowlab). The first part will become a part of the name the end user will use to make this selection, the second part is the real value that will be sent to the remote web page. For example, a client will sent boolean inputs named words_mi, words_me, etc. If there is no colon (as in the last value) the value is used for both (so a client specifies a boolean input named words_moe and the value sent to server is moe).

CHECKBOX form element
is represented in an ACD file as:
boolean: checked [
  qualifier: okay
]
       
The tag qualifier is used to specify what value should be sent if this input datum is wanted (in this example okay will be sent). Note, however, that this is not by any means a value provided by the end user. The end user just say yes/no, or true/false, or 1/0, when sending a boolean input named (in this example) checked. Default qualifier is on (which may not be suitable for some web sites).

UPLOAD form element
is represented in an ACD file as:
infile: upload [
]
       
If it has no data comments, as in the example above, two input names will be created (as with any other Soaplab service for input files): upload_direct_data and upload_url (note that the word 'upload' is taken from the ACD example above, but it can be anything).

If you specify:

infile: upload [
   comment: "data direct"
]
       
then only a name upload will be available, and client will use it to provide his/her input data.

If you specify:

infile: upload [
   comment: "data filename"
]
       
then again only a name upload will be available, and client will use it to provide a URL where his/her input data are stored. This way a client, sitting at location A, can send data located at B to a Gowlab service run at location C. By the way, this is not a new Gowlab feature, it is the same with any other Soaplab services.

There is one restriction, however, for this URL: it cannot be a URL pointing to a local file (using protocol file:/). Obviously, it would be very insecure.

Don't be confused: Gowlab uses the upload feature of the HTTP protocol because the remote web page expects to get data using this way (more precisely: using multipart/form-data encoding). But it has nothing to do with the way how end-users (clients) provide their data - they can provide them as a direct string, or as a URL.

SUBMIT buton form element
is represented in an ACD file as one or more results:
outfile: result  [
]
       
Usually a Gowlab service presents the resulting data as one result - the notation above is sufficient for that.

Often, however, the data presented by a web resource are not in a format suitable for sharing with other components of an integrated suite. You may need to write a special Java class and provide it as an adaptor to convert page data to a real result. The name of such class is given in an output_adaptor comment line:

outfile: result  [
   comment: "output_adaptor my.post_process.class"
]
       
See details below how such class (another plug-in mechanism) should be written.

Also, the same page can be converted to more than one result. Just specify more plug-in classes and name more results. For example:

outfile: result1  [
   comment: "output_adaptor my.post_process.class"
]
outfile: result2  [
   comment: "output_adaptor my.post_process.second.class"
]
       

  Java plug-ins for Gowlab

The web resources are very diverse. Therefore, it is expected that service providers will need to write specialized, but usually quite simple, Java classes to deal with individual resources. The simple pages (for example pages providing data in a well-known format, such as MEDLINE citations exposed in the XML format) can be treated by default classes that are included in the Soaplab distribution, but other will use plug-in mechanisms.

There are two Java interfaces that can be implemented to fetch pages and to adapt resulting data:

  • RunPlugIn

    This interface is used to fetch data from a remote web resource. Its default implementation (used when nothing special is defined in the ACD file) just gets the pointed page. Which may be often fine, but there may be cases where you need more. For example:

    • when the remote page uses frames and you need to fetch just one particular frame, or
    • when you are interested (only) in images included in the page - so you need to make additional fetching, or
    • when you need to go to several subsequent pages and create an integrated results from all of them.

    When you write one of these classes you need to tell Gowlab about it. Put the class name in the launcher comment. For example:

    appl: ApplName [
      documentation: "..."
      groups: "..."
      nonemboss: "Y"
      comment: "launcher my.special.plugin"
      supplier: "..."
    ]
    
    A default class (which you do not need to specify in the launcher) is RunPlugInDefaultImpl that you may find helpful as template when you write your own plug-in.

    Note that this interface is not meant to be used for post-processing of results. It can do that, as well, but if you need just to parse resulting data and convert them into something else, implement rather the following:

  • DataAdaptPlugIn

    It is a very simple interface for adapting (changing, filtering) data coming from the web resources. You may use it together with the RunPlugIn or independently. There may be more classes implementing this interface in the same service if you want to produce more results from the same page.

    The way how to tell Gowlab about this plug-in is to put the class name in the output_adaptor comment in the result that this class going to create. For example:

    outfile: result  [
       comment: "output_adaptor my.post_process.class"
    ]
           
    In Soaplab distribution, there is an example class testing.LengthAdapt that can be used as template for writing your own adapt plug-ins.

Regarding plug-ins, there are some "to-do" things:

  • Because of (possibly) frequent changes in the fetched HTML pages it would be more robust if the implementation can automatically checked whether the returned page is a "real" page, or just an error page. Or even if it is a real one, whether it has wanted data (for example a database search can return an empty result). Therefore I expect to introduce (soon?) a new "comment" in the ACD file with, probably, a regular expression checking for an error. Something like this:
    appl: Gowlab [
      ...
      comment: "error_regexp /SRS Error/"  
    ]
           

  XSLT plug-ins for Gowlab

There is also another way how to convert the fetched data to something more useful. Gowlab comes with a ready-to-use plug-in that converts the fetched page using an XSLT stylesheet provided by the Gowlab service provider.

When you write your own XSLT stylesheet you need to tell Gowlab about it. Put the stylesheet filename in the launcher comment. For example:

appl: ApplName [
  documentation: "..."
  groups: "..."
  nonemboss: "Y"
  comment: "launcher yeast_grid.xsl"
  supplier: "..."
]
When Gowlab spots such definition (recognized by the .xsl ending) it does the following:
  1. It uses class RunPlugInXSLTImpl to fetch the wished HTML page (actually this class inherits from the RunPlugInDefaultImpl that does the fetching).
  2. It uses jTidy classes to convert the fetched HTML page into a valid XML document.
  3. It locates your XSLT stylesheet:
    • By its fully qualified name (not recommended because an ACD file with a fully qualified filename would not be useable for other sites).
    • By looking for the given file in the directory specified by a property soaplab.xslt.dir during the service deployment (recommended way because the XSLT stylesheets can be changed even without restarting Tomcat server).
    • By looking for the given file in all directories and jar files defined by the current CLASSPATH (it is not a bad option, especially if you pack things together and ditribute them, but if you want to change such XSLT stylesheets you usually need to restart your Tomcat server after each change).
  4. It uses the XSLT stylesheet to transform the fetched and tidied HTML page. If the transformation does not work as you expect - try to set a gowlab.debug properties (see service deployment). It puts the tidied HTML page to the report result so you can try your XSLT stylesheet off-line.

    Some XSLT stylesheets are collected and distributed with Soaplab.

      External plug-ins for Gowlab

    There is yet another way how to fetch, and possibly convert, data from a remote web resource. Gowlab comes with a plug-in that invokes an external program (that runs on the same machine where the Tomcat server serving Gowlab services is running).

    THIS OPTION IS NOT YET IMPLEMENTED.

      Deploying Gowlab services

    Deploying Gowlab services (as any other web services, actually) means two things:

    1. Copying compiled classes to the Tomcat directory, and
    2. Informing Tomcat about the names and classes representing your services.

    The first step needs (usually) to restart Tomcat, but the second step does not. Gowlab services needs to know about metadata files - but they cache them sensibly, so everytime you create a new version of your metadata (which happens when you are developing and testing a service) and send them do Tomcat, the services re-read new metadata automatically, without the need to restart Tomcat. This make deployment of Gowlab services quite straightforward.

    Also, Gowlab services are served only by Tomcat, they do not need to have another server running (such as an AppLab server for other Soaplab services).

    Before showing what commands you need to type to deploy Gowlab services, let's say the same again, in more details: Deploying Gowlab services means seven things:

    1. Copying compiled classes to the Tomcat directory:
      1. Copying general Soaplab classes from a Soaplab distribution. This is done just once, unless you are a Soaplab developer.
      2. Copying your own plug-ins when they are ready.
      3. Copying classes of the so-called derived services after they had been generated.

    2. Informing Tomcat about the names representing your services, and about the parameters they need. Tomcat keeps this information in its own deployment descriptor file where you can check it manually when something does not work as supposed (the deployment file is located in <tomcat-home>/webapps/axis/WEB-INF/server-config.wsdd):
      1. Inform about general configuration parameters.
      2. Inform about a List service that shows what Gowlab services are available in this Tomcat.
      3. Inform about all other services.
      4. Inform about derived services.

    Here is a list of the recognized deployment parameters. I put it here more or less for documentation purposes because by default they all are created automatically - as you will find at the end of this chapter.

    The general configuration parameters are shared by all services. For example, Gowlab keeps there location of a logfile, and access information to the local database with results. Being in one place it is easier to correct them manually. But Gowlab services do not mind where they get configuration parameters - they simply try first the global parameters, and then their own parameters.

    Global configuration parameters
    Name Default value Description
    soaplab.results.dir ....work/data Directory for the fetched results (before they go to the database
    soaplab.logfile .../logs/soplab.log Log file
    soaplab.xslt.dir ....src/plugins/gowlab/xslt Directory where Gowlab looks for the XSLT stylesheets
    gowlab.debug false Helps with debugging your XSTL stylesheet
    jdbc.url jdbc:mySQL://localhost/soaplab JDBC URL to your database with results
    jdbc.user soaplab Database user name
    jdbc.passwd (none) Password to access database
    jdbc.maxconn 15 Maximum open connection to database
    jdbc.driver org.gjt.mm.mysql.Driver Class name representing a JDBC driver

    The List service is named GowlabFactory, and it uses following configuration parameters:

    Configuration parameters for a list service
    Name Default value Description
    className org.embl.ebi.SoaplabServer.gowlab.
    AnalysisFactoryWSGowlabImpl
    Class representing this service
    factory_metadata .../work/metadata/Gowlab.xml An XML file containing links to all Gowlab services served from this server.

    All other Gowlab services (both normal and derived ones) use the following configuration parameters:

    Configuration parameters for each service
    Name Default value Description
    className org.embl.ebi.SoaplabServer.gowlab.
    AnalysisWSGowlabImpl
    Class representing this service
    analysis_metadata .../work/metadata/<category>/<service-name>_al.xml An XML file describing this service.

    Finally, how to deploy a service, or, more often, a set of services? It depends on the Soaplab distribution you are using. The same applies as for the other Soaplab service - see Binary distribution Guide and Build from CVS Guide.

    The binary distribution has a script ws/deploy-web-services for deploying services. More probable, however, is that you will be using the Soaplab CVS module because often you will need to write your own Java plug-in - and for that the CVS module has building and developing environment.

    In summary, here is a list of actions you need to do when deploying a new service:

    1. Create an ACD file for a new service, name it <service-name>.acd, and and put it in src/etc/acd/gowlab/ directory.
    2. Type: ./build-dev.sh gowlab
    3. That's it...

    If you wish to deploy only some services (from the ACD files located in the src/etc/acd/gowlab/ directory), use property ga with a space separated list of names. You may also want to create a different name for the file with list metadata (default is Gowlab.xml) - use property gl. Note that these properties have intentionally shorter names so they can be used easily from the command line, without bothering to add them to your build.properties file. For example:

        ./build-dev.sh '-Dga=echo emblsrs' gowlab
    

    Martin Senger
    Last modified: Thu Mar 3 11:35:03 2005