|
|
|
|
Vol. 9, Issue 3, 277-281, March 1999
METHODS
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
DNA sequence chromatograms (traces) are the primary data source for all large-scale genomic and expressed sequence tags (ESTs) sequencing projects. Access to the sequencing trace assists many later analyses, for example contig assembly and polymorphism detection, but obtaining and using traces is problematic. Traces are not collected and published centrally, they are much larger than the base calls derived from them, and viewing them requires the interactivity of a local graphical client with local data. To provide efficient global access to DNA traces, we developed a client/server system based on flexible Java components integrated into other applications including an applet for use in a WWW browser and a stand-alone trace viewer. Client/server interaction is facilitated by CORBA middleware which provides a well-defined interface, a naming service, and location independence.
[The software is packaged as a Jar file available from the following URL: http://www.ebi.ac.uk/~jparsons. Links to working examples of the trace viewers can be found at http://corba.ebi.ac.uk/EST. All the Washington University mouse EST traces are available for browsing at the same URL.]
| |
INTRODUCTION |
|---|
|
|
|---|
Biological Information Distribution
The Internet is host to an increasingly diverse
range of mechanisms for biological data distribution. Two of the latest
are the World Wide Web (WWW) standards set by the W3C
(http://www.w3.org/), which are already well established among
biologists, and the Object Management Group (OMG) Common Object Request
Broker Architecture (CORBA), which is relatively new in this field
(http://www.omg.org). The existing WWW standards have the advantage of
simplicity and broad availability; however frequent extensions to HTML
and additions such as JavaScript, XML, and dynamic HTML (See Box 1 for
an aid to definitions. These are also all described at
http://www.hotwired.com/webmonkey/collections/crash_courses.html) have
tested the WWW browser developers and the user's ability to keep up.
The incorporation of Java applets (http://java.sun.com) into HTML
documents has further tested the maintenance of common standards as
Java itself has undergone rapid change. However Java's basic
combination of security, portability, and desirability as a programming
language have ensured the inclusion of a Java Virtual Machine (JVM)
into the major WWW browsers where it increases the potential for client
interactivity greatly. In 1996, the ease with which client/server
object-oriented applications could be written, distributed, and
supported across the Internet increased when Netscape
(http://www.netscape.com/) announced (Orfali and Harkey 1997
) that
its browsers were all going to include a CORBA Object Request Broker
(ORB), and when it decided subsequently to distribute its browsers for free.
The use of CORBA in a biological context was introduced by Hu et al.
(1998)
and Lijnzaad et al. (1998)
, who explained that CORBA can be a
good solution to the problem of creating applications for distributed
heterogeneous environments. The Internet is the extreme example of both
distribution and heterogeneity and is described by Orfali and Harkey
(1998)
as being host to the Object Web in which CORBA and Java
complement each other's abilities to create globally accessible
interactive objects. Principal among the benefits that CORBA brings to
biological data distribution and interaction, are: the Interface
Definition Language (IDL) to define interfaces between objects,
scalability (including language and operating system independence),
state-preservation across invocations, and a rich set of 15 object
services, for example, the naming service.
|
Sequence Chromatograms
DNA sequence chromatograms are interpreted to produce nucleotide
sequences (base-calling) and corresponding base-call quality estimates
but although these derived views are used more commonly, the traces
remain the ultimate reference source for any queries about that
particular sequencing reaction. All commonly used sequence assembly
packages (for example Bonfield et al. 1995
), include proprietary trace
browsers to help the user (finisher) distinguish poor-quality data from
good and so work backwards to recreate a representation of the original
sequence. Furthermore, in regions with either trace artifacts specific
to a particular sequencing chemistry, or general background
contamination, an experienced finisher might be able to diagnose
correctly the underlying problem and provide a better basecall when
provided with a suitable view of the original trace. As with contig
assembly, trace availability can increase the success rate of STS
development from ESTs by enabling an optimal estimation of the possible
positions of base-calling errors. Mott (1998)
explored more of the
direct uses of sequence traces through trace alignment including the
identification of vector sequence (better than other automated
methods), and detection and analysis of polymorphisms/mutations.
Examples of existing trace viewers and editors include Ted (Gleeson and
Hillier 1991
), Consed (Gordon et al. 1998
), and Trev
(http://www.mrc-lmb.cam.ac.uk/pubseq/manual/trev_toc.html).
A poignant example in which the special role of sequencing traces as
the ultimate sequencing reaction reference has emerged from the
combination of the recent release of the Phred base-calling program
(Ewing and Green 1998
) and the status of the largest section of the
public nucleotide databases: the EST sequences. ESTs are unusual in
that they are submitted and published in a raw state with limited
quality control (Hillier et al. 1996
) and without the error detection
and correction process intrinsic to normal shotgun assembly. Hillier et
al. (1996)
stressed the need for traces to be available online globally
and have therefore maintained an ftp site where traces can be
downloaded since the beginning of their EST sequencing. Now that Ewing
and Green (1998)
have released Phred with its improved base-calling
(estimated to make 50% fewer errors than the original ABI base-caller)
~250,000 entries in the GenBank (Benson et al. 1998
) and EMBL
(Stoesser et al. 1998
) nucleotide databases may be considered to be out
of date and ripe for replacement, whereas the original chromatograms
remain available online and ready for reinterpretation at the
originating laboratory.
Overall, the Internet is enabling decentralization within all areas of biological data access via the simplicity and low cost of HTML, the code portability of Java, and now the global middleware of CORBA. Though DNA sequence traces are collectively large, and scattered globally they are still important and following the same trends as other types of biological data: originally accessible via ftp, and now by Java applet over either HTML or the OMG's Internet Inter-ORB Protocol (IIOP) as described in this paper.
| |
RESULTS |
|---|
|
|
|---|
A Java trace-viewing applet originally written by E. Buehler (see Fig. 1) has been developed into a set of trace-viewing tools with each component filling a different software niche. The tools work with different versions of the Java Virtual Machine; are packaged as Java applets, applications, and Java Beans, and operate as either CORBA client/server systems, or stand-alone applications.
|
Design
The design choices were influenced by many factors including, most
importantly, the fact that the majority of DNA sequence traces are
normally stored in individual files in one of only three formats ABI,
SCF V2, or SCF V3 (see Table 1). Most sequencing machines' proprietary formats are convertible to the Standard Chromatogram File (SCF) formats (Dear and Staden 1992
), a process helped by the Staden group's provision of freely available SCF libraries (ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/src/). This lack
of flat-file diversity enabled both the HTTP-based and CORBA-based trace-viewing clients to share much of the same code and also allowed a
focus on scalability and download speed for the server design.
|
The move from the original applet to the client/server design offers many benefits including: an object abstraction; the use of separately named trace stores, each with its own description; a choice of compressed, or uncompressed traces; and most importantly, the opportunity to generalize implementation details such as where and how a particular trace is stored to present a common database interface (see Box 2). The separation of client and server communicating through an agreed interface is the cornerstone of CORBA distributed software design allowing concurrent use of different languages and operating systems, yet allowing both clients and servers to improve implementations and add new features independently of each other. If, eventually, the original specification is found to be restrictive, new interfaces can be written and implemented, yet still supporting the old (unlike database schemas). The IDL language also allows inheritance so simple IDL specifications like that in Box 2 can be extended to create more complex derived interfaces. Downloading, parsing, and displaying a trace can take less than two seconds (from genome.wustl.edu in the USA to ebi.ac.uk in England) but may take more than five times longer when the Internet is congested (data not shown). Extra time is needed for an initial transfer of a Java ORB (if one is needed by the client). ABI format files are the largest and take the longest to arrive. SCF format trace files (either version) are already many times smaller than the original ABI format trace and SCF version 3 chromatograms can be compressed by gzip to <7% of the original file size. The SCF version 3 format was designed specifically to be compressed easily; see http://www.mrc-lmb.cam.ac.uk/pubseq/manual/formats_2.html.
|
The software can be downloaded as either the compiled-class Jar files referenced from within the applet tags of any example applet pages (use view page source in Netscape), or as Java and IDL source Jar files as specified above. As an example not requiring Java or IDL compilation, nor a local ORB, one could use the local trace viewer to display the gzipped trace file mr32b07.r1.gz in the current directory with the command "java embl.ebi.trace.TraceView mr32b07.r1.gz" after the jar file containing all the chromatogram viewer classes is downloaded from an applet page and specified directly in the user's local CLASSPATH environment variable. This jar file is typically called CorbaChromatogramApplet.jar and includes the TraceView.class file and all of its supporting classes. Thus, setting the environment for a UNIX csh session would require some version of a command such as "setenv CLASSPATH/home/myclasses/CorbaChromatogramApplet.jar."
| |
DISCUSSION |
|---|
|
|
|---|
Currently, there a few problems in deploying CORBA-based applications over the Internet. These problems include: old firewalls blocking the IIOP protocol, the need to download ORB classes to clients because of the lack of a guaranteed local ORB, and a lack of support for multiple applet signing that would allow applets to follow object references to objects on computers other than the original applet's host. There are already solutions to all these problems but their degree of irritation should decrease with the release of JDK1.2 from Javasoft (http://www.javasoft.com/) with its high-performance-class libraries and built-in Java ORB. When all operating systems support this rich environment, which includes OMG CORBA support, distributed computing may move further out of the browser and directly into more of a user's normal molecular biology application set.
CORBA may appear to be overkill for this simple interface specification (relative to sockets for example) but as more biological software components are written to CORBA standards, any extra individual server installation effort becomes reduced. The EMBL outstation European Bioinformatics Institute (EBI) is working toward standards for such components along with other members of the OMG's Life Science Research (LSR) Domain Task Force (DTF) (http://lsr.ebi.ac.uk/). Java RMI would have been an interesting CORBA alternative but was not investigated because of the lack of relevant biological standards efforts, frameworks, language independence, services, and local support.
Future Options
The trace viewer is limited by its isolation: Only when more CORBA servers are developed to support applications such as EST clustering, sequence assembly, etc., will the synergies of CORBA-wrapped data become obvious. The CORBA trace server will move to the new CORBA 3 standard, which supports fully portable (between different vendors' ORBs) server code as soon as practical. The client should benefit from extra interactive features such as quality value display and editing, external trace view positioning interfaces, and multiple trace views.
| |
METHODS |
|---|
|
|
|---|
All the software is written in Java and compiled using Sun Microsystems/Javasoft's Javac Java compilers (http://www.javasoft.com/). The simplest applet (also the first written applet) complies with the Java 1.0 standard but the remainder of the code requires Java 1.1 class libraries. The IDL interface specifications were compiled by Object Oriented Concepts' (http://www.ooc.com/) ORBacus IDL to Java compiler. Many ORBs and IDL compilers are available free (http://industry.ebi.ac.uk/corba/) as are Sun's Java compilers. Documentation is distributed throughout the code in Javadoc comments.
Implementation
The three trace formats are parsed by subclasses of an abstract chromatogram class. The chromatogram class visualization code is in a separate ChromatogramCanvas class to keep display and user-interactivity methods separate from the basic chromatogram object model. The client canvas uses double buffering to reduce flicker when scrolling. The chromatogram display can be switched to display ASCII base-calls, or the ABI sequencing machine's comments field.
The client/server CORBA system wraps the client classes inside a CORBA adapter class. This adapter translates from GUI-generated trace load requests into CORBA method calls on a particular trace database server implementation via a CORBA naming service. The trace file parsing is done easily on the client because CPU cycles are plentiful, and the Java code transfer overhead is small (~25% of the size of the smallest compressed trace). To optimize scalability and speed, the server can store and transfer traces as gzipped files which are handled easily by the Java.util.zip package in Java 1.1.
The CORBA trace server has all the implementation-specific methods for loading a trace (from a particular directory hierarchy or database) in a single class that can be overridden. The server configuration details including database names and descriptions are parsed from a simple text file that is read once when the server starts up. Multiple servers in different locations can register with a common naming service.
The original applet, being written to the older Java standard and using an ordinary http daemon as its download server is well suited to general Internet deployment in which browsers versions may be out of date and for small sequencing centers in which there are few traces for display and no local programming expertise. The implementation of Java 1.1 in Netscape Communicator 4.5 supports all the code described.
| |
ACKNOWLEDGMENTS |
|---|
We are grateful to Tom Flores for helping to start the CORBA element of this project. Rodger Staden's group, especially James Bonfield, helped with advice and support. This work was funded by European Union grant BIO 4 CT 960346.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
4 Corresponding author.
E-MAIL jparsons{at}ebi.ac.uk; FAX 44 1223 494468.
| |
REFERENCES |
|---|
|
|
|---|
Received October 6, 1998; accepted in revised form January 20, 1999.
This article has been cited by other articles:
![]() |
J. Rothganger, M. Weniger, T. Weniger, A. Mellmann, and D. Harmsen Ridom TraceEdit: a DNA trace editor and viewer Bioinformatics, February 15, 2006; 22(4): 493 - 494. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Jareborg and R. Durbin Alfresco---A Workbench for Comparative Genomic Sequence Analysis Genome Res., August 1, 2000; 10(8): 1148 - 1157. [Abstract] [Full Text] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||