Vt Gridce.ps

Published on July 2016 | Categories: Documents | Downloads: 55 | Comments: 0 | Views: 267

of 32

Content

Programming Environments for Multidisciplinary Grid Communities
Naren Ramakrishnan, Layne T. Watson, Dennis G. Kafura, Calvin J. Ribbens, and Clifford A. Shaffer Department of Computer Science Virginia Tech, Blacksburg, VA 24061 July 20, 2001

Abstract Rapid advances in technological infrastructure as well as the emphasis on application support systems have signaled the maturity of grid computing. Today’s grid computing environments (GCEs) extend the notion of a programming environment beyond the compile-schedule-execute paradigm to include functionality such as networked access, information services, data management, and collaborative application composition. In this article, we present GCEs in the context of supporting multidisciplinary communities of scientists and engineers. We present a high-level design framework for building GCEs and a space of characteristics that help identify requirements for GCEs for multidisciplinary communities. By describing integrated systems for ﬁve different multidisciplinary communities, we outline the unique responsibility (and opportunity) for GCEs to exploit the larger context of the scientiﬁc or engineering application, deﬁned by the ongoing activities of the pertinent community. Finally, we describe several core systems support technologies that we have developed to support multidisciplinary GCE applications.

Contents
1 Introduction 1.1 Multidisciplinary Grid Communities: Scenarios . . . . . . . . . . . . . . . 1.2 Multidisciplinary Grid Communities: Themes . . . . . . . . . . . . . . . . 1.3 GCEs for Multidisciplinary Grid Communities: Characteristics . . . . . . . 1.4 GCEs for Multidisciplinary Grid Communities: A High-Level Architecture 2 Motivating Applications 2.1 WBCSim . . . . . 2.2 VizCraft . . . . . . 2.3 L2W . . . . . . . . 2.4 S W . . . . . . . . 2.5 Expresso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 5 5 7 8 9 10 11 12 13 14 14 15 18 18 20 21 23 26

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

3 Systems Support for Multidisciplinary GCE Applications 3.1 Representations in a GCE . . . . . . . . . . . . . . . . . . . 3.2 BSML: A Binding Schema Markup Language . . . . . . . . 3.3 Format Conversions and Change Management . . . . . . . . 3.4 Executing Simulations = Querying . . . . . . . . . . . . . . 3.5 Reasoning and Problem Solving . . . . . . . . . . . . . . . 3.6 Sieve: A Collaborative Component Composition Workspace 3.7 Symphony: Managing Remote Legacy Resources . . . . . . 4 Discussion

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

1 Introduction
Grid computing environments (GCEs) have matured signiﬁcantly in the past few years. Advances in technological infrastructure as well as a better awareness of the needs of application scientists and engineers have been the primary motivating factors. In particular, the shift in emphasis from low-level application scheduling and execution to highlevel problem-solving [16] signals that grid computing will become increasingly important as a way of doing science. We use the term GCE to broadly denote any facility by which a scientist or engineer utilizes grid services and resources to solve computational problems. Our deﬁnition thus includes facilities from high-performance scientiﬁc software libraries [55] augmented with grid access primitives to domain-speciﬁc problem-solving environments (PSEs) [24, 71] that provide targeted access to applications software. GCEs extend the notion of a programming environment beyond the compile-schedule-execute paradigm to include functionality such as networked access [15], information services, data management, and collaborative application composition. This is especially true when designing such systems for supporting multidisciplinary grid communities, the focus of this paper. Our emphasis at Virginia Tech has been on exploiting the high-level problemsolving context of such ‘virtual organizations’ [35] and building on the grid architectures, services, and toolkits (e.g., [8, 33, 79, 77]) being developed by the grid community. In working with large, concerted groups of scientists and engineers in various applications (aircraft design, watershed assessment, wireless communications system design, to name a few), we have identiﬁed several recurring themes important for supporting and sustaining such communities. Our goal in this paper is to document these themes, present a high-level design framework for building GCEs [36], and describe some solutions we are working on to address the concomitant needs. In the remainder of this section, we present usage scenarios from multidisciplinary communities that will help us characterize requirements for GCEs. We also describe a high-level framework for building and organizing programming environments for such communities. In Section 2, we describe PSEs that we have built for ﬁve different multidisciplinary communities. Section 3 discusses a variety of issues pertaining to software systems support for GCEs. In particular our semistructured data management facility plays a central role in exploiting the rich problemsolving context of multidisciplinary grid communities. Two other elements of our GCE framework are described in Section 3: Sieve (a collaborative component composition workspace) and Symphony (a framework for managing remote legacy resources). We conclude with a brief discussion of future directions in Section 4.

1.1 Multidisciplinary Grid Communities: Scenarios
We begin by describing some scenarios to illustrate the needs typical of multidisciplinary grid communities. We posit that there are fundamental differences in the usage patterns for a single researcher (or even a group of collaborators) working on a relatively homogeneous problem as compared to the usage patterns found in the communities we have in mind. For example, how does a grid community for solving matrix eigenvalue problems differ from, say, one for aircraft design or wireless communications? We identify three scenarios that are suggestive of the distinctions we would like to make.

¯

Scenario 1: A specialist in ray tracing, a channel modeler, and a computer scientist are addressing the problem of determining the placement of wireless base stations in a square mile area of a large city such that the coverage is optimal [51]. In terms of execution, this problem involves a computation expressed as a digraph of components, written in multiple languages (C, Matlab, and FORTRAN), and enclosed in an optimization loop (see Fig. 1, left). Notice that information is exchanged between executions in three different languages and is streaming between the optimizer and the simulation. In addition, a variety of intermediate results are produced, not all of which are direct performance data. Such results are typically cached to improve performance, visualized at different stages of the execution, or simply saved for later inspection. Furthermore, the components (codes) are developed at different times by different researchers and many are still under

3

Environment

Transmitters Receivers

Triangulation

Space Partitioning

Ray Tracer

no constraint violations constraints active constraints violated range constraint landing CLmax and tip scrape tip spike constraint

Optimum 2

Propagation Model

Power Delay Profile
TOGW 658000 657000 656000 655000 654000 653000 652000 651000 650000 649000 648000 647000 646000 645000 644000

Signal

Pulse Shaping Filter rectflt

Channel Model Builder chtts1

Matched Filter rectflt

Optimum 1

Signal diff Optimizer Bit Error Rate Propagation Model -- C, MPI Channel Model -- Matlab Optimizer -- FORTRAN

Channel Model

Figure 1: (left) Compositional modeling for designing a wireless communications system. (right) A slice of an aircraft conﬁguration design space through three design points. active development. Their I/O speciﬁcations hence cannot be enumerated in advance to achieve matching of components. Further, the possibilities of how components could be cascaded and combined can itself evolve over time. How can a programming environment be designed that allows the binding of problem speciﬁcations to arbitrary codes and allows their arbitrary composition?

¯

Scenario 2: A team of aircraft design engineers and numerical analysts are attempting to minimize the take-off gross weight (TOGW) for an aircraft conﬁguration design involving 29 design variables with 68 constraints [42] (see Fig. 1, right). High-ﬁdelity codes dealing with aerodynamics, mechanics, and geometry determine how changes in design variables affect the TOGW. This application domain is characterized not by an abundance of data, but rather by a scarcity of data (owing to the cost and time involved in conducting simulations). Consequently, the solution methodology involves a combination of high accuracy computations, surrogate modeling (to provide response surface approximations for unsampled design regions [56]) and a robust data management system to help focus data collection on the most promising regions. As a result, evaluating a design point might involve executing a high-ﬁdelity computation, using low-ﬁdelity approximations to obtain an estimate of the TOGW and/or querying a database to lookup previously conducted simulations. In addition, the resources for computing and data storage could be geographically distributed. How can a single environment provide uniﬁed access to such diverse facilities and what programming abstractions are available that allow its efﬁcient and effective use? Scenario 3: A group of computer scientists, nuclear physicists, and performance engineers are modeling Sweep3D [57, 59], a complex ASCI benchmark for discrete ordinates neutron transport. They concur that efﬁcient modeling of this application (over the accuracy level) requires analytical modeling, simulation, and actual system execution paradigms simultaneously [2]! They use a metasystem infrastructure to combine these various models together in a uniﬁed manner. However, they are undecided over when to switch codes during the computation — do they use a low-level simulator for of the available time, and then switch to analytic models or can they be conﬁdent of extrapolating using analytical models even earlier? What if excessive multi-threading on a machine leads to too many ﬂuctuations in their estimates? What system architectures are available that enable compositional modeling when information about component choices is obtained during the computation (rather than before)?

¯

¼±

¼±

4

1.2 Multidisciplinary Grid Communities: Themes
The key dominant theme in these scenarios, and one that is well accepted as an integral aspect of grid computing, is the ability to do compositional modeling [14, 54]. In the context of problem-solving, Forbus [30] deﬁnes this term as ‘combining representations for different parts of a [computation] to create a representation of the [computation] as a whole.’ In this paper, we employ this term to convey merely an approach to problem-solving and its use is not meant to imply an implementation technology, such as distributed object components (although that is one of the common ways of providing the functionality). For instance, a scientist explicitly moving input and output ﬁles across multiple program executables can be viewed as performing compositional modeling (albeit in a very primitive manner). Thus, a component could be any piece of software, executable, model fragment, or even a set of equations that helps the scientist to formalize the process of modeling a computation. A second aspect (again, one whose assertion will hardly be controversial) is collaboration. By deﬁnition, a GCE for a grid community must support groups of scientists and engineers, not just single investigators. These users rely on each others’ codes and data, contribute results to the total effort, communicate in a variety of ways, and organize themselves around subproblems in ways that are hard to predict. They may need to collaborate in real-time on a given simulation, but they are often at physically separate locations. Collaborative workspaces are fundamental to the way multidisciplinary research is conducted. While the above two aspects are underscored in many grid projects, GCEs for multidisciplinary communities have a unique responsibility (and opportunity) to exploit the larger context of the scientiﬁc or engineering application, deﬁned by the ongoing activities of the pertinent community. Typical GCEs only deal with one simulation at a time. The larger context we allude to here may include previous scientiﬁc results which can be used to improve the efﬁciency of current simulations or avoid computation altogether if a desired result is already available. The context may denote the fact that a simulation is being run as part of a higher-level problem-solving strategy, e.g., involving optimization or recommendation. Context also implies previous computational experience or performance, e.g., grid resources may be assigned more intelligently if the performance of previous similar simulations is known. A ﬁnal example of context is the fact that a given simulation is often part of an ensemble of simulations; recognizing this aspect can help in creating more sophisticated simulation management tools. As we will show below, the synergy resulting from consideration of all of the above three aspects (compositional modeling, collaboration, context) poses a unique set of research issues pertinent for multidisciplinary communities. An important goal of our approach is to maximize the synergy between grid computing on the one hand and multidisciplinary scientiﬁc problem-solving on the other. Thus, we are trying to answer two questions: (i) ‘How can a multidisciplinary community setting be exploited to better use a grid?’; and (ii) ‘How can the grid setting be exploited to better serve a scientiﬁc problem-solving community?’

1.3 GCEs for Multidisciplinary Grid Communities: Characteristics
Abstracting from the scenarios described above, and reﬂecting on the three themes just discussed, where do we locate multidisciplinary grid communities in the ‘space’ of computational grid users? To answer that question, we ﬁnd that the following three dimensions are useful in characterizing GCEs for multidisciplinary communities. These dimensions should not be viewed as a one-to-one translation of the above themes into features; rather, they are the most pertinent forms of distinctions that will help us identify requirements for GCEs for multidisciplinary communities.

¯

Emphasis on component coding effort versus component composition effort Traditional programming environments emphasize either the coding of components (inﬂuenced by an implicit composition style) or the aspect of connecting them together (to prototype complex computations). For instance, when coding effort is paramount and composition is implemented in a distributed objects system

5

(e.g., [22, 39]), techniques such as inheritance and templates can be used to create new components. Other implementations involving parallel programming [12, 19, 31] or multi-agent coordination [25, 26] provide comparable facilities (typically APIs) for creating new components. Component composition effort, on the other hand, emphasizes the modeling of a computation as a process of, say, graphically laying out a network of components (e.g., [52]). By providing a sufﬁciently rich vocabulary and database of primitive components, emphasis is shifted to composition rather than coding. Design decisions made about component implementation and composition style indirectly inﬂuence the options available for composition and coding, respectively. This dimension distinguishes programming environments based on how they ‘carve up’ compositional modeling; which of these efforts do they emphasize more? By placing what forms of restrictions and assumptions on the other? In a multidisciplinary setting (e.g., Scenario 1), programming environments are required to emphasize both efforts in almost equal importance. The needs of the underlying application (in this example, wireless communications) render typical assumptions on both coding and composition style untenable.

¯

Cognitive discordance among components An indirect consequence of typical compositional modeling solutions is that they commit the scientist to an implementation (and representation) vocabulary. For example, components in LSA [39] (and most object-based implementations) are required to be high-performance C++ objects, instantiated from class deﬁnitions. This is not a serious constraint for typical grid communities since there is usually substantial agreement over the methodology of computation. The only sources of discordance here involve format conversions and adherence to standards (e.g., matrices in CSR format versus matrices in CSC format). In multidisciplinary grid communities (see Scenarios 1 and 2), there are huge differences in vocabulary (e.g., biologists, civil engineers, and economists using a watershed assessment PSE have almost no common terminology) and fundamental misunderstandings and disagreements about the way computations should be organized and modeled (e.g., aerodynamicists, control engineers, and structural engineers model an aircraft in qualitatively different ways). Furthermore composition in such a setting typically involves multiple legacy codes in native languages, and requires the ability to adjust to changing data formats, data sources (e.g., usersupplied, accessed through grid information services, streamed from another module etc.). Cognitive discordance is a serious issue here, one that is impossible to address by committing to a standard vocabulary for implementing components. Such messiness should be viewed not as a limiting bottleneck, but a fundamental aspect of how multidisciplinary research is conducted.

¯

Sophistication of simulation management Traditional GCEs make a simple-minded distinction between the representation of a component and its implementation, suitable for execution on the grid. Representation is usually intended to imply naming conventions and association of features (e.g, “is it gcc-2.0.8 compliant?”) to help in execution. Once again, this has not proved a serious constraint since grid services have traditionally focused more on executing computations (single runs) and less on high-level problem solving. The sophistication of simulation management is directly related to the representational adequacy of components in the GCE. For situations such as described in Scenarios 2 and 3, the scientist would like to say “Conduct the same simulation as done on Friday, but update the response surface modeling to use the new numbers collected by Mark.” Or perhaps, “collect data from parameterized sweeps of all performance models of the Sweep3D code of the total time.” Simulation management can where the MPI simulation fragment occupies no more than be viewed as a facility for both high-level speciﬁcation of runs as well as a way to seamlessly mix computations and retrievals from a database of previously conducted simulations (see Scenario 2). This implies that data management facilities should be provided not as a separate layer of service, but as a fundamental mode by

¿¼±

6

User Interface (PSEs, Portals etc.)

Model Definition

Parameter Definition

Simulation Definition

Grid Services

Computational Grid

Figure 2: Layers of functionality needed to support multidisciplinary grid communities. which the simulation environment in a GCE can be managed. The recent NSF-ITR-funded GriPhyN project [7] and the Sequoia data management system [74], both multidisciplinary endeavors, are motivated by similar goals. Simulation management also serves as a way of documenting ‘history’ of computational runs and experiments. For example, in conducting parameterized sweeps [17], knowing that certain particular choices have been executed elsewhere on the grid allows ﬂexibility in load balancing and farming out computations to distributed resources.

1.4 GCEs for Multidisciplinary Grid Communities: A High-Level Architecture
Finally, by way of introduction, we present a high-level architecture or design framework for organizing and building GCEs for multidisciplinary grid communities (see Fig. 2). We believe that programming capabilities improve by recognizing modeling assumptions and explicitly factoring them out in a system design architecture. Fig. 2 does not describe an architecture in the full sense of the word, e.g., with precisely deﬁned interfaces between layers. However, it does separate out the various functions or modes that must be represented in a powerful and effective multidisciplinary community GCE. The functional framework of the Grid summarized in Fig. 2 is complementary to ones that are based on protocol layering (see [35]) and commodity computing (see [37]). Model: A model is a directed graph of speciﬁc executable pieces deﬁning the control-ﬂow and data-ﬂow in a computation, e.g., the digraph in Fig. 1 (left). We distinguish between a model and its representation in a GCE; the representation might involve just the model’s name or it might involve opening up the boxes (nodes in the digraph) and representing them in a more sophisticated fashion. Although models consist of ready-torun pieces of code, these pieces may be parameterized. Model Instance: A model instance is a model with all parameters speciﬁed. Note that some of these parameters may not be speciﬁed until runtime. Thus, while there might not exist a static conversion from models to model instances, the distinction between model instances and models is still useful. For example, using two different input data sets with the same model corresponds to two different model instances and a parameter sweep tool can be used to generate such model instances.

7

Simulation: A simulation is a model instance assigned to and run on a particular computational resource on the grid. It is useful to distinguish between a model instance and a simulation because, for example, a single model instance can be run (and re-run) many times using different computational resources or different random number sequences; each of these would be a new simulation by our conventions. Given these deﬁnitions, the framework summarized in Fig. 2 can be used to organize the various functions which should be supported in a GCE for a typical multidisciplinary grid community. The model deﬁnition layer is where users who need to create or modify models ﬁnd tools to support this activity. Users who simply use existing models require only trivial support from this layer. In the parameter deﬁnition layer we locate those activities that associate model instances with models. Examples include tools that generate parameter sweeps [17] or other types of model instance ensembles, as well as the use of problem-oriented scripting languages to generate multiple model instances. (Note that we are using ‘parameter’ in a very broad sense here, making no speciﬁc assumptions about exactly how these parameters are deﬁned or what they include.) Another activity that is naturally found at the parameter deﬁnition level is a ‘database query’ mode, in which results from previous simulations are accessed, perhaps instead of doing new computations. The next layer, simulation deﬁnition, is where a model instance (or set of model instances) is assigned to grid resources. In the simplest case, a user simply chooses some subset of available grid resources to which the model instance should be mapped. More interesting, however, are the possibilities for simulationmanagement tools which take a set of model instances and assign them to the grid, perhaps with sophisticated load balancing strategies or leveraging performance summaries from previous simulations. The lowest two levels appearing in Fig. 2, grid services and computational grid, correspond to the software and hardware resources (e.g., Globus, networks, machines) that make computational grids possible. As mentioned earlier, our emphasis has been on high-level, application-speciﬁc issues. We omit further discussion of the architecture, protocols, and services being developed elsewhere for these levels (e.g., see [35]). Note that not all services or activities ﬁt neatly into the categories shown. For example, in computational steering [52], model parameters may be modiﬁed and computational resources re-assigned at runtime; so the parameter and simulation deﬁnition services are interleaved with execution in this setting. Other important aspects of an effective GCE are not explicitly represented in Fig. 2. For example, support for collaboration is implicit throughout. However, this high-level view of required layers of functionality helps organize and orthogonalize our efforts. In keeping with the typical end-to-end design philosophy of the Grid [34], we have attempted to provide support for these new services as layers of abstraction over traditional low-level grid scheduling and resource management facilities. In addition, our resulting high-level architecture ‘teases out’ typically blurred layers into distinct levels at which various services can be provided. Three of our speciﬁc contributions to this architecture include (i) a lightweight data management system that supports compositional modeling (at the model deﬁnition level), helps view experiment evaluation as querying (at the parameter deﬁnition level), and provides bindings and semistructured representations (for all levels) (ii) a collaborative component composition workspace (Sieve) for model deﬁnition, and (iii) a framework for distributed resource control (Symphony) that provides core support for parameter and simulation deﬁnition and management. We describe these efforts in more detail in Section 3.

2 Motivating Applications
This section brieﬂy describes ﬁve PSEs that are variously situated along the grid community characteristic axes (see Section 1.3). These examples highlight the diversity of multidisciplinary communities that a unifying GCE architecture must support.

8

Figure 3: (left) Input interface to the CMA model in the WBCSim PSE [43]. (right) Wireframe model of a woodbased composite showing failed layers (gray) and active layers (black), and the orientation of ﬁbers in each layer. In this ﬁgure, the second layer has failed.

2.1 WBCSim
WBCSim is a prototype PSE that is intended to increase the productivity of wood scientists conducting research on wood-based composite materials, by making legacy ﬁle-based FORTRAN programs, which solve scientiﬁc problems in the wood-based composites domain, widely accessible and easy to use. WBCSim currently provides Internet access to command-line driven simulations developed by the Wood-Based Composites (WBC) Program at Virginia Tech. WBCSim leverages the accessibility of the Web to make the simulations with legacy code available to scientists and engineers away from their laboratories. WBCSim integrates simulation codes with a graphical front end, an optimization tool, and a visualization tool. The system converts output from the simulations to the Virtual Reality Modeling Language (VRML) for visualizing simulation results. WBCSim has two design objectives: (1) to increase the productivity of the WBC research group by improving their software environment, and (2) to serve as a prototype for the design, construction, and evaluation of larger scale PSEs. The simulation codes used as test cases are written in FORTRAN 77 and have limited user interaction. All the data communication is done with specially formatted ﬁles, which makes the codes difﬁcult to use. WBCSim hides all this behind a server and allows users to supply the input data graphically, execute the simulation remotely, and view the results in both textual and graphical formats. WBCSim contains four simulation models of interest to scientists studying wood-based composite materials manufacturing — rotary dryer simulation (RDS), radio-frequency pressing (RFP), composite material analysis (CMA), and particle mat formation (MAT). The rotary dryer simulation model was developed as a tool to assist in the design of drying systems for wood particles, such as used in the manufacture of particleboard and strandboard products. The rotary dryer is used in about 90 percent of these processes. The radio-frequency pressing model was developed to simulate the consolidation of wood veneer into a laminated composite, where the energy needed for cure of the adhesive is supplied by a high-frequency electric ﬁeld. The composite material analysis model was developed to assess the strength properties of laminated ﬁber reinforced materials, such as plywood. The mat formation model is used to calculate material properties of wood composites, modeling the mat formation process as wood ﬂakes are deposited and then compressed into a mat. This model is crucial for all other manufacturing process models, as they require material properties as input. The software architecture for WBCSim is three-tiered: (i) the legacy simulations and various visualization and optimization tools, perhaps running on remote computers; (ii) the user interface; and (iii) the middleware that coordinates requests from the user to the legacy simulations and tools, and the resulting output. These three tiers are referred to as the developer layer, the client layer, and the server layer, respectively. The developer layer consists primarily of the legacy codes on which WBCSim is based. The server layer expects a program in the developer layer to communicate its data (input and output) in a certain format. Thus, legacy programs are ‘wrapped’ with custom Perl scripts, and each legacy program must have its own wrapper. The client layer consists of Java applets and is responsible for the user interface (see Fig. 3, left). It also handles communication with the server layer, is the

9

Figure 4: (left) VizCraft design view window showing aircraft geometry and cross sections [42]. (right) Visualizing 156 aircraft design points in 29 dimensions with a careful assignment of variables to color drivers reveals an interesting association. only layer that is visible to end-users, and typically will be the only layer running on the user’s local machine. The server layer is the core of WBCSim as a system distinct from its legacy code simulations and associated data viewers. The server layer is responsible for managing execution of the simulations and for communicating with the user interface contained in the client layer. WBCSim applications require sophisticated management of the execution environment; the server layer, written in Java, directs execution of multiple simulations, accepts multiple requests from clients concurrently, and captures and processes messages that signify major milestones in the execution (such as the computation of an intermediate value). Graphical results from the simulations are communicated to the clients using an HTTP server (see Fig. 3, right).

2.2 VizCraft
VizCraft [42] is a PSE that aids aircraft designers during the conceptual design stage. At this stage, an aircraft design is deﬁned by a vector of 10 to 30 parameters. The goal is to ﬁnd a vector that minimizes a performance-based objective function while meeting a series of constraints. VizCraft integrates simulation codes to evaluate a design with visualizations for analyzing a design individually or in contrast to other designs. VizCraft allows the designer to easily switch between the view of a design in the form of a parameter set, and a visualization of the corresponding aircraft geometry. The user can easily see which, if any, constraints are violated. VizCraft also allows the user to view a database of designs using the parallel coordinates visualization technique. VizCraft is a design tool for the conceptual phase of aircraft design whose goal is to provide an environment in which visualization and computation are combined. The designer is encouraged to think in terms of the overall task of solving a problem, not simply using the visualization to view the results of the computation. VizCraft provides a menu-driven graphical user interface to the high speed civil transport (HSCT) design code that uses 29 variables and 68 realistic constraints. This code is a large (million line) collection of C and FORTRAN routines that calculate the aircraft geometry in 3-D, the design constraint values, and the take-off gross weight (TOGW) value, among other things. VizCraft displays the HSCT planform (a top view), cross sections of the airfoil at the root, leading edge break, and tip of the wing, and color coded (red, yellow, green) constraint violation information. To help manage the large number of constraints, they are grouped conceptually as aerodynamic, geometric, and performance constraints. Design points, and their corresponding TOGW, are displayed via active parallel coordinates. The parallel coordinates are also color coded, and they can be individually scaled, reordered, brushed, zoomed, and colored. A parallel coordinate display for the constraints can be similarly manipulated. While the inte-

10

Figure 5: Front-end decision maker interface to the L2W PSE [70], depicting landuse segmentation of the Upper Roanoke River Watershed in Southwest Virginia. gration of the legacy multidisciplinary HSCT code into a PSE is nontrivial, the strength and uniqueness of VizCraft lie in its support for visualization of high dimensional data (see Fig. 4).

2.3 L2W
Landscapes to Waterscapes (L2W) is a PSE for landuse change analysis and watershed management. L2W organizes and uniﬁes the diverse collection of software typically associated with ecosystem models (hydrological, economic, and biological), providing a web-based interface for potential watershed managers and other users to explore meaningful alternative land development and management scenarios and view their hydrological, ecological, and economic impacts. Watershed management is a broad concept entailing the plans, policies, and activities used to control water and related resources and processes in a given watershed. The fundamental drivers of change are modiﬁcations to landuse and settlement patterns, which affect surface and ground waterﬂows, water quality, wildlife habitat, economic value of the land and infrastructure (directly due to the change itself such as building a housing development, and indirectly due to the effects of the change, such as increased ﬂooding), and cause economic effects on municipalities (taxes raised versus services provided). The ambitious goal of L2W is to model the effects of landuse and settlement changes by, at a minimum, integrating codes/procedures related to surface and subsurface hydrology, economics, and biology. The development of L2W raises issues far beyond the technical software details, since the cognitive discordance between computer scientists (developing the PSE), civil engineers (surface and subsurface hydrology), economists (land value, taxes, public services), and biologists (water quality, wildlife habitat, species survival) is enormous. The disparity between scientiﬁc paradigms in a multidisciplinary engineering design project involving, say, ﬂuid dynamicists, structural, and control engineers is not nearly as signiﬁcant as that between computer scientists, civil engineers, economists, and biologists. A further compounding factor is that L2W should also be usable by governmental planners and public ofﬁcials, yet another different set of users. The architecture of the L2W PSE is based on leveraging existing software tools for hydrology, economic, and biological models into one integrated system. Geographic information system (GIS) data and techniques merge both the hydrologic and economic models with an intuitive web-based user interface. Incorporation of the GIS techniques into the PSE produces a more realistic, site-speciﬁc application where a user can create a landuse change scenario based on local spatial characteristics (see Fig. 5). Another advantage of using a GIS with the PSE is that the GIS can

11

Figure 6: (left) Example outdoor environment for designing a wireless communications system in the S W PSE [51]. (right) Propagation coverage prediction around the region of interest in the environment. obtain necessary parameters for hydrologic and other modeling processes through analysis of terrain, land cover, and other features. Of all the PSEs described here, L2W is unique in that it is centered around a GIS. Currently, L2W integrates surface hydrology codes and economic models for assessing the effect of introducing settlement patterns. Wildlife and ﬁsheries biologists were involved in the L2W project, but their data and models are not fully integrated as of this writing. The biological models include the effect of development on riparian vegetation, water quality, and ﬁsh and wildlife species.

2.4 S W
S W (‘Site-Speciﬁc System Simulator for Wireless Communications’) is a collaborative PSE for the design and analysis of wideband wireless communications systems. In contrast to the above described projects, the S W project is occurring in parallel with the development of high-ﬁdelity propagation and channel models; this poses a unique set of requirements for software system design and implementation (ref. Scenario 1 in the introduction) [78]. S W has the ability to import a 3-dimensional database representing a speciﬁc site (see Fig. 6, left), and permits a wide range of radio propagation models to be used for practical communications scenarios [51]. For example, in a commercial wireless deployment, there is a need to budget resources, such as radio channel assignments and the number of transmitters. S W allows wireless engineers to automatically drive the simulation models to maximize coverage or capacity, or to minimize cost. Furthermore, unlike existing tools, S W permits the user to import measured radio data from the ﬁeld, and to use this data to improve the models used in the simulation. A knowledge-based recommender system [65] provides improved modeling capability as the software corrects the environment model and the parameters in the propagation model, based on measured data. Finally, the ability to optimize the location of particular wireless portals in an arbitrary environment is a fundamental breakthrough for wireless deployment, and S W has the ability to perform optimization based on a criterion of coverage, QoS, or cost (see Fig. 6, right). While primitive software tools exist for cellular and PCS system design, none of these tools include models adequate to simulate broadband wireless systems, nor do they model the multipath effects due to buildings and other man-made objects. Furthermore, currently available tools do not adequately allow the inclusion of new models into the system, visualization of results produced by the models, integration of optimization loops around the models, validation of models by comparison with ﬁeld measurements, and management of the results produced by a large series of experiments. One of the major contributions of S W is a lightweight data management subsystem [78] that supports the experiment deﬁnition, data acquisition, data analysis, and inference processes in wireless system design. In particular, this facility helps manage the execution environment, binds representations to appropriate

12

Genes Showing Drought Mediated Changes 8

6

4

Log(calibrated ratio)

2

0

−2

−4

−6

−8 0

100

200

300

400 Clone Id

500

600

700

800

Figure 7: An example microarray design in Expresso [6] to study gene expression in Loblolly pine clones. (left) ¢ sub-quadrants, one of which is shown here. Figure courtesy Y.-H. Sun The microarray is printed in four (NCSU). (right) Expresso output depicting 265 clones (out of a total of 768) that responded to three cycles of mild drought stress.

¾

½

implementations in a scientiﬁc computing language, and aids in reasoning about models and model instances. Supported by a $1M grant from the NSF Next Generation Software program, S W is designed to enhance three different kinds of performance — software, product, and designer. Superior software performance is addressed in this project by (i) developing fundamentally better wireless communication models, (ii) constructing better simulation systems composed from the component wireless models via the recommender, and (iii) the transparent use of parallel high-performance computing hardware via the composition environment’s access to distributed resources. Superior product performance (the actual deployed wireless systems) is addressed by using optimization to design optimal rather than merely feasible systems. Superior designer performance is directly addressed by the synergy resulting from the integrated PSE, whose purpose is to improve designer performance and productivity.

2.5 Expresso
The Expresso project [6] addresses the entire lifecycle of microarray bioinformatics, an area where ‘computing tools coupled with sophisticated engineering devices [can] facilitate discovery in specialized areas [such as genetics, environment, and drug design]’ [45]. Microarrays (sometimes referred to as DNA chips) are a relatively new technique in bioinformatics, inspired by miniaturization trends in micro-electronics. Microarray technology is an experimental approach to study all the genes in a given organism simultaneously; it has rapidly emerged as a major tool of investigation in experimental biology. The basic idea is to ‘print’ DNA templates (targets), for all available genes that can be expressed in a given organism, onto a high-density 2D array in a very small area on a solid surface. The goal then is to determine the genes that are expressed when cells are exposed to experimental conditions, such as drought, stress, or toxic chemicals. To accomplish this, RNA molecules (probes) are extracted from the exposed cells and ‘transcribed’ to form complementary DNA (cDNA) molecules. These molecules are then allowed to bind (hybridize) with the targets on the microarray and will adhere only with the locations on the array corresponding to their DNA templates. Typically such cDNA molecules are tagged with ﬂuorescent dyes, so the expression pattern can be readily visualized as an image. Intensity differences in spots will then correspond to differences in expression levels for particular genes. Using this approach, one can ‘measure transcripts from thousands of genes in a single afternoon’ [45]. Microarrays thus constitute an approach of great economic and scientiﬁc importance, one whose methodologies are continually evolving to achieve higher value and to ﬁt new uses.

13

The Expresso PSE [6] is designed to support all microarray activities including experiment design, data acquisition, image processing, statistical analysis, and data mining. Expresso’s design incorporates models of biophysical and biochemical processes (to drive experiment management). Sophisticated codes from robotics, physical chemistry, and molecular biology are ‘pushed’ deeper into the computational pipeline. Once designs for experiments are conﬁgured, Expresso continually adapts the various stages of a microarray experiment, monitoring their progress, and using runtime information to make recommendations about the continued execution of various stages. Currently, prototypes of the latter three stages of image processing, statistical analysis, and data mining are completely automated and integrated within our implementation. Expresso’s design underscores the importance of modeling both physical and computational ﬂows through a pipeline to aid in biological model reﬁnement and hypothesis generation. It provides for a constantly changing scenario (in terms of data, schema, and the nature of experiments conducted). The ability to provide expressive and high performance access to objects and streams (for experiment management) with minimal overhead (in terms of traditional database functionality such as transaction processing and integrity maintenance) [44] is thus paramount in Expresso. The design, analysis, and data mining activities in microarray analysis are strongly interactive and iterative. Expresso thus utilizes a lightweight data model to intelligently ‘close the loop’ and address both experiment design and data analysis. The system organizes a database of problem instances and simulations dynamically, and uses data mining to better focus future experimental runs based on results from similar situations. Expresso also uses inductive logic programming (ILP), a relational data mining technique, to model interactions among genes and to evaluate and reﬁne hypothesized gene regulatory networks. One complete instance of the many stages in Expresso has been utilized to study gene expression patterns in Loblolly pine [46], in a joint project with the Forest Biotechnology group of North Carolina State University.

3 Systems Support for Multidisciplinary GCE Applications
This section describes several core systems support technologies useful for developing GCEs (and currently employed in the applications outlined so far). Many of these tools and frameworks rely on the notion of representations of components; we begin by motivating this idea.

3.1 Representations in a GCE
One of the main research issues in GCEs is modeling the fundamental processes by which knowledge about scientiﬁc models is created, validated, and communicated. As mentioned in Section 1 and illustrated in the many example systems of Section 2, the expressiveness with which a scientist could interact with a GCE is directly related to the adequacy of representation provided by the system. While it is true that there is no universal representation that is ideal for all purposes, traditional approaches employed in grid projects (for representing models, for model instances, and for simulations) are very restrictive. Recall that we deﬁned a model to denote a directed graph of speciﬁc computational codes or executables. The notion of the ‘representation of a model’ is open to many interpretations and intensely debated in the modeling literature (see for instance [29]); we will not attempt to settle this debate here. Instead, we adopt an operational deﬁnition for the representation of a model, namely that it is an abstraction of the model that permits useful problemsolving capabilities that would not be possible with the model alone. The abstraction could refer to the functional behavior of the model (e.g., a signature), the structural constituents of the model (e.g., a digraph of model fragments), a proﬁle of its performance (to aid in design and analysis), its relationships to other models, and/or information about how it ﬁts within the larger computational context of the GCE and the activities conducted within it.

14

Consider two extremes of representing a single computational component (the simplest model) in a GCE. A black-box representation would be one where just the name of the component serves as its representation. Such a solution will deﬁnitely aid in referring to the component over the grid (e.g., ‘run the XYZ software’), but doesn’t help in any further sophisticated reasoning about the component (.g., ‘is XYZ an iterative algorithm?’) At the other extreme, a white-box representation is one where the component itself serves as its representation (for example, mathematical descriptions of scientiﬁc phenomena). Usually in such cases, the representation is accompanied by a domain theory and algorithms for reasoning in that representation. For instance, the QSIM system [58] is a facility where once a component (e.g., one for the cooling of liquid in a cup [30]) is deﬁned, it is possible to reason about the properties and performance of that component (e.g., when are the contents of the cup drinkable?). While extremely powerful, such designs work well only within restrictive (and sometimes artiﬁcial) applications. An intermediate solution is to annotate components with (feature, value) pairs describing attribute-based properties. For instance, annotations might involve directives, ﬂags, and hints for compiling the code on a speciﬁc platform. These issues are ampliﬁed when we consider the model to be a digraph of computational components. While many projects distinguish between models and representations, two main approaches can be distinguished here. In the ﬁrst category, representations are motivated by the need to manage the execution environment (e.g., ‘schedule this graph on the grid, taking care to ensure that data ﬁles are properly matched’). Examples here are the Linear System Analyzer (LSA) [40] and the Component Architecture Toolkit (CAT) [13] at Indiana University, the ZOO desktop experiment management environment at the University of Wisconsin [50], the Application Visualization System of Advanced Visual Systems, Inc. [76], and the SCIRun computational steering system at the University of Utah [52]. Projects in the second category are based on AI research in compositional modeling [28, 62, 69] and are motivated by formal methods for reasoning with (approximate and qualitative) representations. The modeling and performance requirements in a multidisciplinary GCE mean that both approaches are too restrictive. With the advent of XML and the emergence of the Web as a large-scale semistructured data source, interest in semistructured representations has expanded into the GCE/PSE community. A plethora of markup languages, XMLbased formats, and OO coding templates have been proposed for representing aspects of domain-speciﬁc scientiﬁc codes (e.g., SIDL [21]). In addition, a variety of formats have been proposed recently (e.g., SOX [38]) for deﬁning metadata associated with various stages of computational experiments [38]. A major advantage of such solutions is that the ensuing representations can be cataloged and managed by database technology. Our goal is to investigate representations that (i) allow the binding of problem speciﬁcations to models (and model instances), without insisting on an implementation vocabulary (for the models); (ii) can help us to reason both about the models being executed as well as data produced from such simulations; and (iii) help design facilities such as change management, high-level speciﬁcation of simulations, recommendation, optimization, and reasoning (about models and model instances).

3.2 BSML: A Binding Schema Markup Language
Akin to other GCE projects, our emphasis here will be on semistructured representations for models. However we view markup languages such as XML as less of a data format, programming convention, or even a high-level abstraction of a programming environment. Rather, we view them as a vehicle to deﬁne bindings from representations to models in a GCE. Binding refers to the process of converting XML data to an appropriate encoding in a scientiﬁc computing language (the reverse process is fairly straightforward). There are several forms of bindings in a GCE — binding of values to language variables, converting an XML format to some native format that can be read directly by the model, and/or generating source code for a stub that contains embedded data and a call to the appropriate language function using these data as parameters. Notice that we do not make a distinction between invoking a component procedurally in a scientiﬁc computing language, generating code that invokes a component, or executing a program with command line arguments. All of these are bindings from one representation to various assumptions on the execution environment (which is presumably being handled by the existing computational setup). Our lack 15

<element name=’pdp’> <sequence> <element name=’rmsDelaySpread’ type=’double’/> <element name=’meanExcessDelay’ type=’double’/> <element name=’peakPower’ type=’double’/> <code component="optimizer"> <bind>print "$peakPower\n"</bind> </code> </sequence> <repetition> <sequence> <element name=’time’ type=’double’/> <element name=’power’ type=’double’/> <code component="chtts1|chttm"> <bind>print " $time $power\n"</bind> </code> </sequence> </repetition> <code component="chtts1|chttm"> <begin>print "M = [\n"</begin> <end>print "];\n"</end> </code> </element>

Figure 8: BSML descriptions for a class of XML documents pertaining to power delay proﬁles (PDPs) in the S W PSE. Sample bindings for MATLAB are shown by the bind tags. of any stringent assumptions on the computational codes or method of invoking models is fundamental to multidisciplinary research. From the viewpoint of the GCE, a single representation could be stored but which can allow all these forms of bindings to be performed. A full description of our BSML (Binding Schema Markup Language) is beyond the scope of this article (for more details, see [78]). We brieﬂy mention that BSML associates user-speciﬁed blocks of code with user-speciﬁed blocks of an XML ﬁle. ‘Blocks’ can be primitive datatypes, sequences, selections, and/or repetitions. Intuitively, primitive datatypes denote single values, such as double precision numbers; sequences denote structures; selections denote multiple choices of conveying the same information; and repetitions denote lists. While not particularly expressive, this notation is meaningful to GCE component developers, simple and efﬁcient to implement, and general enough to allow the building of more complex data representations. Consider, for example, representing a power delay proﬁle (PDP) from the S W project in XML. A PDP is a twocolumn table that describes the power received at a particular location during a speciﬁed time interval. Statistical aggregates derived from power delay proﬁles are used, for example, to optimize transmitter placement in S W. We can use BSML to deﬁne bindings between PDPs and all applicable models in S W. Applying a parser generated from such a BSML document (see Fig. 8 for an example) to a PDP will yield native code in the underlying execution environment (in this case, an executable Matlab script that contains embedded data). For a different native execution environment, a different binding could be deﬁned (for the same data type). Hence, our representation is truly semistructured. Notice that we can rapidly prototype new model instances with this technique. Similarly, we can use the same BSML source to provide bindings for an optimizer (for locating base stations, see Scenario 1 in Section 1). The feedback will be a sequence of peak powers, one number per line. Some twenty ﬁve lines of BSML source can therefore take care of data interchange problems for three components. Storing these PDPs in a database is also

16

Actual Data Actual BSML Schema Parser Generator Required BSML Schema

Parser

Execution Environment Manager

Model Instances

Figure 9: A facility for creating model instances from speciﬁcations represented in a Binding Schema Markup Language (BSML). facilitated. From a system point of view, the schemas are the metadata and the software that translates and converts schemas is the parser generator. Figure 9 shows a typical conﬁguration. Both the data and the metadata are stored in a database. A parser is lazily generated for each combination of model’s input port and the actual BSML schema whose data instance is connected to that port. Model descriptions can also be stored in the database. They consist of model id, description, schemas of its input and output ports, various execution parameters, and annotations, such as relations to other models (see Section 3.5). We do not provide any tools for checking the consistency of the output data with the output schemas because, unlike in Web or business domains, this is rarely an issue. A GCE model’s output schema is rigid and does not depend on the actual input schema. The execution environment manager (see Fig. 9) glues the generated parsers to the components. For full-featured languages like FORTRAN, it will simply link the parser with the model’s code. Prototyping for languages like Matlab requires more attention. The output of the parser for such languages is typically the source code that needs to be supplied to the interpreter. The exact way the parsers are linked to the model is speciﬁed by the model’s execution parameters. From this point, each model together with a set of generated parsers looks like a program that takes a number of XML streams as inputs and produces a number of XML streams as outputs. This is an appropriate representation for the management of the execution environment. Our goal is similar to those in [4, 27, 41] in that the common motivation is management of the execution environment; at the same time, our concern with high-level problem-solving (see next three sections) shifts the emphasis from a unifying programming environment to one that allows a data-driven approach to managing large-scale multidisciplinary codes. Finally, it is relatively straightforward to store any resulting data from such processes in a database system. If an RDBMS is used, we can reuse BSML to generate a number of SQL update statements in the same manner we used it to generate a Matlab script in Fig. 8. One of these ‘models’ will then connect to the database and execute these statements. This is no different from other format conversions happening in the GCE.

17

3.3 Format Conversions and Change Management
One of the beneﬁts of semistructured data representations is automatic format conversion. This feature is useful in the following situations: (i) A model is changed over time, but data corresponding to the older versions has already been recorded in the database system. An example from S W is the evolution of the space partitioning parameters in the ray tracer. After we have realized that placing polygons at the internal nodes of the octree can improve space usage by an order of magnitude, more parameters have been added to space partitioning; (ii) Several components need essentially the same parameters, but are not truly plug-and-play interchangeable. Minor massaging is necessary in order to make their I/O speciﬁcations match. We model the following changes: insertions, deletions, replacements, and unit conversions. Insertions and deletions correspond to additions and removals of parameters. For example, a moving channel builder takes the same inputs as a static one, plus the velocity of the receiver. Thus, any input to a moving channel builder can be converted to the input to a static one by projecting out the receiver’s velocity. Replacements represent changes in parameter representation, such as a conversion between spherical and rectangular coordinates. Unit conversions are a special case of conversions that are quite common and can be easily automated, such as conversions between watts and decibel milliwatts. Unit conversion can be performed by equation and constraint solvers [30]. In our XML representation, insertions can be handled by requiring default values for new parameters. Removals amount to deleting the old values. Replacements and unit conversions require user-supplied or automatically generated conversion ﬁlters. The modeling literature abounds in such conversions, but it is important to realize that conversion facilities are ad-hoc by nature, and therefore only work for small changes in the schema. Typically, it is not necessary to ﬁnd a globally optimal conversion sequence. A thorough treatment of change detection can be found in [20]. Change management can also be used to realize any problem-solving feature that involves transforming between semistructured representations. For example, consider the possibility that two students conﬁgure a GCE independently with different choices for various stages in a computational pipeline and arrive at contradictory results. They could then query the database for ‘What is different between the experiments that produced data in directory A from the ones in directory B’? — providing responses such as ‘The only difference is that a calibration threshold of was used in B instead of for A,’ which are obtained by automatically analyzing the XML descriptions [1]. Change detection and processing is crucial in several GCE projects, such as Expresso (see Section 2.5) where objects of interest change formats, stations, and schema rapidly.

¼

¼

3.4 Executing Simulations = Querying
Recall that we deﬁned a simulation as a model instantiated with inputs, along with an assigned computational resource. This captures the notion of applying multiple models to multiple inputs to generate a database of simulation results and performance data. In this section, we describe how our semistructured representations of models and bindings can aid in even higher level problem-solving facilities. In particular, we concentrate on facilities such as the ‘parameter sweep’ tool [17] and the ‘database query’ mode (found in the parameter deﬁnition layer of Fig. 2). The facilities described in this section (i) produce model instances and also (ii) associate data generated from simulations back with the corresponding model instances. We gloss over the aspect of how simulations corresponding to a model instance are actually executed, since they are addressed in Section 3.7. In particular, when we refer to executing a simulation we imply that some assignment of computational resources to model instances has been done by the simulation deﬁnition layer (see Fig. 2). In the database paradigm, a model instance can be represented as a view. Executing the simulation corresponds to materializing the view. The query behind the view is a join over models and data. To be meaningful, a simulation must further satisfy some syntactic and/or semantic constraints. Syntactic constraints ensure that the simulation can indeed be executed. Each simulation run must be given enough data and the data must conform to the appropriate

18

<experiment id=’diff. prop.’> WHERE <environment id=’$id’> <meta><type>urban</type></meta> </environment> CONTENT_AS $env IN "envs" CONSTRUCT <experiment id=’diff. prop.: $id’> <model>...</model> <inputs> <input>$env</input> ... </inputs> <outputs>...</outputs> </experiment> </experiment>

Figure 10: Constructing new XML data (to recalculate PDPs in S W) using the XML-QL query notation. WHERE..CONSTRUCT is the format for expressing queries in this language. Notice that the query is parameterized by the $env variable whose type is restricted to be urban. schemas. Semantic constraints ensure that the models are meaningful in the speciﬁc problem domain. We will describe semantic constraints in the next section. In our framework, users can impose custom constraints, such as ‘use only the datasets from last week.’ Specifying a model instance therefore maps naturally into a database query. This feature also supports the provision of iteration, aggregation, and composition operators by introducing minimal overhead in implementation. For example, compositions can be achieved by relational joins, aggregation by user-deﬁned VIEWs, and iteration by ‘index striding’ on domain-speciﬁc records. Consider the follo

Vt Gridce.ps

Comments

Content

Sponsor Documents

Recommended