EDSS and Models-3 I/O:
REQUIREMENTS, DESIGN, AND IMPLEMENTATION

Carlie J. Coats, Jr., Ph.D.
Environmental Programs,
MCNC North Carolina Supercomputing Center
Copyright 1992-2001 MCNC

INTRODUCTION

EDSS is a system designed to support decision making and related activities (such as modeling) for both regulation and research concerning environmental issues. That means that EDSS is more than just a model: more, even, than a family of models together with a graphics and analysis package attached to them. Models-3 is the EPA's National Environmental Research Laboratory's interoperable air quality counterpart, which shares many models, parts, etc.,with EDSS. Within both of these, however, lies a common thread: tools which can create and/or get to the data and distill it in ways useful to decision making, and do so in a timely fashion. Decisions need to be based upon facts -- upon environmental data which may be the result of observations, or the result of modeling (whether air quality modeling or economic modeling or what.) With the initial emphasis on air quality issues, EDSS will initially contain at least a family of air quality models, together with meteorology and emissions models to support them, impact models (economic, ecosystem, etc.) to assess their effects, and analysis tools to examine their results. Undistilled environmental data is so voluminous, however, that just "looking at" the numbers (or any substantial subset thereof) is not a profitable enterprise for the decision maker. As Hamming so aptly put it, "The purpose of computing is insight, not numbers."

The voluminous data associated with environmental issues leads us to two different but related subjects which need to be handled by different levels of the EDSS system:

data management; and
data access.

Data management is concerned with operations which affect data sets as a whole: indexing, archiving, and migration, while data access is concerned with the ways that tools access data extracted from data sets for use within themselves, whether for the purpose of modeling or for the purpose of analyzing it in order to attain insight. This paper is concerned with the second issue, the issue of data access; data management in a way which supports the needs of EDSS is the issue of a separate paper. Data access is concerned with the ways that programs access data from data sets. Three issues come up in this regard:

How do programs refer to data sets?
What operations need to be performed, and therefore what subroutines perform these operations?
What data structures (and other common assumptions) are needed for the interfaces to these subroutines?

Specifying these three items cleanly and in a modular, re-usable fashion leads to a data access interface for programs, what the current computer jargon will call an input/output applications programming interface, or I/O API. This document is concerned with describing the requirements, design, and implementation of the Models-3/EDSS I/O API. To some extent, this description will use the language of object-oriented programming, although this treatment is not thoroughgoing in that regard. (Some would say that we are taking an object-based rather than an object-oriented view.) In O-O terms, this combination of operations, or methods, to be performed on data and structures for storing it are said to form a class. From the point of view of a user of a class, what is important is what the operations are, rather than how they do it -- a concept called encapsulation -- that emphasizes the distinction between externally visible or public interfaces and internal or private implementations. The idea of using a generalized class as a foundation for more specific subclasses specialized for a particular purpose is called inheritance. It should be realized that the present implementations may not scale to the long term, since it is unclear at present what the shape of computing will be when it is dominated by massively parallel systems with many thousands of processors. The one thing which seems clear at present is that parallel systems researchers are still groping for the "right" way to exploit such parallelism. With sufficient forethought, however, the requirements analysis and interface design may survive even if the implementation does not.

I. REQUIREMENTS

A: Objective

The objective of the Models-3/EDSS I/O API is to provide a single generalized I/O structure class (with subclasses) with the data structures and data access operations to fulfill the needs of Models-3 and EDSS. Initially, this class needs to serve the needs of meteorology, emissions, and air quality models, preprocessors, postprocessors, analysis and visualization programs, and such other computational tools are used within Models-3 and EDSS, for both regulatory and research applications.

B: Systems Requirements

The I/O API must be callable from at least FORTRAN and C . It must be stable over time and should be upward compatible with successive versions of EDSS models and framework, and capable of fulfilling Models-3 and EDSS' needs for the foreseeable future. It must be compatible with at least the following platforms, operating systems, and other software systems:

POSIX
Cray / UNICOS
UNIX
TCP/IP/FTP
NFS and AFS

The data must be portable across machines; ideally, they should be transparently accessible across a heterogeneous distributed network using remote-mounted file systems such as NFS and AFS. Files must contain sufficient self-description that they can "stand alone." In particular, reading they should not depend upon the availability of additional external "grid description " or "file dictionary" software.

The I/O API must be both friendly for research modelers and also have adequate integrity to serve for regulatory applications. In particular, it must maintain "chain of custody" adequate to stand up in court for the data it manipulates. See the Models-3 Coding Standards Document for some of the relevant coding standards, in order to achieve this level of integrity.

C: Data Types Supported

For purposes of further analysis, we assume that data is organized into logical files which are assumed to be multiple-variable data sets having a common origin and common data structure type (common time-stepping assumptions, grid dimensions, etc.). On the other hand, emissions data shows that variables of different basic data types -- INTEGER, REAL, and DOUBLE PRECISION -- may be required in the same file.

Preliminary investigation suggested that at least 60 variables per file may be necessary (later experience has expanded this, so that the current limit is 120 for versions through 3.0, and 2048 for versions 3.1 and later), and that it is useful for files to be able to hold data for periods of at least a year in duration. Whether the logical files are implemented as physical files or as tables in a scientific database or by some other means is an implementation issue which may in fact change over time; what is important for analysis here is the nature of the interface to these logical files. A systems analysis of meteorology, emissions, and air quality modeling finds a number of particular types of data which must be supported. These particular types of data can be fitted into several generalized data types which need to be supported by the I/O API, in keeping with the object-based methodology we are employing. These particular data types are:

terrain-height data
land-use data
demographic data
meteorology observations:
- surface
- upper-air profiles
- satellite (this is a future item not currently in use)
- radar (also a future item)
air quality monitor observations
deposition monitor observations
emissions input data:
- point-source
- area-source
- mobile-source
- natural-source (biogenic, lightning strike)
processed (modeled) meteorology data
processed (modeled) emissions data
initial condition data
boundary condition data
geospatial transform (sparse) matrices for emissions and other modeling
air quality concentration and deposition data
intermediate-stage modeling data
- diagnostic
- other research
Added Nov. 2001: Additional types of data that it may be useful to support are the following (either of which may be time-independent or time-dependent):
- variables defined on geospatial coverages
- variables defined on finite element (unstructured) grids
Note that the usual GIS formats and access methods do NOT provide efficient access to time-stepped variables defined on geospatial coverages.

During the process of both input (of emissions) and analysis, we are frequently concerned with aggregated data, data which has been combined using such operations as averages, maxima, and minima applied to particular subsets of the data. The I/O API must support the results of such aggregations. Two important types of aggregation operations are:

temporally aggregated data (daily-maximum ozone concentrations or annual-total sulfate depositions, for example)
geographically aggregated data (such as state or county totals)

Analysis of the types of data arising and the operations applied to them yields that the the system must support the following:

Read and write operations must do the status checking necessary to ensure whether they are successful, and maintain audit trails of all the relevant operations as they do so. In particular, failing and/or incomplete read and write operations must be flagged as unsuccessful, and the nature of this failure logged.
both time-independent data and time-stepped data, with time step granularity ranging from very small (~ 1 second) to very large (~ 1 year), and should correctly support dates for the year 2000 and beyond. Climate modeling requires support for dates 1970 and before. Source-attribution modeling may require writing data in other than chronological order.
multiple layers of multiple (at least 60 for full-chemistry aerosol modeling) variables per data set.
at least the following general data structure types:
- gridded (2-D and 3-D) data
- boundary-condition data
- ID-referenced data (this is a generalization which manages both geographically-scattered data such as most observational data, and geographically aggregated data such as state or county totals)
- vertical-profile rawinsonde meteorology data (different enough to be its own special case)
- nested-grid and multiple-grid data
- sparse-matrix data
- data with user-defined structure (for extensibility in case the analysis missed something; the system treats the data as a BLOB of one of the basic data types (REAL, DOUBLE, or INTEGER); the user imposes any additional structure he desires).
Added Nov. 2001: An additional datatype that we have prototyped, and that may be useful is that of geospatial-element cell complexes (GECC), which is a type efficiently supporting both (time-stepped and time-independent) geospatial coverages and finite element data, and modeled after the data structures used in the (pure) mathematical field of geometric topology.

All files must contain all the information necessary to access the data contained in them. This is important both for analysis, where it permits unified tools supporting a variety of files, and for sharing data with others: the only data needing to be transported to a colleague's system is the file itself, and not a collection of auxiliary files as with some current models. At least the following sorts of descriptive information are required:

file description (text about this type of file)
file type (gridded, grid-boundary, ID-referenced, profile, gridnest, sparse matrix, or custom)
time step (or 0 for time-independent)
starting date and time (relevant if time step nonzero)
number of variables, their names, units designations, and descriptions
update description (date and time of update; name of updating program; text concerning the computational model run which supplied the data in this file)
coordinate system type and specifications (e.g., "Lambert, with defining angles 30N, 60N, 90W, and center at 40N, 90W").
data structure dimensionality (depends upon file type) -- e.g., for gridded:
- number of columns
- number of rows
- number of levels
horizontal grid geometry, if relevant:
- location (X,Y) of the grid origin (SW corner)
- cell-size (DX, DY)
Coordinate system and horizontal grid geometry specifications must be sufficiently precise to support (the ill-conditioned arithmetic in) geospatial transforms for very-high-resolution (e.g., 10-meter) modeling. In particular, REAL*4 representation is not adequate.
Coordinate system and horizontal grid geometry specifications must be coded so as to support questions like "Is this grid a properly implemented nest into that one?"
vertical grid geometry, if relevant (if number of levels is greater than 1):
- type of vertical coordinate system:
  - hydrostatic sigma-P;
  - nonhydrostatic sigma-P
  - sigma-Z
  - pressure
  - height above sea level
  - height above ground
  - other
- array of layer surfaces;
- (for sigma-coordinates only:) the model-top

D: Functional Requirements

The I/O API must support the access needs of both the environmental models used to simulate situations of interest to decision makers and also the analysis and visualization tools used to distill insight from the model inputs and outputs. Of particular concern is the fact that Models-3 and EDSS will contain families of air quality models with interchangeable-part science modules which implement the simulation of the various relevant physical processes -- horizontal or vertical advection, convective mixing, deposition, chemistry, etc., at a variety of scales. Supporting both the model structure and the analysis tools suggests that the view of the data presented to the programmer by the I/O API should be selective random access in terms compatible with model usage (i.e., access by file, variable, layer, date and time, with possible further selection by grid location). The I/O API should automate routine activities such as the logging of I/O transactions to the extent feasible. Examination of the initial air quality model prototypes has added a pair of additional features: time-interpolation to a particular date and time should be added as an additional operation; and that it should verify consistency between the file structure as requested by the caller and the file structure as recorded in the file itself, e.g., by an additional buffer-size argument. The desired operations are the following:

start up the system;
create a new file , according to a caller-supplied specification;
open an existing file , either for input-only or for input/output;
get the description of an (existing) file;
read data from a file, with at least the following variants:
- read the data for a specified date and time, variable, and layer (where "all" is a valid layer or variable specifier);
- read a subrectangle within the grid of gridded data for a specified variable, range of dates and times, and subrectangle within the grid;
- time-interpolate gridded, boundary, or custom data for a specified variable to a specified date and time (time interpolation of ID-referenced data might not be well-defined if, for example, differing sets of sites occur at adjacent time steps);
write data for a specified variable, date and time to a file; and
shut the system down, flushing all data to disk.

E: Performance Requirements

In order to support both analysis and modeling, the system must support operations on multiple (at least 20) simultaneously open files. Model nesting and model intercomparison imply that the system must support simultaneous access to files for different domains (something not possible with the current (1990's) generation of some models). Access should be by meaningful name or meaningful value rather than by arbitrary index values (so that the caller asks for "O3" by name for example, rather than needing to know whether ozone is variable # 17, and requesting that). As used by calling programs, file names themselves should be "logical names" in the sense that they are properties of the program using them, do not depend upon particular physical file names in the file system, and permit simultaneous and independent execution of different instances of the same program on the same machine without interference with each other (so that different runs of the same air quality model might be executing simultaneously on the same machine, for example). Using only the globally-visible namespace provided by the file system makes this impossible -- or difficult, at best -- in many instances.

II. DESIGN

The design of the Models-3/EDSS I/O API is given here in terms of its externally visible properties, i.e., in terms of the conventions used, the public INCLUDE-file interfaces, and the function-call interfaces for the public routines in the I/O API. This section documents these externally visible properties from the FORTRAN programmers point of view, rather than from that of the C programmer, as documented in a separate section.

A: Conventions

There are a number of data structuring and manipulation conventions used consistently throughout the Models-3 and EDSS systems, and which affect the I/O API. Among these are the representation of object-names, grids, dates, times, and time-deltas. object names are (blank-padded) FORTRAN CHARACTER strings of length at most 16. Case is significant.

Horizontal coordinate systems are named entities, with map projections taken from a short list of types: Lat-Lon, Lambert conformal, Mercator, and Stereographic. Because of the ill-conditioned nature of arithmetic relating to coordinate transformations, descriptive parameters which completely specify the coordinate systems are kept in 8-byte REALs. For all these except Lat-Lon (for which the parameters are ignored), specification of a map projection requires three parameters to determine the map projection, and two additional parameters to specify the coordinate-system origin relative to that projection.

horizontal grids are named entities, for purposes of unambiguous identification. For many models, it suffices to deal with regular grids, which are completely characterized by the specification of a horizontal coordinate system and four additional parameters which specify the grid origin (lower-left corner) and the cell-size. Irregular grids are specified by grid-geometry files, which are gridded files specifying cell location and extent on a cell-by-cell basis.

vertical grids are presumed to be irregularly-spaced and are characterized by the following;

vertical coordinate type, from a short list:
- hydrostatic sigma-P
- nonhydrostatic sigma-P
- sigma-Z
- Z (m above ground)
- H (m above sea level)
- eta
- specified by a geometry file
- other
value of model-top (sigma-coordinates only)
number NLEVS of levels
array VLEVEL( 0:NLEVS ) of values for the levels

dates and times are stored as integers coding the Julian date and (24-hour) time using the formulas

    JDATE = 1000 * YEAR  +  DAY
    JTIME = 100 * (100 * HOUR  +  MINUTE)  +  SECOND
          = 10000 * HOUR  +  100 * MINUTE  +  SECOND

where the year is specified using all four digits, the day number is between 1 and 365 or 366 (depending upon leap year), hour is between 0 and 23, and minutes and seconds are between 0 and 59. For example, the date Feb. 2, 1993 is coded as the integer 1993033, and the time 3:46:53 PM as 154653. When finer-grained resolution is required, this two-integer representation is supplemented by a third component which is a REAL between 0.0 and 1.0 representing fractions of a second. This representation satisfies the granularity requirement of one-second resolution, gives exact and machine-independent calculation of record numbers within datasets, etc., correctly handles dates and times both before 1970 and after 2000, and is easy for modelers to interpret and manipulate within, e.g., a debugger. time-deltas are stored using the same conventions as times, except that they may have arbitrarily large hours-fields, and may be either positive or negative (in the latter case, all three fields are negative or zero: -333 means a time step backwards by three minutes and thirty-three seconds). A variety of utility routines are available for manipulating dates, times, and time deltas, and which handle arbitrary time deltas correctly.

We recommend that the convention be adopted that all times are given in GMT; however, this policy is by no means required by the system.

B: Files -- Logical Names and Physical Names

Rather than forcing the programmer to deal with hard-coded file names or hard-coded unit numbers, the I/O API introduces the concept of logical file names . The modeler can define his or her own logical names, which then become properties of the program. Then at run-time the EDSS process manager (or the user who writes his own shell-scripts) uses the UNIX setenv command (or the VMS ASSIGN command) to connect up the logical names to the physical file name of any "real" file desired. For programming purposes, the significant facts are that the names should not contain blanks (except as padding at the end: 'foo ' is OK; 'f oo' is not), and when they are used in subroutine calls are FORTRAN character strings at most 16 characters long.

C: Data Structures for Input and Output

Each logical file has header attributes describing itself, and a sequence of time steps divided into logical data records accessed by variable and layer . Dates and times and time-steps are represented as indicated in the preceding section. All layers of all variables are assumed to have the same time-step and data type (gridded, boundary, etc.) structure.There are three categories of time step structure presently in use:

time-independent files have time step = 0; the date and time arguments to access functions are ignored when these access functions are applied to time-independent files;
time-stepped files have time step > 0 with the time step indicated;
restart or circular-buffer files, which have time step < 0 with actual time step the absolute value of the time step indicated, store exactly two active time steps of data (the "even step" and the "odd step") and may be used either for communications buffers or as restart-data files, at a considerable savings in space over a normally time-stepped file used for the same purpose.

There are currently eight types of data structure currently supported, although the system is designed to permit the addition of extra types in an upward-compatible fashion. The present grid-nest type was actually implemented as a test of this extensibility. Each type except dictionary has additional layer structure and array dimensionality structure as well. Indexes for these are subscripted according to FORTRAN conventions (i.e., starting with 1). Layers are counted from bottom to top vertically; rows are counted from bottom (south) to top (north) and columns are counted from left (west) to right (east() horizontally. The data structure types identified by "magic number" parameters defined in INCLUDE-file PARMS3.EXT . Together with the magic-number values, the types are:

Type -1: custom User-defined REAL data with one logical record per variable, layer, and time step, with structure interpreted by the user. Record size (in words) is stored as the number of columns. This type of file may be used to handle situations otherwise unanticipated by the present requirements analysis.
Type 0: dictionary The "reusable" portions of a file description, with a named-record structure (mapping onto the variables referenced by READ3()) to index the file descriptions. This type should be considered as a tentative prototype step in file type management rather than a complete and lasting solution. The fields in such a description are:
- file type ID (custom, dictionary, etc.
- time step
- number of variables
- number of layers
- number of rows or maximum number of ID-referenced data sites
- number of columns or custom words per record or maximum number of
- profile levels
- boundary thickness in cells (used for boundary files only)
- coordinate type ID (lat-lon, Lambert, Mercator, etc.)
- coordinate specification parameters
- grid name
- grid specification parameters
- file description
- list of variable names
- list of units designations for variables
- list of variable descriptions
Type 1: gridded (usually regularly) gridded data having one logical record per time step, variable, and layer, with memory layout as in the FORTRAN declaration
```
        REAL ARRAY( NCOLS, NROWS )
    
```
Type 2: boundary boundary data has one logical record per time step, variable, and layer. Its structure is defined in terms of a thickened grid perimeter proceeding counterclockwise from the SW (1,1) corner. The array size for one layer of data is computed in terms of the dimensions of the corresponding gridded data grid and the additional thickness parameter NTHIK according to the following formula
```
        2 | NTHIK| * (NCOLS + NROWS + 2*NTHIK)
    
```
where NTHIK > 0 indicates an external boundary and NTHIK < 0 indicates an internal boundary. It has component subarrays along each edge of the grid, each layer of which is structured as follows:
```
	REAL SOUTH( NCOLS + NTHIK, NTHIK )
	REAL EAST ( NTHIK, NROWS + NTHIK )
	REAL NORTH( NCOLS + NTHIK, NTHIK )
	REAL WEST ( NTHIK, NROWS + NTHIK )
    
```
Type 3: iddata ID-referenced data has one logical record per time step. Note that such data as county-aggregation files may be treated as a special case by the use of some such encoding of the site-ID as the FIPS codes. Note also that location parameters must be explicitly treated as variables if they are stored in such a file. The data records are structured as follows (where MAX is the file attribute maximum number of sites):
- number of actual sites INTEGER NSITES
- array of site ID's INTEGER ID( MAX )
- array of data REAL DATA( MAX, NLAYS, NVARS )
Type 4: profile For geographically scattered vertical profile arrays of rawinsonde data referenced by ID or by location. Note that location is DOUBLE PRECISION and is treated as potentially time-dependent (to match the behavior of rawinsonde profiles, which the NWS moves around from time to time). The data has one logical record per time step, structured as indicated below (where MXLVL is the maximum number of vertical levels):
- number of actual sites INTEGER NSITES
- array of site ID's INTEGER ID( MAX )
- array of site level counts INTEGER NLVL( MAX )
- array of site X-locations DOUBLE PRECISION X( MAX )
- array of site Y-locations DOUBLE PRECISION Y( MAX )
- array of site Z-locations DOUBLE PRECISION Z( MAX )
- array of data REAL DATA( MXLVL, MAX, NLAYS, NVARS )
Type 5: grid-nest or multiple-grid is a data type implemented largely as a test of how extensible the system was in terms of new data types. Its structure is somewhat similar to profile, except that each time step has a potentially varying number of regular grids, each of which has a time-dependent 2-D dimensionality, location, and cell size. The description of the storage order (which is quite tedious) is omitted here for the sake of brevity.
Type 6: sparse matrix uses so-called "skyline-transpose representation" to store sparse matrices for use by the new emissions model (and possibly other programs that need it. The data has one logical record per time step, as indicated below, where MXROW is the number of rows in the matrix and MXCOL is the maximum number of active columns per row.
- number of active cols per row INTEGER NC( MXROW )
- subscripts for active cols INTEGER IC( MXCOL, MXROW )
- coefficients for active cols REAL CC( MXCOL, MXROW )

D: Public Include-file Structures

There are three public INCLUDE files in the FORTRAN interface to the I/O API. They are the following:

PARMS3.EXT contains dimensioning parameters and the standard file-type, coordinate-system-type, "All Layers", etc., token values for the FORTRAN interface to the I/O API.
FDESC3.EXT contains FORTRAN data structures (COMMONs) for a Models-3/EDSS I/O API file description, and is used to give name syntax for passing file description data between routines OPEN3 and DESC3 and their callers. Requires PARMS3.EXT for dimensioning.
IODECL3.EXT contains declarations and usage comments for the public routines in the FORTRAN I/O API.

E: Public Call Interfaces and Specifications

Except for INIT3(), which is an INTEGER function, the routines in the I/O API are LOGICAL functions which return .TRUE. exactly when they succeed (and .FALSE. otherwise). In the examples below, the names (FNAME for logical file name, VNAME for variable name, PNAME for program name, CNAME for calling-routine's name) are CHARACTER*(*) of length at most 16, RDFLAG is INTEGER, ARRAY is the output buffer for data access routines, dates and times follow Models-3/EDSS conventions described above, and LOGDEV is the INTEGER FORTRAN unit number for the program's log file. From the functional point of view there are four groups of routines.

INIT3(), OPEN3(), and DESC3() are related to initialization,
READ3(), XTRACT3(), and INTERP3() are related to data retrieval,
WRITE3() is related to data storage, and
SHUT3() is related to system shutdown.

Note that for time-independent files , the date and time arguments are ignored by the data access routines. Data sets are "stateless" in the sense that access operations may be done in any (meaningful) order -- a given time step of a variable may be read many times, time steps may be read or written in reverse (or even random) order, etc.

Integer function INIT3() initializes the entire state for the I/O API, and returns the unit number for the log-file (which will be attached to the file whose logical name is 'LOGFILE' if one exists, and to standard output otherwise). INIT3() may (should) be called multiple times by application routines and programs in order to get the log-file's unit number. A typical call to INIT3() might look like the following:

    LOGDEV = INIT3()
    IF ( LOGDEV .LT. 0 ) THEN
    ...(can't proceed; probably couldn't open the log
    ... file.  Stop the program.)
    END IF

Logical function OPEN3 opens files according to the requested status, and writes a file summary to the program log . For those files opened for writing, it sets the update info in the file header. May be called multiple times with multiple files; if called repeatedly for a file already open, it returns .TRUE. unless the request is for READ/WRITE and the file is already open for READONLY. Legal values for STATUS are given in PARMS3.EXT: 1 for READONLY, 2 for READ/WRITE/UPDATE of existing files, 3 for READ/WRITE for new files, and 4 for READ/WRITE of unknown (whether new or old) files. A typical call looks like:

    IF( .NOT. OPEN3( FNAME, STATUS, PNAME ) ) THEN
    ...process the error:  OPEN3 failed.
    END IF

Logical function DESC3 puts all the descriptive data for the specified file into the standard file description data structures in FDESC3.EXT . A typical call looks like:

        IF( .NOT. DESC3( FNAME ) ) THEN
        ...process the error:  DESC3 failed.
        END IF

Logical function INTERP3 provides encapsulated read-and-time-interpolate functionality for gridded and boundary data to EDSS programs. It reads enough data from the specified file to interpolate all layers of the single specified variable to the specified date and time, after checking that the specified record-size is correct for that file. Internally it uses its own data buffers to optimize the read-operations. Note that for time-independent data, "interpolate" is taken to mean "copy" and the date and time are irrelevant. A typical call looks like:

    IF( .NOT. INTERP3( FNAME, VNAME, CNAME, DATE, TIME,
 &		RECSIZE, ARRAY ) ) THEN
    ...process the error:  INTERP3 failed.
    END IF

Logical function READ3 reads data from the specified file for the specified date and time, variable, and layer. If the file is a dictionary file, the variable name is used as the dictionary-entry index. Tokens ALLAYS3 and ALLVAR3 from PARMS3.EXT may be used to read all layers or all variables for the time step, respectively. A typical call looks like:

    IF( .NOT. READ3( FNAME, VNAME, LAYER, DATE, TIME, ARRAY  ) ) THEN
    ...process the error:  READ3 failed.
    END IF

Logical function XTRACT3 reads data from the specified gridded file for the specified date and time, variable, and ranges of rows, columns, and layers. The row, column, layer range may be shrunk down as far as a single cell, or may be expanded to include the entire 3-D grid (although it may be less efficient reading the entire grid than is READ3). Token ALLVAR3 from PARMS3.EXT may be used to read all variables for the time step. A typical call looks like:

    IF( .NOT. XTRACT3( FNAME, VNAME, LAY0, LAY0, ROW0, ROW1,
    &		COL0, COL1, DATE, TIME, ARRAY ) ) THEN
    ...process the error:  XTRACT3 failed.
    END IF

Logical function WRITE3 writes either an individual variable (for GRIDDED, BOUNDARY, or CUSTOM files only), or an entire time step (all variables, all layers) of data for the specified date and time to the specified file. To write an entire time step, VNAME should be 'ALL', A typical call looks like:

    IF( .NOT. WRITE3( FNAME, VNAME, DATE, TIME, ARRAY ) ) THEN
    ...process the error:  WRITE3 failed.
    END IF

Logical function SHUT3 flushes all open files to disk and then closes them. (Failure probably indicates some unrecoverable file-system error, but the user at least should be notified when that happens. A typical call looks like:

    IF( .NOT. SHUT3( ) ) THEN
    ... SHUT3 failed.
    END IF

III. IMPLEMENTATION

The first two Models-3/EDSS I/O API implementations are built on top of UCAR's netCDF library. It is largely is a modeler-oriented wrapper around netCDF calls, and constructs files with particular structure defined in terms of sets of attributes as indicated above. For the most part, the implementation is written in FORTRAN, and uses a number of lower-level subroutines to manage the details of its operation. There is a matching set of C routines, which are for the most part wrappers around the Fortran routines. The interface consists of 65 FORTRAN-77 routines, 5 FORTRAN INCLUDE-files, 26 C routines, and three C include file, with about 14000 lines of code. In three particular places it was necessary to do multi-language programming for the Fortran bindings. First, it was necessary to write wrappers callable from FORTRAN around the getenv() and time() system calls in order to evaluate logical names and to get the current wall-clock time. In addition, because of the necessity to do dynamic memory allocation for the buffers used by INTERP3 (which, it should be noted, requires a more general notion of dynamic allocation than that available in Fortran 90), it is implemented as one module written in two parts -- a FORTRAN part responsible for managing the file name and variable-name interface, and a C part responsible for buffer management and interpolation.

We have also implemented C interfaces with semantics matching the FORTRAN interface for use by graphics and analysis programs. (Presently, some EDSS visualization programs use a C module which directly calls the netCDF C API in order to read EDSS data sets -- a potential source of inconsistency as EDSS expands and develops further.)

IV. LIMITATIONS

A major limitation of the present implementation is the limits imposed by 32-bit addressing within most UNIX file systems. Model management and data indexing within Models-3 and EDSS would both be far easier if it were possible to keep the outputs of entire episodes within single files, rather than being forced to "chunk" the episodes into shorter segments just to fit within the 2 GB limits of most file systems (or the even more stringent necessity of fitting within "small" (less than 1 GB) physical devices. Consider that the primary output file for a single ozone episode might have the following dimensions:

    30 days, at
    24 (hourly) time steps per day, for
    60 variables, on a grid with
    100 columns
    100 rows
    25 layers, for a total data volume of
    43.2 GB, assuming single precision (4 bytes per number) storage.

NOTE: For hydrological applications, the I/O API has been used for much larger data sets than these (and appropriately designed I/O API based analysis and visualization tools were routinely used with):

    33 years, at
     4 (6-hourly) time steps per day, for
     8 variables, on a grid with
  2760 columns
  3320 rows
     1 layer, for a total data volume of
     3.53 TB

Another major limitation has to do with massively parallel supercomputers, for which the "correct" I/O semantics is a matter of research as of this writing, rather than a matter of settled practice. NOTE added OCT. 24, 1997: Various prototypes for domain decomposition data parallel models have been implemented and we are evaluating them as part of the MCNC Environmental Program's Practical Parallel Project

V. FUTURE EXTENSIONS

A: Data Types

One obvious kind of future extension is in the set of data types supported. There are several candidates, none of which is yet sufficiently developed that we can specify them in detail. A first candidate is new data types designed to better structure emissions data in connection with EDSS improvements to emissions modeling. A second candidate is a data type designed to deal with finite-element or finite-volume data on unstructured meshes. A third candidate is exchange-flux matrices to support air quality models incorporating the results of generalized-chemistry research being performed by Prof. Harvey Jeffries of the University of North Carolina at Chapel Hill.

B: Communication for Parallel Computing

NOTE added OCT. 24, 1997: The following has been implemented and we are evaluating it as part of the Practical Parallel Project

Another possible kind of future extension is in structuring communication and coordination for parallel programs. If the I/O API had two modes -- a communications mode in addition to the existing file storage mode -- it could be use to structure well-engineered coupled models and parallel models in the following fashion: In the communications mode, the read operations must be selective by simulation-time (as they are now), and must block (i.e., suspend the execution of their calling process) until the data for the time requested becomes available. One would then construct coupled or parallel models by building an ordinary program for each component, capable of execution as a stand-alone model when the I/O API is used in file storage mode. When the programs are executed at the same time, the coupled models would use the communications mode of the I/O API to exchange data. The scheduling for coupled models is performed implicitly by the operating system (using the blocking nature of the read operations to determine the order of execution), without the developer having to construct an explicit scheduler for the processes being simulated. This methodology for constructing coupled models requires the right sort of underlying interprocess communications tools upon which to build, and does incur the corresponding communications overheads (which, one hopes, are small in comparison to the computational overheads of the component models themselves). However, it does seem to offer several advantages:

It supports good software engineering principles (modularization and encapsulation), since each of the components must deal with only a single sort of simulation.
It makes for easier re-use of code, since each component is a functioning environmental model in its own right.
It leads to smaller and simpler software systems, since scheduling is supplied by the operating system (and its interaction with the I/O API), and the developer need not worry about interactions between the component simulations.
It provides for the decomposition of the modeling system into explicitly parallel components (which may possibly be distributed to different host machines, if the underlying communications layer permits it.) Hence it provides one approach to the use of MIMD massively parallel machines.

One important potential application is the construction of nested atmospheric models, possibly several levels of nesting deep. In such a nest-model system, there is an explicit nest interaction science-process module in all except the highest resolution models, which is responsible at every model time step for aggregating nest results over the model's grid, and then broadcasting boundary conditions for all the models nested within it. The remainder of the science process modules (and the remainder of the individual models themselves) are otherwise unchanged. The one requirement for synchronization is that the high resolution models' time steps divide exactly into the time step of the parent in which they are nested. If this approach is used, the same component models could be used for both one-way and two-way nesting (they need not even know whether they are operating one-way or two-way!).

Another family of applications is the coupling of different types of environmental models -- perhaps meteorology, emissions, and air quality at first -- possibly at high-resolutions time scales that are impractical otherwise because the data volume would overwhelm all available disk space if the data were stored there. If, however, the meteorology data volume is kept in temporary communication memory rather than on disk, the problem is avoided.

Another application of a communication mode of the I/O API might be to use it to achieve domain-decomposition parallelism for the distributed execution of environmental models: First, decompose the geographic domain into subdomains. On each of the subdomains, run a copy of the environmental model, and a master modeling program whose task is to assemble the results from the subdomains into a coherent whole on the entire domain, and then to broadcast boundary conditions to each of the subdomain models. This may well be the paradigm by which we get air quality models to efficiently use the resources of MPP machines while at the same time writing well-engineered, maintainable systems.

Added 1997: This extension, the coupling mode of the I/O API, has been developed under the aegis of the MCNC Environmental Programs Practical Parallel Computing Strategies Project, a project partially funded by US EPA. It has proved very useful for constructing coupled modeling systems, such as that used for MCNC numerical air quality forecasting and for coupled hydrological-meteorological modeling.

Other Extensions

At some point, it might be worthwhile to implement C++ interfaces with a full-blown class structure for files, variables, layers, dates and times, etc., which fully supports the structure of the data. Since the requirements analysis and the design were object based (with inheritance implemented in terms of call hierarchy and "cut, paste, and edit" instead of the implementation language), it should be possible to do so. It would, however, be a nontrivial task :-).

Added Nov. 2001: MCNC Environmental Modeling Center has prototyped a geospatial-element cell complex (GECC) datatype that efficiently supports both (time-stepped and time-independent) geospatial coverages and finite element data, on cell complexes with either time-stepped and time-independent node coordinates.

Previous: netCDF User's Guide

Next: Notices: Copyright, Acknowledgements

To: Models-3/EDSS I/O API: The Help Pages

EDSS and Models-3 I/O: REQUIREMENTS, DESIGN, AND IMPLEMENTATION