PARALLEL_NOTES 3 Sep 2003 The following are some observations related to running the CMAQ CCTM in a Linux MPICH cluster environment where the various computational nodes are connected via Network File System (NFS). Running the CCTM for the first part of the tutorial in a parallel cluster with 10 processors took ~16 minutes wall time, whereas running it on a single processor took ~8 minutes. The reason the parallel run took longer has to do with the size of the problem (domain) and the relative inefficiency of NFS in sending internet packets across the local network fast enough to keep the CPUs on the participating nodes busy. The percentage of the CPU that the single processor job got was ~95%, whereas it was only ~15% for the parallel job. With larger domain sizes, the computation to communication ratio increases, and if the problem size increases sufficiently, then the communication overhead is partly amortized, resulting in faster turnaround for parallel vs. serial applications. Clusters that have local high-bandwidth interconnects or are isolated from general network traffic, e.g. Scyld Beowulf, will not suffer this problem as much. That is, even relatively small problem sizes will still go faster in parallel than in serial. We infrequently experienced the situation where a run on 10 processors (one local, 4 remote dual cpu Intel XEON boxes) would start and hang. But re- launching the run succeeded. This may be due to network latency problems. We had difficulties running in the MPICH cluster associated with the standard practice of automouting directories. It was necessary to "hard mount" the data directory to which we wrote the CMAQ outputs. To make our MPICH linux cluster work, we had to put all the machines that we wanted to use in the ~/.rhosts file. Otherwise, we would get a "permission denied" when you launch mpirun. One operational note: In addition to the main log file, produced by the "processor 0" task, the runs produce ancillary log files for each of the other tasks. If you re- execute a particular run that intends to write to logs that have the same name as these ancillary files, the run will probably hang. You must dispose of these ancillary log files first.