Solution Number: 1001
Title: Running COMSOL in parallel on clusters
Platform: Windows, Linux
Applies to: All Products
Versions: All versions
Categories: Solver, Mesh
Keywords: solver memory parallel cluster

Problem Description

This solution describes how you enable distributed parallelization (cluster jobs) in COMSOL.

Solution

COMSOL supports two mutual modes of parallel operation: shared-memory parallel operations and distributed-memory parallel operations, including cluster support. This solution is dedicated to distributed-memory parallel operations. For shared-memory parallel operations, see Solution 1096.

COMSOL can distribute computations on compute clusters using the MPI model. One large problem can be distributed across many compute nodes. Also, parametric sweeps can be distributed with individual parameter cases distributed to each cluster node.

Cluster computing is supported on Windows (Windows HPC Server 2008 / R2) and Linux, including common schedulers like LSF, PBS, and Sun Grid Engine (SGE, also known as Oracle Grid Engine). As of version 4.3, COMSOL by default uses Hydra to initialize the MPI environment on Linux.

NOTE: to use COMSOL on a compute cluster, you need the Floating Network License (FNL) option.

At the bottom of this page are quick guides that explain how to get started with cluster computing, and how to get more information.

Some useful tips and troubleshooting guides are provided below.

Fundamentals

The following terms occur frequently when describing the hardware for cluster computing and shared memory parallel computing:

  • Compute node: The compute nodes are where the distributed computing occurs. The COMSOL server resides in a compute node and communicates with other compute nodes using MPI (message-passing interface).
  • Host: The host is a hardware physical machine with a network adapter and unique network address. The host is part of the cluster. It is sometimes referred to as a physical node.
  • Core: The core is a processor core used in shared-memory parallelism by a computational node with multiple processors.

The number of used hosts and the number of computational nodes are usually the same. For some special problem types, like very small problems with many parameters, it might be beneficial to use more than one computational node on one host.

Cluster distribution, Windows and Linux

Example models for cluster testing are included in the Model Library:

COMSOL_Multiphysics/Tutorial_Models/micromixer_cluster

COMSOL_Multiphysics/Tutorial_Models/thermal_actuator_jh_distributed

Troubleshooting

Your first stop is to make sure you have the latest release installed. The latest release can be downloaded here. Also do Help > Check for Updates to install the latest software updates. The latest updates are also available for download here.

Error messages due to communication problems between Linux nodes

If you get error messages, make sure that the compute nodes can access each other over tcp/ip and that all nodes can access the license manager in order to check out licenses. If you run the ssh protocol between the hosts on a Linux cluster you need to pre generate the keys in order to prevent the nodes to ask each other for passwords as soon as communication is initiated:

# generate the keys
ssh-keygen -t dsa
ssh-keygen -t rsa
# copy the public key to the other machine
ssh-copy-id -i ~/.ssh/id-rsa.pub user@hostname
ssh-copy-id -i ~/.ssh/id-dsa.pub user@hostname

Check that the nodes can access the license manager

Linux: Log in to each node and run the the command

comsol batch -inputfile /usr/local/comsol50/multiphysics/models/COMSOL_Multiphysics/Equation-Based_Models/point_source.mph -outputfile out.mph

The command above should be issued on one line. /usr/local/comsol50 is assumed to be your COMSOL installation directory. The /usr/local/comsol50/multiphysics/bin directory, where the comsol script is located, is assumed to be included in the system PATH. Make sure you have write permissions for ./out.mph. No error messages should be produced, or you may have a license manager connectivity problem.

Windows HPCS: Log in to each node with remote desktop and start the COMSOL Desktop GUI. No error messages should be displayed.

Issues with Infiniband based Linux clusters

Update the Infiniband drivers to the latest software version. If you cannot update at this time, add the command line options -mpifabrics shm:tcp or -mpifabrics tcp. This will use TCP for communication between nodes.

For more information advice on how to troubleshoot Infiniband issues, please refer to the section Troubleshooting Distributed COMSOL and MPI in the COMSOL Multiphysics Reference Manual.

Problems with the Cluster Computing feature in the model tree

If you get the error message "Process status indicates that process is running"

  • Cancel any running jobs in the Windows HPCS Job manager or other scheduler that you use.
  • In COMSOL, go to the External Process page at the bottom right corner of the COMSOL Desktop.
  • Click the Clear Status button.

Cloud computing

COMSOL 4.3a introduced support for cloud computing through Amazon Elastic Compute Cloud™ (Amazon EC2)™. See the PDF guide Running COMSOL on the Amazon Cloud for further information.

Hardware Recommendations

See the knowledgebase solution on Selecting hardware for clusters.

See Also

See also COMSOL and Multithreading.

Example of LSF job submission script

#!/bin/sh
# Rerun process if node goes down, but not if job crashes
# Cannot be used with interactive jobs.
#BSUB -r

# Job name
#BSUB -J comsoltest

# Number of processes.
#BSUB -n 20

# Redirect screen output to output.txt
#BSUB -o output.txt
rm -rf output.txt

# Launch the COMSOL batch job
comsol –clustersimple batch -inputfile in.mph -outputfile out.mph

Example of PBS job submission script

#!/bin/bash
# ##############################################################################
#
export nn=2
export np=8
export inputfile="simpleParametricModel.mph"
export outputfile="outfile.mph"
#
qsub -V -l nodes=${nn}:ppn=${np} <<´__EOF__´
#
#PBS -N COMSOL
#PBS -q dp48
#PBS -o $HOME/cluster/job_COMSOL_$$.log
#PBS -e $HOME/cluster/job_COMSOL_$$.err
#PBS -r n
#PBS -m a -M email@domain.com<br>
#
echo "------------------------------------------------------------------------------”
echo "--- Starting job at: ´date´"
echo
#
cd ${PBS_O_WORKDIR}
echo "--- Current working directory is: ´pwd´"
#
np=$(wc -l < $PBS_NODEFILE)
echo "--- Running on ${np} processes (cores) on the following nodes:"
cat $PBS_NODEFILE
#
cat $PBS_NODEFILE | uniq > comsol_nodefile
echo "--- parallel COMSOL RUN"
comsol –clustersimple -f comsol_nodefile batch –mpiarg –rmk –mpiarg pbs –inputfile $inputfile -outputfile $outputfile -batchlog batch_COMSOL__$$.log
echo
echo "--- Job finished at: ´date´"
echo "------------------------------------------------------------------------------”
#
__EOF__

Related Files

cluster_install_linux_44.pptx 773 KB
cluster_install_linux_44.pdf 1.0 MB
cluster_install_win_44.pptx 1.5 MB
cluster_install_win_44.pdf 859 KB

Feedback

Poor | Excellent
Document quality?



Disclaimer

COMSOL makes every reasonable effort to verify the information you view on this page. Resources and documents are provided for your information only, and COMSOL makes no explicit or implied claims to their validity. COMSOL does not assume any legal liability for the accuracy of the data disclosed. Any trademarks referenced in this document are the property of their respective owners. Consult your product manuals for complete trademark details.