## MPI Applications Support for Intel VTune and Inspector

Intel® Cluster Studio helps exploit scalable parallelism of a modern cluster at all levels of hybrid parallel or sequential computing for a Fortran/C/C++ MPI application: message passing, threading and SIMD / data levels. The Intel® MPI Library is used at the process messaging level. The Intel OpenMP* library, Intel® Threading Building Blocks (Intel® TBB), and Intel® Cilk™ Plus extensions can be used for thread parallelism. The Intel® Math Kernel Library (Intel® MKL) can be used to automatically exploit threading, message passing through ScaLAPACK, and SIMD data parallelism capabilities of Intel hardware.

This section describes the workflow and capabilities for doing performance and correctness analysis of MPI applications using the Intel® VTune™ Amplifier and the Intel® Inspector tools available in Intel Cluster Studio.

## MPI Analysis Workflow

To analyze the performance and correctness of an MPI application at the inter-process level, use the Intel® Trace Analyzer and Collector tool (located at <installdir>/itac directory after installation). The Intel Trace Analyzer and Collector attaches to the application through linkage (statically, dynamically, also through LD_PRELOAD or via the Intel Compiler -tcollect and -tcollect-filter options), or by using the itcpin tool. The tools collect information about events at the MPI level between processes and allow analyzing the performance and correctness of the MPI calls, deadlock detection, data layout errors, as well as risky or incorrect MPI constructs. The Intel Trace Analyzer and Collector data is correlated and aggregated across all processes and all nodes that participated in the execution run.

Beyond the inter-process level of MPI parallelism, the processes that make up the applications on a modern cluster often also use fork-join threading through OpenMP and Intel TBB. This is where the VTune Amplifier and the Intel Inspector should respectively be used to analyze the performance and correctness of an MPI application.

At the high level the analysis workflow consists of three steps:

1. Use the amplxe-cl and inspxe-cl command-line tools to collect data about an application. By default, all processes are analyzed, but it is possible (and sometimes required for VTune Amplifier - there are certain collection technology limitations) to filter the data collection to limit it to a subset of processes. An individual result directory is created for each spawned MPI application process that was analyzed with MPI process rank value captured.

2. Post-process the result, which is also called finalization or symbol resolution. This is done automatically for each result directory once the collection has finished.

3. Open the content of each result directory through the GUI standalone viewer to analyze the data for the specific process. The GUI viewers are independent: VTune Amplifier and Intel Inspector have their own user-interfaces.

### Note

The file system contents should be the same on all nodes to make sure that the modules referenced in the collected data are available automatically on the host where the collection was initiated. This limitation can be overcome by manual copying of the modules for analysis from the nodes and adjusting the VTune Amplifier / Intel Inspector project search directories to make the modules found.

For VTune Amplifier the CPU model and stepping should be the same on all nodes so that the hardware Event-based sampling operates with the same Performance Monitoring Unit (PMU) type on all nodes.

## Collecting MPI Performance/Correctness Data

To collect performance or correctness data for an MPI application with the VTune Amplifier / Intel Inspector on a Windows or Linux OS, the following command should be used:

$mpirun -n <N> <abbr>-cl -r my_result -collect <analysis type> my_app [my_app_ options] where <abbr> is amplxe or inspxe respectively. The list of analysis types available can be viewed using amplxe-cl -help collect command. As a result of using the collection commands, a number of result directories are created in the current directory, named as my_result.0 - my_result.3. The numeric suffix is the corresponding MPI process rank that is detected and captured by the collector automatically. The usage of the suffix makes sure that multiple amplxe-cl / inspxe-cl instances launched in the same directory on different nodes do not overwrite the data of each other and can work in parallel. So, a separate result directory is created for each analyzed process in the job. Sometimes it is necessary to collect data for a subset of the MPI processes in the workload. In this case the per-host syntax of mpirun/mpiexec* should be used to specify different command lines to execute for different processes. When launching the collection on Windows OS, we recommend passing the -genvall option to the mpiexec tool to make sure that the user environment variables are passed to all instances of the profiled process. Otherwise, by default the processes are launched in the context of a system account and some environment variables (USERPROFILE, APPDATA) do not point where the tools expect them to point to. There are also some specialties about stdout / stdin behavior in MPI jobs profiled with the tools: It is recommended to pass the -quiet / -q option to amplxe-cl / inspxe-cl to avoid diagnostic output like progress messages being spilled to the console by every tool process in the job. The user may want to use the -l option for mpiexec/mpirun to get stdout lines marked with MPI rank. ### Example The most reasonable analysis type to start with for the VTune Amplifier is hotspots, so an example of full command line for collection would be: $ mpirun -n 4 amplxe-cl -r my_result -collect hotspots -- my_app [my_app_options]

A similar command line for the Intel Inspector and its ti1/mi1 analysis types (the lowest overhead threading and memory correctness analysis types respectively) would look like:

$mpirun -n 4 inspxe-cl -r my_result -collect mi1 -- my_app [my_app_options]$ mpirun -n 4 inspxe-cl -r my_result -collect ti1 -- my_app [my_app_options]

Here is an example where there are 16 processes in the job distributed across the hosts and hotspots data should be collected for only two of them:

$mpirun -host myhost -n 14 ./a.out : -host myhost -n 2 amplxe-cl -r foo -c hotspots ./a.out As a result, two directories will be created in the current directory: foo.14 and foo.15 (given that process ranks 14 and 15 were assigned to the last 2 processes in the job). As an alternative to specifying the command line above, it is possible to create a configuration file with the following content: # config.txt configuration file -host myhost -n 14 ./a.out -host myhost -n 2 amplxe-cl -quiet -collect hotspots -r foo ./a.out and run the data collection as: $ mpirun -configfile ./config.txt

to achieve the same result as above (foo.14 and foo.15 result directories will be created). Similarly, you can use specific host names to control where the analyzed processes are executed:

# config.txt configuration file
-host myhost1 -n 14 ./a.out
-host myhost2 -n 2 amplxe-cl -quiet -collect hotspots -r foo ./a.out

When the host names are mentioned, consecutive MPI ranks are allocated to the specified hosts. In the case above, ranks 0 to 13, inclusive, will be assigned to myhost1, the remaining ranks 14 and 15 will be assigned to myhost2. On Linux, it is possible to omit specifying the exact hosts, in which case the distribution of the processes between the hosts will be done in round-robin fashion. That is, myhost1 will get MPI ranks 0, 2, and 4 thru 15, while myhost2 will get MPI ranks 1 and 3. The latter behavior may change in the future.

### Note

In the examples this reference uses the mpirun command as opposed to mpiexec and mpiexec.hydra while real-world jobs might use the mpiexec\* ones. mpirun is a higher-level command that dispatches to mpiexec or mpiexec.hydra depending on the current default and options passed. All the examples listed in the paper work for the mpiexec\* commands as well as the mpirun command.

## MPI Analysis Limitations

There are certain limitations in the current MPI profiling support provided by the VTune Amplifier / Intel Inspector:

1. MPI dynamic processes are not supported by the VTune Amplifier / Intel Inspector. An example of dynamic process API is MPI_Comm_spawn

2. The data collections that use the hardware event-based sampling collector are limited to only one such collection allowed at a time on a system. When the VTune Amplifier is used to profile an MPI application, it is the responsibility of the user to make sure that only one SEP data collection session is launched on a given host. Common ways to achieve this is using the host syntax and distribute the ranks running under the tool over different hosts.

## Support of Non-Intel MPI Implementations

The examples in this section assume the usage of the Intel MPI library implementation but the workflow will work with other MPI implementations, if the following is kept in mind:

1. VTune Amplifier and Intel Inspector tools extract the MPI process rank from the environment variables PMI_RANK or PMI_ID (whichever is set) to detect that the process belongs to an MPI job and to capture the rank in the result directory name. If the alternative MPI implementation does not set those environment variables, the tools do not capture the rank in the result directory name and a usual automatic naming of result directories should be used. Default value for the -result-dir option is r@@@{at}, which results in sequence of result directories like r000hs, r001hs, and so on.

2. The function/module patterns used for classification of time spent inside of the Intel MPI Library as system one may not cover all of modules and functions in the used MPI implementation. This may result in displaying some internal MPI functions and modules by default.

3. The command-line examples in this section may need to be adjusted to work - especially when it comes to specifying different command lines to execute for different process ranks to limit the amount of processes in the job being analyzed.

4. The MPI implementation needs to operate in cases when there is a tool process between the launcher process (mpirun/mpiexec) and the application process. This essentially implies that the communication information should be passed using environment variables, as most MPI implementations do. The tools would not work on an MPI implementation that tried to pass communication information from its immediate parent process. Intel is unaware of any implementations that have this limitation.