109 lines
4 KiB
ReStructuredText
109 lines
4 KiB
ReStructuredText
|
==========================
|
|||
|
OpenMP-Aware Optimizations
|
|||
|
==========================
|
|||
|
|
|||
|
LLVM, since `version 11 <https://releases.llvm.org/download.html#11.0.0>`_ (12
|
|||
|
Oct 2020), supports an :ref:`OpenMP-Aware optimization pass <OpenMPOpt>`. This
|
|||
|
optimization pass will attempt to optimize the module with OpenMP-specific
|
|||
|
domain-knowledge. This pass is enabled by default at high optimization levels
|
|||
|
(O2 / O3) if compiling with OpenMP support enabled.
|
|||
|
|
|||
|
.. _OpenMPOpt:
|
|||
|
|
|||
|
OpenMPOpt
|
|||
|
=========
|
|||
|
|
|||
|
.. contents::
|
|||
|
:local:
|
|||
|
:depth: 1
|
|||
|
|
|||
|
OpenMPOpt contains several OpenMP-Aware optimizations. This pass is run early on
|
|||
|
the entire Module, and later on the entire call graph. Most optimizations done
|
|||
|
by OpenMPOpt support remarks. Optimization remarks can be enabled by compiling
|
|||
|
with the following flags.
|
|||
|
|
|||
|
.. code-block:: console
|
|||
|
|
|||
|
$ clang -Rpass=openmp-opt -Rpass-missed=openmp-opt -Rpass-analysis=openmp-opt
|
|||
|
|
|||
|
OpenMP Runtime Call Deduplication
|
|||
|
---------------------------------
|
|||
|
|
|||
|
The OpenMP runtime library contains several functions used to implement features
|
|||
|
of the OpenMP standard. Several of the runtime calls are constant within a
|
|||
|
parallel region. A common optimization is to replace invariant code with a
|
|||
|
single reference, but in this case the compiler will only see an opaque call
|
|||
|
into the runtime library. To get around this, OpenMPOpt maintains a list of
|
|||
|
OpenMP runtime functions that are constant and will manually deduplicate them.
|
|||
|
|
|||
|
Globalization
|
|||
|
-------------
|
|||
|
|
|||
|
The OpenMP standard requires that data can be shared between different threads.
|
|||
|
This requirement poses a unique challenge when offloading to GPU accelerators.
|
|||
|
Data cannot be shared between the threads in a GPU by default, in order to do
|
|||
|
this it must either be placed in global or shared memory. This needs to be done
|
|||
|
every time a variable may potentially be shared in order to create correct
|
|||
|
OpenMP programs. Unfortunately, this has significant performance implications
|
|||
|
and is not needed in the majority of cases. For example, when Clang is
|
|||
|
generating code for this offloading region, it will see that the variable `x`
|
|||
|
escapes and is potentially shared. This will require globalizing the variable,
|
|||
|
which means it cannot reside in the registers on the device.
|
|||
|
|
|||
|
.. code-block:: c++
|
|||
|
|
|||
|
void use(void *) { }
|
|||
|
|
|||
|
void foo() {
|
|||
|
int x;
|
|||
|
use(&x);
|
|||
|
}
|
|||
|
|
|||
|
int main() {
|
|||
|
#pragma omp target parallel
|
|||
|
foo();
|
|||
|
}
|
|||
|
|
|||
|
In many cases, this transformation is not actually necessary but still carries a
|
|||
|
significant performance penalty. Because of this, OpenMPOpt can perform and
|
|||
|
inter-procedural optimization and scan each known usage of the globalized
|
|||
|
variable and determine if it is potentially captured and shared by another
|
|||
|
thread. If it is not actually captured, it can safely be moved back to fast
|
|||
|
register memory.
|
|||
|
|
|||
|
Another case is memory that is intentionally shared between the threads, but is
|
|||
|
shared from one thread to all the others. Such variables can be moved to shared
|
|||
|
memory when compiled without needing to go through the runtime library. This
|
|||
|
allows for users to confidently declare shared memory on the device without
|
|||
|
needing to use custom OpenMP allocators or rely on the runtime.
|
|||
|
|
|||
|
|
|||
|
.. code-block:: c++
|
|||
|
|
|||
|
static void share(void *);
|
|||
|
|
|||
|
static void foo() {
|
|||
|
int x[64];
|
|||
|
#pragma omp parallel
|
|||
|
share(x);
|
|||
|
}
|
|||
|
|
|||
|
int main() {
|
|||
|
#pragma omp target
|
|||
|
foo();
|
|||
|
}
|
|||
|
|
|||
|
These optimizations can have very large performance implications. Both of these
|
|||
|
optimizations rely heavily on inter-procedural analysis. Because of this,
|
|||
|
offloading applications should ideally be contained in a single translation unit
|
|||
|
and functions should not be externally visible unless needed. OpenMPOpt will
|
|||
|
inform the user if any globalization calls remain if remarks are enabled. This
|
|||
|
should be treated as a defect in the program.
|
|||
|
|
|||
|
Resources
|
|||
|
=========
|
|||
|
|
|||
|
- 2021 OpenMP Webinar: "A Compiler's View of OpenMP" https://youtu.be/eIMpgez61r4
|
|||
|
- 2020 LLVM Developers’ Meeting: "(OpenMP) Parallelism-Aware Optimizations" https://youtu.be/gtxWkeLCxmU
|
|||
|
- 2019 EuroLLVM Developers’ Meeting: "Compiler Optimizations for (OpenMP) Target Offloading to GPUs" https://youtu.be/3AbS82C3X30
|