Intel® VTune™ Profiler

Cookbook

ID 766316
Date 3/22/2024
Public
Document Table of Contents

Compile a Portable Optimized Binary with the Latest Instruction Set

Learn how to compile a binary with the latest instruction set while maintaining portability.

Content expert: Jeffrey Reinemann

Modern Intel® processors support instruction set extensions like the different versions of Intel® Advanced Vector Extensions (Intel® AVX):

  • AVX
  • AVX2
  • AVX-512

When you compile your application, consider these options based on the purpose of your application:

  • Generic binary: Compile an application for the generic x86 instruction set. The application runs on all x86 processors, but may not utilize a newer processor to its full potential.
  • Native binary: Compile an application for the specific processor. The application utilizes all features of the target processor but does run on older processors.
  • Portable binary: Compile a portable optimized binary with multiple versions of functions. Each version is targeted for different processors using compiler options and function attributes. The resulting binary has the performance characteristics of an application compiled for a specific processor (native binary) and can run on older processors.

This recipe demonstrates how you can compile a portable binary with the performance characteristics of a native binary, while still maintaining portability of a generic binary. In this recipe, you compile both the generic and native binaries first to determine if the resulting performance improvement is large enough to justify the increase in binary size.

This recipe covers the Intel® C++ Compiler Classic and the GNU* Compiler Collection (GCC).

This recipe does not cover:

Ingredients

This section lists the systems and tools used in the creation of this recipe:

  • Processor: Intel® Core™ code named Skylake i7-6700 CPU @ 3.40GHz
  • Operating System: Linux OS (Ubuntu 22.04.3 LTS with kernel version 6.2.0-35-generic)
  • Compilers:
    • Intel® C++ Compiler Classic 2024.0
    • GCC version 11.4.0
  • Analysis Tool :Intel® VTune™ Profiler version 2024.0 or newer

Sample Application

Save this code to a source file named fma.c:

// fma.c
#include <stdio.h>
#include <stdlib.h>

void init(float *a, float *b, float *c, int size)
{
    for (int i = 0; i < size; i++)
    {
        a[i] = (float) (i % 10);
        b[i] = a[i] * 1.1f;
        c[i] = a[i] * 1.2f;
    }
}

void my_fma(float *a, float *b, float *c, int size)
{
    for (int i = 0; i < size; i++)
    {
        c[i] += a[i]*b[i];
    }
}

#define ITERATIONS 10000000
#define SIZE 2048

int main()
{
    float *a = malloc(SIZE*sizeof(float));
    float *b = malloc(SIZE*sizeof(float));
    float *c = malloc(SIZE*sizeof(float));

    for (int i = 0; i < ITERATIONS; i++)
    {
        init(a, b, c, SIZE);
        my_fma(a, b, c, SIZE);
    }
    printf("%f", c[5]); // use the data

    free(a);
    free(b);
    free(c);
    return 0;
}

Compile Generic Optimized Binary

Compile the binary following the recommendations from VTune Profiler User Guide (recommendations for Windows).

Intel® DPC++/C++ Compiler

Compile the binary with debug information and -O3 optimization level:

icx -g -O3 -debug inline-debug-info fma.c -o fma_generic

GNU Compiler Collection

Compile the binary with debug information and -O2 optimization level:

gcc -g -O2 fma.c -o fma_generic_O2

To check if the code was vectorized, use the HPC Performance Characterization analysis in VTune Profiler:

vtune -c hpc-performance -r fma_generic_O2_hpc ./fma_generic_O2

The output of this command includes information about vectorization:

Vectorization: 0.0% of Packed FP Operations    
Instruction Mix        
SP FLOPs: 16.4% of uOps            
 Packed: 0.0% from SP FP                
  128-bit: 0.0% from SP FP                
  256-bit: 0.0% from SP FP            
 Scalar: 100.0% from SP FP

Open the result in the VTune Profiler GUI:

vtune-gui fma_generic_O2_hpc

Once you open the analysis result, in the Summary tab, see the Top Loops/Functions with FPU Usage by CPU Time section :

The fact that FP Ops: Scalar value equals 100% and that the Vector Instruction Set column is empty indicates that GCC does not vectorize the code at -O2 optimization level.

Use -O2 -ftree-vectorize or -O3 options to enable vectorization.

Compile the fma_generic binary with -O3 optimization level:

gcc -g -O3 fma.c -o fma_generic

Collect the HPC Performance Characterization analysis data for the generic binary:

vtune -c hpc-performance -r fma_generic_hpc ./fma_generic

The output of this analysis includes the following information:

Vectorization: 100.0% of Packed FP Operations    
Instruction Mix        
SP FLOPs: 8.3% of uOps            
	Packed: 100.0% from SP FP                
		128-bit: 100.0% from SP FP                
		256-bit: 0.0% from SP FP            
 Scalar: 0.0% from SP FP

When you open the analysis result in the VTune Profiler GUI, you can find information about vectorization:

Compile Native Binary

Compile native binary with the Intel® DPC++/C++ Compiler

The -xHost option instructs the compiler to generate instructions for the highest instruction set available on the processor performing the compilation. Alternatively, the -x{Arch} option, where {Arch} is the architecture codename, instructs the compiler to target processor features of a specific architecture.

Compile the fma_native binary with -xHost flag:

icx -g -O3 -debug inline-debug-info -xHost fma.c -o fma_native

Compile native binary with the GNU Compiler Collection

Compile the fma_native binary with -march=native flag:

gcc -g -O3 -march=native fma.c -o fma_native

If your processor supports the AVX-512 instruction set extension, consider experimenting with the mprefer-vector-width=512 option.

Next, collect HPC data for the native binary:

vtune -c hpc-performance -r fma_native_hpc ./fma_native

The output of this analysis includes the following information:

Vectorization: 100.0% of Packed FP Operations    
Instruction Mix        
  SP FLOPs: 14.2% of uOps            
   Packed: 100.0% from SP FP                
    128-bit: 0.0% from SP FP                
    256-bit: 100.0% from SP FP            
   Scalar: 0.0% from SP FP

When you open the analysis result in the VTune Profiler GUI, you can find information about vectorization:

Compare Generic and Native Binaries

To compare the HPC data collected for the generic and native binaries, run this command:

vtune-gui fma_generic_hpc fma_native_hpc

In the VTune Profiler GUI, switch to the Bottom-Up tab. Set Loop Mode to Functions only.

Switch to the Summary tab and scroll down to the Top Loops/Functions with FPU Usage by CPU Time section:

Observe the CPU Time and Vector Instruction Set columns.

Consider the performance difference between the generic and the native binary. Decide whether it makes sense to compile a portable binary with multiple code paths.

NOTE:

This sample application was auto-vectorized by the compiler. To investigate vectorization opportunities in your application in depth, use Intel® Advisor.

Compile Portable Binary

If the comparison between the generic and native binary shows a performance improvement, for example, if the CPU Time was improved, consider compiling a portable binary.

Compile the portable binary with the Intel® DPC++/C++ Compiler

Use the -ax (/Qax for Windows) option to instruct the compiler to generate multiple feature-specific auto-dispatch code paths for Intel processors.

Compile the fma_portable binary with the -ax option:

icx -g -O3 -debug inline-debug-info -axCOMMON-AVX512,CORE-AVX2,AVX,SSE4.2,TREMONT,ICELAKE-SERVER fma.c -o fma_portable

Refer to the -ax option help page for the list of supported architectures.

Compile the portable binary with the GNU Compiler Collection

Compare the results for generic and native binaries. If the CPU Time was improved and an additional Vector Instruction Set was utilized for a specific function in the native binary result, then add the target_clones attribute to this function.

If the function calls other functions, consider adding the flatten attribute to force inlining, since the target_clones attribute is not recursive.

Copy the contents of the fma.c source file to a new file, fma_portable.c, and add the TARGET_CLONE preprocessor macro:

#define TARGET_CLONES __attribute__((flatten,target_clones("default,sse4.2,avx,"\
    "avx2,avx512f,arch=skylake,arch=tremont,arch=skylake-avx512,"\
    "arch=cascadelake,arch=cooperlake,arch=tigerlake,arch=icelake-server")))

Refer to the x86 Options page of the GCC manual for the list of supported architectures.

Multiple versions of a function will increase the binary size. Consider the trade-off between performance improvement for each target and code size. Collecting and comparing VTune Profiler results enables you to make data-driven decisions to apply the TARGET_CLONES macro only to the functions that will run faster with new instructions.

Add the TARGET_CLONES macro before the my_fma function definition and init functions and save the changes to fma_portable.c:

TARGET_CLONES
void my_fma(float *a, float *b, float *c, const int size)

Compile the fma_portable binary:

gcc -g -O3 fma_portable.c -o fma_portable

Compare Portable and Native Binaries

To compare the performance of portable and optimized binaries, collect the HPC Performance Characterization data for the fma_portable binary:

vtune -c hpc-performance -r fma_portable_hpc ./fma_portable

The output of this analysis includes the following data:

Vectorization: 100.0% of Packed FP Operations    
Instruction Mix        
  SP FLOPs: 6.9% of uOps            
   Packed: 100.0% from SP FP                
    128-bit: 66.7% from SP FP                
    256-bit: 33.3% from SP FP            
   Scalar: 0.0% from SP FP

Open the comparison in VTune Profiler GUI:

vtune-gui fma_portable_hpc fma_native_hpc

As a result, the portable binary uses the highest instruction set extension available and demonstrates optimal performance on the target system.