## **Data Parallel C++ (DPC++) for Heterogeneous Architectures**

#### **Aiichiro Nakano**

Collaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Quantitative & Computational Biology University of Southern California

Email: anakano@usc.edu

Goal: Unified low-level programming of both CPU & various accelerators including GPU

# **Open Programming Models**

OpenCL (Open Computing Language)
 Open standard for programming heterogeneous devices
 <u>https://www.khronos.org/opencl/</u>

• SYCL

High-level programming standard (or abstraction layer) for single-source C++ based language on heterogeneous computer architectures

https://www.khronos.org/sycl/

• Data parallel C++ (DPC++)

**Extension of C++ programming language, incorporating SYCL & other features, initially created by Intel; an open-source compiler is available on GitHub** 

https://intel.github.io/llvm-docs/index.html

## **Platform Model**

• DPC++ unifies programming of central processing unit (CPU, scalar computation), graphics processing unit (GPU, vector computation), artificial-intelligence accelerator (AI, matrix or tensor) and field-programmable gate array (FPGA, spatial computation)



<u>Data Parallel C++</u>, B. Ashbaugh *et al*. (Apress, 2021); sample codes at <u>https://github.com/Apress/data-parallel-CPP</u>

### **Host & Device Codes**

- Various accelerators (e.g., GPU & FPGA) are referred to as devices
- DPC++ program can be a single source, *i.e.*, the same file contains both the host code to run on CPU and device kernels that run on devices



C++ in a nutshell

- **Class:** User-defined data type that contains both member variables & member functions to work on them
- **Object: Instance of a class**

Many C++ tutorials like: <u>http://www.cplusplus.com/doc/tutorial</u>

## Queue

- Queue: Abstraction to which work is submitted for execution on a single device (defined in SYCL as sycl::queue class)
- A queue is bound to a device upon construction of the queue object
- Selection of a device is achieved using sycl::device\_selector class



#### **Built-in selectors:**

cpu\_selector
gpu\_selector
Intel::fpga\_selector

CPU as device (useful for debugging) GPU FPGA

## **Binding a Queue to a Device**

#### get\_queue.cpp



#### How to compile & run on DevCloud: \$ dpcpp -o get device get device.cpp

```
$ apcpp -0 get_device get_device.cpp
$ qsub -I -l nodes=1:gpu:ppn=2
$ ./get_device
Device: Intel(R) Gen9 HD Graphics NEO
```

## **Host & Device Memory**

• Host & device have separate memories



• Data needed by a device kernel must be transferred from host memory to device memory prior to kernel execution, and results of kernel computation must be transferred back from device memory to host memory upon termination of kernel execution



## **Data Management: Buffer**

- Buffer class: Abstraction of data object (not specific memory addresses)
- A buffer object can be created from existing data on the host; data is copied during buffer construction from the existing host allocation into the buffer object
- Range class: Represents one-, two- or three-dimensional range

of an array object

## **Data Management: Accessor**

• Accessor class: Abstraction of reading & writing operations on buffer objects; usually created by get\_access() method in the buffer class



| Access mode | Description                     |
|-------------|---------------------------------|
| read        | Read-only access by device code |
| write       | Device code will write into it  |
| read_write  | Read & write access             |

### **Device Code**

- Device code is submitted to a queue using submit() function of a queue object
- Argument to submit() is a command group function object in the form of lambda expression (*i.e.*, function with no name): [access mode to caller's variables] (argument list) {function body}
- The argument of the passed function is a handler to access the command group, which will be created by a runtime system and passed to the user through the argument

```
queue q(gpu selector{});
q.submit([&](handler &h){ Command group });
```

Access by address

### **Parallelization Construct**

- Device code can be parallelized using parallel\_for() function, which takes a range of a loop index and a function as arguments
- Argument of the function is a loop index, which is of id class (index in a one-, two or three-dimensional range)
- Loop indices are distributed among multiple threads on device for parallel execution

```
#define NTRD 512
range<1> sizeBuf{NTRD}; Access by value
h.parallel_for(sizeBuf, [=](id<1> index) { Code for each index });
Index in one-dim. range
```

## **Example: Computing the Value of** $\pi$

• Numerical integration

$$\int_{0}^{1} \frac{4}{1+x^2} \, dx = \pi$$

• Discretization:

 $\Delta = 1/N$ : step = 1/NBIN  $x_i = (i+0.5)\Delta \ (i = 0,...,N-1)$ 

$$\sum_{i=0}^{N-1} \frac{4}{1+x_i^2} \Delta \cong \pi$$

#define NBIN 1000000

```
float sum = 0.0f;
float step = 1.0f/NBIN;
for (int i=0; i<NBIN; i++) {
  float x = (i+0.5f)*step;
   sum += 4.0f/(1.0f+x*x);
}
float pi = sum*step;
```



#### **Multithreading & Data Privatization**

- Multithreading: Interleaved assignment of bins *i* among *NTHRD* threads, where thread ID  $tid \in [0, NTHRD 1]$
- Data privatization: Provide each thread a dedicated accumulator to avoid a race condition (*i.e.*, nondeterministic result depending on the timing of read & write operations on a shared variable by multiple threads)

```
for (int i=tid; i<NBIN; i+=NTHRD) {
  float x = (i+0.5)*step;
  sum[tid] += 4.0/(1.0+x*x);
}</pre>
```

• Interthread reduction: After all partial summations have been executed by multiple threads, the total sum must be computed by a single thread

```
float pi = 0.0f
for (int i=0; i<NTHRD; i++)
    pi += sum[i];
Pi *= step;</pre>
```



#### Computing $\pi$ on a Device

#### From pi.cpp

```
q.submit([&](handler &h){
  auto sumAccessor =
  sumBuf.get_access<access::mode::read_write>(h);
  h.parallel_for(sizeBuf, [=](id<1> tid) {
    for (int i=tid; i<NBIN; i+=NTRD) {
      float x = (i+0.5f)*step;
      sumAccessor[tid] += 4.0f/(1.0f+x*x);
    }
  }); // End parallel_for
}); // End queue submit</pre>
```

```
Compile & run on devcloud

u49162@login-2:~$ dpcpp -o pi pi.cpp

u49162@login-2:~$ qsub -I -l nodes=1:gpu:ppn=2

GPU-accelerated node has been allocated, and automatically logged in

u49162@s001-n181:~$ ./pi

Running on: Intel(R) Gen9 HD Graphics NEO

Pi = 3.14159
```

## **Synchronization**

• Synchronization between host & device can be achieved by buffer destruction

```
std::array<float, NTRD> sum;
    Buffer is created in a separate scope
{
  queue q(gpu selector{});
  range<1> sizeBuf{NTRD};
  buffer<float,1> sumBuf(sum.data(),sizeBuf); Buffer now takes
                                                    ownership of sum array
  q.submit([&](handler &h){
    auto sumAccessor =
    sumBuf.get access<access::mode::read write>(h);
    h.parallel for(sizeBuf, [=](id<1> tid) {
       for (int i=tid; i<NBIN; i+=NTRD) {</pre>
         float x = (i+0.5f)*step;
         sumAccessor[tid] += 4.0f/(1.0f+x*x);
    }); // End parallel for
                                                       Buffer relinquishes the
  }); // End queue submit
                                                       ownership of data &
    Buffer destructor is invoked when exiting from the scope ---->
}
                                                       copies its contents back
float pi=0.0f;
                                                       to host memory
for (int i=0; i<NTRD; i++)</pre>
  pi += sum[i];
pi *= step;
std::cout << "Pi = " << pi << std::endl;</pre>
```

## **DPC++ Program Pattern**

```
#include <CL/sycl.hpp>
#include <iostream>
#include <array>
using namespace cl::sycl;
#define NBIN 1000000 // # of bins for guadrature
#define NTRD 512 // # of threads
int main() {
  float step = 1.0f/NBIN;
  std::array<float, NTRD> sum;
  for (int i=0; i<NTRD; ++i) sum[i] = 0.0f;</pre>
    queue q(gpu selector{});
    std::cout << "Running on: " <<</pre>
      q.get device().get info<info::device::name>() << std::endl;</pre>
    range<1> sizeBuf{NTRD};
    buffer<float, 1> sumBuf(sum.data(), sizeBuf);
                                                                Create Buffer
    q.submit([&](handler &h){
      auto sumAccessor =
                                                                Copy to Device
      sumBuf.get access<access::mode::read write>(h);
      h.parallel for(sizeBuf, [=](id<1> tid) {-
                                                               Execute Kernel
        for (int i=tid; i<NBIN; i+=NTRD) {</pre>
          float x = (i+0.5f)*step;
          sumAccessor[tid] += 4.0f/(1.0f+x*x);
        }
      }); // End parallel for
    }); // End queue submit
                                                              Copy Back to Host
  float pi=0.0f;
 for (int i=0; i<NTRD; i++) // Thread reduction
    pi += sum[i];
  pi *= step; // Multiply bin width to complete integration
  std::cout << "Pi = " << pi << std::endl;</pre>
  return 0;
}
```