Automatic Data Placement into GPU On-Chip Memory Resources

Chao Li# Yi Yang* Zhen Lin# Huiyang Zhou#
# Department of Electrical and Computer Engineering, North Carolina State University
*Department of Computer Systems Architecture, NEC Labs
#{cli17, zlin4, hzhou}@ncsu.edu; *yyang@nec-labs.com

Abstract

Although graphics processing units (GPUs) rely on thread-level parallelism to hide long off-chip memory access latency, judicious utilization of on-chip memory resources, including register files, shared memory, and data caches, is critical to application performance. However, explicitly managing GPU on-chip memory resources is a non-trivial task for application developers. More importantly, as on-chip memory resources vary among different GPU generations, performance portability has become a daunting challenge.

In this paper, we tackle this problem with compiler-driven automatic data placement. We focus on programs that have already been reasonably optimized either manually by programmers or automatically by compiler tools. Our proposed compiler algorithms refine these programs by revising data placement across different types of GPU on-chip resources to achieve both performance enhancement and performance portability. Among 12 benchmarks in our study, our proposed compiler algorithm improves the performance by 1.76x on average on Nvidia GTX480, and by 1.61x on average on GTX680.

1. Introduction

Throughput-oriented architecture, such as graphic processing units (GPUs), has been widely used to accelerate many general-purpose computation workloads. Although general purpose computation on GPUs (GPGPU) achieves high throughput mainly by employing a large bundle of threads to overlap computations with long-latency memory accesses, off-chip memory bandwidth and latency remain a performance as well as energy-efficiency bottleneck. Furthermore, the current trend of GPGPU evolution scales the computational throughput much faster than off-chip memory access bandwidth. For example, Nvidia GTX480 GPUs based on the FERMI architecture [17] have an arithmetic throughput of 1.35 TFLOPS with the memory bandwidth of 178 GB/s. In comparison, GTX680 GPUs based on the KEPLER architecture [18] have an arithmetic throughput 3.09 TFLOPS (2.3X increase over GTX480) with the memory bandwidth of 192 GB/s (7.8% increase over GTX480). To alleviate the off-chip bandwidth bottleneck and reduce memory access latency, GPUs are equipped with a multiple-level on-chip memory hierarchy including register files, L1 data caches (D-cache), shared memory, and L2 caches. As expected, how to effectively utilize such on-chip memory resources has a significant impact on application performance. However, it is non-trivial for application developers to explicitly manage these on-chip memory resources as the trade-offs among these resources are complex and sometimes non-intuitive [14]. More importantly, as on-chip resources have been changing significantly for different generations of GPUs, an optimized kernel upon one generation becomes suboptimal on another one. Thus performance portability is a daunting challenge for application developers.

In this paper, we propose compiler-driven automatic data placement as our solution. We focus on GPGPU programs that have already been reasonably optimized either manually by programmers or automatically by some compiler tools. In other words, our input programs already employ classical loop optimizations such as tiling and allocate important data, either for communication among threads or for data reuses, in shared memory. Our proposed compiler algorithm refines these programs by revising data placement across different types of GPU on-chip memory resources.

Our compiler algorithm places data into different types of on-chip memory resources using the following systematic way. First, it analyzes the usage patterns of all shared memory variables in an input kernel program and tries to promote those shared memory variables into registers if they are not used for communication among threads. Second, if the shared memory usage becomes the bottleneck for thread-level parallelism (TLP), it checks whether it is profitable to move some shared memory variables into either global or local memory so as to implicitly exploit the L1 D-cache. Third, it detects redundant accesses to both global memory and shared memory across different threads. Then, it aims to reduce such redundant accesses by compacting multiple threads into one, thus converting redundant shared/global memory accesses among threads into data sharing/reuse of registers. To find the most profitable data (re)placements, an auto
tuning process is used to select the optimal parameters in the optimization process. The first two steps of our compiler algorithm focus on replacing shared memory variables with registers or global/local memory variables. The key reason is due to the evolution trend of GPU on-chip memory resources. In early generations such as the Nvidia G80 and GT200 architecture, the ratio of the register file size to the shared memory size is 2 and 4, respectively. In comparison, in the FERMI and KEPLER architecture, the ratio becomes 2.7 and 5.3, respectively. As a result, the code optimized for G80 or FERMI tends to over-utilize shared memory while underutilizing the register file when it runs on GT200 or KEPLER GPUs. As a result of such underutilization, it is proposed in prior works [1] to turn off significant portions of the register file to reduce static power consumption.

We evaluate our proposed automatic data replacement algorithm using a diverse set of applications from different GPGPU benchmark suites that have been manually optimized. Our results show that our compiler algorithm improves the performance by up to 4.14X and an average 1.76X on the FERMI architecture, and by up to 3.30X and an average of 1.61X on the KEPLER architecture.

The remainder of the paper is organized as follows. Section 2 presents a brief background on GPU architecture with an emphasis on on-chip memory resources. Section 3 presents in detail our proposed automatic data placement algorithm. Section 4 and 5 discuss our experimental methodology and the experimental results. Section 6 addresses the related work. Section 7 concludes our paper.

2. Background and Motivation

2.1 GPGPU Architecture and Programming Model

State-of-art GPUs employs many-core architecture, on which the cores are organized in a two-level hierarchy. Each GPU contains a cluster of streaming multiprocessors (SM) in Nvidia GPUs, or computing units in AMD GPUs. Each SM in turn consists of multiple streaming processors (SPs). To amortize the overhead of instruction fetch and decode, an array of SPs executes one scalar program in the single-instruction multiple-data (SIMD) manner. A group of threads running on such an array of SPs and sharing the same program counter (PC) is referred to as a warp of threads. Multiple warps of threads are grouped into a thread block (TB) and a number of thread blocks are organized into a thread grid.

2.2 GPU Memory Resources

The GPU off-chip memory space consists of texture memory, constant memory, local memory, and global memory. Texture memory and constant memory are for read-only data which can be accessed by all threads. Global memory can be read or written by all threads in a kernel. In contrast, local memory is private to each thread.

In order to reduce the latency and improve the bandwidth of off-chip memory accesses, three types of on-chip memory including shared memory, data caches, and a register file, are introduced in each SM. Texture caches and constant caches are also on-chip memory but they are used for read-only data and not our focus in this study.

Among three types of on-chip memory, the register file has the shortest access latency and highest throughput. Furthermore, the register file is larger than the L1 D-cache and shared memory as shown in Table 1. The register file is private to each thread, which means data in registers can only be accessed by a single thread, except for the latest Nvidia KEPLER architecture, which introduces a new instruction “__shfl” [18] to enable a thread to access the registers in other threads within the same warp. The maximum number of registers per thread is ISA-dependent and varies in different architectures. Exceedingly heavy usage of registers per thread will result in register spills into its local memory, which may be captured in L1 D-cache.

Compared to register files, shared memory has lower throughput and smaller capacity. As shown in Table 1, a GTX 680 GPU has a 256KB register file and 48KB shared memory. As shared memory is accessible to all threads in a TB and has low access latency, prior works have been focused on using shared memory to achieve memory coalescing, to provide data communication, and to store data for temporal reuses. L1 D-cache shares the same hardware resource as shared memory on FERMI or KEPLER architecture. In contrast to shared memory, which is explicitly managed by kernel code, L1 D-caches are hidden from developers and are implicitly managed by hardware to keep the most recently accessed data. Furthermore, while the intensive usage of shard memory or registers can limit the number of threads running on each SM, the usage of L1 D-cache does not. However, too many threads in a SM would compete with each other for the limited L1 D-cache capacity, which may result in poor performance due to cache contention [10].

2.3 Architecture Evolutions

GPUs evolve at a fast pace. Taking Nvidia GPUs as an example, from the first generations of unified shader G80 to the state-of-art KEPLER architecture. A comparison of them is shown in Table 1. Several observations can be made from the table. First, there is a higher increase in computational throughput than off-chip memory bandwidth. For example, from the FERMI architecture to the KEPLER architecture, the computation throughput increases by up to 229% while the memory bandwidth increases by only 8.3%. As a result, we need to more carefully manage on-chip resource to effectively utilize the computational resources. Second, among GPU on-chip memory resources, the register file size and D-cache/shared memory have been changing across different generations.
For example, from G80 to GT200, the register file size is doubled while the shared memory capacity remains the same. The same trend is present when comparing the FERMI architecture and the KEPLER architecture. Consequently, the code optimized for early GPU generations tend to use shared memory more heavily. This leads to a serious challenge for performance portability for such optimized code running on different GPUs.

Table 1. A comparison of hardware characteristics across different GPU generations.

<table>
<thead>
<tr>
<th></th>
<th>G80 (GTX 8800)</th>
<th>GT200 (GTX 280)</th>
<th>FERMI (GTX 480)</th>
<th>KEPLER (GTX 680)</th>
<th>KEPLER (K20c)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Arithmetic throughput (Gflops/S)</td>
<td>504</td>
<td>933</td>
<td>1345</td>
<td>3090</td>
<td>3950</td>
</tr>
<tr>
<td>Memory Bandwidth (GB/s)</td>
<td>57</td>
<td>141</td>
<td>177</td>
<td>192</td>
<td>250</td>
</tr>
<tr>
<td>Shared memory size (KB)</td>
<td>16</td>
<td>16</td>
<td>48</td>
<td>48</td>
<td>48</td>
</tr>
<tr>
<td>Register file size (KB)</td>
<td>32</td>
<td>64</td>
<td>128</td>
<td>256</td>
<td>256</td>
</tr>
</tbody>
</table>

In summary, the main challenges for application developers to manually manage the on-chip memory resources include: 1) GPUs have three types on-chip memory and, although critical to performance, it is difficult to decide the proper on-chip resource for a particular data element in an application, and 2) the resource evolution is not linear across different GPU generations, and optimal on-chip resource usage varies for different GPU generations.

3. Automatic Data Placement into On-chip Memory Resources

To automatically manage on-chip memory resources and achieve performance portability, in this section, we describe in detail our proposed compiler algorithm for automatic data placement. We first present our analysis of possible data placement patterns among different types of on-chip memory resources. Then, we construct our compiler algorithm using the profitable patterns.

3.1 Data Placement Patterns

As discussed in Section 2, we focus on three types of on-chip memory: register files, shared memory, and L1 D-caches. We propose to move data from one type of on-chip memory to another to achieve optimal resource utilization. As shown in Figure 1, there are six possible directions of moving data variables or six ways of data (re)placement. Data placement between register variables and local memory variables, i.e., direction 3 and 6, is determined by the compiler of the GPU vendors. With the Nvidia GPU compiler NVCC [3], we determine that the array variables accessed with non-constant indices, e.g., A[k] where k is a run-time variable, are allocated in local memory. Both scalars and array variables with constant indices are candidates for register allocation. Moving data from register files and D-caches (i.e., local/global variables) into shared memory, i.e., direction 4 and 5, requires significant code changes besides synchronization. Also, the current trend of GPU evolution is that the register files are much larger than shared memory and the existing compiler tools already can make use of shared memory for data reuse and communication. Therefore, we focus on placement 1, 2, and 3, and leave further investigation on placement 4 and 5 as future work.

3.1.1 Pattern 1: Promote variables from shared memory to registers

Shared memory can be used to exchange data among threads in a TB. Also, as a low-latency on-chip resource, many applications use shared memory as software-managed cache to hold important (private) data for each thread. There are three reasons why it may be profitable to promote a shared memory variable into registers. First, the shared memory usage may limit the number of concurrent TBs on an SM, i.e., TLP, and promoting shared memory variables into registers can alleviate the pressure on this critical resource. Second, shared memory has longer access latency and lower bandwidth than register files. Third, accessing shared memory variables is associated with instruction overhead for address computations. Therefore, higher performance may be expected when promoting

![Figure 1. Data placements among three types of on-chip memory.](image1)

![Figure 2. A code example of PathFinder.](image2)
shared memory variables into registers.

We show a benchmark, PathFinder, as an example, in Figure 2. Path-Finder makes use of two shared memory arrays, ‘prev’ and ‘result’, as shown in Figure 2. Its TB dimension is 256x1 and its thread grid size is 19x1. As a result, the sizes of these two shared-memory arrays are small (256x4=1kB) and such shared memory usage is actually not a bottleneck for the number of concurrent TBs on each SM. For the shared-memory array ‘prev’, its accesses in the kernel code, ‘prev[tx-1]’ and ‘prev[tx+1]’ indicate that the data in this array are indeed shared among different threads. As shown in line 7 in Figure 2a, the ‘result’ array is accessed by each thread multiple times in a loop. As each thread only accesses the array result using its own thread id as shown in line 8 and line 9 in the figure, there is no communication using the ‘result’ array across threads. Since each thread only accesses its individual part of the array, it is safe to simply replace ‘result[tx]’ with a register. Further, as the variable is only defined and used in the same thread, we can safely remove the synchronization instruction ‘__syncthread()’ after the statement updating the variable ‘result’ (line 7). The resulting code is shown in Figure 2b.

In our study, we found that shared memory is used very often in many benchmarks. Therefore, there are usually multiple shared memory arrays that can be replaced with registers. In this case, we may not have enough registers to promote all the shared memory arrays, and need to decide which shared memory array should be replaced with registers to maximize the performance benefits. Our framework handles this problem by counting the references of each shared memory array, and gives higher priority to the one with larger reference counts (Section 3.3).

3.1.2 Pattern 2: Promote variables from shared memory into L1 D-caches

As discussed above, the register file cannot be used for an array variable with a dynamically determined index (e.g., \( A[x] \)) and intensive usage of registers for shared memory promotion can also limit TLP. The local memory or global memory, which implicitly utilizes the L1 D-cache to achieve the high performance, does not have such drawbacks. Therefore, promoting variables from shared memory into local memory / global memory (L1 D-cache) is a better choice when (1) replacing shared memory arrays with dynamic indices or (2) the shared memory array to be promoted has a large size (e.g., an array of structures). Furthermore, if a shared memory variable is used for communication among threads, a global memory variable can be used to replace it since global memory is visible for all threads.

Figure 3a, using the benchmark Matching Cube (MC), from CUDA SDK [19] as an example, shows that two shared memory arrays ‘vertlist’ and ‘normlist’ are used in the kernel. Each thread only accesses part of these two arrays, and the total size of these two arrays is 9216 bytes for each TB. As a result, each SM can run 5 TBs concurrently even when the shared memory is configured to be 48KB. As we can see from the figure, the value of variable ‘edge’ in line 11 of Figure 3a can only be determined in the runtime, and therefore the array ‘vertlist’ cannot be allocated in the registers. We choose to promote these two arrays into local memory instead of global memory to minimize the code change since for global memory, we have to modify the CPU code to allocate a global memory array and insert it as a parameter of the kernel invocation. The resulting code is shown in Figure 3b. Since the code in Figure 3b does not use shared memory any more, each SM can run up to 16 TBs in the KEPLER GPUs and 8 TBs in the FERMI GPUs. Such improved TLP leads to higher performance for MC. In many cases, an application may intensively use shared memory to communicate among threads. Then, the global memory has to be used to replace the shared memory variables to maintain such communication so that we can both overcome the TLP bottleneck imposed by shared memory usage and keep inter-thread data communication.

Note that although promoting variables from shared memory into L1 D-cache can significantly improve the TLP (or occupancy) otherwise limited by shared memory capacity, it doesn’t mean that more TLP will always lead to higher performance. In some scenarios, more concurrent TBs may increase cache and/or network contentions and adversely affect the performance [10]. Thus, our compiler
the index variable 'index_s' as shown in line 8 of Figure 4a.  We can see that the index variable 'index_s' is dependent on tx, bx (i.e., blockIdx.x) and by (i.e., blockIdx.y) but not on ty. It means that when the 8 warps of a TB actually load the same block of global memory data, there are 7 redundant global memory accesses in each TB since all the warps share the same tx, bx, and by, i.e., the same memory reference index.

All three types of on-chip memory can be potentially used to reduce the overhead of such redundant global memory accesses across warps. First, the L1 D-cache is utilized implicitly when redundant global memory accesses hit in the L1 D-cache but such data reuse cannot be assured as the data may be evicted by other data requests. Second, we can choose to let only the first warp load the data into shared memory, and other warps then access the data from shared memory. However, this way incurs overhead due to operations moving data from/into register into/from shared memory [14]. Additional control flow is also needed to ensure that the global memory data are loaded only once and a synchronization is necessary to eliminate potential data races. Third, although the register file has a large size and the lowest latency, it cannot be shared among warps. In order to take advantage of the register file, we need to first compact multiple warps/threads into a single warp/thread, and then promote shared/global memory variables into registers. This way, the register variables after thread compaction can be shared among different threads before compaction. Such thread compaction is also referred to as thread merge [26] and thread coarsening [11]. Compared to the prior works on thread merge/coarsening/fusion [26][15][22], our approach specifically leverages this optimization technique for register tiling, i.e., use register reuse to eliminate redundant shared/global memory accesses. A key question for such register tiling is how many threads to be compacted so as to maximize register reuse while restricting the register pressure on TLP. We introduce the compaction factor C_Factor in our compiler algorithm to determine the most profitable version of data placement using automatic tuning.

The optimized code after compaction is shown in Figure 4b. The number of original threads/warps to be compacted is defined as a run-time constant, C_Factor. First, the thread block dimension is adjusted from <16,16> to <16,16/C_Factor>. Second, the global memory read accesses on line 6 of Figure 4a are replaced with a single global memory access on line 6 of Figure 4b, which loads the data from global memory to the register variable 'tmp_1'. Third, since multiple threads/warps of Figure 4a are compacted into a single thread/warp in Figure 4b, we can reuse the register 'tmp_1' as shown in line 11. Similarly, the memory access of 'c_cuda' under the conditional statement (line 8 of Figure 4a) can be processed in the same way by introducing another register 'tmp_2' as shown in Figure 4b. The if statement in line 7 of Figure 4a sometimes may also need to be replicated to guard this 'c_cuda' access to avoid potential out-of-bound accesses.

Figure 4. A code segment of the benchmark SRAD. (a) The global memory version, (b) The register version. The auto-tuning process to determine (1) how many variables should be promoted and (2) whether they are promoted into local memory or global memory, so as to achieve optimal data placement in balancing trade-offs between TLP and network/memory pressure.

3.1.3 Pattern 3: Promote variables from shared memory / global memory into registers to achieve register tiling

A common side effect of single-program multiple-data (SPMD) parallelization is redundant computations and memory accesses. In GPU kernels, there often exist redundant accesses to either shared memory data or global memory data across different threads. This redundant shared/global memory reference can be promoted into register usage to further save bandwidth.

We use the benchmark SRAD as an example to illustrate this behaviour. Figure 4a shows a code segment from the SRAD kernel code. The TB dimension of the SRAD kernel is configured as <16,16>, i.e., 256 threads per TB. Therefore, tx (i.e., threadIdx.x) ranges from 0 to 15 for all the warps in a TB; and ty (i.e., threadIdx.y) will be 0~1 for the first warp, 2~3 for the second warp, and so on. Before computation, a tile of data will be loaded from the global memory array 'c_cuda' into the shared memory array 'south_c' as shown in line 8 of Figure 4a. We can see that the index variable 'index_s' is dependent on tx, bx (i.e., blockIdx.x) and by (i.e., blockIdx.y) but not on ty. It means that when the 8 warps of a TB actually load the same block of global memory data, there are 7 redundant global memory accesses in each TB since all the warps share the same tx, bx, and by, i.e., the same memory reference index.

All three types of on-chip memory can be potentially used to reduce the overhead of such redundant global memory accesses across warps. First, the L1 D-cache is utilized implicitly when redundant global memory accesses hit in the L1 D-cache but such data reuse cannot be assured as the data may be evicted by other data requests. Second, we can choose to let only the first warp load the data into shared memory, and other warps then access the data from shared memory. However, this way incurs overhead due to operations moving data from/into register into/from shared memory [14]. Additional control flow is also needed to ensure that the global memory data are loaded only once and a synchronization is necessary to eliminate potential data races. Third, although the register file has a large size and the lowest latency, it cannot be shared among warps. In order to take advantage of the register file, we need to first compact multiple warps/threads into a single warp/thread, and then promote shared/global memory variables into registers. This way, the register variables after thread compaction can be shared among different threads before compaction. Such thread compaction is also referred to as thread merge [26] and thread coarsening [11]. Compared to the prior works on thread merge/coarsening/fusion [26][15][22], our approach specifically leverages this optimization technique for register tiling, i.e., use register reuse to eliminate redundant shared/global memory accesses. A key question for such register tiling is how many threads to be compacted so as to maximize register reuse while restricting the register pressure on TLP. We introduce the compaction factor C_Factor in our compiler algorithm to determine the most profitable version of data placement using automatic tuning.

The optimized code after compaction is shown in Figure 4b. The number of original threads/warps to be compacted is defined as a run-time constant, C_Factor. First, the thread block dimension is adjusted from <16,16> to <16,16/C_Factor>. Second, the global memory read accesses on line 6 of Figure 4a are replaced with a single global memory access on line 6 of Figure 4b, which loads the data from global memory to the register variable 'tmp_1'. Third, since multiple threads/warps of Figure 4a are compacted into a single thread/warp in Figure 4b, we can reuse the register 'tmp_1' as shown in line 11. Similarly, the memory access of 'c_cuda' under the conditional statement (line 8 of Figure 4a) can be processed in the same way by introducing another register 'tmp_2' as shown in Figure 4b. The if statement in line 7 of Figure 4a sometimes may also need to be replicated to guard this 'c_cuda' access to avoid potential out-of-bound accesses.

```c
__global__ void srad_kernel(int *c_cuda) {
    int index_s = cols * BLOCK_SIZE * by + BLOCK_SIZE * bx +
                 + cols * BLOCK_SIZE + tx; // BLOCK_SIZE = 16;
    __shared__ float south_c[BLOCK_SIZE][BLOCK_SIZE];
    ....
    south_c[ty][tx] = c_cuda[index_s]
    if (by == gridDim.y - 1)
        south_c[ty][tx] = c_cuda[cols * BLOCK_SIZE *
                           (gridDim.y - 1) + BLOCK_SIZE * bx +
                           cols * ( BLOCK_SIZE - 1 ) + tx];
    ...
    __syncthreads();
}
...

__global__ void srad_kernel(int *c_cuda) {
    int index_s = cols * BLOCK_SIZE * by + BLOCK_SIZE * bx +
                 + cols * BLOCK_SIZE + tx; // BLOCK_SIZE = 16;
    __shared__ float south_c[BLOCK_SIZE][BLOCK_SIZE];
    ....
    int tmp_1 = c_cuda[index_s];
    if (by == gridDim.y - 1)
        tmp_2 = c_cuda[cols * BLOCK_SIZE * (gridDim.y - 1) +
                       + BLOCK_SIZE * bx + cols * ( BLOCK_SIZE - 1 ) + tx];
    ...
    __syncthreads();
    ...
```

(a)
3.2 Compiler Algorithms and Implementation

Although the data placement patterns discussed in Section 3.1 can be used to guide programmers to manually optimize their GPU programs, it will quickly become un-manageable if a non-trivial number of data variables are to be analyzed. In this section, we present our source-to-source compiler framework which implements these three data placement patterns using an automatic compiler optimization algorithm. The goal of the compiler algorithm is to generate the code which utilizes on-chip resource efficiently without effort from application developers. The key feature is that the compiler framework can intelligently re-assign the memory types of variables of a GPU program to maximize the benefit of on-chip resources. Our compiler algorithm has two passes: one for data placement pattern 1 and pattern 2 and the other one for data placement pattern 3.

Either compiler pass has three stages: the identifying stage, the processing stage, and the auto-tuning stage, as detailed in Figures 5 and 6. The identifying stage will scan all the memory variables, and generate a list of candidate variables which can be promoted by collecting the architecture features and analyzing the memory accesses of the target kernel. The processing stage implements the data placement patterns by revising the data types and their access indices of these candidate variables. The auto-tuning stage constructs the search spaces, decides which variables to be processed and selects the optimal code versions.

3.2.1 Compiler pass 1

The algorithm of the compiler pass for promoting shared memory variables to register files/local memory/global memory is shown in Figure 5. The identifying stage (line 5–15) collects all shared memory variables through their ‘__shared__’ keyword. For shared memory variables, we mark an access as a combination of the array name and the access index. The compiler checks access indices to determine (a) whether an access is across different threads or private to a single thread, and (b) whether an index has to be determined at the runtime. Meanwhile, the reference count of the variable is also recorded. If an access is inside a loop, we weight this access number by timing a loop count in line 10. In some cases, the loop count in a one-level loop or multiple loop counts in nested loops may associate with a run-time value, leading to some unknown reference counts. In such cases, we resort to either profiling or simple heuristics (Section 3.2.4).

For all the candidate variables in arrays, the processing stage (line 18 ~24) applies data placement patterns by first selecting the shared memory variable with the largest reference counts. Then, if a shared memory variable is not shared across threads and is not accessed with run-time determined indices, it is promoted to the register file. Otherwise, it is replaced with a global memory variable if the code which utilizes on-chip resource efficiently without effort from application developers.

3.2.2 Compiler pass 2

The second compiler pass implements the third data placement pattern, i.e. promoting redundant shared/global memory accesses into register accesses, as shown in Figure 6. In the identifying stage (line 5–13), the compiler analyzes each shared or global memory array. It checks whether an array index is independent upon the thread id in either the X or Y dimension. If it is independent upon both dimensions, it sets the flag is_redundant_2d. Otherwise, if it is independent upon one direction, it sets the flag is_redundant_1d. During each index check, the compiler also inserts the expressions associated with the index into the exprs list, which will be used in the processing stage. After the identification stage, it outputs exprs, the list of candidate expressions that exhibit data access redundancy, and the corresponding flags that indicates the type of redundancy type, i.e., one-dimension or two-dimension.

In the processing stage (line 16–27), the compiler first adjusts the thread block dimension for each different
Figure 6. The compiler algorithm to promote shared or global memory to registers to be shared among threads.

```c
Kernel shared_or_global_access_to_register (Kernel kernel) {
  Kernel best_kernel = kernel;
  float exe_time = eval(kernel);
  /**<Identification Stage**/
  List exprs;
  bool is_redundant_1d = false, is_redundant_2d = false;
  for (each shared/global memory array smo in kernel) {
    for (each access acc of array smo in expression expr) {
      if (acc is independent of one thread dimension)
        is_redundant_1d = true, exprs.add (expr);
      else if (is_redundant_1d && acc is independent of the other
                thread dimension in expression expr)
        is_redundant_2d = true, exprs.add (expr);
    }
    for (each_C_Factor_in_search_spaces) {
      /**<Processing Stage**/
      Adjust Thread Block Dimension.
      if(is_redundant_1d) {
        construct a one-loop with loop bound C_Factor to
        perform the workload for compacted threads
        convert expr in exprs to from inter-thread memory usage
        into register array.
      } else if(is_redundant_2d) {
        construct an 2-level loop with loop bound C_Factor . x,
        and C_Factor . y to perform the workload
        for compacted threads
        convert expr in exprs to from inter-thread memory
        usage into register array usage;
      }
      /**<Auto-tuning Stage**/
      generate a new kernel nkernel from best_kernel;
      if(exe_time1< exe_time) { // the new kernel is better
        best_kernel = nkernel;
        exe_time = exe_time1;
      } else
        return best_kernel; } // end for
```
condition. If the loop bound and the condition can only be determined at run-time, we choose to either let the user to provide such information through profiling or use the following simple heuristics. We assume that for a nested loop in a kernel, each level has a loop count of 4 and the condition is true half of the times. The reason for such a default loop count is that our observation from the benchmarks shows that when a nested loop is parallelized into GPU threads, the levels with large loop counts are used to generate thread grids and the thread body typically contains loops with smaller counts. Lastly, the preprocessor collects the data structure declaration and annotate array accesses with the data type. For the vector data type such as int2, float4, the memory access index is processed the same as the scalar data type. For the struct type, the array index and the addresses of its elements are identified separately.

4. Experimental Methodology

We implemented our compiler algorithms using Cetus [13], a source-to-source compiler infrastructure for C programs. The CUDA syntax support is ported from MCUDA [21].

Table 2. Parameters used in experiments.

<table>
<thead>
<tr>
<th>Parameter</th>
<th>GTX480</th>
<th>GTX680</th>
<th>K20c</th>
</tr>
</thead>
<tbody>
<tr>
<td>&lt;Shared memory size, L1 D-cache size&gt;</td>
<td>16kB, 48kB, 8kB</td>
<td>16kB, 48kB, 8kB</td>
<td>16kB, 48kB, 8kB</td>
</tr>
<tr>
<td>Register file size</td>
<td>128kB</td>
<td>256kB</td>
<td>256kB</td>
</tr>
<tr>
<td>Max number of threads per SM</td>
<td>512</td>
<td>1024</td>
<td>1536</td>
</tr>
<tr>
<td>Max number of registers per thread</td>
<td>64</td>
<td>64</td>
<td>256</td>
</tr>
<tr>
<td>Computation Factor</td>
<td>2.4,8,16</td>
<td>2.4,8,16</td>
<td>2.4,8,16</td>
</tr>
</tbody>
</table>

To evaluate our proposed compiler optimizations, we perform our experiments on Nvidia GTX480 (FERMI) GPUs, GTX680 (KEPLER) GPUs, and Telsa K20c GPUs. The parameters are presented in Table 2.

Most of the benchmark kernels used in our experiments are from Rodinia [4] and CUDA SDK [19] since they have already been manually optimized. Among them, HotSpot, Back Propagation, SRAD, Pathfinder, B+tree, LU Decomposition are from the latest Rodinia suite. Matrix Multiplication and Marching Cubes are from the CUDA SDK. NQU is from GPGPU-sim benchmark suite [2]. As Back Propagation, SRAD and B+tree, contain two GPU kernels, we use BackPropagation1, BackPropagation2, SRAD1, SRAD2, B+tree1, B+tree2 to differentiate them. In Table 3, from left to right, we show the benchmark name, the input, as well as the resource usage including the number of registers per thread and the size of shared memory (bytes) per SM on GTX 480, GTX 680, and Telsa K20c, respectively. We use the default input released with the code. For each benchmark, the shared memory usage is the same for different GPUs because it is determined by programmer’s explicit definition. The register usage is statically allocated and the maximum available registers per thread vary on different GPUs.

5. Experimental Results

In our first experiment, we measure the execution time of both the original kernel and the optimized kernel generated from our compiler algorithm on GTX480, GTX680 and Telsa K20c separately. On each GPU, we tried all different shared memory/L1 D-cache configurations and selected the one with the best performance for the original kernels. Also, for each optimized kernel, the compiler will generate the best data placement to accommodate the specific architecture so as to achieve optimization portability. Each benchmark has been run one-hundred times to obtain the stable execution times. Figure 7 shows performance comparisons between original kernels and our optimized ones across different GPUs.

![Figure 7. Performance speedups achieved by automatic data placement for all benchmarks on three different GPUs.](image-url)
exchanges among threads, thus these shared memory variables are candidates for register promotion. In NQU, there are five shared memory variables and four of them are promoted, leading to a high performance speed-up of 3.3x on GTX680. For PF, _syncthreads() can be safely removed. However, even though it is not removed, the optimized code (e.g., on GTX480) can still achieve 7% performance improvement. For MC, shared memory variables holding two on-chip working sets can be promoted into local memory arrays so as to remove the resources limitation on the number of concurrent TBs, thereby achieving higher performance. Overall, using the geometric mean as an average, the kernels optimized on GTX480 can achieve up to 4.14x speedup and an average of 1.76x speedup compared to the original benchmarks, up to 3.30x speedup on GTX680, and up to 2.44x speedup and an average of 1.48x speedup on K20c.

In our second experiment, we first breakdown the effectiveness of each placement pattern. Figure 8a and 8b shows the benchmarks that can be applicable to compiler pass 1 and pass 2. Among them, only HS benefits from both pattern 1 and pattern 3 (the total improvement of 64.2%: breakdown into 4.8% from pattern 1 and 59.4% from pattern 3), while other benchmarks only benefit from one in three patterns: MC benefits from pattern 2; PF, NQU benefit from pattern 1; and others benefit from pattern 3.

Figure 8. Auto-tuning of our automatic data-placement for all benchmarks on GTX680 (Performance normalized to original kernel).

We further evaluate the effectiveness of our auto-tuning process for each benchmark. As shown in Figure 8a, the benchmarks NQU, PF, HS and MC benefit from promoting shared memory arrays into register/local/global memory. The search space is how many shared memory variables can be promoted into registers or L1 D-cache using our compiler pass 1 in Section 3.2.1. For all the cases, promoting more shared memory variables into registers or L1 D-cache will lead to higher performance. For the benchmark kernels benefiting from reduced redundant shared/global memory accesses, Figure 8b shows the impact of the search parameter C_Factor in our compiler pass 2 in Section 3.2.2. From Figure 8b, we can see that the best C_Factor varies across different benchmarks. For SR1, the best performing version is achieved when C_Factor is 16. However for BP1, the best performing one is obtained when C_Factor is 2, and further increasing C_Factor to 4 degrades the performance as it reduces the number of active warps in a thread block. Such reduced TLP subsequently degrades the latency hidden ability for off-chip memory accesses, offsetting the profit from reducing redundant accesses. Therefore, auto-tuning is stopped when such a performance drop is observed. We can see that if C_Factor is increased to 8 for BP1, the performance will degrade even more. This validates the effectiveness of our auto-tuning policy, which searches C_Factor in an incremental manner. The same scenario has also been observed in the compiler pass 1 from Figure 8a when searching for the appropriate shared memory variables to be promoted in MC.

Figure 9. The optimal parameter, the number of shared memory array to be promoted and the C-Factor, determined for different GPUs.

Third, in Figure 9, we present the optimal parameters determined by our auto-tuning process on the different GPUs. For PF, NQU and MC, the y-axis means how many variables should be promoted while for others, the y-axis denote the optimal C_Factor values on different GPUs. Our compiler can intelligently generate the optimized kernel for specific architecture to achieve optimization portability. We can see that the different architecture features of these GPUs lead to different optimal parameters. For example, NQU achieves best performance when its four shared memory variables are promoted on

### Table 4. The auto-tuning time on GTX 680

<table>
<thead>
<tr>
<th></th>
<th>Original search space</th>
<th>Pruned search space</th>
<th>Auto-tuning time (ms)</th>
</tr>
</thead>
<tbody>
<tr>
<td>HS</td>
<td>48</td>
<td>8</td>
<td>42.873</td>
</tr>
<tr>
<td>BP1</td>
<td>16</td>
<td>3</td>
<td>11.361</td>
</tr>
<tr>
<td>BP2</td>
<td>16</td>
<td>4</td>
<td>15.755</td>
</tr>
<tr>
<td>SR1</td>
<td>16</td>
<td>5</td>
<td>24.133</td>
</tr>
<tr>
<td>SR2</td>
<td>16</td>
<td>5</td>
<td>21.941</td>
</tr>
<tr>
<td>MM</td>
<td>32</td>
<td>5</td>
<td>210.876</td>
</tr>
<tr>
<td>PF</td>
<td>1</td>
<td>1</td>
<td>8.88</td>
</tr>
<tr>
<td>NQU</td>
<td>45</td>
<td>12</td>
<td>48.124</td>
</tr>
<tr>
<td>MC</td>
<td>9</td>
<td>6</td>
<td>23.986</td>
</tr>
<tr>
<td>BT1</td>
<td>3</td>
<td>3</td>
<td>12.183</td>
</tr>
<tr>
<td>BT2</td>
<td>3</td>
<td>3</td>
<td>14.343</td>
</tr>
<tr>
<td>LUD</td>
<td>16</td>
<td>4</td>
<td>129.531</td>
</tr>
</tbody>
</table>
Fourth, our auto-tuning process has a low overhead on searching the optimal parameters. We report the cost of the auto-tuning function in Table 4. From the left to right, we report the search space, i.e., the number of all possible values to be tried, in the original search if there is no pruning strategy in searching, the search space after applying our pruning strategies in our compiler passes, and the total execution time of our auto-tuning process for generating the optimal kernel for each benchmark. We can see that the search space is reduced significantly by our pruning strategy. We also validated that the optimized one from our pruned space can achieve the same performance as the one from the original search space.

Finally, besides the kernel code itself, we also consider how the problem input of a workload affects our proposed optimization process. For our first compiler pass, the shared memory array sizes are fixed with constants or macro variables which are independent of input sizes. The reason is that the benchmark code has been already optimized to process the inputs as tiled working sets. For the second pass, the input size will impact on the number of thread blocks in a grid and each thread block usually has a pre-defined size to work on a tile of input elements. Thus, the variation of the input size will not affect the steps of our compiler analysis and optimizations. Provided that the performance is in general correlated with the input size, our performance improvement will higher when the problem size is larger. Because the larger inputs will often lead to more frequent on-chip memory resource accesses to process them and our optimized kernel will in turn benefit more from the optimized access patterns. Figure 10 shows the effect of increased input size, from 8K Voxels to 512K Voxels, on Marching Cubes. As the input problem size increases, the performance improvement of our optimized kernel from compiler also increases from 1.179x to 1.446x.

### 6. Related work

In recent years, GPUs have been widely used for general-purpose computation due to their high computational throughput. However, achieving high performance on GPUs is not easy, and one of reasons is the intricate on-chip memory resources. Among on-chip resources, shared memory is controlled by users, and many highly optimized applications or algorithms on GPUs utilize shared memory carefully [12][23][24][27] so as to enjoy the low-access access latency and high bandwidth. Besides them, [12] analyses the upper performance bound of SGEMM on GPUs and optimizes the kernel through register blocking by reusing data in registers as much as possible for maximal throughput. However, none of these works considers the overhead of intensive usage of shared memory and the impacts of varying on-chip resources across different GPU generations.

To relieve the burden of optimizing GPU programs from the programmers, many auto-tuning frameworks [15][16][22][25][26] have been developed to automatically optimize the GPU programs to achieve high performance. For example, a polyhedral model is used in [16] for optimizing global memory accesses. In [27], the shared memory is time multiplexed to reduce the pressure on limited shared memory capacity. In [25], language and compiler support are proposed to leverage nested parallelism inside the GPU programs. However, most of these works focuses on optimizing memory accesses and managing thread-level parallelism using compiler techniques. Management of different types of on-chip memory, especially the varying on-chip memory across different GPU generations, has not been the focus. To the best of our knowledge, our work is the first compiler algorithm to automatically optimize data placement across different on-chip memory resources in a systematic way.

We also observed that vendor’s compiler may promote the variables in shared memory to register file. The way to avoid such an optimization is to use the ‘`volatile`’ keyword when declaring a shared memory array. However, as we verified from the assembly codes, we found that the vendor’s compiler does not apply such optimizations on the benchmarks used in our work.

Current studies on on-chip memory resources mainly focus on identifying resources limitation and boosting the performance by improving architecture design [6][7] or compiler support [9][27]. On-chip data cache may lead to cache contention and [9] proposes a compiler algorithm to automatically turn on/off the D-cache by predicting how cache will affect the performance. The register usage pattern is studied in [6] and the register file accesses are reduced by proposing a register file cache. However, these works target on optimizing one specific resource to conquer their limitations instead of balancing on-chip resources.

The trade-offs between software-managed shared memory and hardware-managed D-cache on GPUs have been studied in [14]. Gebhart et al. [7] made the observation that different applications have different needs for various memory resources. They proposed unified local memory that can dynamically change the partition among registers, cache, and shared memory according to each application’s needs. Hayes and Zhang [8] proposed unified on-chip memory allocation which uses shared memory to
offload register pressure. In comparison, our work focuses on re-assigning data across all on-chip memory resources.

7. Conclusions

Judicious utilization of the on-chip memory resources has a significant impact on application performance. However, how to manage these intricate on-chip memory resources is non-trivial for application developers. More importantly, the varying on-chip resource across different GPU generations makes performance portability a daunting challenge. In this paper, we propose compiler-driven automatic data placement as our solution. We focus on GPGPU programs that have already been reasonably optimized either manually by programmers or automatically by existing compiler tools. Our proposed compiler algorithms refine these programs by altering data placement among different on-chip resources to achieve both performance enhancement and performance portability. In particular, we leverage three data placement patterns. First, we explore shared memory variables to promote them into registers. Second, we explore the opportunities to utilize the L1 D-cache by promoting variables from shared memory into global/local memory if shared memory is a resource bottleneck. Third, we eliminate redundant shared/global memory accesses across different threads. To achieve performance portability, our compiler performs auto-tuning on different GPUs to achieve optimal performance. Among the benchmarks in our study, our proposed compiler algorithms significantly improve the performance by up to 4.14x and 1.76x on average on Nvidia GTX480 (i.e., FERMI) GPUs, and by up to 3.30x and 1.61x on average on GTX680 (i.e., KEPLER) GPUs, and up to 2.44x speedup and an average of 1.48x speedup on K20c GPUs. Our compiler-optimized kernel can also save up to 74.3% energy and save an average of 40.3% energy overall measured on GTX680 GPUs.

ACKNOWLEDGMENT

We thank the anonymous reviewers for their insightful comments to improve our paper. This work is supported by an NSF project 1216569 and a gift fund from AMD Inc.

References