U-Net Cost Analysis Using Roofline Model

One of the most important challenges facing U-Net architecture performance is the method design of its components and how to choose the suitable hardware computing device to deal with the training labelled datasets. Convolution is the most process that requires computations and memory costs, which is needs to minimize. Thus, one of the suitable selection is to change the type of the convolution. Other suggested solutions are to reduce the size of image, number of bits, and, stride value, in addition to number of filters, and image batches. Therefore, in this paper the roofline model will used as performance guide in analyzing the FLOPs and the memory bandwidth boundaries of a U-Net model with different configurations. The cost has been assessed with compared to the limitation of three computing devices, GPU230MX, GPU940MX and GPU2060rtx super. 128 × 128 image dataset has been used during the U-Net cost-performance evaluation process. Based on the analysis, the evaluation results show that the solution that achieves a balance between memory and computations is to implement a U-Net model in parallel using RTX2060 super card with the configurations of batch size is 16, image size of 128×128, number of bits is 32, shared memory management.


INTRODUCTION
The superiority of using deep learning models to achieve good and accurate estimation in one application is related to the size of the data and depends on the type of algorithm/model. However, many factors limit the performance of the algorithm implementation, thus implementation details must be taken into consideration, especially when the hardware used has limited resources. The algorithm is related with how to implement it and optimizing its components with least complexity. The implementation means running an algorithm with the suitable high performance computing (HPC). The floating-point throughput for the HPC, as well as the memory bandwidth of the DRAM chip, which determines the performance cost and depends on the manufacturing specification. However, it may not give a visualization of what the user can achieve.
U-Net architecture [1] as a convolutional neural network (CNN) is one example that has been used in many applications with different configurations [2]. The basic structure of U-Net is composed of several types of layers with different operations, but still the convolution operation is the engine operation. However, the complex U-Net based CNN design and large amount of datasets with its labels will enhance the accuracy but it is demands memory storage space, while the reverse will be happen with small model size. So implementing U-Net variants on general-purpose processors (CPUs) may be inappropriate due to performance bottlenecks resulting from their lack of parallelism. Today, many hardware options give the users flexibility when chosen for training and testing. These options include GPUs [3][4], TPUs [5] [6], FPGAs [7][8], heterogeneous [9] and servers [10].
There are another solutions but it that depend on user's efforts and proposals. Different frameworks based parallelism strategies are introduced to tackle memory limitation such as TensorFlow Large Model Support (TFLMS) and Mesh-TensorFlow [5] [11].
Other attempts are based on reducing the number of bits and quantizing the model size. A helpful solution is changing the type of convolution such as employing depthwise separable convolution as an alternative to the traditional convolution [12]. Another suggestion is performing the convolution process with stride value more than 1 [13]. Roofline model is one useful tool that is used in several works and studies and applied on different architectures [14][15] [16][17] [18] [19].
In this paper, a performance cost of U-Net architecture modeled with different parameters using a helpful roofline model. A Roofline model is employed as an efficient analyzing approach to alleviating the errors before developing the network models. It is can result an appropriate balance between different network configurations and selecting the suitable hardware environment.
In addition to this introduction, the rest sections of this paper is divided as follows: section 2 explains the Roofline models, section 3 describes the details of the U-Net structure, section 4 analyzes the U-Net cost with different configurations. Finally, section 5 summarizes the paper with the most conclusions and presents suitable future work.

ROOFLINE MODEL
Roofline model is a theoretical preprograming step used to avoid errors that may appears during runtime execution such as out of memory (OOM). It is an abstract architectural model, which determines the performance throughput edges using each of the peak performance and the peak bandwidth. This model is usually used visually as a logarithmic plot of peak performance versus arithmetic intensity (AI) as illustrates in figure (1) [14]. The flops/s is a useful measurement because the higher it is, the faster the algorithm can execute more data. The FLOPs/s is often related with the type of hardware, algorithm, and implementation. On the other hand, the Peak Bandwidth represents the fastest the processor can load data. It is measured in bytes/second. Arithmetic intensity (AI) is a measure of the number of operations performed per byte loaded or stored from memory.

AI =
FLOPs Read bytes + Write bytes (1) For any platform, the AI referred to as AImax and its calculation depends on its limitations.
Roofline curves help you better understand how one application works on a given architecture.
Therefore, the maximum theoretical performance of an algorithm/model using roofline is determined according to equation (2) P = min(peak performance, AI × peak bandwidth) When the AI of an application is greater than AImax, then the maximum theoretical performance P that an application can achieve is limited by computational throughput(FLOPs/s) , on the contrary, the maximum performance P is a memory bound and the P will be equal to the multiplication of AI by peak bandwidth. The best use of the platform resources can be achieved at ridge point when the P is equal to AImax.

U-NET MODEL
In this section, typical forward pass of U-Net model is implemented. The U-Net model consists of encoder and decoder, which are connected via short-long connections constructing a U-shaped as shown in figure (2). Both of the encoder and decoder have repeated levels of blocks, where each block includes two alternative 3x3 same convolutions and activation function of a type of a rectified linear unit (ReLU). In each level of the encoder, the spatial data is decreased by two using 2x2 maxpool, while the number of filters are doubled by two, the reverse happens with the decoder, where the spatial data increases using nearest algorithm and the number of filters halves by two. The final layer applied a 1x1 convolution layer and a sigmoid activation function on each one of 64component feature vector through mapping to a four classes.

PREDICTION RESULTS
This section predicts the computational performance and memory bottlenecks of the U-Net model. U-Net is configured with 32-filters as starting point and works with an image size of 128x128 to predict four class.
The cost is measured and analyzed in compared to the three graphic processors integrated on three types of machines specified as in table 1. The FLOPs and total memory access among different layer types is analyzed as shown in figures (3) and (4). One of the observations that needs to be taken into account that the convolution layer takes the largest FLOPs and memory complexities compared with the other operations. This due to addition and multiplication with large number of parameters. On the other hand, log scale is used to reduce the large disparity between the convolution and the cost of the rest computational layers.  (5) shows AI per layer compared with the used graphic cards. All smaller AI will be close to the memory bound while the reverse makes the layers will be computed bound. The smaller layer values will be neglected when estimation the computation of designing models. The observed issue is found at the convolution layer with input-output(128×128×32).  5. The arithmetic intensity for each layer of the U-Net As evident in the roofline model the arithmetic intensity is a significant factor to be consider. However, the complex U-Net will enhance the accuracy but it is demands memory storage space, while the reverse will be happen with small model size. Therefore, for an image size of 128×128 and the starting number of filters is 32, some estimations will be carry out with some tuning and solutions to tackle the computation and memory limitation as follows:

Impact of convolution type
Convolution is the essence linear operator in the convolutional layer which extracts the important features of input data channels by repeated sliding the learnable filter of stride number over an input data then applying elementwise multiplication-accumulation outcome with corresponding window of input data and generate neurons that construct an output feature maps as illustrated in figure (6).

Fig. 6 Standard convolution (SC) application in convolutional layer
To note the performance(AI) per one layer with change convolution type, the most important types of convolutions [20] listed in table (2).  (7) shows the behavior of the arithmetic intensity resulted for an input layer(128×128×32) and output layer(128×128×32) with a batch size is 1. Evidently, the (AI) is increases as got close to pointwise convolution. Since, the performance will be dominated by memory bandwidth rather than throughput computation according to a table (2). The Roofline model mapping suggested networks only on the NVIDIA's graphic cards of MX230, 940MX and RTX2060 SUPER and generating three results.
The first result in figure (9) indicates that the networks are more bounded by the computing performance of the MX230 GPU processor architecture.  The third result in the figure (11) show that the networks are computed bound. Therefore, the 940MX can't be used. It is better to choose the machine that uses the RTX2060 super processor, as it is superior to the MX230 processor in terms of specifications and performance, as shown in the table (1) Figure(12) shows the roofline model based on the accumulative arithmetic intensity when adjusting the batch size from 1 to 16. It can be seen that the larger the batch size, the higher the accumulative arithmetic intensity by a small percentage, thus avoiding the memory limitation problem. Another note, that the effect of changing the network structure on the accumulative arithmetic intensity is balanced as a result of changing both the number of computations as well as the number of times of memory access. In this state, only the 940MX GPU will be the best choice for implementing all the networks at all the batch sizes, but at the cost of taking a longer time. Thus, to reduce the consuming time, it is better to  Figure(13) shows the roofline model based on the accumulative arithmetic intensity when changing the number of bits from 8 to 64 by a factor of 2.

Impact of number of bits
It is noticed that the higher the bits, the lower the accumulative arithmetic intensity and therefore it will approach the memory limits and therefore not all processors will be able to deal with the data bit size of 64. So, it is better to use the number of bits as 16 so that all networks can be implemented with all networks but at the expense of accuracy. In order to increase the accuracy and in a suitable time, a 32-bits can be used with the networks. Fig. 13 Impact of adjusting number of bits on the performance(accumulative arithmetic intensity) Figure(14) shows the roofline model based on the accumulative arithmetic intensity when increasing the image size from 128×128 to 192×192 . It is noticed that the higher the image size, the lower the accumulative arithmetic intensity and therefore it will approach the memory limits and therefore not all processors will be able to deal with the image size of 192p. So, it is better to use an image of size 128p but the accuracy of all the networks may be affected depending on the network structure. To implement the networks with fast time, a RTX2060 super is the best processor due to containing on 2176 cores.

Impact of CUDA memory management
Understanding the memory hierarchical properties provided by CUDA architecture helps in optimizing GPU kernels and thus alleviating the problem of limited memory resources. As shown in figure(15) CUDA architecture contains different classes of memories that are global, constant, shared, registers and local. Global memory is the simplest type of memory available in a GPU. This is the memory that the host usually accesses when transferring data to the device. Global memory provides the maximum storage size on a GPU approximately a few gigabytes, but it has the disadvantage of slow read and write operations (situated off-chip away from streaming multiprocessors).
Constant memory is the fastest type of memory. It is called "constant" because writting to it is done only by host code. Therefore, it is useful when the kernel needs to access read-only data. The size of this type of memory is much smaller than the size of the global memory, only 64K [1]. In spite of it is yet off the chip, but so they can be accessed much faster than global memory due to they are cached on the chip.
A small portion of shared memory(48-163 KB) is allocated to each thread block that can be read and written by that block alone. Because the shared memory is found on the chip, it is much faster to access, which makes it useful for storing intermediate values in the kernel or data that needs to be accessed repeatly.
The GPU architecture as well contains registers and local memory that are unique to each thread. Registers provide storage for variables or arrays defined by threads in the kernel, and are the fastest to access, but in a limited amount. Registers provide storage for variables or arrays defined by threads in the kernel, and are the fastest to access, but in a limited amount. The CUDA compiler specifies what data is put into registers and what data is passed to local memory. It is called local because of its domain, not its location. Local memory is off-chip, but it's local to each thread because it can only be accessed by specific threads, but it takes longer to get to it [21]. The type of memory is affect on the implementation [22], Figure (16) compares between Global, constant and shared as the most memory types for an input layer(128×128×32) and output layer(128×128×32) with a batch size is 1. The results shows that the arithmetic intensity increases when moving towards memory, as well as improving memory management and moving away from the problem of limited memory. Fig. 16 Impact of memory type on the performance(arithmetic intensity)

CONCLUSION
Accurate U-Net performance is a reflection of the complexity design and the training data size with labels. Thus it is necessary to use compatible platform for the implementing work, consequently datasets distribution and batch size are controlling by the performance. In this paper various structures of convolutional layer are estimated. Also, roofline model was exploited as throughput performance model to reduce the time that consumes by the developer in running U-Net architecture. The prediction results shown that RTX2060 super platform is more appropriate selection among the used platforms to achieves well balancing between flops and memory bandwidth. For future work, the roofline model will be extended to take into account another modeling parameters and variables and their effects e.g. on implementation time. Deeply Customized Convolutional and Deconvolutional Architectures on FPGA," vol. 1, no. 1, 2018.