Document Type : Research Paper


1 Computer Engineering Department, Collage of Engineering, University of Mosul, Mosul, Iraq

2 Computer Engineering Department, Collage of Engineering, University of Mosul, Mosul, Iraq0


One of the most important challenges facing U-Net architecture performance is the method design of its components and how to choose the suitable hardware computing device to deal with the training labelled datasets.  Convolution is the most process that requires computations and memory costs, which is needs to minimize. Thus, one of the suitable selection is to change the type of the convolution. Other suggested solutions are to reduce the size of image,    number of bits, and, stride value, in addition to number of filters, and image batches. Therefore, in this paper the roofline model will used as performance guide in analyzing the FLOPs and the memory bandwidth boundaries of a U-Net model with different configurations. The cost has been assessed with compared to the limitation of three computing devices, GPU230MX, GPU940MX and GPU2060rtx super. 128 × 128 image dataset has been used during the U-Net cost-performance evaluation process. Based on the analysis, the evaluation results show that the solution that achieves a balance between memory and computations is to implement a U-Net model in parallel using RTX2060 super card with the configurations of batch size is 16, image size of 128×128, number of bits is 32, shared memory management.


Main Subjects

[1]     O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 9351, pp. 234–241, May 2015, doi: 10.1007/978-3-319-24574-4_28.
[2]     U. T. Salim, F. H. Ali, and S. A. Dawwd, “U-Net Convolutional Networks Performance Based on Software-Hardware Cooperation Parameters : A Review,” International Journal of Computing and Digital System, vol. 11, no. 1, 2022.
[3]     Y. Oyama, N. Maruyama, N. Dryden, E. McCarthy,  P.Harrington, , J. Balewski, S. Matsuoka, P. Nugent, and B. Van Essen,“The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism,” IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 7, pp. 1641–1652, 2021, doi: 10.1109/TPDS.2020.3047974.
[4]     D. Pati, C. Favart, P. Bahl, V. Soni, Y. C. Tsai, , M. Potter, J. Guan, X. Dong, and V. R. Saripalli, “Impact of Inference Accelerators on hardware selection,” pp. 1–5, 2019, [Online]. Available:
[5]     L. Hou, Y. Cheng, N. Shazeer, N. Parmar, Y. Li, P. Korfiatis, T. M. Drucker, D. J. Blezek, and X. Song, “High Resolution Medical Image Analysis with Spatial Partitioning,” pp. 15–19.
[6]     J. Civit-masot, F. Luna-perejón, S. Vicente-díaz, J. María, R. Corral, and A. Civit, “TPU Cloud-Based Generalized U-Net for Eye Fundus Image Segmentation,” vol. 7, pp. 142379–142387, 2019, doi: 10.1109/ACCESS.2019.2944692.
[7]     S. Liu and W. Luk, “Towards an efficient accelerator for DNN-based remote sensing image segmentation on FPGAs,” Proc. - 29th Int. Conf. Field-Programmable Log. Appl. FPL 2019, pp. 187–193, 2019, doi: 10.1109/FPL.2019.00037.
[8]     S. Liu, H. Fan, X. Niu, H. Ng, and Y. Chu, “Optimizing CNN-based Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA,” vol. 1, no. 1, 2018.
[9]     B. K. Joardar, N. K. Jayakodi, J. R. Doppa, H. Li, P. P. Pande, and K. Chakrabarty, “GRAMARCH: A GPU-ReRAM based heterogeneous architecture for neural image segmentation,” in 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2020, pp. 228–233.
[10]  D. Ojika, B. Patel, G. A. Reina, T. Boyer, C. Martin, and P. Shah, “Addressing the Memory Bottleneck in AI Model Training,” pp. 3–5, 2020, [Online]. Available:
[11]  H. Imai, S. Matzek, T. D. Le, and Y. Negishi, “Fast and Accurate 3D Medical Image Segmentation with Data-swapping Method,” pp. 1–13.
[12]  B. Niepceron, A. Nait-sidi-moh, F. Grassia, B. Niepceron, A. Nait-sidi-moh, and F. Grassia, “Moving Medical Image Analysis to GPU Embedded Systems : Application to Brain Tumor Segmentation Moving Medical Image Analysis to GPU Embedded Systems : Application to Brain Tumor Segmentation,” 2020, doi: 10.1080/08839514.2020.1787678.
[13]  N. Beheshti, “Squeeze U-Net : A Memory and Energy Efficient Image Segmentation Network.”
[14]  S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,” Commun. ACM, vol. 52, no. 4, pp. 65–76, 2009, doi: 10.1145/1498765.1498785.
[15]  B. Da Silva, A. Braeken, E. H. D’Hollander, and A. Touhafi, “Performance and resource modeling for FPGAs using high-level synthesis tools,” Adv. Parallel Comput., vol. 25, pp. 523–531, 2014, doi: 10.3233/978-1-61499-381-0-523.
[16]  A. Ilic, F. Pratas, and L. Sousa, “Beyond the roofline: Cache-aware power and energy-efficiency modeling for multi-cores,” IEEE Trans. Comput., vol. 66, no. 1, pp. 52–58, 2017, doi: 10.1109/TC.2016.2582151.
[17]  J. Kwack, T. Applencourt, C. Bertoni, Y. Ghadar, H. Zheng, C. Knight, and S. Parker, “Roofline-based performance efficiency of hpc benchmarks and applications on current generation of processor architectures,” in 2019 Cray User Group Meeting, 2019, vol. 5.
[18]  M. Hill and V. Janapa Reddi, “Gables: A roofline model for mobile SoCs,” Proc. - 25th IEEE Int. Symp. High Perform. Comput. Archit. HPCA 2019, pp. 317–330, 2019, doi: 10.1109/HPCA.2019.00047.
[19]  C. Yang and L. Berkeley, “Hierarchical Roofline Analysis on GPUs,” 2020.
[20]  N. K. Jha and S. Mittal, “Modeling Data Reuse in Deep Neural Networks by Taking Data-Types into Cognizance,” IEEE Trans. Comput., vol. 70, no. 9, pp. 1526–1538, 2021, doi: 10.1109/TC.2020.3015531.
[21]  NVIDIA Corporation, CUDA C Programming Guide - Version 4.2. 2012.
[22]  B. Van Werkhoven, J. Maassen, H. E. Bal, and F. J. Seinstra, “Optimizing convolution operations on GPUs using adaptive tiling,” Futur. Gener. Comput. Syst., vol. 30, no. 1, pp. 14–26, 2014, doi: 10.1016/j.future.2013.09.003.