jagomart
digital resources
picture1_Opencl Programming Guide 190590 | Iwocl2016


 195x       Filetype PDF       File size 0.33 MB       Source: users.soe.ucsc.edu


File: Opencl Programming Guide 190590 | Iwocl2016
thehitchhiker s guide to cross platform opencl application development tyler sorensen alastair f donaldson imperial college london imperial college london t sorensen15 imperial ac uk alastair donaldson imperial ac uk ...

icon picture PDF Filetype PDF | Posted on 03 Feb 2023 | 2 years ago
Partial capture of text on file.
                                 TheHitchhiker’s Guide to Cross-Platform OpenCL
                                                                       Application Development
                                                       Tyler Sorensen                                                            Alastair F. Donaldson
                                                  Imperial College London                                                         Imperial College London
                                         t.sorensen15@imperial.ac.uk                                                 alastair.donaldson@imperial.ac.uk
                    ABSTRACT                                                                                      % of papers that evaluate OpenCL 
                    One of the benefits to programming of OpenCL is plat-                                                 implementations from               Number of papers that evaluate an 
                    form portability.          That is, an OpenCL program that fol-                                     1, 2, and 3 GPU vendors            OpenCL GPU implementation from 
                                                                                                                                                                       each vendor
                    lows the OpenCL specification should, in principle, execute                                                  6%                          39
                    reliably on any platform that supports OpenCL. To assess                                                    (3)
                    the current state of OpenCL portability, we provide an ex-
                    perience report examining two sets of open source bench-                                         36%                                             23
                    marksthatweattemptedtoexecuteacrossavarietyofGPU                                                  (18)                   58%
                    platforms, via OpenCL. We report on the portability issues                                                                (29)                            8
                    we encountered, where applications would execute success-                                                                                                          3        1
                    fully on one platform but fail on another. We classify issues
                    into three groups: (1) framework bugs, where the vendor-
                    provided OpenCL framework fails; (2) specification limita-                                                   1    2   3
                    tions, where the OpenCL specification is unclear and where
                    different GPU platforms exhibit different behaviours; and
                    (3) programming bugs, where non-portability arises due to
                    the program exercising behaviours that are incorrect or un-                                  Figure 1: The number of vendors whose OpenCL
                    defined according to the OpenCL specification. The issues                                      GPUimplementationsareevaluatedin50recentpa-
                    we encountered slowed the development process associated                                     pers listed at http://hgpu.org
                    with our sets of applications, but we view the issues as pro-
                    viding exciting motivation for future testing and verification
                    efforts to improve the state of OpenCL portability; we con-                                       As discussed in Sec. 3, we focus on GPU platforms in
                    clude with a discussion of these.                                                            this study. Many GPU vendors provide implementations of
                                                                                                                 OpenCL for their respective platforms. In principle, this
                    1.     INTRODUCTION                                                                          means that programs adhering to the OpenCL specifica-
                                                                                                                 tion should be executable across these platforms. However,
                       OpenComputingLanguage(OpenCL)isageneral-purpose                                           in our experiences many GPU applications (especially in
                    parallel programming model, designed to be implementable                                     the research literature) target platforms from a single ven-
                    on a range of devices including CPUs, GPUs, and FP-                                          dor. To quantify this claim, we manually examined the 50
                    GAs [17]. Much like mainstream programming languages                                         most recent papers listed on the GPU aggregate website
                    (e.g. C and Java), the OpenCL specification describes ab-                                     http://hgpu.org (retrieved 25 Jan. 2016) that feature evalu-
                    stract semantics. Concrete platforms that support OpenCL                                     ation of OpenCLapplicationsonGPUplatforms(weexclude
                    are then responsible for providing a framework that success-                                 papers that exclusively report results for CPUs and/or FP-
                    fully executes applications according to the abstract spec-                                  GAs). Our findings are summarised in Fig. 1. The pie chart
                    ification. This contract between programming model and                                        shows that over half (58%) of the papers evaluated GPUs
                    platform enables portability; that is, a programmer can de-                                  from one vendor only. Only three papers (6%) evaluated
                    velop programs based on the specification and then execute
                    the program on any platform that supports the program-
                    ming model.                                                                                     chip                 vendor       CUs       type             abbr.        OCL
                                                                                                                    GTX980               Nvidia        16       discrete         980          1.1
                    Permission to make digital or hard copies of all or part of this work for personal or           Quadro K500          Nvidia        12       discrete         K5200        1.1
                    classroom use is granted without fee provided that copies are not made or distributed           Iris 6100            Intel         47       integrated       6100         2.0
                    for profit or commercial advantage and that copies bear this notice and the full citation        HD5500               Intel         24       integrated       5500         2.0
                    on the first page. Copyrights for components of this work owned by others than the               Radeon R9            AMD           28       discrete         R9           2.0
                    author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
                    republish, to post on servers or to redistribute to lists, requires prior specific permission    Radeon R7            AMD            8       integrated       R7           2.0
                    and/or a fee. Request permissions from permissions@acm.org.                                     Mali-T628            ARM            4       integrated       T628-4       1.2
                    IWOCL’16,April19-21,2016,Vienna,Austria                                                         Mali-T628            ARM            2       integrated       T628-2       1.2
                     c
                    
2016Copyright held by the owner/author(s). Publication rights licensed to ACM.              Table 1: The GPUs we consider, spanning designs
                    ISBN978-1-4503-4338-1/16/04...$15.00
                    DOI:http://dx.doi.org/10.1145/2909437.2909440                                                from four vendors
                                 benchmark     app. name     description                       GPUarchitecture         source language
                                 Pannotia      p-sssp        single source shortest path    AMDRadeonHD7000              OpenCL 1.0
                                 Pannotia      p-mis         maximal independent set        AMDRadeonHD7000              OpenCL 1.0
                                 Pannotia      p-colour      graph colouring                AMDRadeonHD7000              OpenCL 1.0
                                 Pannotia      p-bc          betweenness centrality         AMDRadeonHD7000              OpenCL 1.0
                                 Lonestar      ls-mst        minimum spanning tree         Nvidia Kepler and Fermi         CUDA7
                                 Lonestar      ls-dmr        delaunay mesh refinement       Nvidia Kepler and Fermi         CUDA7
                                 Lonestar      ls-bfs        breadth first search           Nvidia Kepler and Fermi         CUDA7
                                 Lonestar      ls-sssp       single source shortest path   Nvidia Kepler and Fermi         CUDA7
                                                            Table 2: The applications we consider
               on GPUs from three vendors, and no paper presented ex-                      • Program bugs, where the original program contains
               periments from more than three vendors. The figure also                         a bug that we observe to be dormant when the pro-
               shows a histogram counting the number of papers that con-                      gram is executed on the originally-targeted platform,
               ducted evaluation on a GPU from each vendor. Nvidia and                        but which appears when the program is executed on
               AMDare by far the most popular, even though other ma-                          different platforms.
               jor vendors (e.g. ARM, Imagination, Qualcomm) all provide
               OpenCLsupportfortheirGPUs. Ourinvestigationsuggests                        Several recent works have raised reliability concerns in re-
               that insufficient effort has been put into assessing the guar-             lation to GPU programming. Compiler fuzzing has revealed
               antees of portability that OpenCL aims to provide.                      many bugs in OpenCL compilers [19], targeted litmus tests
                  In this paper, we discuss our experiences with porting and           have shown surprising hardware behaviours with respect to
               running several open source applications across eight GPUs              relaxed memory [1], and program analysis tools for OpenCL
               spanning four vendors, detailed in Tab. 1. For each chip we             have revealed correctness issues, such as data races, when
               give the full GPU name, vendor, number of compute units                 used to scrutinise open source benchmark suites [3, 10]. In
               (CUs), specify whether the GPU is discrete or integrated,               contrast to this prior work, which specifically set out to ex-
               provide a short name that we use throughout the paper for               posebugs, either through engineered synthetic programs [19,
               brevity, and indicate which version of OpenCL the GPU sup-              1], or by searching for defects that might arise under rare
               ports (OCL).AsTab.1shows,weconsiderGPUsofdifferent                       conditions [3, 10], we report here on portability issues that
               sizes (based on number of compute units), and consider both             we encountered “in the wild”. These issues arose without
               integrated and discrete chips. We also attempt to diversify             provocation when attempting to run open source applica-
               the intra-vendor chips. For Nvidia the 980 and K5200 are                tions. In fact, as discussed further in Section 3, the porting
               from different Nvidia architectures (Maxwell and Kepler, re-             effort that led to this study was undertaken as part of a sep-
               spectively). For Intel the 6100 is part of the higher end Iris          arate, ongoing research project; to make progress on that re-
               product line, while the 5500 is part of the consumer HD se-             search project we were hoping that we would not encounter
               ries. The applications we consider (which are summarised in             such issues. We believe that the “real-world” nature of the
               Tab. 2) are taken from two benchmark suites, Pannotia [9]               issues experienced may be closer to what GPU application
               and Lonestar [8]. For each application we give the bench-               developers encounter day-to-day, compared with the issues
               mark suite it is associated with, a short description, the              exposed by targeted testing and formal verification.
               GPUarchitecture family the application was evaluated on,                   Ourhopeisthatthisreportwillmakethefollowingcontri-
               and the original source language of the application. We de-             butions to the OpenCL community. For software engineers
               scribe the benchmark suites and our motivation for choosing             endeavouring to develop portable OpenCL applications, it
               these applications in more detail in Sec. 3.                            can serve as hazard map for issues to be aware of, and sug-
                  This report serves to assess the current state of portability        gestions for working around such issues. For vendors, it can
               for OpenCL applications across a range of GPUs, by detail-              serve to identify areas in OpenCL frameworks that would
               ing the issues that blocked portability of the applications             benefit from more robust examination and testing. For re-
               we studied. In this work, we consider semantic portability              searchers, the issues we report on may serve as motivational
               rather than performance portability; that is, the issues we             case-studies for new verification and testing methods.
               document deal with the functional behaviour of applications                Despite the challenges we faced, in most cases we were
               rather than runtime performance. Prior work has exam-                   able to find a work-around, and overall we consider our ex-
               ined and addressed the issue of performance portability for             perience a success: OpenCL application portability can be
               OpenCLprogramsonCPUsandGPUs(forexample[25,26,                           achieved with effort, and this effort will diminish as vendor
               2]); however, we encountered these issues when simply at-               implementations improve, aspects of the specification are
               tempting to run the applications across GPUs, without any               clarified, and better analysis tools become available.
               attempt to optimise runtime per platform. We report on                     The structure of the paper is as follows: Sec. 2 contains
               these semantic portability issues in detail, classifying them           an overview of OpenCL and common elements of a GPU
               into three main categories:                                             OpenCL framework. The applications we ported are de-
                   • Framework bugs, where a vendor-provided OpenCL                    scribed in Sec. 3. Section 4 documents the issues we classi-
                     implementation behaves incorrectly according to the               fied as framework bugs. Section 5 documents the issues we
                     OpenCL specification.                                              classified as specification limitations. Section 6 documents
                                                                                       the issues we classified as programming bugs. We then sug-
                   • Specification limitations, where the OpenCL speci-                 gest ways that we believe the state of portability of OpenCL
                     fication is unclear and where different GPU implemen-               GPUprogramscouldbeimprovedinSec.7. Finally, wecon-
                     tations exhibit different behaviours.                              clude in Sec. 8.
              2.   BACKGROUNDONOPENCL                                         Components of an OpenCL Environment. To enable
              OpenCL Programming. An OpenCL application con-                  OpenCL support for a given device, a vendor must provide
              sists of two parts: host code, usually executed on a CPU,       a compiler for OpenCL C that targets the instruction set of
              and device code, which is executed on an accelerator de-        the device, and a runtime capable of coordinating interac-
              vice; in this paper we consider GPU accelerators. The host      tion between the host and the specific device. It is the role
              code is usually written in C or C++ (although wrappers          of the OpenCL specification to define requirements that the
              for other languages now exist) and is compiled using a stan-    compiler and runtime must adhere to in order to successfully
              dard C/C++ compiler (e.g. gcc or MSVC). The OpenCL              execute valid applications. It is the vendor’s job to ensure
              framework is accessed through library calls that allow for      that these requirements are met in practice, and clarity in
              the set-up and execution of a supported device. The API for     the OpenCL specification is essential to achieving this.
              the OpenCLlibrary is documented in the OpenCL specifica-           The device, compiler and runtime comprise a complete
              tion [17], and it is up to the vendor to provide a conforming   OpenCL environment. Issues in any one of these compo-
              implementation that the host code can link to.                  nents can cause the contract between the OpenCL specifi-
                The device code is written in OpenCL C [14] (similar to       cation and the vendor-provided environment to be violated.
              C99). The code is written in an SIMT (single instruction
              multiple thread) manner, such that all threads execute the
              samecode, but have access to unique thread identifiers. The      3.   EVALUATEDAPPLICATIONS
              device code must contain one or more entry functions where        This experience report is a by-product of an ongoing
              execution begins; these functions are called kernels.           project that explores using the OpenCL 2.0 relaxed memory
                OpenCL supports a hierarchical execution model that           model [17, pp. 35-53] to design custom synchronisation con-
              mirrorsfeaturescommontosomeofthespecialisedhardware             structs for GPUs. For that project, we sought benchmarks
              that OpenCL kernels are expected to execute on, in partic-      that might benefit from the use of fine-grained communica-
              ular features common to many GPU architectures. Threads         tion idioms. We discovered that applications containing ir-
              are partitioned into disjoint, equally-sized sets called work-  regular parallelism over dynamic workloads provided a good
              groups. ThreadswithinthesameworkgroupcanuseOpenCL               fit for our goals. With this in mind, we found two suites of
              primitives for efficient communication. For example, each         open source benchmarks to experiment with: Pannotia [9]
              workgrouphasadisjointregionoflocal memory; onlythreads          andLonestar[8]. TheapplicationsaresummarisedinTab.2.
              in the same workgroup can communicate using local mem-          The short names of Pannotia and Lonestar applications are
              ory. OpenCL also provides an intra-workgroup execution          prefixed with“p”and“l”, respectively.
              barrier. Onreachingabarrierathreadwaitsuntilallthreads            The Pannotia benchmarks were originally developed to
              in its workgroup have reached the barrier.   Barriers can       examinefine-grainedperformancecharacteristicsofirregular
              be used for deterministic communication. To aid in finer-        parallelism on GPUs, suchascachehitrateanddatatransfer
              grained and intra-device communication, OpenCL provides         time. The benchmarks were written in OpenCL 1.0, and
              a set of atomic read-modify-write instructions where threads    evaluated using AMD GPUs. There are six applications in
              can atomically access, modify and store a value to memory.      the benchmark suite in total, of which we consider four.
              All device threads have access to a region of global memory.    The two applications we did not consider were structured in
                Newer GPUs provide support for the OpenCL 2.0 mem-            a way such that we could not easily see how to apply our
              ory model [17, pp. 35-53], which is similar to the C++11        experimental custom synchronisation constructs (recall that
              memory model [13, pp. 1112-1129]. In this model, synchro-       applying these constructs was what motivated us to evaluate
              nisation memory locations must be declared with special         these benchmarks across GPUs from a range of vendors).
              atomic types (e.g. atomic_int). Accesses to these memory          The Lonestar applications were originally written in
              locations can be annotated with a memory order indicating       CUDAand evaluated using Nvidia GPUs; we ported these
              the extent to which the access will synchronise with other      applications to OpenCL. Like the Pannotia applications, the
              accesses (e.g. release, acquire), and a scope in the OpenCL     Lonestar applications measure various performance charac-
              hierarchy to indicate with which other threads in the concur-   teristics of irregular applications, including control flow di-
              rency hierarchy the access should communicate (e.g. a scope     vergence between threads.
              can be intra-workgroup or inter-workgroup). If no memory          The Lonestar applications use non-portable, Nvidia-
              order is provided, a default memory order of sequentially       specific constructs, including single dimensional texture
              consistent is used [14, p. 103]. Rules on the orderings pro-    memory, warp-aware operations (e.g. warp shuffle com-
              vided by these annotations are given both in the standard       mands), and a device-level barrier. For each, we attempted
              and (more formally) in recent academic work [5].                to provide portable OpenCL alternatives, changing texture
                While support in OpenCL 2.0 facilitates finer-grained in-      memory to global memory, rewriting warp-aware idioms to
              teractions between the host and device, traditionally the       use workgroup synchronisation, and using the OpenCL 2.0
              host and device interact at a course level of granularity, and  memory model to write a device-level barrier. There are
              this is the case for the applications we consider in this pa-   seven applications in the Lonestar benchmark suite, of which
              per. Thehostanddevicedonotshareamemoryregion,thus               we consider four. Similar to the Pannotia benchmarks, the
              the host must explicitly transfer any input data the kernel     three applications we did not consider were structured in a
              needs to the device through the OpenCL API. The host is         way that we could not easily see how to apply our custom
              responsible for then setting the kernel arguments and finally    synchronisation constructs.
              launching the kernel, again all using the OpenCL API.             Both benchmark suites contain an sssp application, how-
                Asimilar language for programming GPUs is CUDA [21].          ever they are fundamentally different. The Lonestar version
              This language is Nvidia-specific and thus not portable across    (ls-sssp) uses shared task queues to manage the dynamic
              GPUvendors.                                                     workload. The Pannotia version (p-sssp) is implemented
               by iterating over common linear algebra methods. We thus            Framework bug 2: deadlock with break-terminating
               consider them as two distinct applications.                         loops
                                                                                     Summary: Loops without bounds (using break state-
               4.   FRAMEWORKBUGS                                                    ments to exit) lead to kernel deadlock
                 Here we outline three issues that we believe, to the best of        Platforms: K5200 (Nvidia), R7, R9 (AMD)
               our knowledge and debugging efforts, to be framework bugs.             Status: Unreported
               We experienced these issues when experimenting with cus-              Workaround: Re-write loop as a for loop with an over-
               tomsynchronisation constructs in the applications of Tab. 2           approximated iteration bound
               across the chips of Tab. 1.                                           Label: FB-BTL
                 For each bug, we give a brief summary that includes a                When experimenting with the Pannotia benchmarks, we
               short description of the bug, the platforms on which we ob-         found it natural to write the applications using an un-
               served the bug, the status of the bug (indicating whether we        bounded loop which breaks when a terminating condition
               have reported the issue and if so whether it is under investi-      is met (e.g. when there is no more work to process). The
               gation) and, if applicable, a work-around. We additionally          following code snippet illustrates this idiom:
               give each issue a label for ease of reference in the text.
                 After the summary, we elaborate more about how we came             1  while (1) {
               across the issue and our debugging attempts. Where we have           2     terminating_condition = true;
               not reported the issues, this is due to exposure of the issue        3
                                                                                    4     // do computation, setting terminating_condition
               requiring use of our custom synchronisation constructs, the          5     // to false if there is more work to do
               fruits of an ongoing and as-yet-unpublished project. Once            6
               we publish these constructs, we will report the issues.              7     if (terminating_condition) {
                                                                                    8        break;
                                                                                    9     }
               Framework bug 1: compiler crash                                     10  }
                Summary: TheOpenCLkernelcompilercrashesnonde-                         OnK5200, R7 and R9, we discovered that this idiom can
                terministically.                                                   deterministically cause non-termination of the kernel. Our
                Platforms: 5500 and 6100 (Intel)                                   debugging attempts led us to substitute the infinite loop
                Status: Unreported                                                 with a finite loop with large bounds (keeping the break
                Workaround: Addpreprocessordirectives to reduce the                statements). We began with a loop bound of INT_MAX. After
                number of kernels passed to the compiler                           this change, the applications correctly terminated. To de-
                Label: FB-CC                                                       termine if threads were actually executing the loop INT_MAX
                                                                                   times, we tracked how many times each of the threads ex-
                 We encountered this error when experimenting with cus-            ecuted the loop. We observed that no thread actually exe-
               tom synchronisation constructs in the p-sssp application.           cuted the loop for INT_MAX iterations. That is, each thread
               Theoriginal application contained four kernel functions. Us-        terminated early through the break statement.
               ing our synchronisation construct, we implemented three                Given this, we believe that the non-termination in the
               newkernel functions, each of which performed some or all of         original code with the infinite loop is due to a framework
               the original computation using different approaches (e.g. by         bug (e.g. a compiler bug). The work-around is to replace
               varying the number and location of synchronisation opera-           while(1) loop header with a for loop header that uses a
               tions). For convenience, we located all seven kernel functions      large over-approximation of the number of iterations of the
               in a single source file.                                             loop that will actually be executed.
                 We noticed that when we executed scripts to benchmark                As with FB-CC, we did not report the issue yet because
               the different kernels, the application would crash roughly           this example uses our currently unpublished synchronisation
               one in ten times with an unknown error, producing an out-           constructs. While we do not believe that the issue is related
               put that looks like a memory dump. Our debugging ef-                specifically to the new synchronisation constructs, it does
               forts showed that the application was crashing when the             seem that a suitably complex kernel is required to cause this
               OpenCLCcompilerwasinvokedviathe OpenCLAPIfunc-                      behaviour; our attempts to reduce the issue to a significantly
               tion clBuildProgram.                                                smaller example caused the problem to disappear.
                 In an attempt to find the root cause of this issue, we tried       Framework bug 3: defunct processes
               to reduce the size of the OpenCL source file. We were able
               to reduce the problem to a kernel file that contained only             Summary: GPU applications become defunct and un-
               two large kernel functions. At this point, when either of the         responsive when run with a Linux host
               kernel functions were removed, the error disappeared. Our             Platforms: R7 and R9 (AMD)
               hypothesis is that the error is due to the OpenCL kernel file          Status: Known
               containing multiple large kernel functions. We were able to           Workaround: Change host OS to Windows
               work around this issue by surrounding the kernel functions            Label: FB-DP
               in the kernel file with preprocessor conditionals. We then
               used the -D compiler flag to exclude all kernels except the             In experimenting with new synchronisation constructs in
               one we were currently benchmarking.                                 the Pannotia applications we generated kernels that could
                 We have not yet reported this issue as the kernels which          potentially have high runtimes (around 30 seconds). Most
               cause the compiler to crash contain our custom synchroni-           systems we experimented with employed a GPU watchdog
               sation constructs.                                                  daemon (see Sec. 5) which catches and terminates kernels
The words contained in this file might help you see if this file matches what you are looking for:

...Thehitchhiker s guide to cross platform opencl application development tyler sorensen alastair f donaldson imperial college london t ac uk abstract of papers that evaluate one the benets programming is plat implementations from number an form portability program fol and gpu vendors implementation each vendor lows specication should in principle execute reliably on any supports assess current state we provide ex perience report examining two sets open source bench marksthatweattemptedtoexecuteacrossavarietyofgpu platforms via issues encountered where applications would success fully but fail another classify into three groups framework bugs provided fails limita tions unclear dierent exhibit behaviours non arises due exercising are incorrect or un figure whose dened according gpuimplementationsareevaluatedinrecentpa slowed process associated pers listed at http hgpu org with our view as pro viding exciting motivation for future testing verication eorts improve con discussed sec focus cl...

no reviews yet
Please Login to review.