jagomart
digital resources
picture1_F50 Item Download 2023-02-03 20-20-04


 106x       Filetype PDF       File size 0.16 MB       Source: imomath.com


File: F50 Item Download 2023-02-03 20-20-04
introduction to parallel programming using opencl andc ivan matic 1 introduction 1 1 gpgpu the abbreviation gpgpu refers to general purpose programming on graphics processing units the graphic cards of ...

icon picture PDF Filetype PDF | Posted on 03 Feb 2023 | 2 years ago
Partial capture of text on file.
             Introduction to Parallel Programming Using OpenCL
                                  andC++
                                  Ivan Matic
                                1. INTRODUCTION
             1.1. GPGPU. The abbreviation GPGPU refers to general purpose programming
             on graphics processing units. The graphic cards of modern computers can be used
             to do serious scientific calculations. They contain thousands of computing cores
             (that we will call processing elements), and as such they are ideal for parallel pro-
             gramming. The processing elements can be thought of as small CPUs that are
             numerous, but not as advanced as the central processors. High end CPUs can have
             speeds in range of 4GHz, but a quad-core system will have only four of these units.
             On the other hand, the graphic card can have 4000 processing elements each of
             which runs at 1GHz. In addition, not all operations are permitted and not all data
             structures are available to the processing elements of a typical graphic card.
             1.2. Class GraphicCard. The interaction between GPU and the host platform is
             fairly complex for the beginners. This tutorial introduces the class GraphicCard
             written in C++ which does most of the work associated with memory manage-
             mentonGPU.Theusageofthisclassremovesthenecessityofunderstandingcom-
             pletely the mechanisms on how platforms, contexts, devices, kernels, workgroups,
             and memory objects are related to each other.
               Needlesstosay, the usage of this class is limited to building the basic codes only.
             It is unlikely that the class will be suitable for projects that need to harness the
             full power of parallel programming. The comprehensive coverage of OpenCL is
             provided in [1, 3].
                       2. CPU, GPU, RAM, AND DEVICE MEMORY
             While introducing the necessary terminology we will frequently refer to the first
             problem we have the intention to solve:
                                 →−
             Problem 2.1. Read the sequence a = (a[0], a[1], ..., a[n−1]) from a file on the
             hard disk and update it to obtain a[k] := 2∗a[k] for each k ∈ {0,1,...,n−1}.
                                                           →−
               Themainideainsolvingthisproblemistoassigneachtermofthesequence a to
             aspecificprocessingelementonthegraphiccard. Thiswayoneprocessingelement
             will perform the operation a[0] := 2∗a[0]. At the same time another element will
             perform a[1] := 2∗a[1], and so on.
               The programmer may always assume that the number of processing elements is
             as large as necessary. When the shortage of resources occurs, a single processing
             element gets assigned to multiple tasks that will not be performed in parallel. The
             programmer does not have to be aware of that.
                                     1
            2
                 FIGURE 1. Therelationship between the host and the device.
            2.1. Organizationofthehardware. Forthepurposesofthistutorialthecomputer
            hardware can be thought to consist of the following two units:
              (1) Host–atraditionalcomputingsystemthatconsistsofCPUandRAMmem-
                ory. Programs that are executed on CPUs will be written in C++.
              (2) Device – a graphic card that has GPU consisting of thousands of processing
                elements, and its own memory that we will call device memory. Programs
                that can be run on processing elements are called kernels and are written in
                OpenCL.
             Processing elements have access to the device memory only. Specifically, they
            cannot access the RAM memory on the host. On the other hand, the CPU can ac-
            cess its RAM memory, and can perform basic copying of elements from RAM to
            devicememoryandviceversa. However,oneshouldalwayskeepinmindthatcom-
            munication between CPU and device memory is not as fast as the communication
            between the processing elements and the device memory.
             Traditional programs written in C++ organize complicated data structures within
            the RAMmemory. Incontrast, sequences are the only data structures that the class
            GraphicCard permits on the device memory.
            2.2. Organization of the program. Each program that uses graphic cards has the
            following main components: Kernels, which are run on GPU processing elements
            and are written in OpenCL; and Host, which is run on CPU and is written in C++.
             Solving our first problem consists of the following tasks:
                                                                      3
                                   →−
                 (1) Reading the sequence a from a file. This task must be performed by host.
                    The processing elements do not have the ability to access the hard disk on
                    which the file is located.
                                   →−
                 (2) Copying the sequence a from RAM to device memory. This task has to be
                    performed by host as well.
                 (3) Executing the kernel. The host will specify how many processing elements
                    willbedeployedinexecutingthekernel,andwillsynchronizetheexecution.
                    Theprocessing elements work in parallel on the same kernel code and have
                    accesstothesamesequenceonthedevicememory. Theonlydifferentiating
                    factor between the processing elements is their ID number. Each processing
                    element will receive a different ID and our kernel program will be designed
                    to use this ID as an index of the sequence. This way the processing element
                    with ID 27 will perform the task a[27] = 2∗a[27], while the one with the
                    ID17willperformthetask a[17]=2∗a[17].
                 (4) Reading the result from the device memory to the host memory. This oper-
                    ation will be performed by the host.
                             3. READING SEQUENCES FROM FILES
                Our goal is to get to OpenCL programming as soon as possible and in order to
                                                            →−
               do this we will use a pre-made program that can read the sequence a from the file
               input00.txtandstoreitintheRAMmemory. Thecodein
                           generatingSequenceFromFile.cpp
               contains two functions: readSequenceFromFile and printToFile.
                Please go over the simple code from the file example00.cpp. It explains how
               the sequence is read from the input file, stored in the memory and the written to the
               hard-disk using printToFile.
                The input file input00.txt contains the sequence that is to be read by the
               program. Integers (and minus signs) are treated as input, while everything else is
               ignored. The first number in the file is the length n of the sequence. The remaining
               numbers are the elements, until the number −9 is reached. This number −9 is not
               included in the sequence, but all the remaining terms (and there has to be a total of
               nterms) are generated at random.
                         4. MULTIPLYING EACH TERM OF SEQUENCE BY 2
                In this section we will develop our first program that solves Problem 2.1. We will
                                       →−
               multiply each term of the sequence a by 2.
               4.1. Designing the kernel. Our intention is for each processing element to work
               on one term of the sequence. get_global_id(0) is an OpenCL command
               that provides the processing element with the information on which ID number is
               4
               assigned to it. Once this information is obtained the task is obvious, and we may
               summarize this with the following code:
               int index = get_global_id(0);
               a[index]=2 a[index];
                         *
                There is one one unpleasant surprise coming from the design of the GPU hard-
               ware. The host program (run on CPU) cannot request any number of processing
               elements. The architecture of GPU groups these elements and the members of each
               group have to be invoked together. Typically, NVIDIA hardware has groups of size
               32 while AMDhasgroupsofsize 64. These particular numbers are something that
               is supposed to be ignored by those wishing to write elegant programs. However, it
               is dangerous to forget the fact that the host is almost always going to receive more
               processing elements than it has asked for.
                In particular, the previous two lines of code may result in a disaster: If the se-
                    →−
               quence a has 57 terms and the host asks for the kernel to be run on 57 processing
               elements, the host will actually receive 64 of processing elements. One unfortunate
               processing element will receive the id 60 and consequently will try to access a[60],
               which is not a memory that should be accessed. One way to prevent the described
               difficulty is to supply each processing element with the information on the length
               of the sequence, and the code becomes:
               int index = get_global_id(0);
               if(index
						
									
										
									
																
													
					
The words contained in this file might help you see if this file matches what you are looking for:

...Introduction to parallel programming using opencl andc ivan matic gpgpu the abbreviation refers general purpose on graphics processing units graphic cards of modern computers can be used do serious scientic calculations they contain thousands computing cores that we will call elements and as such are ideal for pro gramming thought small cpus numerous but not advanced central processors high end have speeds in range ghz a quad core system only four these other hand card each which runs at addition all operations permitted data structures available typical class graphiccard interaction between gpu host platform is fairly complex beginners this tutorial introduces written c does most work associated with memory manage mentongpu theusageofthisclassremovesthenecessityofunderstandingcom pletely mechanisms how platforms contexts devices kernels workgroups objects related needlesstosay usage limited building basic codes it unlikely suitable projects need harness full power comprehensive covera...

no reviews yet
Please Login to review.