106x Filetype PDF File size 0.16 MB Source: imomath.com
Introduction to Parallel Programming Using OpenCL andC++ Ivan Matic 1. INTRODUCTION 1.1. GPGPU. The abbreviation GPGPU refers to general purpose programming on graphics processing units. The graphic cards of modern computers can be used to do serious scientific calculations. They contain thousands of computing cores (that we will call processing elements), and as such they are ideal for parallel pro- gramming. The processing elements can be thought of as small CPUs that are numerous, but not as advanced as the central processors. High end CPUs can have speeds in range of 4GHz, but a quad-core system will have only four of these units. On the other hand, the graphic card can have 4000 processing elements each of which runs at 1GHz. In addition, not all operations are permitted and not all data structures are available to the processing elements of a typical graphic card. 1.2. Class GraphicCard. The interaction between GPU and the host platform is fairly complex for the beginners. This tutorial introduces the class GraphicCard written in C++ which does most of the work associated with memory manage- mentonGPU.Theusageofthisclassremovesthenecessityofunderstandingcom- pletely the mechanisms on how platforms, contexts, devices, kernels, workgroups, and memory objects are related to each other. Needlesstosay, the usage of this class is limited to building the basic codes only. It is unlikely that the class will be suitable for projects that need to harness the full power of parallel programming. The comprehensive coverage of OpenCL is provided in [1, 3]. 2. CPU, GPU, RAM, AND DEVICE MEMORY While introducing the necessary terminology we will frequently refer to the first problem we have the intention to solve: →− Problem 2.1. Read the sequence a = (a[0], a[1], ..., a[n−1]) from a file on the hard disk and update it to obtain a[k] := 2∗a[k] for each k ∈ {0,1,...,n−1}. →− Themainideainsolvingthisproblemistoassigneachtermofthesequence a to aspecificprocessingelementonthegraphiccard. Thiswayoneprocessingelement will perform the operation a[0] := 2∗a[0]. At the same time another element will perform a[1] := 2∗a[1], and so on. The programmer may always assume that the number of processing elements is as large as necessary. When the shortage of resources occurs, a single processing element gets assigned to multiple tasks that will not be performed in parallel. The programmer does not have to be aware of that. 1 2 FIGURE 1. Therelationship between the host and the device. 2.1. Organizationofthehardware. Forthepurposesofthistutorialthecomputer hardware can be thought to consist of the following two units: (1) Host–atraditionalcomputingsystemthatconsistsofCPUandRAMmem- ory. Programs that are executed on CPUs will be written in C++. (2) Device – a graphic card that has GPU consisting of thousands of processing elements, and its own memory that we will call device memory. Programs that can be run on processing elements are called kernels and are written in OpenCL. Processing elements have access to the device memory only. Specifically, they cannot access the RAM memory on the host. On the other hand, the CPU can ac- cess its RAM memory, and can perform basic copying of elements from RAM to devicememoryandviceversa. However,oneshouldalwayskeepinmindthatcom- munication between CPU and device memory is not as fast as the communication between the processing elements and the device memory. Traditional programs written in C++ organize complicated data structures within the RAMmemory. Incontrast, sequences are the only data structures that the class GraphicCard permits on the device memory. 2.2. Organization of the program. Each program that uses graphic cards has the following main components: Kernels, which are run on GPU processing elements and are written in OpenCL; and Host, which is run on CPU and is written in C++. Solving our first problem consists of the following tasks: 3 →− (1) Reading the sequence a from a file. This task must be performed by host. The processing elements do not have the ability to access the hard disk on which the file is located. →− (2) Copying the sequence a from RAM to device memory. This task has to be performed by host as well. (3) Executing the kernel. The host will specify how many processing elements willbedeployedinexecutingthekernel,andwillsynchronizetheexecution. Theprocessing elements work in parallel on the same kernel code and have accesstothesamesequenceonthedevicememory. Theonlydifferentiating factor between the processing elements is their ID number. Each processing element will receive a different ID and our kernel program will be designed to use this ID as an index of the sequence. This way the processing element with ID 27 will perform the task a[27] = 2∗a[27], while the one with the ID17willperformthetask a[17]=2∗a[17]. (4) Reading the result from the device memory to the host memory. This oper- ation will be performed by the host. 3. READING SEQUENCES FROM FILES Our goal is to get to OpenCL programming as soon as possible and in order to →− do this we will use a pre-made program that can read the sequence a from the file input00.txtandstoreitintheRAMmemory. Thecodein generatingSequenceFromFile.cpp contains two functions: readSequenceFromFile and printToFile. Please go over the simple code from the file example00.cpp. It explains how the sequence is read from the input file, stored in the memory and the written to the hard-disk using printToFile. The input file input00.txt contains the sequence that is to be read by the program. Integers (and minus signs) are treated as input, while everything else is ignored. The first number in the file is the length n of the sequence. The remaining numbers are the elements, until the number −9 is reached. This number −9 is not included in the sequence, but all the remaining terms (and there has to be a total of nterms) are generated at random. 4. MULTIPLYING EACH TERM OF SEQUENCE BY 2 In this section we will develop our first program that solves Problem 2.1. We will →− multiply each term of the sequence a by 2. 4.1. Designing the kernel. Our intention is for each processing element to work on one term of the sequence. get_global_id(0) is an OpenCL command that provides the processing element with the information on which ID number is 4 assigned to it. Once this information is obtained the task is obvious, and we may summarize this with the following code: int index = get_global_id(0); a[index]=2 a[index]; * There is one one unpleasant surprise coming from the design of the GPU hard- ware. The host program (run on CPU) cannot request any number of processing elements. The architecture of GPU groups these elements and the members of each group have to be invoked together. Typically, NVIDIA hardware has groups of size 32 while AMDhasgroupsofsize 64. These particular numbers are something that is supposed to be ignored by those wishing to write elegant programs. However, it is dangerous to forget the fact that the host is almost always going to receive more processing elements than it has asked for. In particular, the previous two lines of code may result in a disaster: If the se- →− quence a has 57 terms and the host asks for the kernel to be run on 57 processing elements, the host will actually receive 64 of processing elements. One unfortunate processing element will receive the id 60 and consequently will try to access a[60], which is not a memory that should be accessed. One way to prevent the described difficulty is to supply each processing element with the information on the length of the sequence, and the code becomes: int index = get_global_id(0); if(index
no reviews yet
Please Login to review.