F50 Item Download 2023-02-03 20-20-04

Partial capture of text on file.

Introduction to Parallel Programming Using OpenCL
andC++
Ivan Matic
1. INTRODUCTION
1.1. GPGPU. The abbreviation GPGPU refers to general purpose programming
on graphics processing units. The graphic cards of modern computers can be used
to do serious scientiﬁc calculations. They contain thousands of computing cores
(that we will call processing elements), and as such they are ideal for parallel pro-
gramming. The processing elements can be thought of as small CPUs that are
numerous, but not as advanced as the central processors. High end CPUs can have
speeds in range of 4GHz, but a quad-core system will have only four of these units.
On the other hand, the graphic card can have 4000 processing elements each of
which runs at 1GHz. In addition, not all operations are permitted and not all data
structures are available to the processing elements of a typical graphic card.
1.2. Class GraphicCard. The interaction between GPU and the host platform is
fairly complex for the beginners. This tutorial introduces the class GraphicCard
written in C++ which does most of the work associated with memory manage-
mentonGPU.Theusageofthisclassremovesthenecessityofunderstandingcom-
pletely the mechanisms on how platforms, contexts, devices, kernels, workgroups,
and memory objects are related to each other.
Needlesstosay, the usage of this class is limited to building the basic codes only.
It is unlikely that the class will be suitable for projects that need to harness the
full power of parallel programming. The comprehensive coverage of OpenCL is
provided in [1, 3].
2. CPU, GPU, RAM, AND DEVICE MEMORY
While introducing the necessary terminology we will frequently refer to the ﬁrst
problem we have the intention to solve:
→−
Problem 2.1. Read the sequence a = (a[0], a[1], ..., a[n−1]) from a ﬁle on the
hard disk and update it to obtain a[k] := 2∗a[k] for each k ∈ {0,1,...,n−1}.
→−
Themainideainsolvingthisproblemistoassigneachtermofthesequence a to
aspeciﬁcprocessingelementonthegraphiccard. Thiswayoneprocessingelement
will perform the operation a[0] := 2∗a[0]. At the same time another element will
perform a[1] := 2∗a[1], and so on.
The programmer may always assume that the number of processing elements is
as large as necessary. When the shortage of resources occurs, a single processing
element gets assigned to multiple tasks that will not be performed in parallel. The
programmer does not have to be aware of that.
1
2
FIGURE 1. Therelationship between the host and the device.
2.1. Organizationofthehardware. Forthepurposesofthistutorialthecomputer
hardware can be thought to consist of the following two units:
(1) Host–atraditionalcomputingsystemthatconsistsofCPUandRAMmem-
ory. Programs that are executed on CPUs will be written in C++.
(2) Device – a graphic card that has GPU consisting of thousands of processing
elements, and its own memory that we will call device memory. Programs
that can be run on processing elements are called kernels and are written in
OpenCL.
Processing elements have access to the device memory only. Speciﬁcally, they
cannot access the RAM memory on the host. On the other hand, the CPU can ac-
cess its RAM memory, and can perform basic copying of elements from RAM to
devicememoryandviceversa. However,oneshouldalwayskeepinmindthatcom-
munication between CPU and device memory is not as fast as the communication
between the processing elements and the device memory.
Traditional programs written in C++ organize complicated data structures within
the RAMmemory. Incontrast, sequences are the only data structures that the class
GraphicCard permits on the device memory.
2.2. Organization of the program. Each program that uses graphic cards has the
following main components: Kernels, which are run on GPU processing elements
and are written in OpenCL; and Host, which is run on CPU and is written in C++.
Solving our ﬁrst problem consists of the following tasks:
3
→−
(1) Reading the sequence a from a ﬁle. This task must be performed by host.
The processing elements do not have the ability to access the hard disk on
which the ﬁle is located.
→−
(2) Copying the sequence a from RAM to device memory. This task has to be
performed by host as well.
(3) Executing the kernel. The host will specify how many processing elements
willbedeployedinexecutingthekernel,andwillsynchronizetheexecution.
Theprocessing elements work in parallel on the same kernel code and have
accesstothesamesequenceonthedevicememory. Theonlydifferentiating
factor between the processing elements is their ID number. Each processing
element will receive a different ID and our kernel program will be designed
to use this ID as an index of the sequence. This way the processing element
with ID 27 will perform the task a[27] = 2∗a[27], while the one with the
ID17willperformthetask a[17]=2∗a[17].
(4) Reading the result from the device memory to the host memory. This oper-
ation will be performed by the host.
3. READING SEQUENCES FROM FILES
Our goal is to get to OpenCL programming as soon as possible and in order to
→−
do this we will use a pre-made program that can read the sequence a from the ﬁle
input00.txtandstoreitintheRAMmemory. Thecodein
generatingSequenceFromFile.cpp
contains two functions: readSequenceFromFile and printToFile.
Please go over the simple code from the ﬁle example00.cpp. It explains how
the sequence is read from the input ﬁle, stored in the memory and the written to the
hard-disk using printToFile.
The input ﬁle input00.txt contains the sequence that is to be read by the
program. Integers (and minus signs) are treated as input, while everything else is
ignored. The ﬁrst number in the ﬁle is the length n of the sequence. The remaining
numbers are the elements, until the number −9 is reached. This number −9 is not
included in the sequence, but all the remaining terms (and there has to be a total of
nterms) are generated at random.
4. MULTIPLYING EACH TERM OF SEQUENCE BY 2
In this section we will develop our ﬁrst program that solves Problem 2.1. We will
→−
multiply each term of the sequence a by 2.
4.1. Designing the kernel. Our intention is for each processing element to work
on one term of the sequence. get_global_id(0) is an OpenCL command
that provides the processing element with the information on which ID number is
4
assigned to it. Once this information is obtained the task is obvious, and we may
summarize this with the following code:
int index = get_global_id(0);
a[index]=2 a[index];
*
There is one one unpleasant surprise coming from the design of the GPU hard-
ware. The host program (run on CPU) cannot request any number of processing
elements. The architecture of GPU groups these elements and the members of each
group have to be invoked together. Typically, NVIDIA hardware has groups of size
32 while AMDhasgroupsofsize 64. These particular numbers are something that
is supposed to be ignored by those wishing to write elegant programs. However, it
is dangerous to forget the fact that the host is almost always going to receive more
processing elements than it has asked for.
In particular, the previous two lines of code may result in a disaster: If the se-
→−
quence a has 57 terms and the host asks for the kernel to be run on 57 processing
elements, the host will actually receive 64 of processing elements. One unfortunate
processing element will receive the id 60 and consequently will try to access a[60],
which is not a memory that should be accessed. One way to prevent the described
difﬁculty is to supply each processing element with the information on the length
of the sequence, and the code becomes:
int index = get_global_id(0);
if(index

The words contained in this file might help you see if this file matches what you are looking for:

...Introduction to parallel programming using opencl andc ivan matic gpgpu the abbreviation refers general purpose on graphics processing units graphic cards of modern computers can be used do serious scientic calculations they contain thousands computing cores that we will call elements and as such are ideal for pro gramming thought small cpus numerous but not advanced central processors high end have speeds in range ghz a quad core system only four these other hand card each which runs at addition all operations permitted data structures available typical class graphiccard interaction between gpu host platform is fairly complex beginners this tutorial introduces written c does most work associated with memory manage mentongpu theusageofthisclassremovesthenecessityofunderstandingcom pletely mechanisms how platforms contexts devices kernels workgroups objects related needlesstosay usage limited building basic codes it unlikely suitable projects need harness full power comprehensive covera...

Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area