Basic MPI - Few Paradigms

From CSI 702

Jump to: navigation, search

Contents

  1. Parallel Programming in a Nutshell
    1. Definition of Terms
    2. Scalability
    3. Speedup
    4. Communication
    5. Message Passing
  2. MIMD Machines
    1. Early MIMD Machine Problems
    2. Limits of Early Parallel Computers
    3. Cluster Based Computing
    4. Performance Characteristics of Beowolf Clusters
    5. Message Passing Libraries
  3. MPI Overview
    1. MPI Documentation
    2. MPICH vs LAM
  4. Parallel Computing on Workstations
    1. MPI Configuration- Setting Environmental Variables
    2. MPI Configuration- Initialize Compilers
    3. Using SSH for Communication
    4. Public- Private Keys
    5. Adding a Key in SSH
    6. Generating the keys
    7. Installing Keys on the System
    8. Installing the private key
    9. Putting the phase phrase in RAM - Not needed for MPI
    10. MPI Configuration - Add workstations to the list of known hosts
    11. MPI configuration - Creating mpd configuration file
    12. MPI configuration - Create a host file
    13. MPI configuration - Start the MPI daemon
    14. MPI configuration - Creating mpd configuration file
    15. Compile an Example Code
    16. Run the Program
    17. Example MPI Code
    18. MPI on the COS Workstations
    19. General Message Passing
    20. The ESSENTIAL Truths
    21. Example 1
    22. Example 2
    23. Very Basic MPI - Fortran
    24. Very Basic MPI - C
    25. Very Basic MPI - C++
  5. Nodal Operations
    1. Ranks and Sizes in MPI
    2. Global Operations
    3. Synchronizing Nodes
    4. Global Broadcasts
    5. MPI Broadcasts
  6. Parallel Reductions
    1. MPI Reductions
    2. Point-to-Point Communication
  7. Example
    1. Blocking Sends and Receives
    2. MPI Sends - Fortran
    3. Non-Blocking Sends and Receives
  8. MPI Overview
  9. Examples
  10. Exchanging Data: Processor Sending - P1
    1. Exchanging Data: Processor Receiving - P2

1 Parallel Programming in a Nutshell

Parallel programming is a programming methodology with the objective of maximizing performance by processing multiple portions of a program at the same time. Maximizing this performance for non-trivial programs is difficult, since to do so you need to ensure that the work is evenly distributed, but this requires a lot of communication that will ultimately slow down the problem considerably. This is the eternal problem in parallel computing, load balancing vs communication.

The basic approaches to this problem are placed into two different groups, data partitioning, and task partitioning. Data partitioning involves the distribution of the data across several nodes, while task partitioning involves separating tasks across several nodes.

1.1 Definition of Terms

  • node - a box usually containing a processor, local memory, and a disk
  • cluster - a group of nodes held together by networking

1.2 Scalability

In parallel computing, the maximum performance that can be gained is equal to the sum of the nodes, e.g. four nodes would run a program four times faster than one node. However this rarely occurs for non-trivial programs, this is due to costs of communication, difficulty in distributing data and work, or parts of programs that are simply not distributable such as writing to disk or other devices. This difficulty is described through Amdahl's law:

S = \frac{1}{\alpha +(1-\alpha)/p}

where α is the portion of the code which cannot be parallelized and p is the number of processors. This is law implies that speedup is limited by the slowest portion of the code.

1.3 Speedup

Under the constraints of Amdahl's law, there are a number of ways to speed up performance:

  • Add Nodes:

Adding nodes may be the easiest method to increase a problem, but as the number of nodes increase the cost for communication and the bounds due to the inherent serial portion of the code will eventually make performance gains negligible.

  • Increase Problem Size:

If the parallel portion of the problem scales in a better than linear method, simply increasing the problem size may yield significant increases in performance, since the non-parallel portion of the code generally scales linearly.

1.4 Communication

The time it takes to access most modern forms of memory is measured in nanoseconds, while network latency is generally measured in milliseconds. Because of this communications between nodes is an expensive process, typically we can do thousands of computations in the time it takes to pass the simplest message. On top of the time it takes to send a message, the amount of time it takes to send a certain amount of data is much less within a node, than between nodes. We can define the total cost of sending a message as below with latency, l and bandwidth b:

t = s / b + l

seconds. (Assuming b, l, and s are in consistent units.)

1.5 Message Passing

Most current parallel computers use message passing. The general paradigms are classified as:

  • Single Program Multiple Data - SPMD (One program on multiple processors taking in multiple inputs)
  • Multiple Programs Multiple Data - MIMD (Different programs on multiple processors taking in multiple inputs)

In almost all cases, we use a single program for message passing. However, on each node, different subroutines may be running at the time time. Communications are handled through a message passing library.

2 MIMD Machines

During most of the 1990's, MIMD machines were sold by a few large vendors.

  • IBM SP2
  • Thinking Machines CM-5
  • Cray T3D/T3E
  • Intel Paragon
  • SGI Origin 2000

Most of these vendors no long produce machines. The expense of developing big machines did not justify the revenue from their sales.

2.1 Early MIMD Machine Problems

  • proprietary operating system
  • proprietary programming languages and libraries
  • proprietary communications hardware

2.2 Limits of Early Parallel Computers

Literally, the machines were built from scratch. The compilers and libraries were not standardized, and the interfaces changed from machine to machine. This led to two big problems:

  • cost - everything was developed in-house
  • user acceptance - moving between machines was very expensive

2.3 Cluster Based Computing

The new direction for parallel computing is in cluster based computing. The essential characteristics are:

  • multiple commercial off the shelf (COTS) machines (usually Intel-based)
  • no special operating system or compilers (usually linux)
  • high speed but COTS networking cards and routers connecting separate boxes
  • standard message passing library handing communications through RSH or SSH The common name for these machines is Beowolf clusters.

2.4 Performance Characteristics of Beowolf Clusters

These are general characteristics of Beowlf clusters. Consider them rules of thumb rather than certainties.

  • individual nodes are moderate end PC boxes
  • communications are usually routed through a single router
  • most communications are fairly slow with high to moderate latency

2.5 Message Passing Libraries

With the creation of MIMD machines, message passing libraries had to be created to allow general communication between computational nodes. The original libraries used were proprietary, and sold only with parallel machines. Every company wanted to have "new and better" features than every other company. This competition led to completely machine dependent programming. Every new parallel machine required a complete rewrite of the message passing sections. It is not very surprising that parallel computing has been a commercial failure... at least until recently.

Message Passing Library Implementations:

  • OpenMPI
  • MPICH
  • LAM/MPI
  • PVM

3 MPI Overview

  • startup / shutdown
  • local information - node id, number of nodes
  • global operations
    • broadcasts
    • reductions
    • synchronization
  • sends and receives
    • block vs non-blocking
  • domain decomposition tools
    • new comm types
    • virtual domains
  • data type creation tools

3.1 MPI Documentation

There are good resources on-line
http://www.mpi-forum.org/docs/docs.html
http://www.netlib.org/mpi/index.html
http://www.lam-mpi.org/tutorials/
http://web.umr.edu/~ercal/387/MPI/qref.html
http://www.lam-mpi.org/tutorials/nd/
You can compile a version of MPI on your home PC using a version called "mpich". The code is free and available on the web. Most MPI implementations use rsh or ssh for interprocessor communication. These can be altered if faster options are available.

3.2 MPICH vs LAM

There are several different versions of MPI libraries available for us.

  • On the COS cluster, we use the "mpich" libraries. They are complied to work with the Intel compilers.
  • LAM is another version of MPI that is sometimes used.
  • Both libraries work fine for most applications.

4 Parallel Computing on Workstations

  • Since modern OS can run several processes at once, we can debug parallel code by running mpi on a serial machine.
  • We will NOT get any speed-up, unless we are using a multicore machine AND there are not other things running in the background.
  • NEVER debug a code on more than one workstation. It helps to prevent hung processes from spreading across all the nodes.

4.1 MPI Configuration- Setting Environmental Variables

setenv MPICH /usr/local/mpich
set path=($MPICH/bin $MPICH/sbin $path)

MPICH=/usr/local/mpich
export MPICH
PATH=$MPICH/bin:$MPICH/sbin:$PATH
export PATH

4.2 MPI Configuration- Initialize Compilers

source /usr/local/intel/fc/10.0.026/bin/ifortvars.csh
source /usr/local/intel/cc/10.0.026/bin/iccvars.csh
source /usr/local/intel/idb/10.0.026/bin/idbvars.csh

setenv MPICH_F90 /usr/local/intel/fc/10.0.026/bin/ifort
setenv MPICH_F77 /usr/local/intel/fc/10.0.026/bin/ifort
setenv MPICH_CC /usr/local/intel/cc/10.0.026/bin/icc
setenv MPICH_CXX /usr/local/intel/cc/10.0.026/bin/icc

source /usr/local/intel/fc/10.0.026/bin/ifortvars.sh
source /usr/local/intel/cc/10.0.026/bin/iccvars.sh
source /usr/local/intel/idb/10.0.026/bin/idbvars.sh

set MPICH_F90=/usr/local/intel/fc/10.0.026/bin/ifort
set MPICH_F77=/usr/local/intel/fc/10.0.026/bin/ifort
set MPICH_CC=/usr/local/intel/cc/10.0.026/bin/icc
set MPICH_CXX=/usr/local/intel/cc/10.0.026/bin/icc

export MPICH_F90
export MPICH_F77
export MPICH_CC
export MPICH_CXX

4.3 Using SSH for Communication

SSH and SFTP can be used for secure communications between nodes for parallel computing. The protocols used are a bit slow, but allow you to experiment with parallel codes on secured clusters. The difficulty with SSH in parallel communication is the need for a password for every transaction. Of course, this is exactly what makes the system secure in the first place. There is a provision in SSH for using keys and for having automatic identification.

4.4 Public- Private Keys

Public/private keys are a way to securely access data within a server. Generally, there are three parts to this system

  • a public key that is readable on the target machine you are trying to access
  • a private key that is readable on the machine you are using to access the target machine
  • a pass phrase that you will use instead of a password

It is important that you delete the private keys after using them for this class. Leaving them on the system might make it vulnerable to hacking.

4.5 Adding a Key in SSH

There are four basic steps in creating a set of keys for a Linux machine that is running ssh.

  • Generate the keys
  • Install them in the system
  • Set up an agent to hold the pass phrase in RAM
  • Add the pass phrase to the machine

4.6 Generating the keys

cd ~/sh
ssh-keygen -t rsa -f tmpkey
wk03 [~/.ssh] (35) % ssh-keygen -t rsa -f tmpkey
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again: [TYPE PASSPHRASE]
Your identification has been saved in tmpkey.
Your public key has been saved in tmpkey.pub.
The key fingerprint is:
31:e0:bd:27:3d:df:12:06:1d:2c:a2:e1:3f:5c:a4:05
jwallin@wk03.cos.gmu.edu
wk03 [~/.ssh] (36)

I have typed in passphrase in the prompt. DO NOT USE A CR ONLY!

4.7 Installing Keys on the System

You now need to append the public key to the authorized keys file.

wk03 [~/.ssh] (40) % cat tmpkey.pub >>
authorized_keys

From now on, you could access the workstations in room 249 using a passphrase. The tmpkey file contains a private key.

wk03 [~/.ssh] (41) % ssh wk03 -i ~/.ssh/tmpkey
Enter passphrase for key '/Users/jwallin/.ssh/tmpkey':

4.8 Installing the private key

You can TEMPORARILY install the private key by making a link between tmpkey and a file named identity in the .ssh directory.

cd ~/.ssh
ln -s tmpkey identity

Now, it will automatically prompt you for your passphrase.

wk03 [~/.ssh] (6) % ssh wk09
Enter passphrase for key
'/Users/jwallin/.ssh/identity':
wk09 [~] (7) % 

4.9 Putting the phase phrase in RAM - Not needed for MPI

To put the pass phrase in RAM for a particular window, you need to do two steps using eval `ssh-agent` and ssh-add.

k03 [~/.ssh] (7) % eval `ssh-agent`
Agent pid 7712
wk03 [~/.ssh] (8) % ssh-add
Enter passphrase for /Users/jwallin/.ssh/identity:
Identity added: /Users/jwallin/.ssh/identity
(/Users/jwallin/.ssh/identity)
wk03 [~/.ssh] (9) % ssh wk09
wk09 [~] (1) %

At this point, you can transfer messages in a simple way. Note...this will not work well between multiple computers with two way connections. The pass phrase lives in memory on one machine, so you can't ssh to wk08 and then ssh back to wk03 without providing a pass phrase.

4.10 MPI Configuration - Add workstations to the list of known hosts

  • ssh to the workstations you want to run on with mpi
  • make sure to they are added to the list of known hosts
wk13% ssh wk10
The authenticity of host 'wk10 (129.174.119.19)' can't be eRSA key fingerprint is 16:af:05:8a:e1:b7:e3:ff:79:29:5c:45:Are you sure you want to continue connecting (yes/no)?
Warning: Permanently added 'wk10,129.174.119.19' (RSA) to tjwallin@wk10's password:
wk10%

4.11 MPI configuration - Creating mpd configuration file

% cd $HOME
% touch .mpd.conf
% chmod 6000 .mpd.conf

Add a line with a "secret word"

MPD_SECRETWORD=secretword

4.12 MPI configuration - Create a host file

vi hostfile
wk10
wk11
wk12
wk13
wk16
wk17

4.13 MPI configuration - Start the MPI daemon

mpdtrace

4.14 MPI configuration - Creating mpd configuration file

This is not always needed... but

mpdhelp
mpdtrace
mpdallexit
mpdboot -n 2 -f /Users/jwallin/c702f07/mpich/hostfile -v
mpdtrace

2 is the number of notes in the host file.

4.15 Compile an Example Code

mpif90 fpi.f
mpicc cpi.c

4.16 Run the Program

% mpirun -np 3 a.out
[deep meaning appears here]
%

4.17 Example MPI Code

program testmpi

include "mpif.h"

integer :: ierr
integer :: nproc, my_id

call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr)
print*,my_id , ' node of ',nproc, ' is on-line'

call MPI_FINALIZE(ierr)

end program testmpi

4.18 MPI on the COS Workstations

wk03 [~/MPISTUFF] (15) % mpif77 -o hellof77 \
-I/usr/include/ hello.f
WARNING: mpif77 expected to find liblammpi++.* in
WARNING: MPI C++ API support will be disabled
WARNING: mpif77 expected to find liblammpio.* in
WARNING: MPI-2 IO support will be disabled
wk03 [~/MPISTUFF] (16) %
wk03 [~/MPISTUFF] (16) % mpirun -np 3 hellof77
0 node of 3 is on-line
2 node of 3 is on-line
1 node of 3 is on-line
wk03 [~/MPISTUFF] (17) %

4.19 General Message Passing

There are only a few things any parallel message passing library can do:

  • initialize the nodes for computation
  • provide basic nodal and nodal array information
  • global operations
  • point-to-point communication between nodes
  • aid in domain decomposition
  • aid in performance benchmarking
  • provide parallel io

4.20 The ESSENTIAL Truths

EACH NODE IS RUNNING THE PROGRAM SEPARATELY ALL DATA NEEDED MUST BE EXPLICITLY PASSED BETWEEN PROCESSORS

4.21 Example 1

Imagine you have to read a data file before the program can executes. You have two choices:

  • read the data from the root node and pass information to all other nodes
  • copy the data file to all other nodes, and let them read it locally

Both of these have disadvantages and advantages,depending on your application.

4.22 Example 2

Imagine that you have completed some local calculation, but other nodes need to have this updated data.YOU MUST EXPLICITLY COMMUNICATE THE DATA TO THE OTHER PROCESSORS.

4.23 Very Basic MPI - Fortran

All Fortran and Fortran 90 MPI codes have four calls in common.

include "mpif.h"
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr)

do something useful

call MPI_FINALIZE(ierr)
  • The include statement defines the variable MPI COMM WORLD and the MPI prototypes.
  • MPI INIT and MPI FINALIZE start and stop the MPI communication library.
  • ierr is an integer variable which is assigned a value of zero when the call is completed successfully.

4.24 Very Basic MPI - C

#include "mpi.h"
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);

do useful stuff

MPI_Finalize();

functions return a value of zero if successful.

4.25 Very Basic MPI - C++

#include "mpi.h"
MPI::Init(argc, argv);
size = MPI::COMM_WORLD.Get_size();
rank = MPI::COMM_WORLD.Get_rank();

do useful stuff

MPI::Finalize();

5 Nodal Operations

Basic nodal operations include knowing what node the program is running on and knowing the number (and addresses) of other nodes.

int MPI Comm rank(MPI Comm comm, int * my node)
int MPI Comm size(MPI Comm comm, int *num nodes)

5.1 Ranks and Sizes in MPI

  • Sizes are given by the number of processor in the MPI COMM
  • ranks range from 0 to size-1 in all languages,including Fortran
  • sizes and ranks are defined only for each processor group
  • The variable MPI COMM WORLD is a global defined the full set of processors.Subgroups of processors can also be used if they are properly defined.

5.2 Global Operations

Global operations include synchronization, broadcasts, and reductions.

  • Synchronization is necessary to cause all processors to stop when they reach a particular point within a program.
  • Broadcasts transmit data from a single node to the rest of the computational nodes.
  • Reductions include sums, sorts, maximums and minimums within a distributed array.

5.3 Synchronizing Nodes

int MPI Barrier(MPI Comm comm, int ierr);

5.4 Global Broadcasts

Global broadcasts are often used to communicate input data to the rest of the nodes. The syntax is

int MPI Bcast( <type> buf, int count, MPI Datatype type, int root, MPI Comm comm, int ierr);

5.5 MPI Broadcasts

MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);

MPI::COMM_WORLD.Bcast(&n, 1, MPI::INT, 0);

call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)
  • n value or array to be broadcast
  • number of elements to be broadcast
  • MPI type of elements to be broadcast
  • origin node of broadcast
  • COMM group of broadcast
  • error on call

6 Parallel Reductions

Reductions include a set of parallel operations on a data set which is spread across multiple nodes. Summation is perhaps the easiest and most obvious.
int MPI Reduce( void *sendbuf, void *recbuf, int cnt,
MPI Datatype datatype, MPI Op op, int root, MPI Comm comm);

op includes MPI SUM

6.1 MPI Reductions

MPI::COMM_WORLD.Reduce(&mypi, &pi,1, MPI::DOUBLE,MPI::SUM, 0);

call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,& MPI_SUM,0,& MPI_COMM_WORLD,ierr)

MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,MPI_COMM_WORLD);
  • local variable(s) used as input for reduction
  • variable reduction is completed on
  • number of elements to be reduced
  • MPI data type
  • target process where reduction is completed
  • MPI comm type
  • error

6.2 Point-to-Point Communication

There are three general types of point to point communication which may be available in parallel machines.

  • synchronous or blocking - the nodes (receiving and perhaps sending) halt until the communication is complete
  • asynchronous or non-blocking - the nodes send and forget the message, check it when ever desired
  • interrupt driven communication - not available in MPI

7 Example

This is an example of broadcasting input from a source:

/* **************************************
Program: cpi.c
 
Author:
   Argornne National Library
   http://www.mcs.anl.gov/research/projects/mpi/usingmpi/examples/simplempi/cpi_c.htm
 
Modified By:
   Jonathan Lisic
*************************************** */
 
#include "mpi.h" 
#include <stdio.h> 
#include <math.h> 
int main( int argc, char *argv[] ) 
{ 
    int n, myid, numprocs, i; 
    double PI25DT = 3.141592653589793238462643; 
    double mypi, pi, h, sum, x; 
    MPI_Init(&argc,&argv); 
    MPI_Comm_size(MPI_COMM_WORLD,&numprocs); 
    MPI_Comm_rank(MPI_COMM_WORLD,&myid); 
    while (1) {
 
        /* Check if we are the first node */
        if (myid == 0) { 
            printf("Enter the number of intervals: (0 quits) "); 
            scanf("%d",&n); 
        } 
 
        /* Broadcast n */
        MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
 
        /* if n is 0 then stop */ 
        if (n == 0) { 
            break; 
        }
        /* else, calculate pi for our a predetermined slice slice */
        else { 
            h   = 1.0 / (double) n; 
            sum = 0.0; 
            for (i = myid + 1; i <= n; i += numprocs) { 
                x = h * ((double)i - 0.5); 
                sum += (4.0 / (1.0 + x*x)); 
            } 
            mypi = h * sum; 
            MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, 
		       MPI_COMM_WORLD); 
            if (myid == 0)  
                printf("pi is approximately %.16f, Error is %.16f\n", 
                       pi, fabs(pi - PI25DT)); 
        } 
    } 
    MPI_Finalize(); 
    return 0; 
}

7.1 Blocking Sends and Receives

Blocking the execution of a routine until it has received a new data set can be used to synchronize the computational results between nodes. In many cases, blocking communications are desirable. There is a "ring toss" program which illustrates this communication fairly well.

Node 1 Node 1 Node 1 Node 1
Node 2 Node 2 Node 2 Node 2
Node 3 Node 3 Node 3 Node 3
Node 4 Node 4 Node 4 Node 4

The communication is passed from one processor to the final processor via blocking sends and receives. The syntax for blocking sends is

 
int MPI Send( void *buf, int count, MPI Datatype datatype, int dest, int type, MPI Comm comm);
int MPI Recv( void *buf, int count, MPI Datatype datatype, int source, int type, MPI Comm comm, MPI Status status);

7.2 MPI Sends - Fortran

integer :: stats(MPI_STATUS_SIZE)

if (my_id == source) then
 call MPI_SEND(msg, lngth, MPI_INTEGER,
& dest, tag, MPI_COMM_WORLD, ierr)
endif

if (my_id == dest) then
 call MPI_RECV( msg, lngth, MPI_INTEGER, &
 source, tag, MPI_COMM_WORLD, stats, ierr)
endif
call MPI_SEND(msg, lngth, MPI_INTEGER,
& dest, tag, MPI_COMM_WORLD, ierr)
  • msg - message array
  • lngth - length or array
  • MPI type
  • dest/source - destination or message source
  • tag - message tag - must same on send and recv
  • MPI comm
  • stats - information about the receive
  • error flag

7.3 Non-Blocking Sends and Receives

Non-blocking sends and receives are very much like email. You send and forget the message. The receiving processor checks for "mail" when it feels like it. Neither process halts for message checking if no message is present. Probe command are used to check for message receipt.The syntax for non-blocking sends and receives is usually IDENTICAL to blocking sends and receives, except, the command name is changed slightly. For example:

int MPI_Send( void *buf, int count, MPI_Datatype datatype,int dest, int type, MPI_Comm comm);
int MPI_Recv( void *buf, int count, MPI_Datatype datatype,int source, int type, MPI_Comm comm, MPI_Status status);

becomes

int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int type, MPI_Comm comm);
int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int type, MPI_Comm comm, MPI_Status status);
int MPI_Iprobe(int source, int type, MPI_Comm comm,
MPI_Status status, boolean flag, int status[], int err);

8 MPI Overview

  • startup / shutdown
  • local information - node id, number of nodes
  • global operations
    • broadcasts
    • reductions
    • synchronization
  • sends and receives
    • block vs non-blocking
  • domain decomposition tools
    • new comm types
    • virtual domains
  • data type creation tools


9 Examples

Numerical Integration

Consider integration for the following function:
\pi =\int_{-1/2}^{1/2}\frac{4.0}{1+x^2}

We can approximate this integral using Simpson's algorithm:
 \int_{a}^{b} f(x) \, dx \approx \frac{b-a}{6}\left[f(a) + 4f\left(\frac{a+b}{2}\right)+f(b)\right][1]

Serial Integration:

  • input the number of partitions to be used
  • divide the domain into n partitions
  • evaluate the function at each partition
  • multiply the function evaluation times the width of the function to find a differential area
  • add the differential areas together
  • output the result

Parallel Integration:

  • on processor zero, input the number of partitions
  • broadcast the user input information to all processors
  • determine the number of processors - m
  • divide the domain into n/m partitions on each processor
  • evaluate the function at each partition
  • multiply the function evaluation times the width of the function to find a differential area
  • add the differential areas together across all the processors
  • on processor zero, output the result

Parallel Sorting

  • distribute an equal number of elements to each processor
  • on each processor, sort the local array
  • do a global exchange to find the minimum and maximum over all elements
  • on each processor, put the data into a set of evenly spaced bins (the number of bins should be larger than the number of processors)
  • use a global summation to find the number of elements in each bin
  • broadcast the results of the summation to all processors
  • using the summation data, the processor id, and the number of processors, determine what data should be shipped to which processor
  • on each processor, create a set of coordinated sends and receives to ship the data to the correct processor
  • re-sort the data locally

10 Exchanging Data: Processor Sending - P1

Exchanging data can be a bit complex. Mostly, this is just a set of bookkeeping exercises that need to be done. Assume we are shipping some part of an array on processor p1 to processor p2. On p1, we need to:

  • determine the data that needs to be sent from p1
  • copy the data into a temporary array on p1
  • sending the data from p1
  • deleting the data from the original p1 array

10.1 Exchanging Data: Processor Receiving - P2

on p2, we need to

  • prepare a temporary array to receive data from p1
  • set up a receive to accept the data
  • integrate the data from the temporary array into p2's array
    • make sure there is enough space to receive the new data
    • re-index the new data, if needed
    • copy the new data into p2's array
  • delete the temporary array
Personal tools