1769 lines
		
	
	
		
			78 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
		
		
			
		
	
	
			1769 lines
		
	
	
		
			78 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
|   | @node Distributed-memory FFTW with MPI, Calling FFTW from Modern Fortran, Multi-threaded FFTW, Top | ||
|  | @chapter Distributed-memory FFTW with MPI | ||
|  | @cindex MPI | ||
|  | 
 | ||
|  | @cindex parallel transform | ||
|  | In this chapter we document the parallel FFTW routines for parallel | ||
|  | systems supporting the MPI message-passing interface.  Unlike the | ||
|  | shared-memory threads described in the previous chapter, MPI allows | ||
|  | you to use @emph{distributed-memory} parallelism, where each CPU has | ||
|  | its own separate memory, and which can scale up to clusters of many | ||
|  | thousands of processors.  This capability comes at a price, however: | ||
|  | each process only stores a @emph{portion} of the data to be | ||
|  | transformed, which means that the data structures and | ||
|  | programming-interface are quite different from the serial or threads | ||
|  | versions of FFTW. | ||
|  | @cindex data distribution | ||
|  | 
 | ||
|  | 
 | ||
|  | Distributed-memory parallelism is especially useful when you are | ||
|  | transforming arrays so large that they do not fit into the memory of a | ||
|  | single processor.  The storage per-process required by FFTW's MPI | ||
|  | routines is proportional to the total array size divided by the number | ||
|  | of processes.  Conversely, distributed-memory parallelism can easily | ||
|  | pose an unacceptably high communications overhead for small problems; | ||
|  | the threshold problem size for which parallelism becomes advantageous | ||
|  | will depend on the precise problem you are interested in, your | ||
|  | hardware, and your MPI implementation. | ||
|  | 
 | ||
|  | A note on terminology: in MPI, you divide the data among a set of | ||
|  | ``processes'' which each run in their own memory address space. | ||
|  | Generally, each process runs on a different physical processor, but | ||
|  | this is not required.  A set of processes in MPI is described by an | ||
|  | opaque data structure called a ``communicator,'' the most common of | ||
|  | which is the predefined communicator @code{MPI_COMM_WORLD} which | ||
|  | refers to @emph{all} processes.  For more information on these and | ||
|  | other concepts common to all MPI programs, we refer the reader to the | ||
|  | documentation at @uref{http://www.mcs.anl.gov/research/projects/mpi/, the MPI home | ||
|  | page}. | ||
|  | @cindex MPI communicator | ||
|  | @ctindex MPI_COMM_WORLD | ||
|  | 
 | ||
|  | 
 | ||
|  | We assume in this chapter that the reader is familiar with the usage | ||
|  | of the serial (uniprocessor) FFTW, and focus only on the concepts new | ||
|  | to the MPI interface. | ||
|  | 
 | ||
|  | @menu | ||
|  | * FFTW MPI Installation:: | ||
|  | * Linking and Initializing MPI FFTW:: | ||
|  | * 2d MPI example:: | ||
|  | * MPI Data Distribution:: | ||
|  | * Multi-dimensional MPI DFTs of Real Data:: | ||
|  | * Other Multi-dimensional Real-data MPI Transforms:: | ||
|  | * FFTW MPI Transposes:: | ||
|  | * FFTW MPI Wisdom:: | ||
|  | * Avoiding MPI Deadlocks:: | ||
|  | * FFTW MPI Performance Tips:: | ||
|  | * Combining MPI and Threads:: | ||
|  | * FFTW MPI Reference:: | ||
|  | * FFTW MPI Fortran Interface:: | ||
|  | @end menu | ||
|  | 
 | ||
|  | @c ------------------------------------------------------------ | ||
|  | @node FFTW MPI Installation, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI, Distributed-memory FFTW with MPI | ||
|  | @section FFTW MPI Installation | ||
|  | 
 | ||
|  | All of the FFTW MPI code is located in the @code{mpi} subdirectory of | ||
|  | the FFTW package.  On Unix systems, the FFTW MPI libraries and header | ||
|  | files are automatically configured, compiled, and installed along with | ||
|  | the uniprocessor FFTW libraries simply by including | ||
|  | @code{--enable-mpi} in the flags to the @code{configure} script | ||
|  | (@pxref{Installation on Unix}). | ||
|  | @fpindex configure | ||
|  | 
 | ||
|  | 
 | ||
|  | Any implementation of the MPI standard, version 1 or later, should | ||
|  | work with FFTW.  The @code{configure} script will attempt to | ||
|  | automatically detect how to compile and link code using your MPI | ||
|  | implementation.  In some cases, especially if you have multiple | ||
|  | different MPI implementations installed or have an unusual MPI | ||
|  | software package, you may need to provide this information explicitly. | ||
|  | 
 | ||
|  | Most commonly, one compiles MPI code by invoking a special compiler | ||
|  | command, typically @code{mpicc} for C code.  The @code{configure} | ||
|  | script knows the most common names for this command, but you can | ||
|  | specify the MPI compilation command explicitly by setting the | ||
|  | @code{MPICC} variable, as in @samp{./configure MPICC=mpicc ...}. | ||
|  | @fpindex mpicc | ||
|  | 
 | ||
|  | 
 | ||
|  | If, instead of a special compiler command, you need to link a certain | ||
|  | library, you can specify the link command via the @code{MPILIBS} | ||
|  | variable, as in @samp{./configure MPILIBS=-lmpi ...}.  Note that if | ||
|  | your MPI library is installed in a non-standard location (one the | ||
|  | compiler does not know about by default), you may also have to specify | ||
|  | the location of the library and header files via @code{LDFLAGS} and | ||
|  | @code{CPPFLAGS} variables, respectively, as in @samp{./configure | ||
|  | LDFLAGS=-L/path/to/mpi/libs CPPFLAGS=-I/path/to/mpi/include ...}. | ||
|  | 
 | ||
|  | @c ------------------------------------------------------------ | ||
|  | @node Linking and Initializing MPI FFTW, 2d MPI example, FFTW MPI Installation, Distributed-memory FFTW with MPI | ||
|  | @section Linking and Initializing MPI FFTW | ||
|  | 
 | ||
|  | Programs using the MPI FFTW routines should be linked with | ||
|  | @code{-lfftw3_mpi -lfftw3 -lm} on Unix in double precision, | ||
|  | @code{-lfftw3f_mpi -lfftw3f -lm} in single precision, and so on | ||
|  | (@pxref{Precision}). You will also need to link with whatever library | ||
|  | is responsible for MPI on your system; in most MPI implementations, | ||
|  | there is a special compiler alias named @code{mpicc} to compile and | ||
|  | link MPI code. | ||
|  | @fpindex mpicc | ||
|  | @cindex linking on Unix | ||
|  | @cindex precision | ||
|  | 
 | ||
|  | 
 | ||
|  | @findex fftw_init_threads | ||
|  | Before calling any FFTW routines except possibly | ||
|  | @code{fftw_init_threads} (@pxref{Combining MPI and Threads}), but after calling | ||
|  | @code{MPI_Init}, you should call the function: | ||
|  | 
 | ||
|  | @example | ||
|  | void fftw_mpi_init(void); | ||
|  | @end example | ||
|  | @findex fftw_mpi_init | ||
|  | 
 | ||
|  | If, at the end of your program, you want to get rid of all memory and | ||
|  | other resources allocated internally by FFTW, for both the serial and | ||
|  | MPI routines, you can call: | ||
|  | 
 | ||
|  | @example | ||
|  | void fftw_mpi_cleanup(void); | ||
|  | @end example | ||
|  | @findex fftw_mpi_cleanup | ||
|  | 
 | ||
|  | which is much like the @code{fftw_cleanup()} function except that it | ||
|  | also gets rid of FFTW's MPI-related data.  You must @emph{not} execute | ||
|  | any previously created plans after calling this function. | ||
|  | 
 | ||
|  | @c ------------------------------------------------------------ | ||
|  | @node 2d MPI example, MPI Data Distribution, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI | ||
|  | @section 2d MPI example | ||
|  | 
 | ||
|  | Before we document the FFTW MPI interface in detail, we begin with a | ||
|  | simple example outlining how one would perform a two-dimensional | ||
|  | @code{N0} by @code{N1} complex DFT.  | ||
|  | 
 | ||
|  | @example | ||
|  | #include <fftw3-mpi.h> | ||
|  | 
 | ||
|  | int main(int argc, char **argv) | ||
|  | @{ | ||
|  |     const ptrdiff_t N0 = ..., N1 = ...; | ||
|  |     fftw_plan plan; | ||
|  |     fftw_complex *data; | ||
|  |     ptrdiff_t alloc_local, local_n0, local_0_start, i, j; | ||
|  | 
 | ||
|  |     MPI_Init(&argc, &argv); | ||
|  |     fftw_mpi_init(); | ||
|  | 
 | ||
|  |     /* @r{get local data size and allocate} */ | ||
|  |     alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD, | ||
|  |                                          &local_n0, &local_0_start); | ||
|  |     data = fftw_alloc_complex(alloc_local); | ||
|  | 
 | ||
|  |     /* @r{create plan for in-place forward DFT} */ | ||
|  |     plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD, | ||
|  |                                 FFTW_FORWARD, FFTW_ESTIMATE);     | ||
|  | 
 | ||
|  |     /* @r{initialize data to some function} my_function(x,y) */ | ||
|  |     for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j) | ||
|  |        data[i*N1 + j] = my_function(local_0_start + i, j); | ||
|  | 
 | ||
|  |     /* @r{compute transforms, in-place, as many times as desired} */ | ||
|  |     fftw_execute(plan); | ||
|  | 
 | ||
|  |     fftw_destroy_plan(plan); | ||
|  | 
 | ||
|  |     MPI_Finalize(); | ||
|  | @} | ||
|  | @end example | ||
|  | 
 | ||
|  | As can be seen above, the MPI interface follows the same basic style | ||
|  | of allocate/plan/execute/destroy as the serial FFTW routines.  All of | ||
|  | the MPI-specific routines are prefixed with @samp{fftw_mpi_} instead | ||
|  | of @samp{fftw_}.  There are a few important differences, however: | ||
|  | 
 | ||
|  | First, we must call @code{fftw_mpi_init()} after calling | ||
|  | @code{MPI_Init} (required in all MPI programs) and before calling any | ||
|  | other @samp{fftw_mpi_} routine. | ||
|  | @findex MPI_Init | ||
|  | @findex fftw_mpi_init | ||
|  | 
 | ||
|  | 
 | ||
|  | Second, when we create the plan with @code{fftw_mpi_plan_dft_2d}, | ||
|  | analogous to @code{fftw_plan_dft_2d}, we pass an additional argument: | ||
|  | the communicator, indicating which processes will participate in the | ||
|  | transform (here @code{MPI_COMM_WORLD}, indicating all processes). | ||
|  | Whenever you create, execute, or destroy a plan for an MPI transform, | ||
|  | you must call the corresponding FFTW routine on @emph{all} processes | ||
|  | in the communicator for that transform.  (That is, these are | ||
|  | @emph{collective} calls.)  Note that the plan for the MPI transform | ||
|  | uses the standard @code{fftw_execute} and @code{fftw_destroy} routines | ||
|  | (on the other hand, there are MPI-specific new-array execute functions | ||
|  | documented below). | ||
|  | @cindex collective function | ||
|  | @findex fftw_mpi_plan_dft_2d | ||
|  | @ctindex MPI_COMM_WORLD | ||
|  | 
 | ||
|  | 
 | ||
|  | Third, all of the FFTW MPI routines take @code{ptrdiff_t} arguments | ||
|  | instead of @code{int} as for the serial FFTW.  @code{ptrdiff_t} is a | ||
|  | standard C integer type which is (at least) 32 bits wide on a 32-bit | ||
|  | machine and 64 bits wide on a 64-bit machine.  This is to make it easy | ||
|  | to specify very large parallel transforms on a 64-bit machine.  (You | ||
|  | can specify 64-bit transform sizes in the serial FFTW, too, but only | ||
|  | by using the @samp{guru64} planner interface.  @xref{64-bit Guru | ||
|  | Interface}.) | ||
|  | @tindex ptrdiff_t | ||
|  | @cindex 64-bit architecture | ||
|  | 
 | ||
|  | 
 | ||
|  | Fourth, and most importantly, you don't allocate the entire | ||
|  | two-dimensional array on each process.  Instead, you call | ||
|  | @code{fftw_mpi_local_size_2d} to find out what @emph{portion} of the | ||
|  | array resides on each processor, and how much space to allocate. | ||
|  | Here, the portion of the array on each process is a @code{local_n0} by | ||
|  | @code{N1} slice of the total array, starting at index | ||
|  | @code{local_0_start}.  The total number of @code{fftw_complex} numbers | ||
|  | to allocate is given by the @code{alloc_local} return value, which | ||
|  | @emph{may} be greater than @code{local_n0 * N1} (in case some | ||
|  | intermediate calculations require additional storage).  The data | ||
|  | distribution in FFTW's MPI interface is described in more detail by | ||
|  | the next section. | ||
|  | @findex fftw_mpi_local_size_2d | ||
|  | @cindex data distribution | ||
|  | 
 | ||
|  | 
 | ||
|  | Given the portion of the array that resides on the local process, it | ||
|  | is straightforward to initialize the data (here to a function | ||
|  | @code{myfunction}) and otherwise manipulate it.  Of course, at the end | ||
|  | of the program you may want to output the data somehow, but | ||
|  | synchronizing this output is up to you and is beyond the scope of this | ||
|  | manual.  (One good way to output a large multi-dimensional distributed | ||
|  | array in MPI to a portable binary file is to use the free HDF5 | ||
|  | library; see the @uref{http://www.hdfgroup.org/, HDF home page}.) | ||
|  | @cindex HDF5 | ||
|  | @cindex MPI I/O | ||
|  | 
 | ||
|  | @c ------------------------------------------------------------ | ||
|  | @node MPI Data Distribution, Multi-dimensional MPI DFTs of Real Data, 2d MPI example, Distributed-memory FFTW with MPI | ||
|  | @section MPI Data Distribution | ||
|  | @cindex data distribution | ||
|  | 
 | ||
|  | The most important concept to understand in using FFTW's MPI interface | ||
|  | is the data distribution.  With a serial or multithreaded FFT, all of | ||
|  | the inputs and outputs are stored as a single contiguous chunk of | ||
|  | memory.  With a distributed-memory FFT, the inputs and outputs are | ||
|  | broken into disjoint blocks, one per process. | ||
|  | 
 | ||
|  | In particular, FFTW uses a @emph{1d block distribution} of the data, | ||
|  | distributed along the @emph{first dimension}.  For example, if you | ||
|  | want to perform a @twodims{100,200} complex DFT, distributed over 4 | ||
|  | processes, each process will get a @twodims{25,200} slice of the data. | ||
|  | That is, process 0 will get rows 0 through 24, process 1 will get rows | ||
|  | 25 through 49, process 2 will get rows 50 through 74, and process 3 | ||
|  | will get rows 75 through 99.  If you take the same array but | ||
|  | distribute it over 3 processes, then it is not evenly divisible so the | ||
|  | different processes will have unequal chunks.  FFTW's default choice | ||
|  | in this case is to assign 34 rows to processes 0 and 1, and 32 rows to | ||
|  | process 2. | ||
|  | @cindex block distribution | ||
|  | 
 | ||
|  | 
 | ||
|  | FFTW provides several @samp{fftw_mpi_local_size} routines that you can | ||
|  | call to find out what portion of an array is stored on the current | ||
|  | process.  In most cases, you should use the default block sizes picked | ||
|  | by FFTW, but it is also possible to specify your own block size.  For | ||
|  | example, with a @twodims{100,200} array on three processes, you can | ||
|  | tell FFTW to use a block size of 40, which would assign 40 rows to | ||
|  | processes 0 and 1, and 20 rows to process 2.  FFTW's default is to | ||
|  | divide the data equally among the processes if possible, and as best | ||
|  | it can otherwise.  The rows are always assigned in ``rank order,'' | ||
|  | i.e. process 0 gets the first block of rows, then process 1, and so | ||
|  | on.  (You can change this by using @code{MPI_Comm_split} to create a | ||
|  | new communicator with re-ordered processes.)  However, you should | ||
|  | always call the @samp{fftw_mpi_local_size} routines, if possible, | ||
|  | rather than trying to predict FFTW's distribution choices. | ||
|  | 
 | ||
|  | In particular, it is critical that you allocate the storage size that | ||
|  | is returned by @samp{fftw_mpi_local_size}, which is @emph{not} | ||
|  | necessarily the size of the local slice of the array.  The reason is | ||
|  | that intermediate steps of FFTW's algorithms involve transposing the | ||
|  | array and redistributing the data, so at these intermediate steps FFTW | ||
|  | may require more local storage space (albeit always proportional to | ||
|  | the total size divided by the number of processes).  The | ||
|  | @samp{fftw_mpi_local_size} functions know how much storage is required | ||
|  | for these intermediate steps and tell you the correct amount to | ||
|  | allocate. | ||
|  | 
 | ||
|  | @menu | ||
|  | * Basic and advanced distribution interfaces:: | ||
|  | * Load balancing:: | ||
|  | * Transposed distributions:: | ||
|  | * One-dimensional distributions:: | ||
|  | @end menu | ||
|  | 
 | ||
|  | @node Basic and advanced distribution interfaces, Load balancing, MPI Data Distribution, MPI Data Distribution | ||
|  | @subsection Basic and advanced distribution interfaces | ||
|  | 
 | ||
|  | As with the planner interface, the @samp{fftw_mpi_local_size} | ||
|  | distribution interface is broken into basic and advanced | ||
|  | (@samp{_many}) interfaces, where the latter allows you to specify the | ||
|  | block size manually and also to request block sizes when computing | ||
|  | multiple transforms simultaneously.  These functions are documented | ||
|  | more exhaustively by the FFTW MPI Reference, but we summarize the | ||
|  | basic ideas here using a couple of two-dimensional examples. | ||
|  | 
 | ||
|  | For the @twodims{100,200} complex-DFT example, above, we would find | ||
|  | the distribution by calling the following function in the basic | ||
|  | interface: | ||
|  | 
 | ||
|  | @example | ||
|  | ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, | ||
|  |                                  ptrdiff_t *local_n0, ptrdiff_t *local_0_start); | ||
|  | @end example | ||
|  | @findex fftw_mpi_local_size_2d | ||
|  | 
 | ||
|  | Given the total size of the data to be transformed (here, @code{n0 = | ||
|  | 100} and @code{n1 = 200}) and an MPI communicator (@code{comm}), this | ||
|  | function provides three numbers. | ||
|  | 
 | ||
|  | First, it describes the shape of the local data: the current process | ||
|  | should store a @code{local_n0} by @code{n1} slice of the overall | ||
|  | dataset, in row-major order (@code{n1} dimension contiguous), starting | ||
|  | at index @code{local_0_start}.  That is, if the total dataset is | ||
|  | viewed as a @code{n0} by @code{n1} matrix, the current process should | ||
|  | store the rows @code{local_0_start} to | ||
|  | @code{local_0_start+local_n0-1}.  Obviously, if you are running with | ||
|  | only a single MPI process, that process will store the entire array: | ||
|  | @code{local_0_start} will be zero and @code{local_n0} will be | ||
|  | @code{n0}.  @xref{Row-major Format}. | ||
|  | @cindex row-major | ||
|  | 
 | ||
|  | 
 | ||
|  | Second, the return value is the total number of data elements (e.g., | ||
|  | complex numbers for a complex DFT) that should be allocated for the | ||
|  | input and output arrays on the current process (ideally with | ||
|  | @code{fftw_malloc} or an @samp{fftw_alloc} function, to ensure optimal | ||
|  | alignment).  It might seem that this should always be equal to | ||
|  | @code{local_n0 * n1}, but this is @emph{not} the case.  FFTW's | ||
|  | distributed FFT algorithms require data redistributions at | ||
|  | intermediate stages of the transform, and in some circumstances this | ||
|  | may require slightly larger local storage.  This is discussed in more | ||
|  | detail below, under @ref{Load balancing}. | ||
|  | @findex fftw_malloc | ||
|  | @findex fftw_alloc_complex | ||
|  | 
 | ||
|  | 
 | ||
|  | @cindex advanced interface | ||
|  | The advanced-interface @samp{local_size} function for multidimensional | ||
|  | transforms returns the same three things (@code{local_n0}, | ||
|  | @code{local_0_start}, and the total number of elements to allocate), | ||
|  | but takes more inputs: | ||
|  | 
 | ||
|  | @example | ||
|  | ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, | ||
|  |                                    ptrdiff_t howmany, | ||
|  |                                    ptrdiff_t block0, | ||
|  |                                    MPI_Comm comm, | ||
|  |                                    ptrdiff_t *local_n0, | ||
|  |                                    ptrdiff_t *local_0_start); | ||
|  | @end example | ||
|  | @findex fftw_mpi_local_size_many | ||
|  | 
 | ||
|  | The two-dimensional case above corresponds to @code{rnk = 2} and an | ||
|  | array @code{n} of length 2 with @code{n[0] = n0} and @code{n[1] = n1}. | ||
|  | This routine is for any @code{rnk > 1}; one-dimensional transforms | ||
|  | have their own interface because they work slightly differently, as | ||
|  | discussed below. | ||
|  | 
 | ||
|  | First, the advanced interface allows you to perform multiple | ||
|  | transforms at once, of interleaved data, as specified by the | ||
|  | @code{howmany} parameter.  (@code{hoamany} is 1 for a single | ||
|  | transform.) | ||
|  | 
 | ||
|  | Second, here you can specify your desired block size in the @code{n0} | ||
|  | dimension, @code{block0}.  To use FFTW's default block size, pass | ||
|  | @code{FFTW_MPI_DEFAULT_BLOCK} (0) for @code{block0}.  Otherwise, on | ||
|  | @code{P} processes, FFTW will return @code{local_n0} equal to | ||
|  | @code{block0} on the first @code{P / block0} processes (rounded down), | ||
|  | return @code{local_n0} equal to @code{n0 - block0 * (P / block0)} on | ||
|  | the next process, and @code{local_n0} equal to zero on any remaining | ||
|  | processes.  In general, we recommend using the default block size | ||
|  | (which corresponds to @code{n0 / P}, rounded up). | ||
|  | @ctindex FFTW_MPI_DEFAULT_BLOCK | ||
|  | @cindex block distribution | ||
|  | 
 | ||
|  | 
 | ||
|  | For example, suppose you have @code{P = 4} processes and @code{n0 = | ||
|  | 21}.  The default will be a block size of @code{6}, which will give | ||
|  | @code{local_n0 = 6} on the first three processes and @code{local_n0 = | ||
|  | 3} on the last process.  Instead, however, you could specify | ||
|  | @code{block0 = 5} if you wanted, which would give @code{local_n0 = 5} | ||
|  | on processes 0 to 2, @code{local_n0 = 6} on process 3.  (This choice, | ||
|  | while it may look superficially more ``balanced,'' has the same | ||
|  | critical path as FFTW's default but requires more communications.) | ||
|  | 
 | ||
|  | @node Load balancing, Transposed distributions, Basic and advanced distribution interfaces, MPI Data Distribution | ||
|  | @subsection Load balancing | ||
|  | @cindex load balancing | ||
|  | 
 | ||
|  | Ideally, when you parallelize a transform over some @math{P} | ||
|  | processes, each process should end up with work that takes equal time. | ||
|  | Otherwise, all of the processes end up waiting on whichever process is | ||
|  | slowest.  This goal is known as ``load balancing.''  In this section, | ||
|  | we describe the circumstances under which FFTW is able to load-balance | ||
|  | well, and in particular how you should choose your transform size in | ||
|  | order to load balance. | ||
|  | 
 | ||
|  | Load balancing is especially difficult when you are parallelizing over | ||
|  | heterogeneous machines; for example, if one of your processors is a | ||
|  | old 486 and another is a Pentium IV, obviously you should give the | ||
|  | Pentium more work to do than the 486 since the latter is much slower. | ||
|  | FFTW does not deal with this problem, however---it assumes that your | ||
|  | processes run on hardware of comparable speed, and that the goal is | ||
|  | therefore to divide the problem as equally as possible. | ||
|  | 
 | ||
|  | For a multi-dimensional complex DFT, FFTW can divide the problem | ||
|  | equally among the processes if: (i) the @emph{first} dimension | ||
|  | @code{n0} is divisible by @math{P}; and (ii), the @emph{product} of | ||
|  | the subsequent dimensions is divisible by @math{P}.  (For the advanced | ||
|  | interface, where you can specify multiple simultaneous transforms via | ||
|  | some ``vector'' length @code{howmany}, a factor of @code{howmany} is | ||
|  | included in the product of the subsequent dimensions.) | ||
|  | 
 | ||
|  | For a one-dimensional complex DFT, the length @code{N} of the data | ||
|  | should be divisible by @math{P} @emph{squared} to be able to divide | ||
|  | the problem equally among the processes. | ||
|  | 
 | ||
|  | @node Transposed distributions, One-dimensional distributions, Load balancing, MPI Data Distribution | ||
|  | @subsection Transposed distributions | ||
|  | 
 | ||
|  | Internally, FFTW's MPI transform algorithms work by first computing | ||
|  | transforms of the data local to each process, then by globally | ||
|  | @emph{transposing} the data in some fashion to redistribute the data | ||
|  | among the processes, transforming the new data local to each process, | ||
|  | and transposing back.  For example, a two-dimensional @code{n0} by | ||
|  | @code{n1} array, distributed across the @code{n0} dimension, is | ||
|  | transformd by: (i) transforming the @code{n1} dimension, which are | ||
|  | local to each process; (ii) transposing to an @code{n1} by @code{n0} | ||
|  | array, distributed across the @code{n1} dimension; (iii) transforming | ||
|  | the @code{n0} dimension, which is now local to each process; (iv) | ||
|  | transposing back. | ||
|  | @cindex transpose | ||
|  | 
 | ||
|  | 
 | ||
|  | However, in many applications it is acceptable to compute a | ||
|  | multidimensional DFT whose results are produced in transposed order | ||
|  | (e.g., @code{n1} by @code{n0} in two dimensions).  This provides a | ||
|  | significant performance advantage, because it means that the final | ||
|  | transposition step can be omitted.  FFTW supports this optimization, | ||
|  | which you specify by passing the flag @code{FFTW_MPI_TRANSPOSED_OUT} | ||
|  | to the planner routines.  To compute the inverse transform of | ||
|  | transposed output, you specify @code{FFTW_MPI_TRANSPOSED_IN} to tell | ||
|  | it that the input is transposed.  In this section, we explain how to | ||
|  | interpret the output format of such a transform. | ||
|  | @ctindex FFTW_MPI_TRANSPOSED_OUT | ||
|  | @ctindex FFTW_MPI_TRANSPOSED_IN | ||
|  | 
 | ||
|  | 
 | ||
|  | Suppose you have are transforming multi-dimensional data with (at | ||
|  | least two) dimensions @ndims{}.  As always, it is distributed along | ||
|  | the first dimension @dimk{0}.  Now, if we compute its DFT with the | ||
|  | @code{FFTW_MPI_TRANSPOSED_OUT} flag, the resulting output data are stored | ||
|  | with the first @emph{two} dimensions transposed: @ndimstrans{}, | ||
|  | distributed along the @dimk{1} dimension.  Conversely, if we take the | ||
|  | @ndimstrans{} data and transform it with the | ||
|  | @code{FFTW_MPI_TRANSPOSED_IN} flag, then the format goes back to the | ||
|  | original @ndims{} array. | ||
|  | 
 | ||
|  | There are two ways to find the portion of the transposed array that | ||
|  | resides on the current process.  First, you can simply call the | ||
|  | appropriate @samp{local_size} function, passing @ndimstrans{} (the | ||
|  | transposed dimensions).  This would mean calling the @samp{local_size} | ||
|  | function twice, once for the transposed and once for the | ||
|  | non-transposed dimensions.  Alternatively, you can call one of the | ||
|  | @samp{local_size_transposed} functions, which returns both the | ||
|  | non-transposed and transposed data distribution from a single call. | ||
|  | For example, for a 3d transform with transposed output (or input), you | ||
|  | might call: | ||
|  | 
 | ||
|  | @example | ||
|  | ptrdiff_t fftw_mpi_local_size_3d_transposed( | ||
|  |                 ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm, | ||
|  |                 ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | ||
|  |                 ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | ||
|  | @end example | ||
|  | @findex fftw_mpi_local_size_3d_transposed | ||
|  | 
 | ||
|  | Here, @code{local_n0} and @code{local_0_start} give the size and | ||
|  | starting index of the @code{n0} dimension for the | ||
|  | @emph{non}-transposed data, as in the previous sections.  For | ||
|  | @emph{transposed} data (e.g. the output for | ||
|  | @code{FFTW_MPI_TRANSPOSED_OUT}), @code{local_n1} and | ||
|  | @code{local_1_start} give the size and starting index of the @code{n1} | ||
|  | dimension, which is the first dimension of the transposed data | ||
|  | (@code{n1} by @code{n0} by @code{n2}). | ||
|  | 
 | ||
|  | (Note that @code{FFTW_MPI_TRANSPOSED_IN} is completely equivalent to | ||
|  | performing @code{FFTW_MPI_TRANSPOSED_OUT} and passing the first two | ||
|  | dimensions to the planner in reverse order, or vice versa.  If you | ||
|  | pass @emph{both} the @code{FFTW_MPI_TRANSPOSED_IN} and | ||
|  | @code{FFTW_MPI_TRANSPOSED_OUT} flags, it is equivalent to swapping the | ||
|  | first two dimensions passed to the planner and passing @emph{neither} | ||
|  | flag.) | ||
|  | 
 | ||
|  | @node One-dimensional distributions,  , Transposed distributions, MPI Data Distribution | ||
|  | @subsection One-dimensional distributions | ||
|  | 
 | ||
|  | For one-dimensional distributed DFTs using FFTW, matters are slightly | ||
|  | more complicated because the data distribution is more closely tied to | ||
|  | how the algorithm works.  In particular, you can no longer pass an | ||
|  | arbitrary block size and must accept FFTW's default; also, the block | ||
|  | sizes may be different for input and output.  Also, the data | ||
|  | distribution depends on the flags and transform direction, in order | ||
|  | for forward and backward transforms to work correctly. | ||
|  | 
 | ||
|  | @example | ||
|  | ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm, | ||
|  |                 int sign, unsigned flags, | ||
|  |                 ptrdiff_t *local_ni, ptrdiff_t *local_i_start, | ||
|  |                 ptrdiff_t *local_no, ptrdiff_t *local_o_start); | ||
|  | @end example | ||
|  | @findex fftw_mpi_local_size_1d | ||
|  | 
 | ||
|  | This function computes the data distribution for a 1d transform of | ||
|  | size @code{n0} with the given transform @code{sign} and @code{flags}. | ||
|  | Both input and output data use block distributions.  The input on the | ||
|  | current process will consist of @code{local_ni} numbers starting at | ||
|  | index @code{local_i_start}; e.g. if only a single process is used, | ||
|  | then @code{local_ni} will be @code{n0} and @code{local_i_start} will | ||
|  | be @code{0}.  Similarly for the output, with @code{local_no} numbers | ||
|  | starting at index @code{local_o_start}.  The return value of | ||
|  | @code{fftw_mpi_local_size_1d} will be the total number of elements to | ||
|  | allocate on the current process (which might be slightly larger than | ||
|  | the local size due to intermediate steps in the algorithm). | ||
|  | 
 | ||
|  | As mentioned above (@pxref{Load balancing}), the data will be divided | ||
|  | equally among the processes if @code{n0} is divisible by the | ||
|  | @emph{square} of the number of processes.  In this case, | ||
|  | @code{local_ni} will equal @code{local_no}.  Otherwise, they may be | ||
|  | different. | ||
|  | 
 | ||
|  | For some applications, such as convolutions, the order of the output | ||
|  | data is irrelevant.  In this case, performance can be improved by | ||
|  | specifying that the output data be stored in an FFTW-defined | ||
|  | ``scrambled'' format.  (In particular, this is the analogue of | ||
|  | transposed output in the multidimensional case: scrambled output saves | ||
|  | a communications step.)  If you pass @code{FFTW_MPI_SCRAMBLED_OUT} in | ||
|  | the flags, then the output is stored in this (undocumented) scrambled | ||
|  | order.  Conversely, to perform the inverse transform of data in | ||
|  | scrambled order, pass the @code{FFTW_MPI_SCRAMBLED_IN} flag. | ||
|  | @ctindex FFTW_MPI_SCRAMBLED_OUT | ||
|  | @ctindex FFTW_MPI_SCRAMBLED_IN | ||
|  | 
 | ||
|  | 
 | ||
|  | In MPI FFTW, only composite sizes @code{n0} can be parallelized; we | ||
|  | have not yet implemented a parallel algorithm for large prime sizes. | ||
|  | 
 | ||
|  | @c ------------------------------------------------------------ | ||
|  | @node Multi-dimensional MPI DFTs of Real Data, Other Multi-dimensional Real-data MPI Transforms, MPI Data Distribution, Distributed-memory FFTW with MPI | ||
|  | @section Multi-dimensional MPI DFTs of Real Data | ||
|  | 
 | ||
|  | FFTW's MPI interface also supports multi-dimensional DFTs of real | ||
|  | data, similar to the serial r2c and c2r interfaces.  (Parallel | ||
|  | one-dimensional real-data DFTs are not currently supported; you must | ||
|  | use a complex transform and set the imaginary parts of the inputs to | ||
|  | zero.) | ||
|  | 
 | ||
|  | The key points to understand for r2c and c2r MPI transforms (compared | ||
|  | to the MPI complex DFTs or the serial r2c/c2r transforms), are: | ||
|  | 
 | ||
|  | @itemize @bullet | ||
|  | 
 | ||
|  | @item | ||
|  | Just as for serial transforms, r2c/c2r DFTs transform @ndims{} real | ||
|  | data to/from @ndimshalf{} complex data: the last dimension of the | ||
|  | complex data is cut in half (rounded down), plus one.  As for the | ||
|  | serial transforms, the sizes you pass to the @samp{plan_dft_r2c} and | ||
|  | @samp{plan_dft_c2r} are the @ndims{} dimensions of the real data. | ||
|  | 
 | ||
|  | @item | ||
|  | @cindex padding | ||
|  | Although the real data is @emph{conceptually} @ndims{}, it is | ||
|  | @emph{physically} stored as an @ndimspad{} array, where the last | ||
|  | dimension has been @emph{padded} to make it the same size as the | ||
|  | complex output.  This is much like the in-place serial r2c/c2r | ||
|  | interface (@pxref{Multi-Dimensional DFTs of Real Data}), except that | ||
|  | in MPI the padding is required even for out-of-place data.  The extra | ||
|  | padding numbers are ignored by FFTW (they are @emph{not} like | ||
|  | zero-padding the transform to a larger size); they are only used to | ||
|  | determine the data layout. | ||
|  | 
 | ||
|  | @item | ||
|  | @cindex data distribution | ||
|  | The data distribution in MPI for @emph{both} the real and complex data | ||
|  | is determined by the shape of the @emph{complex} data.  That is, you | ||
|  | call the appropriate @samp{local size} function for the @ndimshalf{} | ||
|  | complex data, and then use the @emph{same} distribution for the real | ||
|  | data except that the last complex dimension is replaced by a (padded) | ||
|  | real dimension of twice the length. | ||
|  | 
 | ||
|  | @end itemize | ||
|  | 
 | ||
|  | For example suppose we are performing an out-of-place r2c transform of | ||
|  | @threedims{L,M,N} real data [padded to @threedims{L,M,2(N/2+1)}], | ||
|  | resulting in @threedims{L,M,N/2+1} complex data.  Similar to the | ||
|  | example in @ref{2d MPI example}, we might do something like: | ||
|  | 
 | ||
|  | @example | ||
|  | #include <fftw3-mpi.h> | ||
|  | 
 | ||
|  | int main(int argc, char **argv) | ||
|  | @{ | ||
|  |     const ptrdiff_t L = ..., M = ..., N = ...; | ||
|  |     fftw_plan plan; | ||
|  |     double *rin; | ||
|  |     fftw_complex *cout; | ||
|  |     ptrdiff_t alloc_local, local_n0, local_0_start, i, j, k; | ||
|  | 
 | ||
|  |     MPI_Init(&argc, &argv); | ||
|  |     fftw_mpi_init(); | ||
|  | 
 | ||
|  |     /* @r{get local data size and allocate} */ | ||
|  |     alloc_local = fftw_mpi_local_size_3d(L, M, N/2+1, MPI_COMM_WORLD, | ||
|  |                                          &local_n0, &local_0_start); | ||
|  |     rin = fftw_alloc_real(2 * alloc_local); | ||
|  |     cout = fftw_alloc_complex(alloc_local); | ||
|  | 
 | ||
|  |     /* @r{create plan for out-of-place r2c DFT} */ | ||
|  |     plan = fftw_mpi_plan_dft_r2c_3d(L, M, N, rin, cout, MPI_COMM_WORLD, | ||
|  |                                     FFTW_MEASURE); | ||
|  | 
 | ||
|  |     /* @r{initialize rin to some function} my_func(x,y,z) */ | ||
|  |     for (i = 0; i < local_n0; ++i) | ||
|  |        for (j = 0; j < M; ++j) | ||
|  |          for (k = 0; k < N; ++k) | ||
|  |        rin[(i*M + j) * (2*(N/2+1)) + k] = my_func(local_0_start+i, j, k); | ||
|  | 
 | ||
|  |     /* @r{compute transforms as many times as desired} */ | ||
|  |     fftw_execute(plan); | ||
|  | 
 | ||
|  |     fftw_destroy_plan(plan); | ||
|  | 
 | ||
|  |     MPI_Finalize(); | ||
|  | @} | ||
|  | @end example | ||
|  | 
 | ||
|  | @findex fftw_alloc_real | ||
|  | @cindex row-major | ||
|  | Note that we allocated @code{rin} using @code{fftw_alloc_real} with an | ||
|  | argument of @code{2 * alloc_local}: since @code{alloc_local} is the | ||
|  | number of @emph{complex} values to allocate, the number of @emph{real} | ||
|  | values is twice as many.  The @code{rin} array is then | ||
|  | @threedims{local_n0,M,2(N/2+1)} in row-major order, so its | ||
|  | @code{(i,j,k)} element is at the index @code{(i*M + j) * (2*(N/2+1)) + | ||
|  | k} (@pxref{Multi-dimensional Array Format }). | ||
|  | 
 | ||
|  | @cindex transpose | ||
|  | @ctindex FFTW_TRANSPOSED_OUT | ||
|  | @ctindex FFTW_TRANSPOSED_IN | ||
|  | As for the complex transforms, improved performance can be obtained by | ||
|  | specifying that the output is the transpose of the input or vice versa | ||
|  | (@pxref{Transposed distributions}).  In our @threedims{L,M,N} r2c | ||
|  | example, including @code{FFTW_TRANSPOSED_OUT} in the flags means that | ||
|  | the input would be a padded @threedims{L,M,2(N/2+1)} real array | ||
|  | distributed over the @code{L} dimension, while the output would be a | ||
|  | @threedims{M,L,N/2+1} complex array distributed over the @code{M} | ||
|  | dimension.  To perform the inverse c2r transform with the same data | ||
|  | distributions, you would use the @code{FFTW_TRANSPOSED_IN} flag. | ||
|  | 
 | ||
|  | @c ------------------------------------------------------------ | ||
|  | @node Other Multi-dimensional Real-data MPI Transforms, FFTW MPI Transposes, Multi-dimensional MPI DFTs of Real Data, Distributed-memory FFTW with MPI | ||
|  | @section Other multi-dimensional Real-Data MPI Transforms | ||
|  | 
 | ||
|  | @cindex r2r | ||
|  | FFTW's MPI interface also supports multi-dimensional @samp{r2r} | ||
|  | transforms of all kinds supported by the serial interface | ||
|  | (e.g. discrete cosine and sine transforms, discrete Hartley | ||
|  | transforms, etc.).  Only multi-dimensional @samp{r2r} transforms, not | ||
|  | one-dimensional transforms, are currently parallelized. | ||
|  | 
 | ||
|  | @tindex fftw_r2r_kind | ||
|  | These are used much like the multidimensional complex DFTs discussed | ||
|  | above, except that the data is real rather than complex, and one needs | ||
|  | to pass an r2r transform kind (@code{fftw_r2r_kind}) for each | ||
|  | dimension as in the serial FFTW (@pxref{More DFTs of Real Data}). | ||
|  | 
 | ||
|  | For example, one might perform a two-dimensional @twodims{L,M} that is | ||
|  | an REDFT10 (DCT-II) in the first dimension and an RODFT10 (DST-II) in | ||
|  | the second dimension with code like: | ||
|  | 
 | ||
|  | @example | ||
|  |     const ptrdiff_t L = ..., M = ...; | ||
|  |     fftw_plan plan; | ||
|  |     double *data; | ||
|  |     ptrdiff_t alloc_local, local_n0, local_0_start, i, j; | ||
|  | 
 | ||
|  |     /* @r{get local data size and allocate} */ | ||
|  |     alloc_local = fftw_mpi_local_size_2d(L, M, MPI_COMM_WORLD, | ||
|  |                                          &local_n0, &local_0_start); | ||
|  |     data = fftw_alloc_real(alloc_local); | ||
|  | 
 | ||
|  |     /* @r{create plan for in-place REDFT10 x RODFT10} */ | ||
|  |     plan = fftw_mpi_plan_r2r_2d(L, M, data, data, MPI_COMM_WORLD, | ||
|  |                                 FFTW_REDFT10, FFTW_RODFT10, FFTW_MEASURE); | ||
|  | 
 | ||
|  |     /* @r{initialize data to some function} my_function(x,y) */ | ||
|  |     for (i = 0; i < local_n0; ++i) for (j = 0; j < M; ++j) | ||
|  |        data[i*M + j] = my_function(local_0_start + i, j); | ||
|  | 
 | ||
|  |     /* @r{compute transforms, in-place, as many times as desired} */ | ||
|  |     fftw_execute(plan); | ||
|  | 
 | ||
|  |     fftw_destroy_plan(plan); | ||
|  | @end example | ||
|  | 
 | ||
|  | @findex fftw_alloc_real | ||
|  | Notice that we use the same @samp{local_size} functions as we did for | ||
|  | complex data, only now we interpret the sizes in terms of real rather | ||
|  | than complex values, and correspondingly use @code{fftw_alloc_real}. | ||
|  | 
 | ||
|  | @c ------------------------------------------------------------ | ||
|  | @node FFTW MPI Transposes, FFTW MPI Wisdom, Other Multi-dimensional Real-data MPI Transforms, Distributed-memory FFTW with MPI | ||
|  | @section FFTW MPI Transposes | ||
|  | @cindex transpose | ||
|  | 
 | ||
|  | The FFTW's MPI Fourier transforms rely on one or more @emph{global | ||
|  | transposition} step for their communications.  For example, the | ||
|  | multidimensional transforms work by transforming along some | ||
|  | dimensions, then transposing to make the first dimension local and | ||
|  | transforming that, then transposing back.  Because global | ||
|  | transposition of a block-distributed matrix has many other potential | ||
|  | uses besides FFTs, FFTW's transpose routines can be called directly, | ||
|  | as documented in this section.  | ||
|  | 
 | ||
|  | @menu | ||
|  | * Basic distributed-transpose interface:: | ||
|  | * Advanced distributed-transpose interface:: | ||
|  | * An improved replacement for MPI_Alltoall:: | ||
|  | @end menu | ||
|  | 
 | ||
|  | @node Basic distributed-transpose interface, Advanced distributed-transpose interface, FFTW MPI Transposes, FFTW MPI Transposes | ||
|  | @subsection Basic distributed-transpose interface | ||
|  | 
 | ||
|  | In particular, suppose that we have an @code{n0} by @code{n1} array in | ||
|  | row-major order, block-distributed across the @code{n0} dimension.  To | ||
|  | transpose this into an @code{n1} by @code{n0} array block-distributed | ||
|  | across the @code{n1} dimension, we would create a plan by calling the | ||
|  | following function: | ||
|  | 
 | ||
|  | @example | ||
|  | fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1, | ||
|  |                                   double *in, double *out, | ||
|  |                                   MPI_Comm comm, unsigned flags); | ||
|  | @end example | ||
|  | @findex fftw_mpi_plan_transpose | ||
|  | 
 | ||
|  | The input and output arrays (@code{in} and @code{out}) can be the | ||
|  | same.  The transpose is actually executed by calling | ||
|  | @code{fftw_execute} on the plan, as usual. | ||
|  | @findex fftw_execute | ||
|  | 
 | ||
|  | 
 | ||
|  | The @code{flags} are the usual FFTW planner flags, but support | ||
|  | two additional flags: @code{FFTW_MPI_TRANSPOSED_OUT} and/or | ||
|  | @code{FFTW_MPI_TRANSPOSED_IN}.  What these flags indicate, for | ||
|  | transpose plans, is that the output and/or input, respectively, are | ||
|  | @emph{locally} transposed.  That is, on each process input data is | ||
|  | normally stored as a @code{local_n0} by @code{n1} array in row-major | ||
|  | order, but for an @code{FFTW_MPI_TRANSPOSED_IN} plan the input data is | ||
|  | stored as @code{n1} by @code{local_n0} in row-major order.  Similarly, | ||
|  | @code{FFTW_MPI_TRANSPOSED_OUT} means that the output is @code{n0} by | ||
|  | @code{local_n1} instead of @code{local_n1} by @code{n0}. | ||
|  | @ctindex FFTW_MPI_TRANSPOSED_OUT | ||
|  | @ctindex FFTW_MPI_TRANSPOSED_IN | ||
|  | 
 | ||
|  | 
 | ||
|  | To determine the local size of the array on each process before and | ||
|  | after the transpose, as well as the amount of storage that must be | ||
|  | allocated, one should call @code{fftw_mpi_local_size_2d_transposed}, | ||
|  | just as for a 2d DFT as described in the previous section: | ||
|  | @cindex data distribution | ||
|  | 
 | ||
|  | @example | ||
|  | ptrdiff_t fftw_mpi_local_size_2d_transposed | ||
|  |                 (ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, | ||
|  |                  ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | ||
|  |                  ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | ||
|  | @end example | ||
|  | @findex fftw_mpi_local_size_2d_transposed | ||
|  | 
 | ||
|  | Again, the return value is the local storage to allocate, which in | ||
|  | this case is the number of @emph{real} (@code{double}) values rather | ||
|  | than complex numbers as in the previous examples. | ||
|  | 
 | ||
|  | @node Advanced distributed-transpose interface, An improved replacement for MPI_Alltoall, Basic distributed-transpose interface, FFTW MPI Transposes | ||
|  | @subsection Advanced distributed-transpose interface | ||
|  | 
 | ||
|  | The above routines are for a transpose of a matrix of numbers (of type | ||
|  | @code{double}), using FFTW's default block sizes.  More generally, one | ||
|  | can perform transposes of @emph{tuples} of numbers, with | ||
|  | user-specified block sizes for the input and output: | ||
|  | 
 | ||
|  | @example | ||
|  | fftw_plan fftw_mpi_plan_many_transpose | ||
|  |                 (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany, | ||
|  |                  ptrdiff_t block0, ptrdiff_t block1, | ||
|  |                  double *in, double *out, MPI_Comm comm, unsigned flags); | ||
|  | @end example | ||
|  | @findex fftw_mpi_plan_many_transpose | ||
|  | 
 | ||
|  | In this case, one is transposing an @code{n0} by @code{n1} matrix of | ||
|  | @code{howmany}-tuples (e.g. @code{howmany = 2} for complex numbers). | ||
|  | The input is distributed along the @code{n0} dimension with block size | ||
|  | @code{block0}, and the @code{n1} by @code{n0} output is distributed | ||
|  | along the @code{n1} dimension with block size @code{block1}.  If | ||
|  | @code{FFTW_MPI_DEFAULT_BLOCK} (0) is passed for a block size then FFTW | ||
|  | uses its default block size.  To get the local size of the data on | ||
|  | each process, you should then call @code{fftw_mpi_local_size_many_transposed}. | ||
|  | @ctindex FFTW_MPI_DEFAULT_BLOCK | ||
|  | @findex fftw_mpi_local_size_many_transposed | ||
|  | 
 | ||
|  | @node An improved replacement for MPI_Alltoall,  , Advanced distributed-transpose interface, FFTW MPI Transposes | ||
|  | @subsection An improved replacement for MPI_Alltoall | ||
|  | 
 | ||
|  | We close this section by noting that FFTW's MPI transpose routines can | ||
|  | be thought of as a generalization for the @code{MPI_Alltoall} function | ||
|  | (albeit only for floating-point types), and in some circumstances can | ||
|  | function as an improved replacement. | ||
|  | @findex MPI_Alltoall | ||
|  | 
 | ||
|  | 
 | ||
|  | @code{MPI_Alltoall} is defined by the MPI standard as: | ||
|  | 
 | ||
|  | @example | ||
|  | int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype,  | ||
|  |                  void *recvbuf, int recvcnt, MPI_Datatype recvtype,  | ||
|  |                  MPI_Comm comm); | ||
|  | @end example | ||
|  | 
 | ||
|  | In particular, for @code{double*} arrays @code{in} and @code{out}, | ||
|  | consider the call: | ||
|  | 
 | ||
|  | @example | ||
|  | MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm); | ||
|  | @end example | ||
|  | 
 | ||
|  | This is completely equivalent to: | ||
|  | 
 | ||
|  | @example | ||
|  | MPI_Comm_size(comm, &P); | ||
|  | plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE); | ||
|  | fftw_execute(plan); | ||
|  | fftw_destroy_plan(plan); | ||
|  | @end example | ||
|  | 
 | ||
|  | That is, computing a @twodims{P,P} transpose on @code{P} processes, | ||
|  | with a block size of 1, is just a standard all-to-all communication. | ||
|  | 
 | ||
|  | However, using the FFTW routine instead of @code{MPI_Alltoall} may | ||
|  | have certain advantages.  First of all, FFTW's routine can operate | ||
|  | in-place (@code{in == out}) whereas @code{MPI_Alltoall} can only | ||
|  | operate out-of-place. | ||
|  | @cindex in-place | ||
|  | 
 | ||
|  | 
 | ||
|  | Second, even for out-of-place plans, FFTW's routine may be faster, | ||
|  | especially if you need to perform the all-to-all communication many | ||
|  | times and can afford to use @code{FFTW_MEASURE} or | ||
|  | @code{FFTW_PATIENT}.  It should certainly be no slower, not including | ||
|  | the time to create the plan, since one of the possible algorithms that | ||
|  | FFTW uses for an out-of-place transpose @emph{is} simply to call | ||
|  | @code{MPI_Alltoall}.  However, FFTW also considers several other | ||
|  | possible algorithms that, depending on your MPI implementation and | ||
|  | your hardware, may be faster. | ||
|  | @ctindex FFTW_MEASURE | ||
|  | @ctindex FFTW_PATIENT | ||
|  | 
 | ||
|  | @c ------------------------------------------------------------ | ||
|  | @node FFTW MPI Wisdom, Avoiding MPI Deadlocks, FFTW MPI Transposes, Distributed-memory FFTW with MPI | ||
|  | @section FFTW MPI Wisdom | ||
|  | @cindex wisdom | ||
|  | @cindex saving plans to disk | ||
|  | 
 | ||
|  | FFTW's ``wisdom'' facility (@pxref{Words of Wisdom-Saving Plans}) can | ||
|  | be used to save MPI plans as well as to save uniprocessor plans. | ||
|  | However, for MPI there are several unavoidable complications. | ||
|  | 
 | ||
|  | @cindex MPI I/O | ||
|  | First, the MPI standard does not guarantee that every process can | ||
|  | perform file I/O (at least, not using C stdio routines)---in general, | ||
|  | we may only assume that process 0 is capable of I/O.@footnote{In fact, | ||
|  | even this assumption is not technically guaranteed by the standard, | ||
|  | although it seems to be universal in actual MPI implementations and is | ||
|  | widely assumed by MPI-using software.  Technically, you need to query | ||
|  | the @code{MPI_IO} attribute of @code{MPI_COMM_WORLD} with | ||
|  | @code{MPI_Attr_get}.  If this attribute is @code{MPI_PROC_NULL}, no | ||
|  | I/O is possible.  If it is @code{MPI_ANY_SOURCE}, any process can | ||
|  | perform I/O.  Otherwise, it is the rank of a process that can perform | ||
|  | I/O ... but since it is not guaranteed to yield the @emph{same} rank | ||
|  | on all processes, you have to do an @code{MPI_Allreduce} of some kind | ||
|  | if you want all processes to agree about which is going to do I/O. | ||
|  | And even then, the standard only guarantees that this process can | ||
|  | perform output, but not input. See e.g. @cite{Parallel Programming | ||
|  | with MPI} by P. S. Pacheco, section 8.1.3.  Needless to say, in our | ||
|  | experience virtually no MPI programmers worry about this.} So, if we | ||
|  | want to export the wisdom from a single process to a file, we must | ||
|  | first export the wisdom to a string, then send it to process 0, then | ||
|  | write it to a file. | ||
|  | 
 | ||
|  | Second, in principle we may want to have separate wisdom for every | ||
|  | process, since in general the processes may run on different hardware | ||
|  | even for a single MPI program.  However, in practice FFTW's MPI code | ||
|  | is designed for the case of homogeneous hardware (@pxref{Load | ||
|  | balancing}), and in this case it is convenient to use the same wisdom | ||
|  | for every process.  Thus, we need a mechanism to synchronize the wisdom. | ||
|  | 
 | ||
|  | To address both of these problems, FFTW provides the following two | ||
|  | functions: | ||
|  | 
 | ||
|  | @example | ||
|  | void fftw_mpi_broadcast_wisdom(MPI_Comm comm); | ||
|  | void fftw_mpi_gather_wisdom(MPI_Comm comm); | ||
|  | @end example | ||
|  | @findex fftw_mpi_gather_wisdom | ||
|  | @findex fftw_mpi_broadcast_wisdom | ||
|  | 
 | ||
|  | Given a communicator @code{comm}, @code{fftw_mpi_broadcast_wisdom} | ||
|  | will broadcast the wisdom from process 0 to all other processes. | ||
|  | Conversely, @code{fftw_mpi_gather_wisdom} will collect wisdom from all | ||
|  | processes onto process 0.  (If the plans created for the same problem | ||
|  | by different processes are not the same, @code{fftw_mpi_gather_wisdom} | ||
|  | will arbitrarily choose one of the plans.)  Both of these functions | ||
|  | may result in suboptimal plans for different processes if the | ||
|  | processes are running on non-identical hardware.  Both of these | ||
|  | functions are @emph{collective} calls, which means that they must be | ||
|  | executed by all processes in the communicator. | ||
|  | @cindex collective function | ||
|  | 
 | ||
|  | 
 | ||
|  | So, for example, a typical code snippet to import wisdom from a file | ||
|  | and use it on all processes would be: | ||
|  | 
 | ||
|  | @example | ||
|  | @{ | ||
|  |     int rank; | ||
|  | 
 | ||
|  |     fftw_mpi_init(); | ||
|  |     MPI_Comm_rank(MPI_COMM_WORLD, &rank); | ||
|  |     if (rank == 0) fftw_import_wisdom_from_filename("mywisdom"); | ||
|  |     fftw_mpi_broadcast_wisdom(MPI_COMM_WORLD); | ||
|  | @} | ||
|  | @end example | ||
|  | 
 | ||
|  | (Note that we must call @code{fftw_mpi_init} before importing any | ||
|  | wisdom that might contain MPI plans.)  Similarly, a typical code | ||
|  | snippet to export wisdom from all processes to a file is: | ||
|  | @findex fftw_mpi_init | ||
|  | 
 | ||
|  | @example | ||
|  | @{ | ||
|  |     int rank; | ||
|  | 
 | ||
|  |     fftw_mpi_gather_wisdom(MPI_COMM_WORLD); | ||
|  |     MPI_Comm_rank(MPI_COMM_WORLD, &rank); | ||
|  |     if (rank == 0) fftw_export_wisdom_to_filename("mywisdom"); | ||
|  | @} | ||
|  | @end example | ||
|  | 
 | ||
|  | @c ------------------------------------------------------------ | ||
|  | @node Avoiding MPI Deadlocks, FFTW MPI Performance Tips, FFTW MPI Wisdom, Distributed-memory FFTW with MPI | ||
|  | @section Avoiding MPI Deadlocks | ||
|  | @cindex deadlock | ||
|  | 
 | ||
|  | An MPI program can @emph{deadlock} if one process is waiting for a | ||
|  | message from another process that never gets sent.  To avoid deadlocks | ||
|  | when using FFTW's MPI routines, it is important to know which | ||
|  | functions are @emph{collective}: that is, which functions must | ||
|  | @emph{always} be called in the @emph{same order} from @emph{every} | ||
|  | process in a given communicator.  (For example, @code{MPI_Barrier} is | ||
|  | the canonical example of a collective function in the MPI standard.) | ||
|  | @cindex collective function | ||
|  | @findex MPI_Barrier | ||
|  | 
 | ||
|  | 
 | ||
|  | The functions in FFTW that are @emph{always} collective are: every | ||
|  | function beginning with @samp{fftw_mpi_plan}, as well as | ||
|  | @code{fftw_mpi_broadcast_wisdom} and @code{fftw_mpi_gather_wisdom}. | ||
|  | Also, the following functions from the ordinary FFTW interface are | ||
|  | collective when they are applied to a plan created by an | ||
|  | @samp{fftw_mpi_plan} function: @code{fftw_execute}, | ||
|  | @code{fftw_destroy_plan}, and @code{fftw_flops}. | ||
|  | @findex fftw_execute | ||
|  | @findex fftw_destroy_plan | ||
|  | @findex fftw_flops | ||
|  | 
 | ||
|  | @c ------------------------------------------------------------ | ||
|  | @node FFTW MPI Performance Tips, Combining MPI and Threads, Avoiding MPI Deadlocks, Distributed-memory FFTW with MPI | ||
|  | @section FFTW MPI Performance Tips | ||
|  | 
 | ||
|  | In this section, we collect a few tips on getting the best performance | ||
|  | out of FFTW's MPI transforms. | ||
|  | 
 | ||
|  | First, because of the 1d block distribution, FFTW's parallelization is | ||
|  | currently limited by the size of the first dimension. | ||
|  | (Multidimensional block distributions may be supported by a future | ||
|  | version.) More generally, you should ideally arrange the dimensions so | ||
|  | that FFTW can divide them equally among the processes. @xref{Load | ||
|  | balancing}. | ||
|  | @cindex block distribution | ||
|  | @cindex load balancing | ||
|  | 
 | ||
|  | 
 | ||
|  | Second, if it is not too inconvenient, you should consider working | ||
|  | with transposed output for multidimensional plans, as this saves a | ||
|  | considerable amount of communications.  @xref{Transposed distributions}. | ||
|  | @cindex transpose | ||
|  | 
 | ||
|  | 
 | ||
|  | Third, the fastest choices are generally either an in-place transform | ||
|  | or an out-of-place transform with the @code{FFTW_DESTROY_INPUT} flag | ||
|  | (which allows the input array to be used as scratch space).  In-place | ||
|  | is especially beneficial if the amount of data per process is large. | ||
|  | @ctindex FFTW_DESTROY_INPUT | ||
|  | 
 | ||
|  | 
 | ||
|  | Fourth, if you have multiple arrays to transform at once, rather than | ||
|  | calling FFTW's MPI transforms several times it usually seems to be | ||
|  | faster to interleave the data and use the advanced interface.  (This | ||
|  | groups the communications together instead of requiring separate | ||
|  | messages for each transform.) | ||
|  | 
 | ||
|  | @c ------------------------------------------------------------ | ||
|  | @node Combining MPI and Threads, FFTW MPI Reference, FFTW MPI Performance Tips, Distributed-memory FFTW with MPI | ||
|  | @section Combining MPI and Threads | ||
|  | @cindex threads | ||
|  | 
 | ||
|  | In certain cases, it may be advantageous to combine MPI | ||
|  | (distributed-memory) and threads (shared-memory) parallelization. | ||
|  | FFTW supports this, with certain caveats.  For example, if you have a | ||
|  | cluster of 4-processor shared-memory nodes, you may want to use | ||
|  | threads within the nodes and MPI between the nodes, instead of MPI for | ||
|  | all parallelization. | ||
|  | 
 | ||
|  | In particular, it is possible to seamlessly combine the MPI FFTW | ||
|  | routines with the multi-threaded FFTW routines (@pxref{Multi-threaded | ||
|  | FFTW}). However, some care must be taken in the initialization code, | ||
|  | which should look something like this: | ||
|  | 
 | ||
|  | @example | ||
|  | int threads_ok; | ||
|  | 
 | ||
|  | int main(int argc, char **argv) | ||
|  | @{ | ||
|  |     int provided; | ||
|  |     MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided); | ||
|  |     threads_ok = provided >= MPI_THREAD_FUNNELED; | ||
|  | 
 | ||
|  |     if (threads_ok) threads_ok = fftw_init_threads(); | ||
|  |     fftw_mpi_init(); | ||
|  | 
 | ||
|  |     ... | ||
|  |     if (threads_ok) fftw_plan_with_nthreads(...); | ||
|  |     ... | ||
|  |      | ||
|  |     MPI_Finalize(); | ||
|  | @} | ||
|  | @end example | ||
|  | @findex fftw_mpi_init | ||
|  | @findex fftw_init_threads | ||
|  | @findex fftw_plan_with_nthreads | ||
|  | 
 | ||
|  | First, note that instead of calling @code{MPI_Init}, you should call | ||
|  | @code{MPI_Init_threads}, which is the initialization routine defined | ||
|  | by the MPI-2 standard to indicate to MPI that your program will be | ||
|  | multithreaded.  We pass @code{MPI_THREAD_FUNNELED}, which indicates | ||
|  | that we will only call MPI routines from the main thread.  (FFTW will | ||
|  | launch additional threads internally, but the extra threads will not | ||
|  | call MPI code.)  (You may also pass @code{MPI_THREAD_SERIALIZED} or | ||
|  | @code{MPI_THREAD_MULTIPLE}, which requests additional multithreading | ||
|  | support from the MPI implementation, but this is not required by | ||
|  | FFTW.)  The @code{provided} parameter returns what level of threads | ||
|  | support is actually supported by your MPI implementation; this | ||
|  | @emph{must} be at least @code{MPI_THREAD_FUNNELED} if you want to call | ||
|  | the FFTW threads routines, so we define a global variable | ||
|  | @code{threads_ok} to record this.  You should only call | ||
|  | @code{fftw_init_threads} or @code{fftw_plan_with_nthreads} if | ||
|  | @code{threads_ok} is true.  For more information on thread safety in | ||
|  | MPI, see the | ||
|  | @uref{http://www.mpi-forum.org/docs/mpi-20-html/node162.htm, MPI and | ||
|  | Threads} section of the MPI-2 standard. | ||
|  | @cindex thread safety | ||
|  | 
 | ||
|  | 
 | ||
|  | Second, we must call @code{fftw_init_threads} @emph{before} | ||
|  | @code{fftw_mpi_init}.  This is critical for technical reasons having | ||
|  | to do with how FFTW initializes its list of algorithms. | ||
|  | 
 | ||
|  | Then, if you call @code{fftw_plan_with_nthreads(N)}, @emph{every} MPI | ||
|  | process will launch (up to) @code{N} threads to parallelize its transforms. | ||
|  | 
 | ||
|  | For example, in the hypothetical cluster of 4-processor nodes, you | ||
|  | might wish to launch only a single MPI process per node, and then call | ||
|  | @code{fftw_plan_with_nthreads(4)} on each process to use all | ||
|  | processors in the nodes. | ||
|  | 
 | ||
|  | This may or may not be faster than simply using as many MPI processes | ||
|  | as you have processors, however.  On the one hand, using threads | ||
|  | within a node eliminates the need for explicit message passing within | ||
|  | the node.  On the other hand, FFTW's transpose routines are not | ||
|  | multi-threaded, and this means that the communications that do take | ||
|  | place will not benefit from parallelization within the node. | ||
|  | Moreover, many MPI implementations already have optimizations to | ||
|  | exploit shared memory when it is available, so adding the | ||
|  | multithreaded FFTW on top of this may be superfluous. | ||
|  | @cindex transpose | ||
|  | 
 | ||
|  | @c ------------------------------------------------------------ | ||
|  | @node FFTW MPI Reference, FFTW MPI Fortran Interface, Combining MPI and Threads, Distributed-memory FFTW with MPI | ||
|  | @section FFTW MPI Reference | ||
|  | 
 | ||
|  | This chapter provides a complete reference to all FFTW MPI functions, | ||
|  | datatypes, and constants.  See also @ref{FFTW Reference} for information | ||
|  | on functions and types in common with the serial interface. | ||
|  | 
 | ||
|  | @menu | ||
|  | * MPI Files and Data Types:: | ||
|  | * MPI Initialization:: | ||
|  | * Using MPI Plans:: | ||
|  | * MPI Data Distribution Functions:: | ||
|  | * MPI Plan Creation:: | ||
|  | * MPI Wisdom Communication:: | ||
|  | @end menu | ||
|  | 
 | ||
|  | @node MPI Files and Data Types, MPI Initialization, FFTW MPI Reference, FFTW MPI Reference | ||
|  | @subsection MPI Files and Data Types | ||
|  | 
 | ||
|  | All programs using FFTW's MPI support should include its header file: | ||
|  | 
 | ||
|  | @example | ||
|  | #include <fftw3-mpi.h> | ||
|  | @end example | ||
|  | 
 | ||
|  | Note that this header file includes the serial-FFTW @code{fftw3.h} | ||
|  | header file, and also the @code{mpi.h} header file for MPI, so you | ||
|  | need not include those files separately. | ||
|  | 
 | ||
|  | You must also link to @emph{both} the FFTW MPI library and to the | ||
|  | serial FFTW library.  On Unix, this means adding @code{-lfftw3_mpi | ||
|  | -lfftw3 -lm} at the end of the link command. | ||
|  | 
 | ||
|  | @cindex precision | ||
|  | Different precisions are handled as in the serial interface: | ||
|  | @xref{Precision}.  That is, @samp{fftw_} functions become | ||
|  | @code{fftwf_} (in single precision) etcetera, and the libraries become | ||
|  | @code{-lfftw3f_mpi -lfftw3f -lm} etcetera on Unix.  Long-double | ||
|  | precision is supported in MPI, but quad precision (@samp{fftwq_}) is | ||
|  | not due to the lack of MPI support for this type. | ||
|  | 
 | ||
|  | @node MPI Initialization, Using MPI Plans, MPI Files and Data Types, FFTW MPI Reference | ||
|  | @subsection MPI Initialization | ||
|  | 
 | ||
|  | Before calling any other FFTW MPI (@samp{fftw_mpi_}) function, and | ||
|  | before importing any wisdom for MPI problems, you must call: | ||
|  | 
 | ||
|  | @findex fftw_mpi_init | ||
|  | @example | ||
|  | void fftw_mpi_init(void); | ||
|  | @end example | ||
|  | 
 | ||
|  | @findex fftw_init_threads | ||
|  | If FFTW threads support is used, however, @code{fftw_mpi_init} should | ||
|  | be called @emph{after} @code{fftw_init_threads} (@pxref{Combining MPI | ||
|  | and Threads}).  Calling @code{fftw_mpi_init} additional times (before | ||
|  | @code{fftw_mpi_cleanup}) has no effect. | ||
|  | 
 | ||
|  | 
 | ||
|  | If you want to deallocate all persistent data and reset FFTW to the | ||
|  | pristine state it was in when you started your program, you can call: | ||
|  | 
 | ||
|  | @findex fftw_mpi_cleanup | ||
|  | @example | ||
|  | void fftw_mpi_cleanup(void); | ||
|  | @end example | ||
|  | 
 | ||
|  | @findex fftw_cleanup | ||
|  | (This calls @code{fftw_cleanup}, so you need not call the serial | ||
|  | cleanup routine too, although it is safe to do so.)  After calling | ||
|  | @code{fftw_mpi_cleanup}, all existing plans become undefined, and you | ||
|  | should not attempt to execute or destroy them.  You must call | ||
|  | @code{fftw_mpi_init} again after @code{fftw_mpi_cleanup} if you want | ||
|  | to resume using the MPI FFTW routines. | ||
|  | 
 | ||
|  | @node Using MPI Plans, MPI Data Distribution Functions, MPI Initialization, FFTW MPI Reference | ||
|  | @subsection Using MPI Plans | ||
|  | 
 | ||
|  | Once an MPI plan is created, you can execute and destroy it using | ||
|  | @code{fftw_execute}, @code{fftw_destroy_plan}, and the other functions | ||
|  | in the serial interface that operate on generic plans (@pxref{Using | ||
|  | Plans}).   | ||
|  | 
 | ||
|  | @cindex collective function | ||
|  | @cindex MPI communicator | ||
|  | The @code{fftw_execute} and @code{fftw_destroy_plan} functions, applied to | ||
|  | MPI plans, are @emph{collective} calls: they must be called for all processes | ||
|  | in the communicator that was used to create the plan. | ||
|  | 
 | ||
|  | @cindex new-array execution | ||
|  | You must @emph{not} use the serial new-array plan-execution functions | ||
|  | @code{fftw_execute_dft} and so on (@pxref{New-array Execute | ||
|  | Functions}) with MPI plans.  Such functions are specialized to the | ||
|  | problem type, and there are specific new-array execute functions for MPI plans: | ||
|  | 
 | ||
|  | @findex fftw_mpi_execute_dft | ||
|  | @findex fftw_mpi_execute_dft_r2c | ||
|  | @findex fftw_mpi_execute_dft_c2r | ||
|  | @findex fftw_mpi_execute_r2r | ||
|  | @example | ||
|  | void fftw_mpi_execute_dft(fftw_plan p, fftw_complex *in, fftw_complex *out); | ||
|  | void fftw_mpi_execute_dft_r2c(fftw_plan p, double *in, fftw_complex *out); | ||
|  | void fftw_mpi_execute_dft_c2r(fftw_plan p, fftw_complex *in, double *out); | ||
|  | void fftw_mpi_execute_r2r(fftw_plan p, double *in, double *out); | ||
|  | @end example | ||
|  | 
 | ||
|  | @cindex alignment | ||
|  | @findex fftw_malloc | ||
|  | These functions have the same restrictions as those of the serial | ||
|  | new-array execute functions.  They are @emph{always} safe to apply to | ||
|  | the @emph{same} @code{in} and @code{out} arrays that were used to | ||
|  | create the plan.  They can only be applied to new arrarys if those | ||
|  | arrays have the same types, dimensions, in-placeness, and alignment as | ||
|  | the original arrays, where the best way to ensure the same alignment | ||
|  | is to use FFTW's @code{fftw_malloc} and related allocation functions | ||
|  | for all arrays (@pxref{Memory Allocation}).  Note that distributed | ||
|  | transposes (@pxref{FFTW MPI Transposes}) use | ||
|  | @code{fftw_mpi_execute_r2r}, since they count as rank-zero r2r plans | ||
|  | from FFTW's perspective. | ||
|  | 
 | ||
|  | @node MPI Data Distribution Functions, MPI Plan Creation, Using MPI Plans, FFTW MPI Reference | ||
|  | @subsection MPI Data Distribution Functions | ||
|  | 
 | ||
|  | @cindex data distribution | ||
|  | As described above (@pxref{MPI Data Distribution}), in order to | ||
|  | allocate your arrays, @emph{before} creating a plan, you must first | ||
|  | call one of the following routines to determine the required | ||
|  | allocation size and the portion of the array locally stored on a given | ||
|  | process.  The @code{MPI_Comm} communicator passed here must be | ||
|  | equivalent to the communicator used below for plan creation. | ||
|  | 
 | ||
|  | The basic interface for multidimensional transforms consists of the | ||
|  | functions: | ||
|  | 
 | ||
|  | @findex fftw_mpi_local_size_2d | ||
|  | @findex fftw_mpi_local_size_3d | ||
|  | @findex fftw_mpi_local_size | ||
|  | @findex fftw_mpi_local_size_2d_transposed | ||
|  | @findex fftw_mpi_local_size_3d_transposed | ||
|  | @findex fftw_mpi_local_size_transposed | ||
|  | @example | ||
|  | ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, | ||
|  |                                  ptrdiff_t *local_n0, ptrdiff_t *local_0_start); | ||
|  | ptrdiff_t fftw_mpi_local_size_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | ||
|  |                                  MPI_Comm comm, | ||
|  |                                  ptrdiff_t *local_n0, ptrdiff_t *local_0_start); | ||
|  | ptrdiff_t fftw_mpi_local_size(int rnk, const ptrdiff_t *n, MPI_Comm comm, | ||
|  |                               ptrdiff_t *local_n0, ptrdiff_t *local_0_start); | ||
|  | 
 | ||
|  | ptrdiff_t fftw_mpi_local_size_2d_transposed(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm, | ||
|  |                                             ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | ||
|  |                                             ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | ||
|  | ptrdiff_t fftw_mpi_local_size_3d_transposed(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | ||
|  |                                             MPI_Comm comm, | ||
|  |                                             ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | ||
|  |                                             ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | ||
|  | ptrdiff_t fftw_mpi_local_size_transposed(int rnk, const ptrdiff_t *n, MPI_Comm comm, | ||
|  |                                          ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | ||
|  |                                          ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | ||
|  | @end example | ||
|  | 
 | ||
|  | These functions return the number of elements to allocate (complex | ||
|  | numbers for DFT/r2c/c2r plans, real numbers for r2r plans), whereas | ||
|  | the @code{local_n0} and @code{local_0_start} return the portion | ||
|  | (@code{local_0_start} to @code{local_0_start + local_n0 - 1}) of the | ||
|  | first dimension of an @ndims{} array that is stored on the local | ||
|  | process.  @xref{Basic and advanced distribution interfaces}.  For | ||
|  | @code{FFTW_MPI_TRANSPOSED_OUT} plans, the @samp{_transposed} variants | ||
|  | are useful in order to also return the local portion of the first | ||
|  | dimension in the @ndimstrans{} transposed output.   | ||
|  | @xref{Transposed distributions}.   | ||
|  | The advanced interface for multidimensional transforms is: | ||
|  | 
 | ||
|  | @cindex advanced interface | ||
|  | @findex fftw_mpi_local_size_many | ||
|  | @findex fftw_mpi_local_size_many_transposed | ||
|  | @example | ||
|  | ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, | ||
|  |                                    ptrdiff_t block0, MPI_Comm comm, | ||
|  |                                    ptrdiff_t *local_n0, ptrdiff_t *local_0_start); | ||
|  | ptrdiff_t fftw_mpi_local_size_many_transposed(int rnk, const ptrdiff_t *n, ptrdiff_t howmany, | ||
|  |                                               ptrdiff_t block0, ptrdiff_t block1, MPI_Comm comm, | ||
|  |                                               ptrdiff_t *local_n0, ptrdiff_t *local_0_start, | ||
|  |                                               ptrdiff_t *local_n1, ptrdiff_t *local_1_start); | ||
|  | @end example | ||
|  | 
 | ||
|  | These differ from the basic interface in only two ways.  First, they | ||
|  | allow you to specify block sizes @code{block0} and @code{block1} (the | ||
|  | latter for the transposed output); you can pass | ||
|  | @code{FFTW_MPI_DEFAULT_BLOCK} to use FFTW's default block size as in | ||
|  | the basic interface.  Second, you can pass a @code{howmany} parameter, | ||
|  | corresponding to the advanced planning interface below: this is for | ||
|  | transforms of contiguous @code{howmany}-tuples of numbers | ||
|  | (@code{howmany = 1} in the basic interface). | ||
|  | 
 | ||
|  | The corresponding basic and advanced routines for one-dimensional | ||
|  | transforms (currently only complex DFTs) are: | ||
|  | 
 | ||
|  | @findex fftw_mpi_local_size_1d | ||
|  | @findex fftw_mpi_local_size_many_1d | ||
|  | @example | ||
|  | ptrdiff_t fftw_mpi_local_size_1d( | ||
|  |              ptrdiff_t n0, MPI_Comm comm, int sign, unsigned flags, | ||
|  |              ptrdiff_t *local_ni, ptrdiff_t *local_i_start, | ||
|  |              ptrdiff_t *local_no, ptrdiff_t *local_o_start); | ||
|  | ptrdiff_t fftw_mpi_local_size_many_1d( | ||
|  |              ptrdiff_t n0, ptrdiff_t howmany, | ||
|  |              MPI_Comm comm, int sign, unsigned flags, | ||
|  |              ptrdiff_t *local_ni, ptrdiff_t *local_i_start, | ||
|  |              ptrdiff_t *local_no, ptrdiff_t *local_o_start); | ||
|  | @end example | ||
|  | 
 | ||
|  | @ctindex FFTW_MPI_SCRAMBLED_OUT | ||
|  | @ctindex FFTW_MPI_SCRAMBLED_IN | ||
|  | As above, the return value is the number of elements to allocate | ||
|  | (complex numbers, for complex DFTs).  The @code{local_ni} and | ||
|  | @code{local_i_start} arguments return the portion | ||
|  | (@code{local_i_start} to @code{local_i_start + local_ni - 1}) of the | ||
|  | 1d array that is stored on this process for the transform | ||
|  | @emph{input}, and @code{local_no} and @code{local_o_start} are the | ||
|  | corresponding quantities for the input.  The @code{sign} | ||
|  | (@code{FFTW_FORWARD} or @code{FFTW_BACKWARD}) and @code{flags} must | ||
|  | match the arguments passed when creating a plan.  Although the inputs | ||
|  | and outputs have different data distributions in general, it is | ||
|  | guaranteed that the @emph{output} data distribution of an | ||
|  | @code{FFTW_FORWARD} plan will match the @emph{input} data distribution | ||
|  | of an @code{FFTW_BACKWARD} plan and vice versa; similarly for the | ||
|  | @code{FFTW_MPI_SCRAMBLED_OUT} and @code{FFTW_MPI_SCRAMBLED_IN} flags. | ||
|  | @xref{One-dimensional distributions}. | ||
|  | 
 | ||
|  | @node MPI Plan Creation, MPI Wisdom Communication, MPI Data Distribution Functions, FFTW MPI Reference | ||
|  | @subsection MPI Plan Creation | ||
|  | 
 | ||
|  | @subsubheading Complex-data MPI DFTs | ||
|  | 
 | ||
|  | Plans for complex-data DFTs (@pxref{2d MPI example}) are created by: | ||
|  | 
 | ||
|  | @findex fftw_mpi_plan_dft_1d | ||
|  | @findex fftw_mpi_plan_dft_2d | ||
|  | @findex fftw_mpi_plan_dft_3d | ||
|  | @findex fftw_mpi_plan_dft | ||
|  | @findex fftw_mpi_plan_many_dft | ||
|  | @example | ||
|  | fftw_plan fftw_mpi_plan_dft_1d(ptrdiff_t n0, fftw_complex *in, fftw_complex *out, | ||
|  |                                MPI_Comm comm, int sign, unsigned flags); | ||
|  | fftw_plan fftw_mpi_plan_dft_2d(ptrdiff_t n0, ptrdiff_t n1, | ||
|  |                                fftw_complex *in, fftw_complex *out, | ||
|  |                                MPI_Comm comm, int sign, unsigned flags); | ||
|  | fftw_plan fftw_mpi_plan_dft_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | ||
|  |                                fftw_complex *in, fftw_complex *out, | ||
|  |                                MPI_Comm comm, int sign, unsigned flags); | ||
|  | fftw_plan fftw_mpi_plan_dft(int rnk, const ptrdiff_t *n,  | ||
|  |                             fftw_complex *in, fftw_complex *out, | ||
|  |                             MPI_Comm comm, int sign, unsigned flags); | ||
|  | fftw_plan fftw_mpi_plan_many_dft(int rnk, const ptrdiff_t *n, | ||
|  |                                  ptrdiff_t howmany, ptrdiff_t block, ptrdiff_t tblock, | ||
|  |                                  fftw_complex *in, fftw_complex *out, | ||
|  |                                  MPI_Comm comm, int sign, unsigned flags); | ||
|  | @end example | ||
|  | 
 | ||
|  | @cindex MPI communicator | ||
|  | @cindex collective function | ||
|  | These are similar to their serial counterparts (@pxref{Complex DFTs}) | ||
|  | in specifying the dimensions, sign, and flags of the transform.  The | ||
|  | @code{comm} argument gives an MPI communicator that specifies the set | ||
|  | of processes to participate in the transform; plan creation is a | ||
|  | collective function that must be called for all processes in the | ||
|  | communicator.  The @code{in} and @code{out} pointers refer only to a | ||
|  | portion of the overall transform data (@pxref{MPI Data Distribution}) | ||
|  | as specified by the @samp{local_size} functions in the previous | ||
|  | section.  Unless @code{flags} contains @code{FFTW_ESTIMATE}, these | ||
|  | arrays are overwritten during plan creation as for the serial | ||
|  | interface.  For multi-dimensional transforms, any dimensions @code{> | ||
|  | 1} are supported; for one-dimensional transforms, only composite | ||
|  | (non-prime) @code{n0} are currently supported (unlike the serial | ||
|  | FFTW).  Requesting an unsupported transform size will yield a | ||
|  | @code{NULL} plan.  (As in the serial interface, highly composite sizes | ||
|  | generally yield the best performance.) | ||
|  | 
 | ||
|  | @cindex advanced interface | ||
|  | @ctindex FFTW_MPI_DEFAULT_BLOCK | ||
|  | @cindex stride | ||
|  | The advanced-interface @code{fftw_mpi_plan_many_dft} additionally | ||
|  | allows you to specify the block sizes for the first dimension | ||
|  | (@code{block}) of the @ndims{} input data and the first dimension | ||
|  | (@code{tblock}) of the @ndimstrans{} transposed data (at intermediate | ||
|  | steps of the transform, and for the output if | ||
|  | @code{FFTW_TRANSPOSED_OUT} is specified in @code{flags}).  These must | ||
|  | be the same block sizes as were passed to the corresponding | ||
|  | @samp{local_size} function; you can pass @code{FFTW_MPI_DEFAULT_BLOCK} | ||
|  | to use FFTW's default block size as in the basic interface.  Also, the | ||
|  | @code{howmany} parameter specifies that the transform is of contiguous | ||
|  | @code{howmany}-tuples rather than individual complex numbers; this | ||
|  | corresponds to the same parameter in the serial advanced interface | ||
|  | (@pxref{Advanced Complex DFTs}) with @code{stride = howmany} and | ||
|  | @code{dist = 1}. | ||
|  | 
 | ||
|  | @subsubheading MPI flags | ||
|  | 
 | ||
|  | The @code{flags} can be any of those for the serial FFTW | ||
|  | (@pxref{Planner Flags}), and in addition may include one or more of | ||
|  | the following MPI-specific flags, which improve performance at the | ||
|  | cost of changing the output or input data formats. | ||
|  | 
 | ||
|  | @itemize @bullet | ||
|  | 
 | ||
|  | @item | ||
|  | @ctindex FFTW_MPI_SCRAMBLED_OUT | ||
|  | @ctindex FFTW_MPI_SCRAMBLED_IN | ||
|  | @code{FFTW_MPI_SCRAMBLED_OUT}, @code{FFTW_MPI_SCRAMBLED_IN}: valid for | ||
|  | 1d transforms only, these flags indicate that the output/input of the | ||
|  | transform are in an undocumented ``scrambled'' order.  A forward | ||
|  | @code{FFTW_MPI_SCRAMBLED_OUT} transform can be inverted by a backward | ||
|  | @code{FFTW_MPI_SCRAMBLED_IN} (times the usual 1/@i{N} normalization). | ||
|  | @xref{One-dimensional distributions}. | ||
|  | 
 | ||
|  | @item | ||
|  | @ctindex FFTW_MPI_TRANSPOSED_OUT | ||
|  | @ctindex FFTW_MPI_TRANSPOSED_IN | ||
|  | @code{FFTW_MPI_TRANSPOSED_OUT}, @code{FFTW_MPI_TRANSPOSED_IN}: valid | ||
|  | for multidimensional (@code{rnk > 1}) transforms only, these flags | ||
|  | specify that the output or input of an @ndims{} transform is | ||
|  | transposed to @ndimstrans{}.  @xref{Transposed distributions}. | ||
|  | 
 | ||
|  | @end itemize | ||
|  | 
 | ||
|  | @subsubheading Real-data MPI DFTs | ||
|  | 
 | ||
|  | @cindex r2c | ||
|  | Plans for real-input/output (r2c/c2r) DFTs (@pxref{Multi-dimensional | ||
|  | MPI DFTs of Real Data}) are created by: | ||
|  | 
 | ||
|  | @findex fftw_mpi_plan_dft_r2c_2d | ||
|  | @findex fftw_mpi_plan_dft_r2c_2d | ||
|  | @findex fftw_mpi_plan_dft_r2c_3d | ||
|  | @findex fftw_mpi_plan_dft_r2c | ||
|  | @findex fftw_mpi_plan_dft_c2r_2d | ||
|  | @findex fftw_mpi_plan_dft_c2r_2d | ||
|  | @findex fftw_mpi_plan_dft_c2r_3d | ||
|  | @findex fftw_mpi_plan_dft_c2r | ||
|  | @example | ||
|  | fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1,  | ||
|  |                                    double *in, fftw_complex *out, | ||
|  |                                    MPI_Comm comm, unsigned flags); | ||
|  | fftw_plan fftw_mpi_plan_dft_r2c_2d(ptrdiff_t n0, ptrdiff_t n1,  | ||
|  |                                    double *in, fftw_complex *out, | ||
|  |                                    MPI_Comm comm, unsigned flags); | ||
|  | fftw_plan fftw_mpi_plan_dft_r2c_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | ||
|  |                                    double *in, fftw_complex *out, | ||
|  |                                    MPI_Comm comm, unsigned flags); | ||
|  | fftw_plan fftw_mpi_plan_dft_r2c(int rnk, const ptrdiff_t *n, | ||
|  |                                 double *in, fftw_complex *out, | ||
|  |                                 MPI_Comm comm, unsigned flags); | ||
|  | fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1,  | ||
|  |                                    fftw_complex *in, double *out, | ||
|  |                                    MPI_Comm comm, unsigned flags); | ||
|  | fftw_plan fftw_mpi_plan_dft_c2r_2d(ptrdiff_t n0, ptrdiff_t n1,  | ||
|  |                                    fftw_complex *in, double *out, | ||
|  |                                    MPI_Comm comm, unsigned flags); | ||
|  | fftw_plan fftw_mpi_plan_dft_c2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | ||
|  |                                    fftw_complex *in, double *out, | ||
|  |                                    MPI_Comm comm, unsigned flags); | ||
|  | fftw_plan fftw_mpi_plan_dft_c2r(int rnk, const ptrdiff_t *n, | ||
|  |                                 fftw_complex *in, double *out, | ||
|  |                                 MPI_Comm comm, unsigned flags); | ||
|  | @end example | ||
|  | 
 | ||
|  | Similar to the serial interface (@pxref{Real-data DFTs}), these | ||
|  | transform logically @ndims{} real data to/from @ndimshalf{} complex | ||
|  | data, representing the non-redundant half of the conjugate-symmetry | ||
|  | output of a real-input DFT (@pxref{Multi-dimensional Transforms}). | ||
|  | However, the real array must be stored within a padded @ndimspad{} | ||
|  | array (much like the in-place serial r2c transforms, but here for | ||
|  | out-of-place transforms as well). Currently, only multi-dimensional | ||
|  | (@code{rnk > 1}) r2c/c2r transforms are supported (requesting a plan | ||
|  | for @code{rnk = 1} will yield @code{NULL}).  As explained above | ||
|  | (@pxref{Multi-dimensional MPI DFTs of Real Data}), the data | ||
|  | distribution of both the real and complex arrays is given by the | ||
|  | @samp{local_size} function called for the dimensions of the | ||
|  | @emph{complex} array.  Similar to the other planning functions, the | ||
|  | input and output arrays are overwritten when the plan is created | ||
|  | except in @code{FFTW_ESTIMATE} mode. | ||
|  | 
 | ||
|  | As for the complex DFTs above, there is an advance interface that | ||
|  | allows you to manually specify block sizes and to transform contiguous | ||
|  | @code{howmany}-tuples of real/complex numbers: | ||
|  | 
 | ||
|  | @findex fftw_mpi_plan_many_dft_r2c | ||
|  | @findex fftw_mpi_plan_many_dft_c2r | ||
|  | @example | ||
|  | fftw_plan fftw_mpi_plan_many_dft_r2c | ||
|  |               (int rnk, const ptrdiff_t *n, ptrdiff_t howmany, | ||
|  |                ptrdiff_t iblock, ptrdiff_t oblock, | ||
|  |                double *in, fftw_complex *out, | ||
|  |                MPI_Comm comm, unsigned flags); | ||
|  | fftw_plan fftw_mpi_plan_many_dft_c2r | ||
|  |               (int rnk, const ptrdiff_t *n, ptrdiff_t howmany, | ||
|  |                ptrdiff_t iblock, ptrdiff_t oblock, | ||
|  |                fftw_complex *in, double *out, | ||
|  |                MPI_Comm comm, unsigned flags);                | ||
|  | @end example | ||
|  | 
 | ||
|  | @subsubheading MPI r2r transforms | ||
|  | 
 | ||
|  | @cindex r2r | ||
|  | There are corresponding plan-creation routines for r2r | ||
|  | transforms (@pxref{More DFTs of Real Data}), currently supporting | ||
|  | multidimensional (@code{rnk > 1}) transforms only (@code{rnk = 1} will | ||
|  | yield a @code{NULL} plan): | ||
|  | 
 | ||
|  | @example | ||
|  | fftw_plan fftw_mpi_plan_r2r_2d(ptrdiff_t n0, ptrdiff_t n1, | ||
|  |                                double *in, double *out, | ||
|  |                                MPI_Comm comm, | ||
|  |                                fftw_r2r_kind kind0, fftw_r2r_kind kind1, | ||
|  |                                unsigned flags); | ||
|  | fftw_plan fftw_mpi_plan_r2r_3d(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, | ||
|  |                                double *in, double *out, | ||
|  |                                MPI_Comm comm, | ||
|  |                                fftw_r2r_kind kind0, fftw_r2r_kind kind1, fftw_r2r_kind kind2, | ||
|  |                                unsigned flags); | ||
|  | fftw_plan fftw_mpi_plan_r2r(int rnk, const ptrdiff_t *n, | ||
|  |                             double *in, double *out, | ||
|  |                             MPI_Comm comm, const fftw_r2r_kind *kind,  | ||
|  |                             unsigned flags); | ||
|  | fftw_plan fftw_mpi_plan_many_r2r(int rnk, const ptrdiff_t *n, | ||
|  |                                  ptrdiff_t iblock, ptrdiff_t oblock, | ||
|  |                                  double *in, double *out, | ||
|  |                                  MPI_Comm comm, const fftw_r2r_kind *kind,  | ||
|  |                                  unsigned flags); | ||
|  | @end example | ||
|  | 
 | ||
|  | The parameters are much the same as for the complex DFTs above, except | ||
|  | that the arrays are of real numbers (and hence the outputs of the | ||
|  | @samp{local_size} data-distribution functions should be interpreted as | ||
|  | counts of real rather than complex numbers).  Also, the @code{kind} | ||
|  | parameters specify the r2r kinds along each dimension as for the | ||
|  | serial interface (@pxref{Real-to-Real Transform Kinds}).  @xref{Other | ||
|  | Multi-dimensional Real-data MPI Transforms}. | ||
|  | 
 | ||
|  | @subsubheading MPI transposition | ||
|  | @cindex transpose | ||
|  | 
 | ||
|  | FFTW also provides routines to plan a transpose of a distributed | ||
|  | @code{n0} by @code{n1} array of real numbers, or an array of | ||
|  | @code{howmany}-tuples of real numbers with specified block sizes | ||
|  | (@pxref{FFTW MPI Transposes}): | ||
|  | 
 | ||
|  | @findex fftw_mpi_plan_transpose | ||
|  | @findex fftw_mpi_plan_many_transpose | ||
|  | @example | ||
|  | fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1, | ||
|  |                                   double *in, double *out, | ||
|  |                                   MPI_Comm comm, unsigned flags); | ||
|  | fftw_plan fftw_mpi_plan_many_transpose | ||
|  |                 (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany, | ||
|  |                  ptrdiff_t block0, ptrdiff_t block1, | ||
|  |                  double *in, double *out, MPI_Comm comm, unsigned flags); | ||
|  | @end example | ||
|  | 
 | ||
|  | @cindex new-array execution | ||
|  | @findex fftw_mpi_execute_r2r | ||
|  | These plans are used with the @code{fftw_mpi_execute_r2r} new-array | ||
|  | execute function (@pxref{Using MPI Plans }), since they count as (rank | ||
|  | zero) r2r plans from FFTW's perspective. | ||
|  | 
 | ||
|  | @node MPI Wisdom Communication,  , MPI Plan Creation, FFTW MPI Reference | ||
|  | @subsection MPI Wisdom Communication | ||
|  | 
 | ||
|  | To facilitate synchronizing wisdom among the different MPI processes, | ||
|  | we provide two functions: | ||
|  | 
 | ||
|  | @findex fftw_mpi_gather_wisdom | ||
|  | @findex fftw_mpi_broadcast_wisdom | ||
|  | @example | ||
|  | void fftw_mpi_gather_wisdom(MPI_Comm comm); | ||
|  | void fftw_mpi_broadcast_wisdom(MPI_Comm comm); | ||
|  | @end example | ||
|  | 
 | ||
|  | The @code{fftw_mpi_gather_wisdom} function gathers all wisdom in the | ||
|  | given communicator @code{comm} to the process of rank 0 in the | ||
|  | communicator: that process obtains the union of all wisdom on all the | ||
|  | processes.  As a side effect, some other processes will gain | ||
|  | additional wisdom from other processes, but only process 0 will gain | ||
|  | the complete union. | ||
|  | 
 | ||
|  | The @code{fftw_mpi_broadcast_wisdom} does the reverse: it exports | ||
|  | wisdom from process 0 in @code{comm} to all other processes in the | ||
|  | communicator, replacing any wisdom they currently have. | ||
|  | 
 | ||
|  | @xref{FFTW MPI Wisdom}. | ||
|  | 
 | ||
|  | @c ------------------------------------------------------------ | ||
|  | @node FFTW MPI Fortran Interface,  , FFTW MPI Reference, Distributed-memory FFTW with MPI | ||
|  | @section FFTW MPI Fortran Interface | ||
|  | @cindex Fortran interface | ||
|  | 
 | ||
|  | @cindex iso_c_binding | ||
|  | The FFTW MPI interface is callable from modern Fortran compilers | ||
|  | supporting the Fortran 2003 @code{iso_c_binding} standard for calling | ||
|  | C functions.  As described in @ref{Calling FFTW from Modern Fortran}, | ||
|  | this means that you can directly call FFTW's C interface from Fortran | ||
|  | with only minor changes in syntax.  There are, however, a few things | ||
|  | specific to the MPI interface to keep in mind: | ||
|  | 
 | ||
|  | @itemize @bullet | ||
|  | 
 | ||
|  | @item | ||
|  | Instead of including @code{fftw3.f03} as in @ref{Overview of Fortran | ||
|  | interface }, you should @code{include 'fftw3-mpi.f03'} (after | ||
|  | @code{use, intrinsic :: iso_c_binding} as before).  The | ||
|  | @code{fftw3-mpi.f03} file includes @code{fftw3.f03}, so you should | ||
|  | @emph{not} @code{include} them both yourself.  (You will also want to | ||
|  | include the MPI header file, usually via @code{include 'mpif.h'} or | ||
|  | similar, although though this is not needed by @code{fftw3-mpi.f03} | ||
|  | @i{per se}.)  (To use the @samp{fftwl_} @code{long double} extended-precision routines in supporting compilers, you should include @code{fftw3f-mpi.f03} in @emph{addition} to @code{fftw3-mpi.f03}. @xref{Extended and quadruple precision in Fortran}.) | ||
|  | 
 | ||
|  | @item | ||
|  | Because of the different storage conventions between C and Fortran, | ||
|  | you reverse the order of your array dimensions when passing them to | ||
|  | FFTW (@pxref{Reversing array dimensions}).  This is merely a | ||
|  | difference in notation and incurs no performance overhead.  However, | ||
|  | it means that, whereas in C the @emph{first} dimension is distributed, | ||
|  | in Fortran the @emph{last} dimension of your array is distributed. | ||
|  | 
 | ||
|  | @item | ||
|  | @cindex MPI communicator | ||
|  | In Fortran, communicators are stored as @code{integer} types; there is | ||
|  | no @code{MPI_Comm} type, nor is there any way to access a C | ||
|  | @code{MPI_Comm}.  Fortunately, this is taken care of for you by the | ||
|  | FFTW Fortran interface: whenever the C interface expects an | ||
|  | @code{MPI_Comm} type, you should pass the Fortran communicator as an | ||
|  | @code{integer}.@footnote{Technically, this is because you aren't | ||
|  | actually calling the C functions directly. You are calling wrapper | ||
|  | functions that translate the communicator with @code{MPI_Comm_f2c} | ||
|  | before calling the ordinary C interface.  This is all done | ||
|  | transparently, however, since the @code{fftw3-mpi.f03} interface file | ||
|  | renames the wrappers so that they are called in Fortran with the same | ||
|  | names as the C interface functions.} | ||
|  | 
 | ||
|  | @item | ||
|  | Because you need to call the @samp{local_size} function to find out | ||
|  | how much space to allocate, and this may be @emph{larger} than the | ||
|  | local portion of the array (@pxref{MPI Data Distribution}), you should | ||
|  | @emph{always} allocate your arrays dynamically using FFTW's allocation | ||
|  | routines as described in @ref{Allocating aligned memory in Fortran}. | ||
|  | (Coincidentally, this also provides the best performance by | ||
|  | guaranteeding proper data alignment.) | ||
|  | 
 | ||
|  | @item | ||
|  | Because all sizes in the MPI FFTW interface are declared as | ||
|  | @code{ptrdiff_t} in C, you should use @code{integer(C_INTPTR_T)} in | ||
|  | Fortran (@pxref{FFTW Fortran type reference}). | ||
|  | 
 | ||
|  | @item | ||
|  | @findex fftw_execute_dft | ||
|  | @findex fftw_mpi_execute_dft | ||
|  | @cindex new-array execution | ||
|  | In Fortran, because of the language semantics, we generally recommend | ||
|  | using the new-array execute functions for all plans, even in the | ||
|  | common case where you are executing the plan on the same arrays for | ||
|  | which the plan was created (@pxref{Plan execution in Fortran}). | ||
|  | However, note that in the MPI interface these functions are changed: | ||
|  | @code{fftw_execute_dft} becomes @code{fftw_mpi_execute_dft}, | ||
|  | etcetera. @xref{Using MPI Plans}. | ||
|  | 
 | ||
|  | @end itemize | ||
|  | 
 | ||
|  | For example, here is a Fortran code snippet to perform a distributed | ||
|  | @twodims{L,M} complex DFT in-place.  (This assumes you have already | ||
|  | initialized MPI with @code{MPI_init} and have also performed | ||
|  | @code{call fftw_mpi_init}.) | ||
|  | 
 | ||
|  | @example | ||
|  |   use, intrinsic :: iso_c_binding | ||
|  |   include 'fftw3-mpi.f03' | ||
|  |   integer(C_INTPTR_T), parameter :: L = ... | ||
|  |   integer(C_INTPTR_T), parameter :: M = ... | ||
|  |   type(C_PTR) :: plan, cdata | ||
|  |   complex(C_DOUBLE_COMPLEX), pointer :: data(:,:) | ||
|  |   integer(C_INTPTR_T) :: i, j, alloc_local, local_M, local_j_offset | ||
|  | 
 | ||
|  | !   @r{get local data size and allocate (note dimension reversal)} | ||
|  |   alloc_local = fftw_mpi_local_size_2d(M, L, MPI_COMM_WORLD, & | ||
|  |                                        local_M, local_j_offset) | ||
|  |   cdata = fftw_alloc_complex(alloc_local) | ||
|  |   call c_f_pointer(cdata, data, [L,local_M]) | ||
|  | 
 | ||
|  | !   @r{create MPI plan for in-place forward DFT (note dimension reversal)} | ||
|  |   plan = fftw_mpi_plan_dft_2d(M, L, data, data, MPI_COMM_WORLD, & | ||
|  |                               FFTW_FORWARD, FFTW_MEASURE) | ||
|  | 
 | ||
|  | ! @r{initialize data to some function} my_function(i,j) | ||
|  |   do j = 1, local_M | ||
|  |     do i = 1, L | ||
|  |       data(i, j) = my_function(i, j + local_j_offset) | ||
|  |     end do | ||
|  |   end do | ||
|  | 
 | ||
|  | ! @r{compute transform (as many times as desired)} | ||
|  |   call fftw_mpi_execute_dft(plan, data, data) | ||
|  | 
 | ||
|  |   call fftw_destroy_plan(plan) | ||
|  |   call fftw_free(cdata) | ||
|  | @end example | ||
|  | 
 | ||
|  | Note that when we called @code{fftw_mpi_local_size_2d} and | ||
|  | @code{fftw_mpi_plan_dft_2d} with the dimensions in reversed order, | ||
|  | since a @twodims{L,M} Fortran array is viewed by FFTW in C as a | ||
|  | @twodims{M, L} array.  This means that the array was distributed over | ||
|  | the @code{M} dimension, the local portion of which is a | ||
|  | @twodims{L,local_M} array in Fortran.  (You must @emph{not} use an | ||
|  | @code{allocate} statement to allocate an @twodims{L,local_M} array, | ||
|  | however; you must allocate @code{alloc_local} complex numbers, which | ||
|  | may be greater than @code{L * local_M}, in order to reserve space for | ||
|  | intermediate steps of the transform.)  Finally, we mention that | ||
|  | because C's array indices are zero-based, the @code{local_j_offset} | ||
|  | argument can conveniently be interpreted as an offset in the 1-based | ||
|  | @code{j} index (rather than as a starting index as in C). | ||
|  | 
 | ||
|  | If instead you had used the @code{ior(FFTW_MEASURE, | ||
|  | FFTW_MPI_TRANSPOSED_OUT)} flag, the output of the transform would be a | ||
|  | transposed @twodims{M,local_L} array, associated with the @emph{same} | ||
|  | @code{cdata} allocation (since the transform is in-place), and which | ||
|  | you could declare with: | ||
|  | 
 | ||
|  | @example | ||
|  |   complex(C_DOUBLE_COMPLEX), pointer :: tdata(:,:) | ||
|  |   ... | ||
|  |   call c_f_pointer(cdata, tdata, [M,local_L]) | ||
|  | @end example | ||
|  | 
 | ||
|  | where @code{local_L} would have been obtained by changing the | ||
|  | @code{fftw_mpi_local_size_2d} call to: | ||
|  | 
 | ||
|  | @example | ||
|  |   alloc_local = fftw_mpi_local_size_2d_transposed(M, L, MPI_COMM_WORLD, & | ||
|  |                            local_M, local_j_offset, local_L, local_i_offset) | ||
|  | @end example |