283 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
		
		
			
		
	
	
			283 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| 
								 | 
							
								@node Multi-threaded FFTW, Distributed-memory FFTW with MPI, FFTW Reference, Top
							 | 
						||
| 
								 | 
							
								@chapter Multi-threaded FFTW
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@cindex parallel transform
							 | 
						||
| 
								 | 
							
								In this chapter we document the parallel FFTW routines for
							 | 
						||
| 
								 | 
							
								shared-memory parallel hardware.  These routines, which support
							 | 
						||
| 
								 | 
							
								parallel one- and multi-dimensional transforms of both real and
							 | 
						||
| 
								 | 
							
								complex data, are the easiest way to take advantage of multiple
							 | 
						||
| 
								 | 
							
								processors with FFTW.  They work just like the corresponding
							 | 
						||
| 
								 | 
							
								uniprocessor transform routines, except that you have an extra
							 | 
						||
| 
								 | 
							
								initialization routine to call, and there is a routine to set the
							 | 
						||
| 
								 | 
							
								number of threads to employ.  Any program that uses the uniprocessor
							 | 
						||
| 
								 | 
							
								FFTW can therefore be trivially modified to use the multi-threaded
							 | 
						||
| 
								 | 
							
								FFTW.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								A shared-memory machine is one in which all CPUs can directly access
							 | 
						||
| 
								 | 
							
								the same main memory, and such machines are now common due to the
							 | 
						||
| 
								 | 
							
								ubiquity of multi-core CPUs.  FFTW's multi-threading support allows
							 | 
						||
| 
								 | 
							
								you to utilize these additional CPUs transparently from a single
							 | 
						||
| 
								 | 
							
								program.  However, this does not necessarily translate into
							 | 
						||
| 
								 | 
							
								performance gains---when multiple threads/CPUs are employed, there is
							 | 
						||
| 
								 | 
							
								an overhead required for synchronization that may outweigh the
							 | 
						||
| 
								 | 
							
								computatational parallelism.  Therefore, you can only benefit from
							 | 
						||
| 
								 | 
							
								threads if your problem is sufficiently large.
							 | 
						||
| 
								 | 
							
								@cindex shared-memory
							 | 
						||
| 
								 | 
							
								@cindex threads
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@menu
							 | 
						||
| 
								 | 
							
								* Installation and Supported Hardware/Software::
							 | 
						||
| 
								 | 
							
								* Usage of Multi-threaded FFTW::
							 | 
						||
| 
								 | 
							
								* How Many Threads to Use?::
							 | 
						||
| 
								 | 
							
								* Thread safety::
							 | 
						||
| 
								 | 
							
								@end menu
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@c ------------------------------------------------------------
							 | 
						||
| 
								 | 
							
								@node Installation and Supported Hardware/Software, Usage of Multi-threaded FFTW, Multi-threaded FFTW, Multi-threaded FFTW
							 | 
						||
| 
								 | 
							
								@section Installation and Supported Hardware/Software
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								All of the FFTW threads code is located in the @code{threads}
							 | 
						||
| 
								 | 
							
								subdirectory of the FFTW package.  On Unix systems, the FFTW threads
							 | 
						||
| 
								 | 
							
								libraries and header files can be automatically configured, compiled,
							 | 
						||
| 
								 | 
							
								and installed along with the uniprocessor FFTW libraries simply by
							 | 
						||
| 
								 | 
							
								including @code{--enable-threads} in the flags to the @code{configure}
							 | 
						||
| 
								 | 
							
								script (@pxref{Installation on Unix}), or @code{--enable-openmp} to use
							 | 
						||
| 
								 | 
							
								@uref{http://www.openmp.org,OpenMP} threads.
							 | 
						||
| 
								 | 
							
								@fpindex configure
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@cindex portability
							 | 
						||
| 
								 | 
							
								@cindex OpenMP
							 | 
						||
| 
								 | 
							
								The threads routines require your operating system to have some sort
							 | 
						||
| 
								 | 
							
								of shared-memory threads support.  Specifically, the FFTW threads
							 | 
						||
| 
								 | 
							
								package works with POSIX threads (available on most Unix variants,
							 | 
						||
| 
								 | 
							
								from GNU/Linux to MacOS X) and Win32 threads.  OpenMP threads, which
							 | 
						||
| 
								 | 
							
								are supported in many common compilers (e.g. gcc) are also supported,
							 | 
						||
| 
								 | 
							
								and may give better performance on some systems.  (OpenMP threads are
							 | 
						||
| 
								 | 
							
								also useful if you are employing OpenMP in your own code, in order to
							 | 
						||
| 
								 | 
							
								minimize conflicts between threading models.)  If you have a
							 | 
						||
| 
								 | 
							
								shared-memory machine that uses a different threads API, it should be
							 | 
						||
| 
								 | 
							
								a simple matter of programming to include support for it; see the file
							 | 
						||
| 
								 | 
							
								@code{threads/threads.c} for more detail.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								You can compile FFTW with @emph{both} @code{--enable-threads} and
							 | 
						||
| 
								 | 
							
								@code{--enable-openmp} at the same time, since they install libraries
							 | 
						||
| 
								 | 
							
								with different names (@samp{fftw3_threads} and @samp{fftw3_omp}, as
							 | 
						||
| 
								 | 
							
								described below).  However, your programs may only link to @emph{one}
							 | 
						||
| 
								 | 
							
								of these two libraries at a time.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Ideally, of course, you should also have multiple processors in order to
							 | 
						||
| 
								 | 
							
								get any benefit from the threaded transforms.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@c ------------------------------------------------------------
							 | 
						||
| 
								 | 
							
								@node Usage of Multi-threaded FFTW, How Many Threads to Use?, Installation and Supported Hardware/Software, Multi-threaded FFTW
							 | 
						||
| 
								 | 
							
								@section Usage of Multi-threaded FFTW
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Here, it is assumed that the reader is already familiar with the usage
							 | 
						||
| 
								 | 
							
								of the uniprocessor FFTW routines, described elsewhere in this manual.
							 | 
						||
| 
								 | 
							
								We only describe what one has to change in order to use the
							 | 
						||
| 
								 | 
							
								multi-threaded routines.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@cindex OpenMP
							 | 
						||
| 
								 | 
							
								First, programs using the parallel complex transforms should be linked
							 | 
						||
| 
								 | 
							
								with @code{-lfftw3_threads -lfftw3 -lm} on Unix, or @code{-lfftw3_omp
							 | 
						||
| 
								 | 
							
								-lfftw3 -lm} if you compiled with OpenMP. You will also need to link
							 | 
						||
| 
								 | 
							
								with whatever library is responsible for threads on your system
							 | 
						||
| 
								 | 
							
								(e.g. @code{-lpthread} on GNU/Linux) or include whatever compiler flag
							 | 
						||
| 
								 | 
							
								enables OpenMP (e.g. @code{-fopenmp} with gcc).
							 | 
						||
| 
								 | 
							
								@cindex linking on Unix
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Second, before calling @emph{any} FFTW routines, you should call the
							 | 
						||
| 
								 | 
							
								function:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@example
							 | 
						||
| 
								 | 
							
								int fftw_init_threads(void);
							 | 
						||
| 
								 | 
							
								@end example
							 | 
						||
| 
								 | 
							
								@findex fftw_init_threads
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								This function, which need only be called once, performs any one-time
							 | 
						||
| 
								 | 
							
								initialization required to use threads on your system.  It returns zero
							 | 
						||
| 
								 | 
							
								if there was some error (which should not happen under normal
							 | 
						||
| 
								 | 
							
								circumstances) and a non-zero value otherwise.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Third, before creating a plan that you want to parallelize, you should
							 | 
						||
| 
								 | 
							
								call:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@example
							 | 
						||
| 
								 | 
							
								void fftw_plan_with_nthreads(int nthreads);
							 | 
						||
| 
								 | 
							
								@end example
							 | 
						||
| 
								 | 
							
								@findex fftw_plan_with_nthreads
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The @code{nthreads} argument indicates the number of threads you want
							 | 
						||
| 
								 | 
							
								FFTW to use (or actually, the maximum number).  All plans subsequently
							 | 
						||
| 
								 | 
							
								created with any planner routine will use that many threads.  You can
							 | 
						||
| 
								 | 
							
								call @code{fftw_plan_with_nthreads}, create some plans, call
							 | 
						||
| 
								 | 
							
								@code{fftw_plan_with_nthreads} again with a different argument, and
							 | 
						||
| 
								 | 
							
								create some more plans for a new number of threads.  Plans already created
							 | 
						||
| 
								 | 
							
								before a call to @code{fftw_plan_with_nthreads} are unaffected.  If you
							 | 
						||
| 
								 | 
							
								pass an @code{nthreads} argument of @code{1} (the default), threads are
							 | 
						||
| 
								 | 
							
								disabled for subsequent plans.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								You can determine the current number of threads that the planner can
							 | 
						||
| 
								 | 
							
								use by calling:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@example
							 | 
						||
| 
								 | 
							
								int fftw_planner_nthreads(void);
							 | 
						||
| 
								 | 
							
								@end example
							 | 
						||
| 
								 | 
							
								@findex fftw_planner_nthreads
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@cindex OpenMP
							 | 
						||
| 
								 | 
							
								With OpenMP, to configure FFTW to use all of the currently running
							 | 
						||
| 
								 | 
							
								OpenMP threads (set by @code{omp_set_num_threads(nthreads)} or by the
							 | 
						||
| 
								 | 
							
								@code{OMP_NUM_THREADS} environment variable), you can do:
							 | 
						||
| 
								 | 
							
								@code{fftw_plan_with_nthreads(omp_get_max_threads())}. (The @samp{omp_}
							 | 
						||
| 
								 | 
							
								OpenMP functions are declared via @code{#include <omp.h>}.)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@cindex thread safety
							 | 
						||
| 
								 | 
							
								Given a plan, you then execute it as usual with
							 | 
						||
| 
								 | 
							
								@code{fftw_execute(plan)}, and the execution will use the number of
							 | 
						||
| 
								 | 
							
								threads specified when the plan was created.  When done, you destroy
							 | 
						||
| 
								 | 
							
								it as usual with @code{fftw_destroy_plan}.  As described in
							 | 
						||
| 
								 | 
							
								@ref{Thread safety}, plan @emph{execution} is thread-safe, but plan
							 | 
						||
| 
								 | 
							
								creation and destruction are @emph{not}: you should create/destroy
							 | 
						||
| 
								 | 
							
								plans only from a single thread, but can safely execute multiple plans
							 | 
						||
| 
								 | 
							
								in parallel.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								There is one additional routine: if you want to get rid of all memory
							 | 
						||
| 
								 | 
							
								and other resources allocated internally by FFTW, you can call:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@example
							 | 
						||
| 
								 | 
							
								void fftw_cleanup_threads(void);
							 | 
						||
| 
								 | 
							
								@end example
							 | 
						||
| 
								 | 
							
								@findex fftw_cleanup_threads
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								which is much like the @code{fftw_cleanup()} function except that it
							 | 
						||
| 
								 | 
							
								also gets rid of threads-related data.  You must @emph{not} execute any
							 | 
						||
| 
								 | 
							
								previously created plans after calling this function.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								We should also mention one other restriction: if you save wisdom from a
							 | 
						||
| 
								 | 
							
								program using the multi-threaded FFTW, that wisdom @emph{cannot be used}
							 | 
						||
| 
								 | 
							
								by a program using only the single-threaded FFTW (i.e. not calling
							 | 
						||
| 
								 | 
							
								@code{fftw_init_threads}).  @xref{Words of Wisdom-Saving Plans}.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Finally, FFTW provides a optional callback interface that allows you to
							 | 
						||
| 
								 | 
							
								replace its parallel threading backend at runtime:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@example
							 | 
						||
| 
								 | 
							
								void fftw_threads_set_callback(
							 | 
						||
| 
								 | 
							
								    void (*parallel_loop)(void *(*work)(void *), char *jobdata, size_t elsize, int njobs, void *data),
							 | 
						||
| 
								 | 
							
								    void *data);
							 | 
						||
| 
								 | 
							
								@end example
							 | 
						||
| 
								 | 
							
								@findex fftw_threads_set_callback
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								This routine (which is @emph{not} threadsafe and should generally be called before creating
							 | 
						||
| 
								 | 
							
								any FFTW plans) allows you to provide a function @code{parallel_loop} that executes
							 | 
						||
| 
								 | 
							
								parallel work for FFTW: it should call the function @code{work(jobdata + elsize*i)} for
							 | 
						||
| 
								 | 
							
								@code{i} from @code{0} to @code{njobs-1}, possibly in parallel.  (The `data` pointer
							 | 
						||
| 
								 | 
							
								supplied to @code{fftw_threads_set_callback} is passed through to your @code{parallel_loop}
							 | 
						||
| 
								 | 
							
								function.)   For example, if you link to an FFTW threads library built to use POSIX threads,
							 | 
						||
| 
								 | 
							
								but you want it to use OpenMP instead (because you are using OpenMP elsewhere in your program
							 | 
						||
| 
								 | 
							
								and want to avoid competing threads), you can call @code{fftw_threads_set_callback} with
							 | 
						||
| 
								 | 
							
								the callback function:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@example
							 | 
						||
| 
								 | 
							
								void parallel_loop(void *(*work)(char *), char *jobdata, size_t elsize, int njobs, void *data)
							 | 
						||
| 
								 | 
							
								@{
							 | 
						||
| 
								 | 
							
								#pragma omp parallel for
							 | 
						||
| 
								 | 
							
								    for (int i = 0; i < njobs; ++i)
							 | 
						||
| 
								 | 
							
								        work(jobdata + elsize * i);
							 | 
						||
| 
								 | 
							
								@}
							 | 
						||
| 
								 | 
							
								@end example
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The same mechanism could be used in order to make FFTW use a threading backend
							 | 
						||
| 
								 | 
							
								implemented via Intel TBB, Apple GCD, or Cilk, for example.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@c ------------------------------------------------------------
							 | 
						||
| 
								 | 
							
								@node How Many Threads to Use?, Thread safety, Usage of Multi-threaded FFTW, Multi-threaded FFTW
							 | 
						||
| 
								 | 
							
								@section How Many Threads to Use?
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@cindex number of threads
							 | 
						||
| 
								 | 
							
								There is a fair amount of overhead involved in synchronizing threads,
							 | 
						||
| 
								 | 
							
								so the optimal number of threads to use depends upon the size of the
							 | 
						||
| 
								 | 
							
								transform as well as on the number of processors you have.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								As a general rule, you don't want to use more threads than you have
							 | 
						||
| 
								 | 
							
								processors.  (Using more threads will work, but there will be extra
							 | 
						||
| 
								 | 
							
								overhead with no benefit.)  In fact, if the problem size is too small,
							 | 
						||
| 
								 | 
							
								you may want to use fewer threads than you have processors.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								You will have to experiment with your system to see what level of
							 | 
						||
| 
								 | 
							
								parallelization is best for your problem size.  Typically, the problem
							 | 
						||
| 
								 | 
							
								will have to involve at least a few thousand data points before threads
							 | 
						||
| 
								 | 
							
								become beneficial.  If you plan with @code{FFTW_PATIENT}, it will
							 | 
						||
| 
								 | 
							
								automatically disable threads for sizes that don't benefit from
							 | 
						||
| 
								 | 
							
								parallelization.
							 | 
						||
| 
								 | 
							
								@ctindex FFTW_PATIENT
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@c ------------------------------------------------------------
							 | 
						||
| 
								 | 
							
								@node Thread safety,  , How Many Threads to Use?, Multi-threaded FFTW
							 | 
						||
| 
								 | 
							
								@section Thread safety
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@cindex threads
							 | 
						||
| 
								 | 
							
								@cindex OpenMP
							 | 
						||
| 
								 | 
							
								@cindex thread safety
							 | 
						||
| 
								 | 
							
								Users writing multi-threaded programs (including OpenMP) must concern
							 | 
						||
| 
								 | 
							
								themselves with the @dfn{thread safety} of the libraries they
							 | 
						||
| 
								 | 
							
								use---that is, whether it is safe to call routines in parallel from
							 | 
						||
| 
								 | 
							
								multiple threads.  FFTW can be used in such an environment, but some
							 | 
						||
| 
								 | 
							
								care must be taken because the planner routines share data
							 | 
						||
| 
								 | 
							
								(e.g. wisdom and trigonometric tables) between calls and plans.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The upshot is that the only thread-safe routine in FFTW is
							 | 
						||
| 
								 | 
							
								@code{fftw_execute} (and the new-array variants thereof).  All other routines
							 | 
						||
| 
								 | 
							
								(e.g. the planner) should only be called from one thread at a time.  So,
							 | 
						||
| 
								 | 
							
								for example, you can wrap a semaphore lock around any calls to the
							 | 
						||
| 
								 | 
							
								planner; even more simply, you can just create all of your plans from
							 | 
						||
| 
								 | 
							
								one thread.  We do not think this should be an important restriction
							 | 
						||
| 
								 | 
							
								(FFTW is designed for the situation where the only performance-sensitive
							 | 
						||
| 
								 | 
							
								code is the actual execution of the transform), and the benefits of
							 | 
						||
| 
								 | 
							
								shared data between plans are great.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Note also that, since the plan is not modified by @code{fftw_execute},
							 | 
						||
| 
								 | 
							
								it is safe to execute the @emph{same plan} in parallel by multiple
							 | 
						||
| 
								 | 
							
								threads.  However, since a given plan operates by default on a fixed
							 | 
						||
| 
								 | 
							
								array, you need to use one of the new-array execute functions (@pxref{New-array Execute Functions}) so that different threads compute the transform of different data.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								(Users should note that these comments only apply to programs using
							 | 
						||
| 
								 | 
							
								shared-memory threads or OpenMP.  Parallelism using MPI or forked processes
							 | 
						||
| 
								 | 
							
								involves a separate address-space and global variables for each process,
							 | 
						||
| 
								 | 
							
								and is not susceptible to problems of this sort.)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The FFTW planner is intended to be called from a single thread.  If you
							 | 
						||
| 
								 | 
							
								really must call it from multiple threads, you are expected to grab
							 | 
						||
| 
								 | 
							
								whatever lock makes sense for your application, with the understanding
							 | 
						||
| 
								 | 
							
								that you may be holding that lock for a long time, which is undesirable.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Neither strategy works, however, in the following situation.  The
							 | 
						||
| 
								 | 
							
								``application'' is structured as a set of ``plugins'' which are unaware
							 | 
						||
| 
								 | 
							
								of each other, and for whatever reason the ``plugins'' cannot coordinate
							 | 
						||
| 
								 | 
							
								on grabbing the lock.  (This is not a technical problem, but an
							 | 
						||
| 
								 | 
							
								organizational one.  The ``plugins'' are written by independent agents,
							 | 
						||
| 
								 | 
							
								and from the perspective of each plugin's author, each plugin is using
							 | 
						||
| 
								 | 
							
								FFTW correctly from a single thread.)  To cope with this situation,
							 | 
						||
| 
								 | 
							
								starting from FFTW-3.3.5, FFTW supports an API to make the planner
							 | 
						||
| 
								 | 
							
								thread-safe:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								@example
							 | 
						||
| 
								 | 
							
								void fftw_make_planner_thread_safe(void);
							 | 
						||
| 
								 | 
							
								@end example
							 | 
						||
| 
								 | 
							
								@findex fftw_make_planner_thread_safe
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								This call operates by brute force: It just installs a hook that wraps a
							 | 
						||
| 
								 | 
							
								lock (chosen by us) around all planner calls.  So there is no magic and
							 | 
						||
| 
								 | 
							
								you get the worst of all worlds.  The planner is still single-threaded,
							 | 
						||
| 
								 | 
							
								but you cannot choose which lock to use.  The planner still holds the
							 | 
						||
| 
								 | 
							
								lock for a long time, but you cannot impose a timeout on lock
							 | 
						||
| 
								 | 
							
								acquisition.  As of FFTW-3.3.5 and FFTW-3.3.6, this call does not work
							 | 
						||
| 
								 | 
							
								when using OpenMP as threading substrate.  (Suggestions on what to do
							 | 
						||
| 
								 | 
							
								about this bug are welcome.)  @emph{Do not use
							 | 
						||
| 
								 | 
							
								@code{fftw_make_planner_thread_safe} unless there is no other choice,}
							 | 
						||
| 
								 | 
							
								such as in the application/plugin situation.
							 |