283 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
		
		
			
		
	
	
			283 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
|   | @node Multi-threaded FFTW, Distributed-memory FFTW with MPI, FFTW Reference, Top | ||
|  | @chapter Multi-threaded FFTW | ||
|  | 
 | ||
|  | @cindex parallel transform | ||
|  | In this chapter we document the parallel FFTW routines for | ||
|  | shared-memory parallel hardware.  These routines, which support | ||
|  | parallel one- and multi-dimensional transforms of both real and | ||
|  | complex data, are the easiest way to take advantage of multiple | ||
|  | processors with FFTW.  They work just like the corresponding | ||
|  | uniprocessor transform routines, except that you have an extra | ||
|  | initialization routine to call, and there is a routine to set the | ||
|  | number of threads to employ.  Any program that uses the uniprocessor | ||
|  | FFTW can therefore be trivially modified to use the multi-threaded | ||
|  | FFTW. | ||
|  | 
 | ||
|  | A shared-memory machine is one in which all CPUs can directly access | ||
|  | the same main memory, and such machines are now common due to the | ||
|  | ubiquity of multi-core CPUs.  FFTW's multi-threading support allows | ||
|  | you to utilize these additional CPUs transparently from a single | ||
|  | program.  However, this does not necessarily translate into | ||
|  | performance gains---when multiple threads/CPUs are employed, there is | ||
|  | an overhead required for synchronization that may outweigh the | ||
|  | computatational parallelism.  Therefore, you can only benefit from | ||
|  | threads if your problem is sufficiently large. | ||
|  | @cindex shared-memory | ||
|  | @cindex threads | ||
|  | 
 | ||
|  | @menu | ||
|  | * Installation and Supported Hardware/Software:: | ||
|  | * Usage of Multi-threaded FFTW:: | ||
|  | * How Many Threads to Use?:: | ||
|  | * Thread safety:: | ||
|  | @end menu | ||
|  | 
 | ||
|  | @c ------------------------------------------------------------ | ||
|  | @node Installation and Supported Hardware/Software, Usage of Multi-threaded FFTW, Multi-threaded FFTW, Multi-threaded FFTW | ||
|  | @section Installation and Supported Hardware/Software | ||
|  | 
 | ||
|  | All of the FFTW threads code is located in the @code{threads} | ||
|  | subdirectory of the FFTW package.  On Unix systems, the FFTW threads | ||
|  | libraries and header files can be automatically configured, compiled, | ||
|  | and installed along with the uniprocessor FFTW libraries simply by | ||
|  | including @code{--enable-threads} in the flags to the @code{configure} | ||
|  | script (@pxref{Installation on Unix}), or @code{--enable-openmp} to use | ||
|  | @uref{http://www.openmp.org,OpenMP} threads. | ||
|  | @fpindex configure | ||
|  | 
 | ||
|  | 
 | ||
|  | @cindex portability | ||
|  | @cindex OpenMP | ||
|  | The threads routines require your operating system to have some sort | ||
|  | of shared-memory threads support.  Specifically, the FFTW threads | ||
|  | package works with POSIX threads (available on most Unix variants, | ||
|  | from GNU/Linux to MacOS X) and Win32 threads.  OpenMP threads, which | ||
|  | are supported in many common compilers (e.g. gcc) are also supported, | ||
|  | and may give better performance on some systems.  (OpenMP threads are | ||
|  | also useful if you are employing OpenMP in your own code, in order to | ||
|  | minimize conflicts between threading models.)  If you have a | ||
|  | shared-memory machine that uses a different threads API, it should be | ||
|  | a simple matter of programming to include support for it; see the file | ||
|  | @code{threads/threads.c} for more detail. | ||
|  | 
 | ||
|  | You can compile FFTW with @emph{both} @code{--enable-threads} and | ||
|  | @code{--enable-openmp} at the same time, since they install libraries | ||
|  | with different names (@samp{fftw3_threads} and @samp{fftw3_omp}, as | ||
|  | described below).  However, your programs may only link to @emph{one} | ||
|  | of these two libraries at a time. | ||
|  | 
 | ||
|  | Ideally, of course, you should also have multiple processors in order to | ||
|  | get any benefit from the threaded transforms. | ||
|  | 
 | ||
|  | @c ------------------------------------------------------------ | ||
|  | @node Usage of Multi-threaded FFTW, How Many Threads to Use?, Installation and Supported Hardware/Software, Multi-threaded FFTW | ||
|  | @section Usage of Multi-threaded FFTW | ||
|  | 
 | ||
|  | Here, it is assumed that the reader is already familiar with the usage | ||
|  | of the uniprocessor FFTW routines, described elsewhere in this manual. | ||
|  | We only describe what one has to change in order to use the | ||
|  | multi-threaded routines. | ||
|  | 
 | ||
|  | @cindex OpenMP | ||
|  | First, programs using the parallel complex transforms should be linked | ||
|  | with @code{-lfftw3_threads -lfftw3 -lm} on Unix, or @code{-lfftw3_omp | ||
|  | -lfftw3 -lm} if you compiled with OpenMP. You will also need to link | ||
|  | with whatever library is responsible for threads on your system | ||
|  | (e.g. @code{-lpthread} on GNU/Linux) or include whatever compiler flag | ||
|  | enables OpenMP (e.g. @code{-fopenmp} with gcc). | ||
|  | @cindex linking on Unix | ||
|  | 
 | ||
|  | 
 | ||
|  | Second, before calling @emph{any} FFTW routines, you should call the | ||
|  | function: | ||
|  | 
 | ||
|  | @example | ||
|  | int fftw_init_threads(void); | ||
|  | @end example | ||
|  | @findex fftw_init_threads | ||
|  | 
 | ||
|  | This function, which need only be called once, performs any one-time | ||
|  | initialization required to use threads on your system.  It returns zero | ||
|  | if there was some error (which should not happen under normal | ||
|  | circumstances) and a non-zero value otherwise. | ||
|  | 
 | ||
|  | Third, before creating a plan that you want to parallelize, you should | ||
|  | call: | ||
|  | 
 | ||
|  | @example | ||
|  | void fftw_plan_with_nthreads(int nthreads); | ||
|  | @end example | ||
|  | @findex fftw_plan_with_nthreads | ||
|  | 
 | ||
|  | The @code{nthreads} argument indicates the number of threads you want | ||
|  | FFTW to use (or actually, the maximum number).  All plans subsequently | ||
|  | created with any planner routine will use that many threads.  You can | ||
|  | call @code{fftw_plan_with_nthreads}, create some plans, call | ||
|  | @code{fftw_plan_with_nthreads} again with a different argument, and | ||
|  | create some more plans for a new number of threads.  Plans already created | ||
|  | before a call to @code{fftw_plan_with_nthreads} are unaffected.  If you | ||
|  | pass an @code{nthreads} argument of @code{1} (the default), threads are | ||
|  | disabled for subsequent plans. | ||
|  | 
 | ||
|  | You can determine the current number of threads that the planner can | ||
|  | use by calling: | ||
|  | 
 | ||
|  | @example | ||
|  | int fftw_planner_nthreads(void); | ||
|  | @end example | ||
|  | @findex fftw_planner_nthreads | ||
|  | 
 | ||
|  | @cindex OpenMP | ||
|  | With OpenMP, to configure FFTW to use all of the currently running | ||
|  | OpenMP threads (set by @code{omp_set_num_threads(nthreads)} or by the | ||
|  | @code{OMP_NUM_THREADS} environment variable), you can do: | ||
|  | @code{fftw_plan_with_nthreads(omp_get_max_threads())}. (The @samp{omp_} | ||
|  | OpenMP functions are declared via @code{#include <omp.h>}.) | ||
|  | 
 | ||
|  | @cindex thread safety | ||
|  | Given a plan, you then execute it as usual with | ||
|  | @code{fftw_execute(plan)}, and the execution will use the number of | ||
|  | threads specified when the plan was created.  When done, you destroy | ||
|  | it as usual with @code{fftw_destroy_plan}.  As described in | ||
|  | @ref{Thread safety}, plan @emph{execution} is thread-safe, but plan | ||
|  | creation and destruction are @emph{not}: you should create/destroy | ||
|  | plans only from a single thread, but can safely execute multiple plans | ||
|  | in parallel. | ||
|  | 
 | ||
|  | There is one additional routine: if you want to get rid of all memory | ||
|  | and other resources allocated internally by FFTW, you can call: | ||
|  | 
 | ||
|  | @example | ||
|  | void fftw_cleanup_threads(void); | ||
|  | @end example | ||
|  | @findex fftw_cleanup_threads | ||
|  | 
 | ||
|  | which is much like the @code{fftw_cleanup()} function except that it | ||
|  | also gets rid of threads-related data.  You must @emph{not} execute any | ||
|  | previously created plans after calling this function. | ||
|  | 
 | ||
|  | We should also mention one other restriction: if you save wisdom from a | ||
|  | program using the multi-threaded FFTW, that wisdom @emph{cannot be used} | ||
|  | by a program using only the single-threaded FFTW (i.e. not calling | ||
|  | @code{fftw_init_threads}).  @xref{Words of Wisdom-Saving Plans}. | ||
|  | 
 | ||
|  | Finally, FFTW provides a optional callback interface that allows you to | ||
|  | replace its parallel threading backend at runtime: | ||
|  | 
 | ||
|  | @example | ||
|  | void fftw_threads_set_callback( | ||
|  |     void (*parallel_loop)(void *(*work)(void *), char *jobdata, size_t elsize, int njobs, void *data), | ||
|  |     void *data); | ||
|  | @end example | ||
|  | @findex fftw_threads_set_callback | ||
|  | 
 | ||
|  | This routine (which is @emph{not} threadsafe and should generally be called before creating | ||
|  | any FFTW plans) allows you to provide a function @code{parallel_loop} that executes | ||
|  | parallel work for FFTW: it should call the function @code{work(jobdata + elsize*i)} for | ||
|  | @code{i} from @code{0} to @code{njobs-1}, possibly in parallel.  (The `data` pointer | ||
|  | supplied to @code{fftw_threads_set_callback} is passed through to your @code{parallel_loop} | ||
|  | function.)   For example, if you link to an FFTW threads library built to use POSIX threads, | ||
|  | but you want it to use OpenMP instead (because you are using OpenMP elsewhere in your program | ||
|  | and want to avoid competing threads), you can call @code{fftw_threads_set_callback} with | ||
|  | the callback function: | ||
|  | 
 | ||
|  | @example | ||
|  | void parallel_loop(void *(*work)(char *), char *jobdata, size_t elsize, int njobs, void *data) | ||
|  | @{ | ||
|  | #pragma omp parallel for | ||
|  |     for (int i = 0; i < njobs; ++i) | ||
|  |         work(jobdata + elsize * i); | ||
|  | @} | ||
|  | @end example | ||
|  | 
 | ||
|  | The same mechanism could be used in order to make FFTW use a threading backend | ||
|  | implemented via Intel TBB, Apple GCD, or Cilk, for example. | ||
|  | 
 | ||
|  | 
 | ||
|  | @c ------------------------------------------------------------ | ||
|  | @node How Many Threads to Use?, Thread safety, Usage of Multi-threaded FFTW, Multi-threaded FFTW | ||
|  | @section How Many Threads to Use? | ||
|  | 
 | ||
|  | @cindex number of threads | ||
|  | There is a fair amount of overhead involved in synchronizing threads, | ||
|  | so the optimal number of threads to use depends upon the size of the | ||
|  | transform as well as on the number of processors you have. | ||
|  | 
 | ||
|  | As a general rule, you don't want to use more threads than you have | ||
|  | processors.  (Using more threads will work, but there will be extra | ||
|  | overhead with no benefit.)  In fact, if the problem size is too small, | ||
|  | you may want to use fewer threads than you have processors. | ||
|  | 
 | ||
|  | You will have to experiment with your system to see what level of | ||
|  | parallelization is best for your problem size.  Typically, the problem | ||
|  | will have to involve at least a few thousand data points before threads | ||
|  | become beneficial.  If you plan with @code{FFTW_PATIENT}, it will | ||
|  | automatically disable threads for sizes that don't benefit from | ||
|  | parallelization. | ||
|  | @ctindex FFTW_PATIENT | ||
|  | 
 | ||
|  | @c ------------------------------------------------------------ | ||
|  | @node Thread safety,  , How Many Threads to Use?, Multi-threaded FFTW | ||
|  | @section Thread safety | ||
|  | 
 | ||
|  | @cindex threads | ||
|  | @cindex OpenMP | ||
|  | @cindex thread safety | ||
|  | Users writing multi-threaded programs (including OpenMP) must concern | ||
|  | themselves with the @dfn{thread safety} of the libraries they | ||
|  | use---that is, whether it is safe to call routines in parallel from | ||
|  | multiple threads.  FFTW can be used in such an environment, but some | ||
|  | care must be taken because the planner routines share data | ||
|  | (e.g. wisdom and trigonometric tables) between calls and plans. | ||
|  | 
 | ||
|  | The upshot is that the only thread-safe routine in FFTW is | ||
|  | @code{fftw_execute} (and the new-array variants thereof).  All other routines | ||
|  | (e.g. the planner) should only be called from one thread at a time.  So, | ||
|  | for example, you can wrap a semaphore lock around any calls to the | ||
|  | planner; even more simply, you can just create all of your plans from | ||
|  | one thread.  We do not think this should be an important restriction | ||
|  | (FFTW is designed for the situation where the only performance-sensitive | ||
|  | code is the actual execution of the transform), and the benefits of | ||
|  | shared data between plans are great. | ||
|  | 
 | ||
|  | Note also that, since the plan is not modified by @code{fftw_execute}, | ||
|  | it is safe to execute the @emph{same plan} in parallel by multiple | ||
|  | threads.  However, since a given plan operates by default on a fixed | ||
|  | array, you need to use one of the new-array execute functions (@pxref{New-array Execute Functions}) so that different threads compute the transform of different data. | ||
|  | 
 | ||
|  | (Users should note that these comments only apply to programs using | ||
|  | shared-memory threads or OpenMP.  Parallelism using MPI or forked processes | ||
|  | involves a separate address-space and global variables for each process, | ||
|  | and is not susceptible to problems of this sort.) | ||
|  | 
 | ||
|  | The FFTW planner is intended to be called from a single thread.  If you | ||
|  | really must call it from multiple threads, you are expected to grab | ||
|  | whatever lock makes sense for your application, with the understanding | ||
|  | that you may be holding that lock for a long time, which is undesirable. | ||
|  | 
 | ||
|  | Neither strategy works, however, in the following situation.  The | ||
|  | ``application'' is structured as a set of ``plugins'' which are unaware | ||
|  | of each other, and for whatever reason the ``plugins'' cannot coordinate | ||
|  | on grabbing the lock.  (This is not a technical problem, but an | ||
|  | organizational one.  The ``plugins'' are written by independent agents, | ||
|  | and from the perspective of each plugin's author, each plugin is using | ||
|  | FFTW correctly from a single thread.)  To cope with this situation, | ||
|  | starting from FFTW-3.3.5, FFTW supports an API to make the planner | ||
|  | thread-safe: | ||
|  | 
 | ||
|  | @example | ||
|  | void fftw_make_planner_thread_safe(void); | ||
|  | @end example | ||
|  | @findex fftw_make_planner_thread_safe | ||
|  | 
 | ||
|  | This call operates by brute force: It just installs a hook that wraps a | ||
|  | lock (chosen by us) around all planner calls.  So there is no magic and | ||
|  | you get the worst of all worlds.  The planner is still single-threaded, | ||
|  | but you cannot choose which lock to use.  The planner still holds the | ||
|  | lock for a long time, but you cannot impose a timeout on lock | ||
|  | acquisition.  As of FFTW-3.3.5 and FFTW-3.3.6, this call does not work | ||
|  | when using OpenMP as threading substrate.  (Suggestions on what to do | ||
|  | about this bug are welcome.)  @emph{Do not use | ||
|  | @code{fftw_make_planner_thread_safe} unless there is no other choice,} | ||
|  | such as in the application/plugin situation. |