Transposed distributions (FFTW 3.3.10)

Internally, FFTW’s MPI transform algorithms work by first computing transforms of the data local to each process, then by globally transposing the data in some fashion to redistribute the data among the processes, transforming the new data local to each process, and transposing back. For example, a two-dimensional n0 by n1 array, distributed across the n0 dimension, is transformd by: (i) transforming the n1 dimension, which are local to each process; (ii) transposing to an n1 by n0 array, distributed across the n1 dimension; (iii) transforming the n0 dimension, which is now local to each process; (iv) transposing back.

However, in many applications it is acceptable to compute a multidimensional DFT whose results are produced in transposed order (e.g., n1 by n0 in two dimensions). This provides a significant performance advantage, because it means that the final transposition step can be omitted. FFTW supports this optimization, which you specify by passing the flag FFTW_MPI_TRANSPOSED_OUT to the planner routines. To compute the inverse transform of transposed output, you specify FFTW_MPI_TRANSPOSED_IN to tell it that the input is transposed. In this section, we explain how to interpret the output format of such a transform.

Suppose you have are transforming multi-dimensional data with (at least two) dimensions n₀ × n₁ × n₂ × … × n_d-1 . As always, it is distributed along the first dimension n₀ . Now, if we compute its DFT with the FFTW_MPI_TRANSPOSED_OUT flag, the resulting output data are stored with the first two dimensions transposed: n₁ × n₀ × n₂ ×…× n_d-1 , distributed along the n₁ dimension. Conversely, if we take the n₁ × n₀ × n₂ ×…× n_d-1 data and transform it with the FFTW_MPI_TRANSPOSED_IN flag, then the format goes back to the original n₀ × n₁ × n₂ × … × n_d-1 array.

There are two ways to find the portion of the transposed array that resides on the current process. First, you can simply call the appropriate ‘local_size’ function, passing n₁ × n₀ × n₂ ×…× n_d-1 (the transposed dimensions). This would mean calling the ‘local_size’ function twice, once for the transposed and once for the non-transposed dimensions. Alternatively, you can call one of the ‘local_size_transposed’ functions, which returns both the non-transposed and transposed data distribution from a single call. For example, for a 3d transform with transposed output (or input), you might call:

ptrdiff_t fftw_mpi_local_size_3d_transposed(
                ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm,
                ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
                ptrdiff_t *local_n1, ptrdiff_t *local_1_start);

Here, local_n0 and local_0_start give the size and starting index of the n0 dimension for the non-transposed data, as in the previous sections. For transposed data (e.g. the output for FFTW_MPI_TRANSPOSED_OUT), local_n1 and local_1_start give the size and starting index of the n1 dimension, which is the first dimension of the transposed data (n1 by n0 by n2).

(Note that FFTW_MPI_TRANSPOSED_IN is completely equivalent to performing FFTW_MPI_TRANSPOSED_OUT and passing the first two dimensions to the planner in reverse order, or vice versa. If you pass both the FFTW_MPI_TRANSPOSED_IN and FFTW_MPI_TRANSPOSED_OUT flags, it is equivalent to swapping the first two dimensions passed to the planner and passing neither flag.)

6.4.3 Transposed distributions