/*! \page volk_guide Instructions for using Volk in GNU Radio

\section volk_intro Introduction

Volk is the Vector-Optimized Library of Kernels. It is a library that
contains kernels of hand-written SIMD code for different mathematical
operations. Since each SIMD architecture can be greatly different and
no compiler has yet come along to handle vectorization properly or
highly efficiently, Volk approaches the problem differently. For each
architecture or platform that a developer wishes to vectorize for, a
new proto-kernel is added to Volk. At runtime, Volk will select the
correct proto-kernel. In this way, the users of Volk call a kernel for
performing the operation that is platform/architecture agnostic. This
allows us to write portable SIMD code.

Volk kernels are always defined with a 'generic' proto-kernel, which
is written in plain C. With the generic kernel, the kernel becomes
portable to any platform. Kernels are then extended by adding
proto-kernels for new platforms in which they are desired.

A good example of a Volk kernel with multiple proto-kernels defined is
the volk_32f_s32f_multiply_32f_a. This kernel implements a scalar
multiplication of a vector of floating point numbers (each item in the
vector is multiplied by the same value). This kernel has the following
proto-kernels that are defined for 'generic,' 'avx,' 'sse,' and 'orc.'

    void volk_32f_s32f_multiply_32f_a_generic
    void volk_32f_s32f_multiply_32f_a_sse
    void volk_32f_s32f_multiply_32f_a_avx
    void volk_32f_s32f_multiply_32f_a_orc

These proto-kernels means that on platforms with AVX support, Volk can
select this option or the SSE option, depending on which is faster. On
other platforms, the ORC SIMD compiler might provide a solution. If
all else fails, Volk can fall back on the generic proto-kernel, which
will always work.

Just a note on ORC. ORC is a SIMD compiler library that uses a generic
assembly-like language for SIMD commands. Based on the available SIMD
architecture of a system, it will try and compile a good
solution. Tests show that the results of ORC proto-kernels are
generally better than the generic versions but often not as good as
the hand-tuned proto-kernels for a specific SIMD architecture. This
is, of course, to be expected, and ORC provides a nice intermediary
step to performance improvements until a specific hand-tuned
proto-kernel can be made for a given platform.

See <a
href="http://gnuradio.org/redmine/projects/gnuradio/wiki/Volk">Volk on
gnuradio.org</a> for details on the Volk naming scheme.

\section volk_alignment Setting and Using Memory Alignment Information

For Volk to work as best as possible, we want to use memory-aligned
SIMD calls, which means we have to have some way of knowing and
controlling the alignment of the buffers passed to gr_block's work
function. We set the alignment requirement for SIMD aligned memory
calls with:

  const int alignment_multiple =
    volk_get_alignment() / output_item_size;

The Volk function 'volk_get_alignment' provides the alignment of the
the machine architecture. We then base the alignment on the number of
output items required to maintain the alignment, so we divide the
number of alignment bytes by the number of bytes in an output items
(sizeof(float), sizeof(gr_complex), etc.). This value is then set per
block with the 'set_alignment' function.

Because the scheduler tries to optimize throughput, the number of
items available per call to work will change and depends on the
availability of the read and write buffers. This means that it
sometimes cannot produce a buffer that is properly memory
aligned. This is an inevitable consequence of the scheduler
system. Instead of requiring alignment, the scheduler enforces the
alignment as much as possible, and when a buffer becomes unaligned,
the scheduler will work to correct it as much as possible. If a
block's buffers are unaligned, then, the scheduler sets a flag to
indicate as much so that the block can then decide what best to
do. The next section discusses the use of the aligned/unaligned
information in a gr_block's work function.

\section volk_work Using Alignment Properties in Work()

The buffers passed to work/general_work in a gr_block are not
guaranteed to be aligned, but they will mostly be aligned whenever
possible. When not aligned, the 'is_unaligned()' flag will be set. So
a block can know if its buffers are aligned and make the right
decisions. This looks like:

gr_some_block::work (int noutput_items,
		     gr_vector_const_void_star &input_items,
		     gr_vector_void_star &output_items)
  const float *in = (const float *) input_items[0];
  float *out = (float *) output_items[0];

  if(is_unaligned()) {
    // do something with unaligned data. This can either be a manual
    // handling of the items or a call to an unaligned Volk function.
    volk_32f_something_32f_u(out, in, noutput_items);
  else {
    // Buffers are aligned; can call the aligned Volk function.
    volk_32f_something_32f_a(out, in, noutput_items);

  return noutput_items;

\section volk_tuning Tuning Volk Performance

VOLK comes with a profiler that will build a config file for the best
SIMD architecture for your processor. Run volk_profile that is
installed into $PREFIX/bin. This program tests all known VOLK kernels
for each architecture supported by the processor. When finished, it
will write to $HOME/.volk/volk_config the best architecture for the
VOLK function. This file is read when using a function to know the
best version of the function to execute.

\subsection volk_hand_tuning Hand-Tuning Performance

If you know a particular architecture works best for your processor,
you can specify the particular architecture to use in the VOLK
preferences file: $HOME/.volk/volk_config

The file looks like:


Where the "FUNCTION_NAME" is the particular function that you want to
over-ride the default value and "ARCHITECTURE" is the VOLK SIMD
architecture to use (generic, sse, sse2, sse3, avx, etc.). For
example, the following config file tells VOLK to use SSE3 for the
aligned and unaligned versions of a function that multiplies two
complex streams together.

    volk_32fc_x2_multiply_32fc_a sse3
    volk_32fc_x2_multiply_32fc_u sse3

\b Tip: if benchmarking GNU Radio blocks, it can be useful to have a
volk_config file that sets all architectures to 'generic' as a way to
test the vectorized versus non-vectorized implementations.
