GSoC 2014: Using Hardware Based Co-processors in GNU Radio

Student: Alfredo Muniz ()
Mentor: Philip Balister

Paper
Abstract: GNU Radio as a digital signal processing program requires numerous mathematical operations to be executed readily and repeatedly. A faster way to process signals allows projects with high timing constraints such as channel sounding. Past co-processor projects including GPUs, FPGAs, and DSPs do not scale well with new devices. This summer we would like to implement a purely open source three step approach that would improve the state of GNU Radio co-processing for years to come.

High Level Details

Essentially a summary of the project and previous work. We will be using the XTCIEVMK2H
- This is a special board by TI currently only available to academics
- The TCI6630K2L also contains the coprocessors and should work the same way

Overview of Deliverables

The necessary steps to accomplish the project can be summarized below (not necessarily in order):
- Modify gnuradio-runtime to allow blocks to use the write_pointer without a need for a kernel copy
- Install GNU Radio on the Keystone2 and run a test on a co-processor
- Test the implementation of gnuradio-runtime on a co-processor

Previous Work

Previous work done at VT with the keystone2 platform and GNU Radio. They focus on making the keystone2 development entirely open source. Was helpful for setting up the board for netboot and getting the uboot parameters correct.

Keystone2 Details

The XTCIEVMK2H Keystone2 has lots of documentation available Additional information on the coprocessor drivers and api is found by installing the CCSv5 and the MCSDK v3.0.4 in the $(TI_PDK_INSTALL_DIR)\packages\ti\drv\$(COPROC)\docs\

Setting Up the Development Environment

The way to develop on the keystone2 is through booting from a TFTP server and running the root file system (rootfs) from a network file system (nfs). We have to configure both our host machine and the board in order to allow netboot and nfs.

The most similar setup I have seen on the wiki is on this page

Gathering the Files

Philip did magic with OE and made the images for GNU Radio and for UHD
dropbox

We need 5 files to go in our boot folder which will be connected to our TFTP server and for these tutorials is /tftpboot
Boot Monitor = mon.bin
Boot Loader = u-boot.bin
??? = u-boot.img
Kernel Image = uImage-k2hk-evm.bin
Device Tree = uImage-k2hk-evm.dtb

We need the gnuradio-dev-image-k2hk-evm.tar.gz to go in our rootfs folder which will be a nfs. I created /var/www/share as my nfs and extracted the package there.

Setting up a TFTP Server

For Ubuntu 12.04 Precise there are lots of ways to get TFTP but I found the only working one to be with ATFTP. This setup requires that our host computer be connected through ethernet to a network that the board is also connected to.

Below is a copy of my configuration file located at /etc/default/atftpd


USE_INETD=false
OPTIONS="--tftpd-timeout 300 --retry-timeout 5 --port=69 --bind-address=10.16.32.74 --maxthread 100 --verbose=7 /tftpboot" 

Notice that bind-address is the address of the host we are using which can be found using ifconfig and looking at inet_addr.

Once we setup the configuration, we need to create a folder for which we want to keep the files we wish to share with the board. In this case, I choose /tftpboot. We need to set the permissions in /tftpboot to allow copying and access from the board.

We can then start the tftp server using the commands

sudo service atftpd restart

We can then create a file in the tftpboot called test and connect to our tftp server from our host machine:

tftp 10.16.32.74
get test

The file test should appear in the folder we called tftp from.

Setting up the Network File System

The NFS can be setup using nfs-kernel-server. We have to install that package and modify the /etc/exports file to have the line

/var/www/share *(rw,no_subtree_check,no_root_squash,sync)

Then we can run these commands to ensure it works:

sudo exportfs -a
sudo exportfs -v

I choose to put my rootfs in /var/www/share and ran these commands to give the board access to the files

sudo chown -R nobody:nogroup /var/www/share
sudo chmod 777 -R /var/www/share

Lastly we can run the nfs server

sudo service nfs-kernel-server restart

Now we can login to the keystone2 by connecting the USB cables, ethernet cable, and power cable.

Configuring the Keystone Boot Environment

Now that we have made a tftp server, we can receive the files we need to communicate with the board. There are a couple of configurations we need to change. Again 10.16.32.74 is my ip on my host machine. The ip of the keystone should be automatically assigned unless there is a problem with the internet/firewall in which case we want to configure following these steps

Once we connect to the serial port of our board, we want to interrupt the countdown by hitting any key. We can then set the uboot environment:

env default -f -a 
setenv serverip 10.16.32.74
setenv tftp_root /tftpboot
setenv bootargs console=ttyS0,115200n8 rootwait=1 earlyprintk root=/dev/nfs nfsroot=10.16.32.74:/var/www/share,v3,tcp rw ip=dhcp
setenv _1 tftpboot 0c5f0000 mon.bin
setenv _2 mon_install 0x0c5f0000
setenv _3 tftpboot 87000000 uImage-k2hk-evm.dtb
setenv _4 tftpboot 88000000 uImage-k2hk-evm.bin
setenv _5 bootm 88000000 - 87000000
setenv _ run _1 _2 _3 _4 _5
saveenv

From now on, we can then boot from our TFTP server and from our NFS by running

run _

That should be enough to get us up and running. We should be able to login as root and run the test for GNU Radio

gnuradio-config-info -v

Co-Processors

There are six different co-processors on the board that we have access to
Little Documentation: More Documentation:

For now I think the FFTC is a good choice as it contains examples, is well documented, and should be straightforward to test in GNU Radio. Success of the FFTC, means we can test others such as the TCP3d which can be very beneficial to GNU Radio.

FFTC - Fast Fourier Transform Coprocessor

I'll describe the steps of using the co-processor briefly and will update with further details once I get it working. We first need to go through an initialization sequence that is described on page 26 of the FFTC SDS. Then we need to send a TX Request using the fftc_txgetrequestbuffer() buffer function. The function outputs the pointer to where the data we are inputing should be stored. Once we fill in the buffer with our data from GNU Radio, we can go through the RX stage by calling fftc_rxgetresult() which returns the pointer to the raw result and the pointer to the length of the buffer. The interface should work well with our get_user_pages function described below.

We decided to figure out how the system works on the register level so that we can use the co-processors properly.

TeraNet - System Interconnect

The interconnect on the tci6638k2k is called the TeraNet or Eagle's Nest. It is divided into Data Space to show the transfer of data and Configuration Space for access to peripheral configuration registers. The Data Space and Configuration Space are connected through bridge_12, bridge_13, bridge_14 and the Tnet_msmc_sys. The TeraNet is shown as TeraNet_3_x for data and TeraNet_3P_x for configuration.

- Data Interconnect
- Configuration Interconnect

To get data across the board, we need to learn about the different systems and see how their registers are configured. Then we can try to make sense of the TI software.

MSMC - Multicore Shared Memory Controller

The MSMC (pronounced mizmick).

AXI - ARM CorePac

We need to figure out how to get data from the ARM to the MSMC.

System Register Map

All the registers available.

Programming the Keystone

Developing on the keystone2 can be divided into two major sections: ARM (TI calls it Linux) and DSP (TI calls it DSP/BIOS). For this project we want to develop on the ARM side since that is where GNU Radio will be running. Unfortunately most of the documentation and examples focus on developing applications for the DSP side while the ARM side is turned off using TI's debugger. However, TI's software is smart in that it is multilayered and the DSP compiler is somewhat similar to the ARM compiler so porting code from the DSP to the ARM shouldn't be too difficult. This section describes the software in more detail and explains a method of writing code for the ARM side using the libraries provided by TI.

Useful TI pages for communication between the ARM and DSP and CoProcessors
- System Management
- Transports
- Msgcom or MessageQ
- ARM+DSP Shared Memory
- CMEM and MessageQ

Multicore Software Development Kit (MCSDK)

First we need to download the MCSDK which contains the libraries, drivers, documentation, examples, tests, and toolchains needed for using the device. It is possible to work without the MCSDK by programming the registers manually with the linaro toolchain but it will get complex really fast as systems depend on each other and need to be setup in a certain order. First we need to install the appropriate MCSDK for our device (mine is TCI6638K2K) on our host machine. This is the only software we need to install in order to use TI's libraries and documentation. The MCSDK comes with many different parts/folders a few notable ones for this project are the PDK and MCSDK_LINUX.

MCSDK_LINUX

The first thing we need to do is crosscompile the linux-devkit. This can be done in MCSDK_LINUX/linux-devkit by running the script:
arago-2013.12-cortexa15-linux-gnueabi-mcsdk-sdk-i686.sh

This produces the supporting binaries we need for the peripherals. We can simply cp n the contents of sysroots/cortexa15hf---- into the rootfs of the keystone2 which for me is /var/www/share/.

Programmers Development Kit (PDK)

We need to setup the PDK

The PDK comes with the Low Level Drivers (LLDs) needed to run the many peripherals. It will be our working directory as it contains the code we need to compile. If we go into PA folder PDK/packages/ti/drv/pa/, we can see that the following folders:
src - Files to generate the libraries
test - Files to generate tests
example - Files to generate examples

We will first generate the libraries for the arm. We do this by modifying the makefile in the PDK/packages/makefile.

Once we point to the makefile_armv7 in our peripheral directory, we can edit that makefile to build the libs. The makefile for the libs is in PDK/packages/ti/drv/PERIPHERAL/build/armv7/

We need to modify that makefile to include the appropriate include directories and to build the appropriate files in the src folder.

Once we have that, we can in the PDK/packages folder run:
source armv7setupenv.sh
make lib

If all is successful we should receive no errors and find our new libraries in the ../bin folder

Creating a Test

To create the test we need to first create our makefile in the drv/tests/k2k/armv7/linux/build folder. Here we will specify our test files and our include directories for the test to build properly.

Resource Manager

The resource manager (RM) can be found in the PDK/ti/drv/rm directory. It allows communication between ARM<==>DSP, DSP<==>DSP cores. Because all of the examples for the coprocessors are written for the DSP, we are able to easily run DSP programs while still communicating with the ARM.

ARM+DSP Communication

There are a number of ways to get data to and from the ARM and DSP - Msgcom, MessageQ, mpm_mailbox, and CMEM. The current plan is to use CMEM to allocate contiguous memory for holding the log-likelihood ratios and for holding the resulting hard decisions. CMEM is able to translate the virtual address of the buffers into physical pointers so that the DSP can use since it doesn't have a memory management unit (MMU). We can pass the physical pointers to the DSP through the MessageQ system. Msgcom is being deprecated and mpm_mailbox isn't as well documented. We'll then run a test to verify it functions and then move onto getting data out of GNU Radio.

We can look at the files in filetestdemo specifically options 2 and 3 for cmem_cached_test and for cmem_uncached_test for examples on how to use the cmem api. When allocating a buffer we need to remember that addresses are 36 bits instead of 32.

Configuring Interrupts

The MessageQ only needs to happen once because the DSP would then know where memory is. However the ARM doesn't know when the result is ready from the Turbo Decoder. Sending another MessageQ would slow things down (20us) so interrupts is the way to go. The TCP3d generates an interrupt to the DSP that tells it when the data is ready. Perhaps we can use this to interrupt the ARM so that we can get the result faster!

Configuring Interrupts
IPC HW Interrupts
Chip Interrupt Controller

It may be possible to use the request_irq function in the linux kernel since we can see the interrupt on /proc/interrupts but that requires writing a kernel module and spending time in kernel space. It is much easier to use a busy-wait algorithm for a quick test in the meantime. We noted that the ARM doesn't recognize when the DSP writes to the address possibly due to the ARM caching the memory. We need to figure out how to make the ARM not cache stuff in the memory we choose.

GNU Radio Buffers with Zero Copy

Details on this can be found on the runtime page

Direct IO

The goal here is to avoid copying data from the user space to the kernel space which is typically done to provide separation from the user and the device to make programming easier and for security reasons. Direct IO allows us to take large amounts of data and operate on them from user space. The disadvantages are the time it takes to setup the direct IO which involves faulting and setting up the pages. The advantage is that we can tell the accelerators where the data from GNU Radio is without the need for an extra copy (time) in the kernel buffer. There are two methods that we will explore: contiguous buffer and scatter-gather lists. The contiguous buffer method is easy on the keystone2 since there is a module that supports this. Scatter-gather list method is a little more complex and will be explored further down the line once I figure out more about the linux kernel DMA api.

Deliverables

- Create a block in GNU Radio that prints the page numbers of a GNU Radio buffer using get_user_pages

Work in Progress

- An example of how to make loadable kernel modules and an example of get_user_pages is now available on github with the proper instructions. The next step before proceeding further is to pass the write buffer to the block constructor which is an animal on its own.

Resources

- Performing Direct IO
- Lockless Page Cache
- Better IO Scalability
- Zero Copy User Space Access
- User Mode Perspective
- Example with Video Streams

gnuradio-runtime

In order to make efficient use of co-processors in GNU Radio, we want to be able to perform direct IO as explained above with the GNU Radio write buffers. This requires us to modify a couple of files in gnuradio-runtime so that we can pass the write buffer pointer to the block's constructors. The goal is to then create co-processor blocks that perform get_user_pages on the write buffers so that we can perform direct IO to and from the userspace pages without an extra copy to the kernel. This technique should be portable to the majority of co-processors.

In-Place Buffers

We call these buffers that are to be modified outside of the circbuf factory in-place buffers. Essentially they are history=1, relative_rate=1, single input, and single output.

See work on dissecting GNU Radio Runtime