At the GPU Technology Conference (GTC) 2015, there were a group of presentations, tutorials and posters centered around embedded computing using Tegra TK1 and the Jetson TK1 Development Kit. The Tegra TK1 is a System on a Chip (SoC) with built in GPU (192 CUDA cores) in one package. Since there were at least 20 talks at the conference on the subject, here’s a couple of links to the talks to let you forage through.
There is some overlap in the two listings, but here are the direct links:
Tegra K1 Talks, Panels and Posters.
Jetson TK1 Development Kit Talks, Panels, and Posters.
I will note here that there were over 500 presentations in total on all types of subjects at the conference. Many of the presentations are archived using this link to presentations at the conference.
Here’s some sampling of some of the talks to which I have listened and enjoyed:
Automotive
Matthias Rudolph, Head of Architecture Driver Assistance Systems, Audi AG gave a talk titled ZFAS – The Brain of Piloted Driving at Audi. Like most auto manufacturers, Audi has been working in this area for over a decade. If you think this is a “solved” area after listening to all the media attention, listen to this talk. Dr. Rudolph gives an very good explanation of what it takes in term of sensor architecture and integration to get a complex system like this up and running, especially taking into consideration remnants of legacy systems. ZFAS is implemented on Tegra K1.
Sensors Visualization and Usage
In Mobile 3D Mapping With Tegra K1 Karol Majek from Institute of Mathematical Machines presents work on 3D mapping algorithms implemented on Tegra TK1. By implementing the processing pipeline in parallel using CUDA, this allows people to replace traditional laptops with embedded Tegra K1 systems.
Massimiliano Fatica from the Tesla HPC Performance Group at NVIDIA talked about Synthetic Aperture Radar on Jetson TK1. There is a big advantage to being able to onboard SAR analysis, and Dr. Fatica shares some of the implementation challenges of processing large image sets on the Jetson.
A hybrid Tegra K1 and FPGA solution is discussed in Creating Dense Mixed GPU and FPGA Systems With Tegra K1 Using OpenCL. In the talk given by Lance Brown from Colorado Engineering Inc, one observation of having capable GPU systems mixed with special purpose FPGAs is that the combination can use one board to replace several discrete boards and improve performance at the same time. It takes a little while for the talk to hit its stride, but the content is interesting.
Tutorial
The tutorial Application Development for Mobile Devices: A Case Study for the Tegra K1 (Presented by ArrayFire) gave insight into how to get better performance on the Tegra K1 using zero-copy capability, along with a case study of JIT (Just In Time) performance of the ArrayFire library. Also, performance results for image processing and computer vision algorithms are discussed.
Speech
Most of the time when people talk about “Deep Learning” they are referring to image processing. However, another area that convolution networks are being used is for speech recognition. In his presentation Memory-Efficient Heterogeneous Speech Recognition Hybrid in the GPU-Equipped Mobile Devices, Alexei V. Ivanov from Verbumware Inc introduces a technique based on the GPU called “Phonetic Lattice Generation”. By using the properties of decoding graphs, GPU-based phonetic decoding stages are complemented with CPU-based lexical decoding. On the Tegra TK1, phonetic decoding and lexical decoding can be executed twice as fast as a person can speak!
Image Processing
There were several fun presentations on image processing. A really fun presentation, Achieving Real-Time Performances on Facial Motion Capture and Animation on Mobile GPUs showed Emiliano Gambaretto from Mixamo using facial expressions which were mimicked by an on screen 3D character. The process of converting the original 15 FPS code to 27 FPS by porting to CUDA and simplifying the least square problem using random sub-sampling is discussed. An alternative deep learning exploration of the process, which delivers faster speed but requires a large training set and regularization strategies was presented.
Yangdong Deng gave a presentation on A Feasibility Study of Ray Tracing on Mobile GPUs, which compared desktop and mobile GPU implementations. Three systems were compared, a desktop with GTX780, an PowerVR SGX 544-MP3 (Odroid) and a Jetson TK1. The Jetson constructs a 1M-triangle scene in ~ 100ms, and performs traversal at a throughput of 70 million rays per second. This is considerably faster than the Odroid, but much slower than the GTX780. However, the Jetson is still 100x faster than a mobile CPU-based ray tracer. An interesting study.
Conclusion
In most of the performance analysis of the Tegra K1, the main performance bottleneck appears to be the L2 cache speeds and in general the memory bandwidth. Despite this, and the 2GB memory implementation, the Tegra TK1 enables a whole new level of computing power in embedded systems and will serve as a new baseline of performance for years to come.