Back in February, we installed Caffe on the TX1. At the time, the TX1 was running a 32-bit version of L4T 23.1. With the advent of the 64-bit L4T 24.2, this seems like a good time to do a performance comparison of the two. The TX1 can now do an image recognition in about 8 ms! For the install and test, Looky Here:
Background
As you recall, Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind. It was created by Yangqing Jia during his PhD at UC Berkeley, and is in active development by the Berkeley Vision and Learning Center (BVLC) and by community contributors.
The L4T 23.1 Operating System release was a 64-bit kernel supporting a 32-bit user space. For the L4T 24.2 release, both the kernel and the user space are 64-bit.
Caffe Installation
A script is available in the JetsonHack Github repository which will install the dependencies for Caffe, downloads the source files, configures the build system, compiles Caffe, and then runs a suite of tests. Passing the tests indicates that Caffe is installed correctly.
This installation demonstration is for a NVIDIA Jetson TX1 running L4T 24.2, an Ubuntu 16.04 variant. The installation of L4T 24.2 was done using JetPack 2.3, and includes installation of OpenCV4Tegra, CUDA 8.0, and cuDNN 5.1.
Before starting the installation, you may want to set the CPU and GPU clocks to maximum by running the script:
$ sudo ./jetson_clocks.sh
The script is in the home directory, and is also included in the installCaffeJTX1 repository for convenience.
In order to install Caffe:
$ git clone https://github.com/jetsonhacks/installCaffeJTX1.git
$ cd installCaffeJTX1
$ ./installCaffe.sh
Installation should not require intervention, in the video installation of dependencies and compilation took about 10 minutes. Running the unit tests takes about 45 minutes. While not strictly necessary, running the unit tests makes sure that the installation is correct.
Test Results
At the end of the video, there are a couple of timed tests which can be compared with the Jetson TK1, and the previous installation:
Jetson TK1 vs. Jetson TX1 Caffe GPU Example Comparison 10 iterations, times in milliseconds | |||
---|---|---|---|
Machine | Average FWD | Average BACK | Average FWD-BACK |
Jetson TK1 (32-bit OS) | 234 | 243 | 478 |
Jetson TX1 (32-bit OS) | 179 | 144 | 324 |
Jetson TX1 with cuDNN support (32-bit OS) | 103 | 117 | 224 |
Jetson TX1 (64-bit OS) | 110 | 122 | 233 |
Jetson TX1 with cuDNN support (64-bit) | 80 | 119 | 200 |
There is definitely a performance improvement between the 32-bit and 64-bit releases. There are a couple of factors for the performance improvement. One is the change from a 32-bit to 64-bit operating system. Another factor is the improvement of the deep learning libraries, CUDA and cuDNN, between the releases. Considering that the tests are running on the exact same hardware, the performance boost is impressive. Using cuDNN provides a huge gain in the forward pass tests.
The tests are running 50 iterations of the recognition pipeline, and each one is analyzing 10 different crops of the input image, so look at the ‘Average Forward pass’ time and divide by 10 to get the timing per recognition result. For the 64-bit version, that means that an image recognition takes about 8 ms.
NVCaffe
It is worth mentioning that NVCaffe is a special branch of Caffe used on the TX1 which includes support for FP16. The above tests use FP32. In many cases, FP32 and FP16 give very similar results; FP16 is faster. For example, in the above tests, the Average Forward Pass test finishes in about 60ms, a result of 6 ms per image recognition!
Conclusion
Deep learning is in its infancy and as people explore its potential, the Jetson TX1 seems well positioned to take the lessons learned and deploy them in the embedded computing ecosystem. There are several different deep learning platforms being developed, the improvement in Caffe on the Jetson Dev Kits over the last couple of years is quite impressive.
Notes
The installation in this video was done directly after flashing L4T 24.2 on to the Jetson TX1 with CUDA 8.0, cuDNN r5.1 and OpenCV4Tegra. Git was then installed:
$ sudo apt-get install git
The latest Caffe commit used in the video is: 80f44100e19fd371ff55beb3ec2ad5919fb6ac43
14 Responses
Thanks for the great manual!
Build worked flawlessly, however tests get stuck.
Sometimes after just a few tests, sometimes after a few tens, but always get stuck. No errors.
The board is brand new, full clean L4T using latest JetPack. jetson_clocks.sh executed.
Any suggestions on how to debug the issue?
What version of L4T are you using?
Also, do you have any idea how much memory is being used? You can try to open up the System Monitor while it is running and make sure that memory pressure isn’t causing the issue.
Another thing to try is to turn off the jetson_clocks.sh script, some people have reported issues there.
ubuntu@tegra-ubuntu:~/caffe$ head -n 1 /etc/nv_tegra_release
# R24 (release), REVISION: 2.1, GCID: 8028265, BOARD: t210ref, EABI: aarch64, DATE: Thu Nov 10 03:51:59 UTC 2016
When it gets stuck, I seem to have plenty of memory:
ubuntu@tegra-ubuntu:~$ free
total used free shared buff/cache available
Mem: 4090604 2197844 521628 42648 1371132 2154284
Swap: 0 0 0
Tried disabling jetson_clocks.sh and running the test without -j4.
Same result.
One more data point: it freezes on different tests, but always on some variant of gradient testing…
[ RUN ] CuDNNConvolutionLayerTest/0.TestGradientGroupCuDNN
or
[ RUN ] ConvolutionLayerTest/2.Test1x1Gradient
or
[ RUN ] InnerProductLayerTest/2.TestGradientTranspose
or
[ RUN ] RNNLayerTest/3.TestGradientNonZeroContBufferSize2
etc.
Any way to see what it gets stuck on exactly?
Unfortunately I don’t see any obvious error, or have any idea how to go about fixing your issue. I haven’t encountered anything along those lines.
You might try the NVCaffe version:
https://github.com/dusty-nv/jetson-inference/blob/master/docs/building-nvcaffe.md
and see if that works.
nvcaffe got quite a few issues:
1) I had to fix include path
2) Had to create links to some of the libraries, so it finds them at linking
3) Some of tests fail
4) One of the tests runs out of memory and aborts “make runtest”…
So I am back to your version and trying to debug it.
According to GDB, the hanging tests are spinning around “usleep” inside
cuMemcpy, probably waiting for the transfer to finish. Forever.
Are there any non-caffe, Cuda tests that I can run on TX1?
May be it’s just the specific board/CPU are faulty and I should RMA it..
Hi Max,
I don’t have enough experience with your issue to be of much help. It’s worth asking the same question on the Jetson TX1 forum: https://devtalk.nvidia.com/default/board/164/jetson-tx1/
The NVCaffe implementation should work out of the box, it’s probably worth putting in some issues on Github to say you were having issues.
run sudo ./jetson_clocks.sh and follows: Can’t access Fan!
Which version of L4T are you using?
Hello,I had meet the same problem,had you soloved it?
the latest version ,R24.3
This article and the code are for L4T 24.2. You should try the jetsonClocks.sh file located in the home directory that is provided in a JetPack install.
OK,I will try it later, Thx!
Has anyone tried to install caffe on TX1 with latest JetPack 3.1 and LT28.2?
I am permanently getting errors for missing cblas.h and I am not able to install libopenblas-dev neither via apt-get nor aptitude are able to install it forwhatever reason.
I am not very keen on doing another full reset of the TX1, and probably to some older JetPack version.
Any advice is welcome.