Note: This article has been updated to use Caffe with cuDNN. cuDNN is a NVIDIA provided GPU-accelerated library for deep neural networks which can more than double performance. For 64-bit L4T please visit: 64-bit Caffe. For 32-bit L4T please visit: 32-bit Caffe.
In an earlier article on running the Caffe Deep Learning Framework on the Jetson TK1, the Jetson TK1 example results showed that an image recognition takes place on an example AlexNet time demonstration in about 27ms. How does the TX1 fare? Looky here:
Caffe Background
Just as a reminder, Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind. It was created by Yangqing Jia during his PhD at UC Berkeley, and is in active development by the Berkeley Vision and Learning Center (BVLC) and by community contributors.
Caffe Installation
A script is available in the JetsonHack Github repository which will install the dependencies for Caffe, and then download the source code and compile it on the Jetson TX1. In order to install:
$ git clone https://github.com/jetsonhacks/installCaffeJTX1.git
$ cd installCaffeJTX1
$ ./installCaffe.sh
There are a couple of points of interest in the installCaffe.sh script. The first point is the following:
# Dec. 7, 2015; This only appears in once place currently
# This is a 32 bit OS LMDB_MAP_SIZE needs to be reduced from
# 1099511627776 to 536870912
git grep -lz 1099511627776 | xargs -0 sed -i ‘s/1099511627776/536870912/g’
# Change the comment too
git grep -lz “// 1TB” | xargs -0 sed -i ‘s:// 1TB:// 1/2TB:g’
In the current version of Caffe, there is an issue with running on a 32-bit OS like L4T 23.1 on the Jetson TX1. From Aaron Schumacher’s article: The NVIDIA Jetson TK1 with Caffe on MNIST
Unfortunately master has a really large value for LMDB_MAP_SIZE in src/caffe/util/db.cpp, which confuses our little 32-bit ARM processor on the Jetson, eventually leading to Caffe tests failing with errors like MDB_MAP_FULL: Environment mapsize limit reached. Caffe GitHub issue #1861 has some discussion about this and maybe it will be fixed eventually, but for the moment if you manually adjust the value from 1099511627776 to 536870912, you’ll be able to run all the Caffe tests successfully.
Corey Thompson also added:
To get the LMDB portion of tests to work, make sure to also update examples/mnist/convert_mnist_data.cpp as well:
examples/mnist/convert_mnist_data.cpp:89:56: warning: large integer implicitly truncated to unsigned type [-Woverflow]
CHECK_EQ(mdb_env_set_mapsize(mdb_env, 1099511627776), MDB_SUCCESS) // 1TB
^
adjust the value from 1099511627776 to 536870912.
The JetsonHacks script uses the ‘git grep‘ command to round up the usual suspects in the source code that use the 1TB marker, 1099511627776, and then uses the ‘sed‘ command to replace it with half of that value. The script also tries to fix the associated comment. You will notice that the ‘sed’ command is slightly unconventional in the latter case, the colon (‘:’) after the s in the command line indicates to sed to use the colon as the delimiter instead of the default slash (‘/’). This is useful for helping with those pesky .cpp style comments.
The second point in the script that is a little unusual is the line:
make -j 3 all
Normally one would expect the command to tell the machine to use all available cores, i.e. ‘-j 4’ in the case of the Jetson TX1, but instead we instruct the system to use only 3 out of the available 4. Why? Because if 4 is specified the system crashes with all CPUs pegged at 100%. At this point I’ll give it the benefit of the doubt and make the assumption that it is because of a niggle somewhere in the new L4T 21.3 release on the Jetson TX1.
Installation should not require intervention, in the video installation of dependencies and compilation took about 20 minutes. Running the unit tests takes about 45 minutes. While not strictly necessary, running the unit tests makes sure that the installation is correct.
Test Results
At the end of the video, there are a couple of timed tests which can be compared with the Jetson TK1:
Jetson TK1 vs. Jetson TX1 Caffe GPU Example Comparison 10 iterations, times in milliseconds |
|||
---|---|---|---|
Machine | Average FWD | Average BACK | Average FWD-BACK |
Jetson TK1 | 274 | 278 | 555 |
Jetson TK1 max GPU clock |
234 | 243 | 478 |
Jetson TX1 | 179 | 144 | 324 |
This is an image recognition in about 27ms for the TK1, 18ms for the TX1. In another test from the TK1 video, the GPU clocks were maxed out which resulted in the TK1 scoring 24ms. Speeding up the GPU clock test on the Jetson TX1 still needs work, the results will be forthcoming.
For completeness, the TK1 video also had results for using just 1 CPU core:
Jetson TK1 vs. Jetson TX1 Caffe CPU Example Comparison 1 CPU core, 10 iterations, times in milliseconds |
|||
---|---|---|---|
Machine | Average FWD | Average BACK | Average FWD-BACK |
Jetson TK1 | 5872 | 5562 | 11435 |
Jetson TK1 max GPU clock |
5370 | 5701 | 10472 |
Jetson TX1 | 4360 | 4011 | 8372 |
An interesting thing to note about the results is that both the Jetson TK1 and the Jetson TX1 both have about the same power footprint (around 10 watts during these tests), so on the TX1 there’s quite a performance difference while still sipping the same amount of energy.
Next Steps
There is still some work to do here for testing. First up is to max out the CPU and GPU clocks on the Jetson TX1 and see what type of performance gain that brings. The second step is to use cuDNN which should also gain some time. The release of cuDNN shipped with the Jetson TX1 L4T 23.1 is R4, which causes some hiccups with the standard Caffe release. Hopefully this issue will be rectified shortly.
Notes
The installation in this video was done directly after flashing L4T 23.1 on to the Jetson TX1 with CUDA 7.0 and OpenCV4Tegra. Git was then installed:
$ sudo apt-get install git
The Caffe commit used in the video is: 9c9f94e18a8909580a6b94c44dbb1e46f0ee8eb8