Image recognition in about 10ms on a Jetson TX1! In an earlier article on running Caffe on the Jetson TX1 we stated that when cuDNN library version 4 support was added, we would revisit the installation and run some benchmarks. That time be now. Looky here:
Caffe Background
Just as a reminder, Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind. It was created by Yangqing Jia during his PhD at UC Berkeley, and is in active development by the Berkeley Vision and Learning Center (BVLC) and by community contributors.
Caffe Installation
A script is available in the JetsonHack Github repository which will install the dependencies for Caffe, and then download the source code and compile it on the Jetson TX1. In order to install:
$ git clone https://github.com/jetsonhacks/installCaffeJTX1.git
$ cd installCaffeJTX1
$ ./installCaffe.sh
The name of the script is installCaffeCuDNN.sh.
There are a few points of interest in the installCaffeCuDNN.sh script. The first point is the following:
# Dec. 7, 2015; This only appears in once place currently
# This is a 32 bit OS LMDB_MAP_SIZE needs to be reduced from
# 1099511627776 to 536870912
git grep -lz 1099511627776 | xargs -0 sed -i ‘s/1099511627776/536870912/g’
# Change the comment too
git grep -lz “// 1TB” | xargs -0 sed -i ‘s:// 1TB:// 1/2TB:g’
In the current version of Caffe, there is an issue with running on a 32-bit OS like L4T 23.1 on the Jetson TX1. From Aaron Schumacher’s article: The NVIDIA Jetson TK1 with Caffe on MNIST
Unfortunately master has a really large value for LMDB_MAP_SIZE in src/caffe/util/db.cpp, which confuses our little 32-bit ARM processor on the Jetson, eventually leading to Caffe tests failing with errors like MDB_MAP_FULL: Environment mapsize limit reached. Caffe GitHub issue #1861 has some discussion about this and maybe it will be fixed eventually, but for the moment if you manually adjust the value from 1099511627776 to 536870912, you’ll be able to run all the Caffe tests successfully.
Corey Thompson also added:
To get the LMDB portion of tests to work, make sure to also update examples/mnist/convert_mnist_data.cpp as well:
examples/mnist/convert_mnist_data.cpp:89:56: warning: large integer implicitly truncated to unsigned type [-Woverflow]
CHECK_EQ(mdb_env_set_mapsize(mdb_env, 1099511627776), MDB_SUCCESS) // 1TB
^
adjust the value from 1099511627776 to 536870912.
The JetsonHacks script uses the ‘git grep‘ command to round up the usual suspects in the source code that use the 1TB marker, 1099511627776, and then uses the ‘sed‘ command to replace it with half of that value. The script also tries to fix the associated comment. You will notice that the ‘sed’ command is slightly unconventional in the latter case, the colon (‘:’) after the s in the command line indicates to sed to use the colon as the delimiter instead of the default slash (‘/’). This is useful for helping with those pesky .cpp style comments.
The second point in the script that is a little unusual is the line:
make -j 3 all
Normally one would expect the command to tell the machine to use all available cores, i.e. ‘-j 4’ in the case of the Jetson TX1, but instead we instruct the system to use only 3 out of the available 4. Why? Because if 4 is specified the system crashes with all CPUs pegged at 100%. At this point I’ll give it the benefit of the doubt and make the assumption that it is because of a niggle somewhere in the new L4T 21.3 release on the Jetson TX1.
Installation should not require intervention, in the video installation of dependencies and compilation took about 20 minutes. Running the unit tests takes about 45 minutes. While not strictly necessary, running the unit tests makes sure that the installation is correct.
Test Results
At the end of the video, there are a couple of timed tests which can be compared with the Jetson TK1, and the previous installation:
Jetson TK1 vs. Jetson TX1 Caffe GPU Example Comparison 10 iterations, times in milliseconds | |||
---|---|---|---|
Machine | Average FWD | Average BACK | Average FWD-BACK |
Jetson TK1 | 274 | 278 | 555 |
Jetson TK1 max GPU clock | 234 | 243 | 478 |
Jetson TX1 | 179 | 144 | 324 |
Jetson TX1 with cuDNN support | 103 | 117 | 224 |
This is an image recognition in about 27ms for the TK1, and around 10ms for the TX1 using cuDNN support. In another test from the TK1 video, the GPU and CPU clocks were maxed out which resulted in the TK1 scoring 24ms. In the numbers in the above table, the GPU and CPU clocks are maxed out on the Jetson TX1. As you can see, performance of the Jetson TX1 is about 2X Jetson TK1.
For completeness, the TK1 video also had results for using just 1 CPU core:
Jetson TK1 vs. Jetson TX1 Caffe CPU Example Comparison 1 CPU core, 10 iterations, times in milliseconds | |||
---|---|---|---|
Machine | Average FWD | Average BACK | Average FWD-BACK |
Jetson TK1 | 5872 | 5562 | 11435 |
Jetson TK1 max GPU clock | 5370 | 5701 | 10472 |
Jetson TX1 | 4360 | 4011 | 8372 |
An interesting thing to note about the results is that both the Jetson TK1 and the Jetson TX1 both have about the same power footprint (around 10 watts during these tests), so on the TX1 there’s quite a performance difference while still sipping the same amount of energy.
Next Steps
Though the Jetson TX1 has a 64 bit processor, the current operating system is 32 bit. Over the next few months, the OS will be upgraded to 64 bit. It will be interesting to see the amount of performance improvement (if any) from the switch over. Update: Here’s an article using Caffe on 64-bit L4T on the Jetson TX1.
Notes
The installation in this video was done directly after flashing L4T 23.1 on to the Jetson TX1 with CUDA 7.0, cuDNN r4 and OpenCV4Tegra. Git was then installed:
$ sudo apt-get install git
The latest Caffe commit used in the video is: 4541f8900588a335f2d9387a5b03460deba68678