Note: This article has been updated to use Caffe with cuDNN. cuDNN is a NVIDIA provided GPU-accelerated library for deep neural networks which can more than double performance. For 64-bit L4T please visit: 64-bit Caffe. For 32-bit L4T please visit: 32-bit Caffe.
In an earlier article on running the Caffe Deep Learning Framework on the Jetson TK1, the Jetson TK1 example results showed that an image recognition takes place on an example AlexNet time demonstration in about 27ms. How does the TX1 fare? Looky here:
Caffe Background
Just as a reminder, Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind. It was created by Yangqing Jia during his PhD at UC Berkeley, and is in active development by the Berkeley Vision and Learning Center (BVLC) and by community contributors.
Caffe Installation
A script is available in the JetsonHack Github repository which will install the dependencies for Caffe, and then download the source code and compile it on the Jetson TX1. In order to install:
$ git clone https://github.com/jetsonhacks/installCaffeJTX1.git
$ cd installCaffeJTX1
$ ./installCaffe.sh
There are a couple of points of interest in the installCaffe.sh script. The first point is the following:
# Dec. 7, 2015; This only appears in once place currently
# This is a 32 bit OS LMDB_MAP_SIZE needs to be reduced from
# 1099511627776 to 536870912
git grep -lz 1099511627776 | xargs -0 sed -i ‘s/1099511627776/536870912/g’
# Change the comment too
git grep -lz “// 1TB” | xargs -0 sed -i ‘s:// 1TB:// 1/2TB:g’
In the current version of Caffe, there is an issue with running on a 32-bit OS like L4T 23.1 on the Jetson TX1. From Aaron Schumacher’s article: The NVIDIA Jetson TK1 with Caffe on MNIST
Unfortunately master has a really large value for LMDB_MAP_SIZE in src/caffe/util/db.cpp, which confuses our little 32-bit ARM processor on the Jetson, eventually leading to Caffe tests failing with errors like MDB_MAP_FULL: Environment mapsize limit reached. Caffe GitHub issue #1861 has some discussion about this and maybe it will be fixed eventually, but for the moment if you manually adjust the value from 1099511627776 to 536870912, you’ll be able to run all the Caffe tests successfully.
Corey Thompson also added:
To get the LMDB portion of tests to work, make sure to also update examples/mnist/convert_mnist_data.cpp as well:
examples/mnist/convert_mnist_data.cpp:89:56: warning: large integer implicitly truncated to unsigned type [-Woverflow]
CHECK_EQ(mdb_env_set_mapsize(mdb_env, 1099511627776), MDB_SUCCESS) // 1TB
^
adjust the value from 1099511627776 to 536870912.
The JetsonHacks script uses the ‘git grep‘ command to round up the usual suspects in the source code that use the 1TB marker, 1099511627776, and then uses the ‘sed‘ command to replace it with half of that value. The script also tries to fix the associated comment. You will notice that the ‘sed’ command is slightly unconventional in the latter case, the colon (‘:’) after the s in the command line indicates to sed to use the colon as the delimiter instead of the default slash (‘/’). This is useful for helping with those pesky .cpp style comments.
The second point in the script that is a little unusual is the line:
make -j 3 all
Normally one would expect the command to tell the machine to use all available cores, i.e. ‘-j 4’ in the case of the Jetson TX1, but instead we instruct the system to use only 3 out of the available 4. Why? Because if 4 is specified the system crashes with all CPUs pegged at 100%. At this point I’ll give it the benefit of the doubt and make the assumption that it is because of a niggle somewhere in the new L4T 21.3 release on the Jetson TX1.
Installation should not require intervention, in the video installation of dependencies and compilation took about 20 minutes. Running the unit tests takes about 45 minutes. While not strictly necessary, running the unit tests makes sure that the installation is correct.
Test Results
At the end of the video, there are a couple of timed tests which can be compared with the Jetson TK1:
Jetson TK1 vs. Jetson TX1 Caffe GPU Example Comparison 10 iterations, times in milliseconds | |||
---|---|---|---|
Machine | Average FWD | Average BACK | Average FWD-BACK |
Jetson TK1 | 274 | 278 | 555 |
Jetson TK1 max GPU clock | 234 | 243 | 478 |
Jetson TX1 | 179 | 144 | 324 |
This is an image recognition in about 27ms for the TK1, 18ms for the TX1. In another test from the TK1 video, the GPU clocks were maxed out which resulted in the TK1 scoring 24ms. Speeding up the GPU clock test on the Jetson TX1 still needs work, the results will be forthcoming.
For completeness, the TK1 video also had results for using just 1 CPU core:
Jetson TK1 vs. Jetson TX1 Caffe CPU Example Comparison 1 CPU core, 10 iterations, times in milliseconds | |||
---|---|---|---|
Machine | Average FWD | Average BACK | Average FWD-BACK |
Jetson TK1 | 5872 | 5562 | 11435 |
Jetson TK1 max GPU clock | 5370 | 5701 | 10472 |
Jetson TX1 | 4360 | 4011 | 8372 |
An interesting thing to note about the results is that both the Jetson TK1 and the Jetson TX1 both have about the same power footprint (around 10 watts during these tests), so on the TX1 there’s quite a performance difference while still sipping the same amount of energy.
Next Steps
There is still some work to do here for testing. First up is to max out the CPU and GPU clocks on the Jetson TX1 and see what type of performance gain that brings. The second step is to use cuDNN which should also gain some time. The release of cuDNN shipped with the Jetson TX1 L4T 23.1 is R4, which causes some hiccups with the standard Caffe release. Hopefully this issue will be rectified shortly.
Notes
The installation in this video was done directly after flashing L4T 23.1 on to the Jetson TX1 with CUDA 7.0 and OpenCV4Tegra. Git was then installed:
$ sudo apt-get install git
The Caffe commit used in the video is: 9c9f94e18a8909580a6b94c44dbb1e46f0ee8eb8
20 Responses
Have you tried to compile pycaffe? I get errors whenever I try and make
I haven’t tried it. Is it version mismatch type of stuff?
Incorrect testing speeds
In the TK1 and the TX1 I agree that each image takes about 23ms and 18ms respectively during batch testing.However suppose I use an JPEG image and test it through a pyhton script/Caffe command line interface im clocking close to 1~0.7 s.
for example take this test script below for a single image:
./build/examples/cpp_classification/classification.bin \ models/bvlc_reference_caffenet/deploy.prototxt \ models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel \ data/ilsvrc12/imagenet_mean.binaryproto \ data/ilsvrc12/synset_words.txt \ examples/images/cat.jpg
testing takes 1s if u time it(excluding all the labels and printing).Do you know why this happens.?
I’m getting around 3 sec on running the classification.bin exe on my image. Did you just run the time command on the execution line?
Thanks for the great article. I needed
export LD_LIBRARY_PATH=/usr/local/cuda-7.0/targets/armv7-linux-gnueabihf/lib
before the tests, but other than that things worked perfectly.
Thanks for the kind words, and thanks for reading!
Great clip thanks! What is the background music in the video?
Hi Steve,
I’m glad you liked the video! The music is just a couple of friends and I jamming on a Saturday afternoon. Thanks for reading.
I have NVIDIA Jetson TX1 and I want to implement Alexnet on my own data (104000 train and 20000 validation). I have tried many ways but each time it is killed while training or it fails while saving snapshots because of cache size problem or open files limitation. All are memory problems. I wanna know if it is possible to do such task with this GPU or not?
Have you tried doing this with swap memory turned on?
As a side note, typically training is done on a desktop/cloud GPU and inferencing or running a trained model is done on the TX1.
I am trying to train my model on my data using Jetson TX1. Then when I get the weights, I use desktop (CPU) to test the trained model on my test data.
Besides, I am training the model from scratch and I am not using a trained model for that!
So you mean I can not train my model using this GPU?
I meant:
First, have you tried turning on swap memory? If you are having memory pressure issues this should help.
Second, if you have a large model (especially one that takes a long time to train), a lot of people will train on a larger machine, and then bring the trained model to run on the Jetson TX1. The term “side note” simply means additional information.
Third, I since I have no specific knowledge of your model that makes it difficult for me to determine if it can/cannot be trained on the Jetson TX1.
Could you please tell me how I can use the swap memory? or direct me to get information about that.
Thanks.
There are two answers here. First if you’re using L4T 24.X, you can follow these directions:
https://www.digitalocean.com/community/tutorials/how-to-add-swap-space-on-ubuntu-16-04
If you’re using L4T 23.X, it’s much harder, because you have to modify the kernel to turn swap on, and then follow the above instructions.
Thanks for sharing, I wonder what kinds of framework available on TX1?
only Caffe?
Is is possible to run tensorflow from google on TX1?
I’ve heard some people have had success with TensorFlow on the TX1. You can ask that question in the NVIDIA Jetson TX1 Forum:
https://devtalk.nvidia.com/default/board/164/jetson-tx1/
and see if you can get a better answer. Thanks for reading!
Thanks ~!!:) By the way, I have never been using caffe. Is it great you think?
I recently became aware of a company called Nervana Systems in San Diego. They have a Deep Neural Network engine called Neon which they claim is 2x’s faster than Caffe. Here is a link for a SegNet network: https://www.nervanasys.com/industry-focus-serving-the-automotive-industry-with-the-nervana-platform/
The article briefly discusses a port of Neon for the TX1. I don’t have a TX1 yet, but I would interested to hear what you have to say about the product.
Great site, and great posts. Thanks
Thanks for the kind words. I don’t have any experience with Nervana. In reference to the article, which came out in May, NVIDIA has since released the TensorRT libraries and updated cuDNN. Because deep learning is moving so fast, it can be difficult to figure out performance differences between competing systems.
As with most large software environment, each system has strong and weak points. Marketing people learned long ago how to compare them to make a competitor look weak in comparison. Experienced users and developers know that there is a learning curve and investment to be made to fully understand how good any given system at a particular task. In the case of deep learning in general, training a large network is a major investment, even if you count just setting it up and then the compute cycles involved.
Being owned by Intel, one would think that Nervana would tend to be able to keep pace with the fast moving deep learning sector. I saw this the other day:
“Matlab is so 2012. Caffe is so 2013. Theano is so 2014. Torch is so 2015. TensorFlow is so 2016”
Is Nervana’s Neon 2017? Dunno. But here’s the thing. These systems are tools. It’s the people wielding the tools that can make them sing. It’s the old Lucius Annaeus Seneca quote, “It’s not the sword that kills; it is a tool in the killer’s hand”.