This is the start of a new set of articles which provide a deeper explanation of the concepts around the NVIDIA Jetson family of products. In many of the previous articles on JetsonHacks, we have been concentrating on “how to” do a particular task.
“In Depth” will focus more on how different subsystems work and can be combined together for specific tasks such as vision processing and machine learning. Articles in this series will cover the Jetson hardware itself, external devices, plus the software that binds everything together.
Here we focus on digital video cameras. As cameras provide the images for vision and machine learning analysis, understanding how a camera gathers and distributes those images is important.
The intent here is to familiarize (or if you’re like me, refresh your memory) on different terms and concepts associated with digital streaming video cameras. You can use this overview as a jumping off point to dive deeper into the subjects presented.
Digital video cameras are ubiquitous. Billions of people have smartphones or tablets with built-in cameras, and hundreds of millions have webcams attached to their computer.
Digital video has a brief history. The first semiconductor image sensor (CCD) was invented in 1969 at Bell Laboratories. A second type, called a CMOS sensor, was invented at the Jet Propulsion Laboratory down the street here in Pasadena, California in 1993. It was in the early 1990s that there was a convergence of technology which allowed digital video to be streamed into consumer level computers. The first popular consumer webcam, the Connectix QuickCam, was introduced in 1994 for $100. 320×240 resolution, 16 bit grayscale. Twas amazing at the time.
CMOS technology is now in use on the vast majority of the sensors in consumer digital video products. Over time the resolution of the sensors has improved, while adding a myriad of capabilities.
Even with a short history, there is a forest of abbreviations and acronyms to navigate to understand what people are talking about in a given context. Hard to talk about something if you don’t know the right name.
Here we will concentrate on cameras that we attach to a Jetson, though these same cameras can be attached to lesser machines. Just one example, here’s a 4K camera:
You can think of a camera as several different parts. First is the image sensor, which gathers lights and digitizes it. The second part is the optics, which helps focus light on the sensor and provides a shutter. Then there is the supporting electronic circuitry which interfaces with the sensor, gathers the images and transmits them.
There are two types of image sensors predominately in use today. The first is CMOS, the other is CCD. CMOS is dominant in most lower cost applications. The raw sensors provide monochrome (greyscale) images.
Here’s an image of an image sensor, the Sony IMX477:
There are different ways to get color images from these sensors. By far the most common way is to use a Bayer Filter mosaic, which is a color filter array. The mosaic arranges color filters on the pixel array of the image sensor. The filter pattern is half green, one quarter red, and one quarter blue. The human eye is most sensitive to green, that’s why there’s an extra in the filter pattern.
Each filter tunes a particular wavelength of photons to the sensor pixel. For example, a blue filter makes the sensor pixel sensitive to blue light. The pixel emits a signal depending on how many photons it sees, in this case how much blue light.
There are other variations using color filter arrays of this type of approach. The Bayer method is patented, so some people try to work around that. Alternatives are CYGM (Cyan, Yellow, Green, Magenta) and RGBE (Red, Green, Blue, Emerald).
In the Bayer filter, the colors may be arranged in different patterns. To get all the combinations, you may see BGGR (Blue, Green, Green, Red), RGBG, GRBG and RGGB. This pattern is used to interpolate a color image using demosaicing algorithms.
The raw output of Bayer-filter cameras is called a Bayer pattern image. Remember that each pixel is filtered to record only one of three colors. The demosaicing algorithm examines each pixel and its surrounding neighbors to estimate a full Red Green Blue (RGB) color for that pixel. That’s why it’s important to know the arrangement of the colors in the filter.
These algorithms can be simple or complex, depending on computational elements onboard the camera. As you can imagine, this is quite the problem. The algorithms make tradeoffs and assumptions about the scene that they are capturing and take into account the time allowed to calculate the color values. There can be artifacts in the final color image depending on the scene and algorithms chosen.
Time is an important factor when you are trying to estimate the color of each pixel in real time. Let’s say you are streaming data at 30 frames per second. That means you have have about 33 milliseconds between frames. Your image better be done and gone before the next one arrives! If you have a couple of million pixels to demosaic per frame, that means you have your work cut out for you! Accurate color estimation can be the enemy of speed, depending on which algorithm is in use.
Sensor modules contain just the image sensor. Raspberry Pi V2 Camera IMX219 and the High Quality version, IMX477 are two such modules that work on the Jetson Nano and Xavier NX. These sensors transmit the raw Bayer pattern image over the Camera Serial Interface (CSI) bus. The Jetson then use on board Image Signal Processors (ISP) to perform a variety of tasks on the images. The Tegra configurable ISP hardware handles demosaicing, auto white balance, down scaling and so on. Check out Image Processing and Management for an expanded overview.
On the other hand, camera modules include the smarts onboard the module to perform those tasks. Some of these modules have a CSI interface, but are typically in use by cameras with alternate interfaces, such as USB. While some of these modules transmit raw Bayer pattern images, the most likely use case you will encounter is an encapsulated video stream, raw color images or compressed.
The Bayer filter is transparent to infrared light. Many image sensors can detect near infrared wavelengths. Most color cameras add an infrared filter on the lens to help with better color estimation.
However, sometimes it is useful to look at a scene that is illuminated by infrared light! Security “night vision” systems typically have an IR emitter combined with a camera image sensor without an infrared filter. This allows the camera to “see in the dark”. One example is the Raspberry Pi NoIR Camera Module V2. This Jetson compatible sensor is the same as the previously mentioned V2 IMX219 RPi camera with the infrared lens removed.
The optics for a digital video camera consist of the lens and the shutter. Most inexpensive cameras use a plastic lens, and provide limited manual focus control. There are also fixed-focus lenses which have no provision for adjustment. Other cameras have glass lenses, and some have interchangeable lenses.
You will hear lenses classified by different statements. Typically a lens is specified by its focal length. The focal length of a lens can be a fixed distance. If the focal length is variable, this is called a zoom lens.
Another classification is the aperture, which is denoted by a f, e.g. f2.8. A lens can have a fixed aperture, or a variable one. The size of the aperture determines how much light can hit the sensor. The larger the aperture, the more light is allowed through the lens. The larger the aperture, the smaller the f number.
The lens Field of View (FoV) is also important. Typically this is expressed in degrees, both in the horizontal and the vertical dimension, or diagonally, with the center of the lens being the midpoint of both of the angles.
The fourth classification is the mount type for cameras that have interchangeable lenses. Interchangeable lenses allow for much more flexibility when capturing images. In the Jetson world, you may hear of a M12 mount. It uses a metric M12 thread with 0.5mm pitch. This is also known as a S-mount. Another common term is a C or CS lens mount. There may attach directly to the PCB of the sensor. The Raspberry Pi Hi Def camera uses this type of mount.
The shutter for the camera may be mechanical or electronic. The shutter exposes the sensor for a predetermined amount of time. There are two main types of exposure methods that shutters use. The first is a rolling shutter. The rolling shutter scans across the sensor progressively, either horizontally or vertically. The second is a global shutter, which exposes the whole sensor at the same instant. The rolling shutter is most common as it tends to be less expensive to implement on a CMOS device, though there may have image artifacts, like smearing, for fast moving objects in a scene.
For scenes that do not have any fast moving objects, rolling shutter can be a good choice. However, for other applications this may be unacceptable. For example, a mobile robot which is inherently a shaky platform to begin with may not be able to produce good enough images for visualization if the images are smeared. Therefore a global shutter is more appropriate.
The electronic circuitry of the digital video camera controls image acquisition, interpolation and routing of the images to the awaiting world. Some cameras have this circuitry on the sensor die (many phone cameras do this to save space), others have external circuitry to handle the task.
Camera sensors, on the other hand, simply interface with a host that handles the data acquisition directly. The Jetsons have multiple Tegra ISPs to handle this task
Data compression is an important task. Video data streams can be very large. Most inexpensive webcams have a built-in ASIC to do image interpolation and video compression.
Newer to the market ‘smart’ cameras may have additional circuitry to process the video data stream. This includes more complicated tasks such as computer vision or depth image processing. These specialty cameras may combine more than one sensor in the camera.
For example, a RGBD camera (Red, Green, Blue, Depth) may have two sensors for calculating depth, and another sensor for grabbing color images. Some of these cameras use infrared illuminators to help the depth sensors in low light situations.
The electronic circuitry transmits the video data from the camera to a host device. This can be through one of several physical paths. On the Jetson, this is the MIPI Camera Serial Interface (MIPI CSI) or through the familiar USB. Third parties offer GMSL (Gigabit Multimedia Serial Link) connectors on Jetson carrier boards. GMSL allows longer transmission distances than the typical CSI ribbon cables by serializing/deserializing the video data stream with buffers. For example, you may see these types of connections in use in robots or automobiles.
Data Compression and Transmission
Here’s where it starts to get interesting for us. Data is coming across the wire, how do we interpret it?
We talked about creating full color images. Typically we think about these as three channels of Red, Green and Blue (RGB). The number of bits in each of these channels determine how many “true” colors can be displayed. 8 bits per channel is common, you may see 10 bits. In professional video, you will see higher numbers. The more bits, the more colors you can represent.
Let’s say it’s 8 bits per color channel, so that’s 24 bits per pixel. If an image is 1920×1080 pixels, that’s 2,073,600 pixels X 3 bytes = 12,441,600 bytes. If there are 30 frames per second, you get 373,248,000 bytes per second. Of course, if you are using 4K video then you get 4x that amount. Now, we love our pixel friends, but we don’t want to drown in them.
As I’m sure you have pointed out by now, we took a Bayer pattern image and expanded it. Certainly we can transmit the image itself along with an identifier indicating which pattern of colors are on the sensor! Of course we can! However, this forces the receiver to do the color conversion, which may not be an optimal solution.
Types of Data Compression
There are many ways to reduce the amount of image data being transmitted from a video stream. Generally this is done by:
- Color space conversion
- Lossless Compression
- Lossy Compression
- Temporal Compression
We won’t go too deeply into this subject here. Subsequent articles will cover the highlights as we get down the road. Entire industries are devoted to these subjects. However, if you have used cameras in the past you are probably already familiar with some of the names of the subjects here.
In color space conversion, YUV coding converts the RGB signal to an intensity component (Y) that ranges from black to white plus two other components (U and V) which code the color. This can be either a lossless or lossy approach. Lossless means that we can convert the image back to the original without any loss, lossy means that we will lose some of the data.
Then there is image compression. You are probably familiar with a PNG file, which uses lossless bitmap compression. A JPEG file is a lossy compression method based on a discreet cosine transform. In general, you can get up to a ~4x size reduction using lossless compression, whereas through lossy compression you can go much higher. The quality of the lossy compressed image may suffer, of course.
Temporal compression measures and encodes differences in the video stream images over time. Generally a frame is set as the key (keyframe), and differences are measured between subsequent frames from there. That way, you only need to send the one keyframe and then the differences. New keyframes are usually generated after a given interval, or generated on a scene change. For mostly static scenes, the size savings can be quite dramatic.
There are a wide variety of algorithms for this task, which is called encoding. The names of these encoders include H.264, H.265, VP8, VP9 and MJPEG. A matching decoder on the receiving end reconstructs the video.
A four character identifier (fourcc) identifies how the video data stream is encoded. This is a throwback to the old Macintosh days where QuickTime built upon the Apple File Manager idea of defining containers with four characters. The four characters conveniently fit in a 32 bit word. Audio uses this method too.
Some of the fourcc codes are easy to guess, such as ‘H264’ and ‘H265’. ‘MJPG’ means that each image is JPEG encoded. Others are not so easy, ‘YUYV’ is fairly common which is a packed format with ½ horizontal chroma resolution, also known as YUV 4:2:2. In part some of this confusion is because manufacturers can register these format names. Also, over the years the same code may have an alias on different platforms. For example, on the Windows platform ‘YUYV’ is known as ‘YUY2’.
This is an overview of cameras. There are multiple books and research article on each of the subjects here. Hopefully this gives you a starting point on where to start exploring when digging deeper into the subject.
In the next article, we will go over how to actually get the video stream into the Jetson!
Debayering Demystified by Craig Stark, PhD
How does YUV Color Encoding Work?
Figure 1:Image Creator: Interiot at English Wikipedia., CC BY-SA 3.0, via Wikimedia Commons
Figure 2, 3: CC By-SA 3.0 from Optical Camera Communications
Figure 4: Raspberry Pi HQ Camera, CS Mount. Image courtesy Rasperrypi.org