This article is part of our coverage of the latest in AI research.
A new machine learning technique developed by researchers at Edge Impulse, a platform for creating ML models for the edge, makes it possible to run real-time object detection on devices with very little computing and memory capacity. Called Faster Objects, More Objects (FOMO), the new deep learning architecture can unlock new machine vision applications.
Most object detection deep learning models have memory and computation requirements that exceed the capabilities of small processors. FOMO, on the other hand, only requires several hundred kilobytes of memory, which makes it a great technique for TinyML, a subfield of machine learning focused on running ML models on microcontrollers and other memory-constrained devices that have Internet connectivity. limited or none.
Image Classification vs. Object Detection
TinyML has come a long way in image classification, where the machine learning model should only predict the presence of a certain type of object in an image. On the other hand, object detection requires the model to identify more than one object, as well as the bounding box of each instance.
Object detection models are much more complex than image classification networks and require more memory.
“We added computer vision support to Edge Impulse in 2020, and we’ve seen a huge recovery of applications (40 percent of our projects are computer vision applications),” Jan Jongboom, CTO of Edge Impulse, told TechTalks. “But with current state-of-the-art models, you could only classify images on microcontrollers.”
Image classification is very useful for many applications. For example, a security camera can use TinyML image classification to determine whether or not there is a person in the frame. However, much more can be done.
“It was a huge hassle that you were limited to these very basic sorting tasks. There is a lot of value in seeing ‘there are three people here’ or ‘this label is in the top left corner’, for example, counting things is one of the biggest questions we see in the market today,” says Jongboom.
Previous object detection ML models had to process the input image multiple times to locate the objects, making them slow and computationally expensive. Newer models such as YOLO (You Only Look Once) use single shot detection to provide near real-time object detection. But its memory requirements are still large. Even models designed for edge applications are difficult to run on small devices.
“YOLOv5 or MobileNet SSD are incredibly large networks that never fit on MCUs and barely fit on Raspberry Pi-like devices,” says Jongboom.
Also, these models are poor at detecting small objects and need a lot of data. For example, YOLOv5 recommends more than 10,000 training instances per object class.
The idea behind FOMO is that not all object detection applications require the high-precision output that state-of-the-art deep learning models provide. By finding the right trade-off between accuracy, speed, and memory, you can reduce your deep learning models to very small sizes and keep them useful.
Instead of detecting bounding boxes, FOMO predicts the center of the object. This is because many object detection applications are only interested in the location of objects in the frame and not their sizes. Centroid detection is much more computationally efficient than bounding box prediction and requires less data.
Redefining Object Detection Deep Learning Architectures
FOMO also applies a major structural change to traditional deep learning architectures.
Single-shot object detectors are composed of a set of convolutional layers that extract features and several fully connected layers that predict the bounding box. Convolution layers extract visual features in a hierarchical fashion. The first layer detects simple things like lines and edges in different directions. Each convolutional layer is typically coupled with a pooling layer, which reduces the size of the layer’s output while keeping the salient features in each area.
The output of the pooling layer is then sent to the next convolutional layer, which extracts higher-level features such as corners, arcs, and circles. As more convolutional and pooling layers are added, feature maps get farther away and can detect tricky things like faces and objects.
Finally, the fully connected layers flatten the output of the final convolution layer and try to predict the class and bounding box of the objects.
FOMO removes fully connected layers and the last convolution layers. This converts the output of the neural network into a reduced version of the image, where each output value represents a small patch of the input image. The network is then trained on a special loss function so that each output unit predicts the class probabilities for the corresponding patch in the input image. The output effectively becomes a heatmap for the object types.
There are several key benefits of this approach. First, FOMO is compatible with existing architectures. For example, FOMO can be applied to MobileNetV2, a popular deep learning model for image classification on peripheral devices.
Furthermore, by greatly reducing the size of the neural network, FOMO reduces the memory and computation requirements of object detection models. According to Edge Impulse, it’s 30 times faster than MobileNet SSD and can run on devices that have less than 200 kilobytes of RAM.
For example, the following video shows a FOMO neural network detecting objects at 30 frames per second on an Arduino Nicla Vision with just over 200 kilobytes of memory. On a Raspberry Pi 4, FOMO can detect objects at 60 fps compared to MobileNet SSD’s 2 fps performance.
Jongboom told me that FOMO was inspired by the work that Mat Kelcey, principal engineer at Edge Impulse, did on the architecture of neural networks for counting bees.
“Traditional object detection algorithms (YOLOv5, MobileNet SSD) are bad for these kinds of problems (objects of similar size, many very small objects), so we designed a custom architecture that optimizes for these problems,” he said.
The granularity of the FOMO output can be configured on a per-application basis and can detect many instances of objects in a single image.
The benefits of FOMO do not come without compensation. It works best when the objects are the same size. It’s like a grid of equal-sized squares, each of which detects an object. So if there is a very large object in the foreground and a lot of small objects in the background, it won’t work as well.
Also, when objects are too close to each other or overlapping, they will occupy the same grid square, reducing the accuracy of the object detector (see video below). You can overcome this limit to some extent by reducing the FOMO cell size or increasing the resolution of the image.
FOMO is especially useful when the camera is in a fixed location, for example scanning objects on a conveyor belt or counting cars in a parking lot.
The Edge Impulse team plans to expand their work in the future, including making the model even smaller, below 100 kilobytes, and improving it on transfer learning.
This article was originally written by Ben Dickson and published by Ben Dickson on TechTalks, a publication that examines trends in technology, how they affect the way we live and do business, and the problems they solve. But we also discuss the downside of technology, the darker implications of new technology, and what to watch out for. You can read the original article here.