Engineering

SAPPHIRE: An Always-on Context-aware Computer Vision System for Portable Devices

Categories
Published
of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Description
SAPPHIRE: An Alwayson Contextaware Computer Vision System for Portable Devices Swagath Venkataramani School of Electrical and Computer Engineering Purdue University, West Lafayette, IN
Transcript
SAPPHIRE: An Alwayson Contextaware Computer Vision System for Portable Devices Swagath Venkataramani School of Electrical and Computer Engineering Purdue University, West Lafayette, IN Victor Bahl, XianSheng Hua, Jie Liu, Jin Li, Matthai Phillipose, Bodhi Priyantha, Mohammed Shoaib Microsoft Research, One Microsoft Way, Redmond WA Abstract Being aware of objects in the ambient provides a new dimension of context awareness. Towards this goal, we present a system that exploits powerful computer vision algorithms in the cloud by collecting data through alwayson cameras on portable devices. To reduce comunicationenergy costs, our system allows client devices to continually analyze streams of video and distill out frames that contain objects of interest. Through a dedicated imageclassification engine SAPPHIRE, we show that if an object is found in 5% of all frames, we end up selecting 3% of them to be able to detect the object 9% of the time: 7% data reduction on the client device at a cost of 6mW of power (45nm ASIC). By doing so, we demonstrate systemlevel energy reductions of 2. Thanks to multiple levels of pipelining and parallel vectorreduction stages, SAPPHIRE consumes only 3. mj/frame and 38 pj/op estimated to be lower by 11.4 than a 45 nm GPU and a slightly higher level of peak performance (29 vs. 2 GFLOPS). Further, compared to a parallelized sofware implementation on a mobile CPU, it provides a processing speed up of up to 235 (1.81 s vs. 7.7 ms/frame), which is necessary to meet the realtime processing needs of an alwayson contextaware system. Keywords Always on, portable devices, computer vision, object recognition, hardware acceleration, energy efficiency I. Introduction Emerging mobile applications require persistent context awareness [1]. One important form of awareness is about what things exist in the vicinity of a device. In this paper, we present the endtoend design of a system that provides this form of context awareness. Devices can achieve ambient objectawareness by analyzing data from embedded sensors. The most promising of these is the camera. This is because, images and video are richer in lightfield information as compared to other types of sensor data. Thus, keeping cameras alwayson allows portable devices to be constantly objectaware, which can enable several interesting applications. For instance, if flying drones can detect obstacles in their path, they would be able to make use of robust navigation algorithms for efficient delivery. If dashboard or rooftop cameras in automobiles can detect pedestrians, traffic etc., they would allow the use of complex decision algorithms, which are necessary for (semi) autonomous navigation. Systemlevel challenges. Although they provide rich data, alwayson cameras on portable devices present several systemlevel design challenges [2]. Most of these arise due to the limited energy and computational resources available on portable devices. In the systems we consider, the eventual goal is to achieve context awareness through image understanding. This process thus requires the use of several computer vision algorithms. Among these, the foremost are the ones that are used for visual (object) recognition. In our work, we focus on this specific task. Recent results have shown that /DATE15/ c 215 EDAA Traditional System Model 1.7 mj/fr. $ 8 66 mj/fr.* Sensing Sensing Compr ession Portable Device Proposed Approach for Alwayson Computer Vision 3 mj/fr. Data Filter (Local Proc.) Tx Compr ession All Data Transmitted Tx Cloud Selected Data Instances Transmitted $:1.8 mw for OmniVision VGA sensor: MG 2 mw; * 45 nj/bit for WiFi at 15Mbps 82.11a/g Fig. 1. Traditional system models are limited by communication energy. We propose to exploit lightweight local data filtering to overcome this limitation. deep neural networks have the potential to provide stateoftheart accuracy in visual recognition [3]. These algorithms employ dynamic decision models that require large memories, highbandwidth data links, and compute capacities of up to several giga operations per second (OPS). With enormous potential parallelism, such algorithms form ideal workloads for acceleration in highperformance cloud clusters. Thus, they enable the system model shown at the top in Fig. 1 for visual object recognition. In this model, portable devices constantly stream camera data over a wireless link to the cloud. This approach, however, is undesirable due to the high energy cost of communication [4]. To enable realtime ambient objectawareness, there is thus a need to reduce the transmitted data from portable devices to the cloud. Our design approach. Data reduction on portable devices can be achieved by either lowering the frame rate of alwayson cameras or by performing object recognition on the device itself. Both of these approaches have their limitations. On the one hand, naively reducing frame rates leads to a loss of information, while modest capabilities disallow the use of local object recognition on the other. In this paper, we propose an improved system model shown at the bottom in Fig. 1, which overcomes both of these limitations. We propose to perform lowcomplexity image classification on the device, while exploiting the power of the cloud for fullscale object recognition. Specifically, for the local computations, we employ a wellknown algorithm for binary image classification, which allows substantial amounts of local data filtering with an energy cost that is well within the capacity of portable devices [5]. We motivate our approach through results from a wellknown videobased objectrecognition dataset shown in Fig. 2(a) [6]. The figure shows that framesofinterest (FoI) (i.e., those that contain relevant objects) typically only comprise a small percentage of all frames. Thus, if we can use an algorithm X to isolate those frames, we can achieve substantial energy savings at the system level as long as the algorithm s energy [E f ilter (X)] is not very high. Fig. 2(b) illustrates two such potential algorithms A and B. When compared to B, ColumnPole Wall Building Car Tree Sidewalk Pedestrian MiscText Fence LaneMkgs OthrMoving FoI (%) Persistence (%) Estimated Systemlevel Energy Savings (%) Recall FoI Mean FoI Persistence Energy for local filter (E filter ) 5 mj/fr. 1 mj/fr. 2 mj/fr. 4 mj/fr % FT (Ideally equal to % FoI) (a) (b) Fig. 2. (a) For most objects, FoI 5% and they stay in view for at least 4.4% of contiguous frames. (b) When E f ilter is low, FT FoI can give us substantial energy savings at the system level. A B* E Filter (B*) E Filter (A) High Recall B* B A Precision algorithm A achieves better precision and recall at the cost of higher processing energy. Suppose the FoI for a particular object is 5%. Ideally, we would want the frames transmitted (FT) to also equal 5%. Thanks to its high accuracy, suppose A is able to achieve this goal. We see from the figure that since E f ilter (A) is high, the overall energy savings we get are not maximized. In fact, as E f ilter (A) increases, the energy savings go down substantially. In this paper, we show how to bias a lowenergy algorithm B so that it has high recall at the cost of a slightly lower precision. Thus, the resulting algorithm B* allows us to obtain higher energy savings than A despite having a higher FT percentage. Given the short length of this paper, we restrict ourselves to a few key aspects of the system. The following are the major contributions we make: For the first time, we present a flexible system architecture that enables general purpose visual (object) recognition through alwayson cameras on portable devices. We demonstrate total system energy reductions of 24 (depending on FoI) by exploiting powerful algorithms in the cloud along with locally running lowcomplexity classification algorithms on the device. We present an efficient approach to build configurable biased classifiers that achieve high ( 9%) recall and modest ( 5%) precision in image classification. Our system is thus able to achieve very high ( 92%) endtoend recognition accuracy [3]. We present a novel hierarchicallypipelined hardware architecture for imageclassification that is able to operate at 136 fps (depending on frame size/data complexity), which is effectively faster than a parallelized software implementation on a mobile CPU. We present synthesis results for the engine in a 45nm SOI process, which shows an average energy cost of 3. mj/frame and 38 pj/op lower by 11.4 than a 45 nm GPU and comparable level of FLOPS. We present a new barrel convolution engine as a subcomponent used in imageclassification. This flexible engine performs 2level vector reductions through a systolic array operation and can be configured for different stride lengths, directions, and kernel matrices. It also allows us to lower the memory bandwidth by performing maximum data reuse within the computation engine. The rest of the paper is organized as follows. In Sec. II, we present related work. In Sec. III, we describe the classification algorithm along with results from a software implementation on a mobile CPU, which will motivate the need for the hardware specialization of the algorithm. In Sec. IV, we present the hardware architecture of SAPPHIRE, which is the lowenergy imageclassification engine that we propose. In Sec. V, we describe our experimental and evaluation methodology followed by results at the system, architecture, and circuit level in Sec. VI. Finally, we conclude in Sec. VII. II. Related Work Past research in this area has considered three distinct directions. The first addresses communication energy by exploiting compression [7]. The second optimizes the sensing energy by either designing lowenergy image sensors [8], [9] or by tuning quality parameters such as frequency and sampling rates of existing sensors [1]. These set of directions attempt to trade image quality for energy efficiency. The final set of approaches avoid fulloffloading of recognition to the cloud by performing partial computations on the device [11]. This last set is the most similar to our design model. In these systems, intermediate algorithmic variables (including features) are transmitted to the cloud, where the rest of the algorithm is run to completion. In contrast to these approaches, SAPPHIRE leverages the input data characteristics [low FoI, and high persistence (presence of the object in the field of view): see Fig. 2(a)], to completely eliminate continuous data transmission to the cloud. Data is transmitted only sporadically. We next describe details of the algorithm that enables SAPPHIRE to achieve this characteristic. III. Algorithm for Lowenergy Image Classification When selecting an algorithm for image classification, it is important to note that our objective is to only use lightweight operations that can help isolate (but not recognize details of) potentially interesting frames. It is also useful if the classifier is programmable to detect any object of interest. We thus chose an algorithm that not only performed reasonably well in the ILSVRC competition, but also that which had a lower computational complexity [5]. The basic algorithm is illustrated in Fig. 3. It comprises four major computational blocks that we detail next. Interestpoint detection (IPD). This block helps identify pixels in a frame that potentially contain informative features such as edges, corner, blobs, ridges, etc. In our case, we utilize the HarrisNoble algorithm that detects corners in an image [12]. The algorithm iterates through every pixel in the image and considers pixels within a small neighborhood (e.g., 3 3) to determine an attribute score, called the corner measure. This measure is obtained based on the computation of trace and determinant of a covariance matrix that is derived using the image gradients for all pixels in the window of interest. A pixel is deemed a corner, if its measure is largest among all abutting pixels and if it above a prespecified threshold. This process is called nonmaximum suppression. This process is computationally efficient and invariant to lighting, translation, and rotation. Feature extraction. The featureextraction step extracts lowlevel features from pixels around the interest points. Typical classification algorithms use histogrambased featureextraction methods such as SIFT, HoG, GLOH, etc.. Since we aim to have highflexibility in the classifier, we chose the Daisy featureextraction algorithm, which allows us to adapt one computation engine to represent most other featureextraction methods depending on tunable algorithmic parameters that can be set at runtime [13]. As shown in Fig. 3, this algorithm comprises four computation subblocks: M X N Image Harris IPD I x I y Covariance NMS T x P x P DAISY Feature Extraction G Block Smoothing G(x,σ) Image patches around interest points Tblock T1: Gradient Histogram (e.g., SIFT, GLOH) SBlock T2: Sineweighted P x P Gradient Quantization k x P x P k x N NBlock Euclidean Norm T x D Local feature at each interest point Fisher Feature Representation GMM Model K GMM Clusters Q Computation Mean Variance Weight µ σ π K x D K x D Fisher Vector Computation 1 x D Euclidean Norm 2KD X 1 Global feature inferred from all local features SVM Classification Dot Product Kernel Computation Dist = o o X X Κ(x. sv )α y 1 X 1 Class Output Fig. 3. Lightweight algorithm used for imageclassification on the portable device. GBlock: The image patch is smoothed by convolving it with a Gaussian filter. TBlock: At each pixel, gradients along both horizontal and vertical directions are computed. The magnitude of the gradient vector is then apportioned into k (equals 4 in T1 and 8 in T2 mode) bins resulting in an output array of k feature maps, each of size P P. SBlock: The feature maps from the Tblock are then pooled along a grid that is foveated (1r6/8s, 2r6/8s), rectangular (rect), or polar. Each spatially pooled section produces one scalar value. Thus, if there are N points on the pooling grid, an Ndimensional vector is produced for each Tblock feature map. When concatenated, these vectors thus lead to the final S block feature output of dimensionality D (= kn). NBlock: The Sblock features are normalized (non) iteratively using the l 2 norm to produce the final daisy feature vector of dimensionality D. Feature representation. This block allows us to aggregate featurevectors from all image patches to produce a vector of constant dimensionality. There are several algorithmic options for highlevel feature representation including the bagofvisual words, fisher vectors (FV), etc. [14]. We chose FV representation since it provides better classification performance, thanks to a richer Gaussian mixture model (GMM) based representation of the visual vocabulary. The GMM is allowed to have K centroids each with a parameter set µ σ, and π corresponding to the mean, standard deviation, and proportion, respectively. The gradient of the log likelihood is computed with respect to the parameters of the model to represent every frame. The FV is the concatenation of these partial derivatives and describes in which direction the parameters of the model should be modified to best fit the data. This block thus produces a global feature vector of size 2KD. SVM classification. A simple marginbased classifier [an support vector machine (SVM), in this case] is used to detect relevant frames based on a model that is learnt offline using training data. In SVMs, a set of vectors, called support vectors, determine the decision boundary. During online classification, the feature vector is used to compute a distance score that represents the probability with which the input belongs to a specific class. Modifying the decision boundary is thus key to biasing the classifier towards high recall, which is necessary for the reliable operation of SAPPHIRE. A. Software implementation of the algorithm We implemented a parallelized version of the algorithm in C# using task parallel library (TPL) provided by the.net 4.5 framework. We evaluated the algorithm using four popular objectrecognition datasets: [15], [16], PASCAL VOC [17], and CamVid [6]. For these datasets, Fig. 4 FT (%) Coverage 9% 4 2 PASCAL VOC CamVid FoI (%) FT (%) Coverage 79% 4 2 PASCAL VOC CamVid FoI (%) Fig. 4. FT, which is FoI at higher coverage values, begins to approach FoI as we relax the coverage constraints on the algorithm. shows the trends in FT vs. FoI. The results are shown at two levels of coverage, which is a term we use synonymously with recall (the fraction of interesting frames that were detected). The error bars shown in the figure represent the variance across different objects of interest. The dotted line along the diagonal indicates the ideal value of FT (= FoI) at 9% and 7% coverage for the left and right subfigures, respectively. We observe that without no ondevice classification, FT is always 1%. Further, with local classification, we are able to filter out 7% of the frames (averaged over all datasets) at FoI = 5%. This is quite substantial as it will directly translate into system level energy savings (as we show ahead in Sec. VI). We bias the SVM classifier to achieve different levels of coverage. This helps us explore the reduction in FT at lower levels of coverage. From the figure, we see that at 79% coverage, we are able to filter 73% of the frames at 5% FoI. TABLE I. Software implementation of image classification incurs a large processing delay that is unacceptable for realtime contextaware applications. PASCAL CamVid Image Size MOPS Time/frame (sec.) Table I shows how the algorithm complexity varies depending on frame size, number of interest points, classifier model size, etc.. Across all datasets, we find that the mean complexity is quite low: 5 MOPS. However, the software runtime on both a desktop (Core i7) and mobile CPU (Snapdragon 8) exceeds 1.8 sec./frame on average. This latency arises because we are unable to fully exploit the inherent parallelism in the algorithm. Since this latency is unacceptable for realtime contextaware applications, we propose to accelerate the imageclassification algorithm through hardware specialization. We describe this approach next. IV. SAPPHIRE: Hardware Architecture In this section, we propose a hardwarespecialized engine for accelerating image classification on portable devices. We exploit computation patterns of the algorithm to achieve significant processing efficiency. A key feature of our microarchitecture is that it can be configured to obtain different power GBlock Rows SBlock Lanes GBlock TBlock SBlock DAISYGTS Local Feature FV Block Classifier1 Fig. 5. QBlock QNorm Harris IPD DAISYN Block Classifier 2 GDU FV Block Class Image Interest Point InterPixel Pipeline SFeature Global Feature InterPatch Pipeline Classifier N Feature Computation Classification InterPicture Pipeline Pipelining in SAPPHIRE. Image Patch Fig. 6. Control Regs Idle FIFO FIFO INPUT GBlock Columns X + ACC GBlock Gaussian Kernel (a) SVM Lanes 1d 1d ACC + 1d DPU DPU DPU ACC Smoothened Image SBlock Patterns SV SV SV SV alp and performance points for a given application. Thus, SAP PHIRE can be easily scaled to cater to both the performance constraints of the application and the energy constraints of the device. We next outline a few key architectural innovations in SAPPHIRE. A. Exploiting Data Parallelism The image classification algorithm provides abundant opportunity for parallel processing. Since, SAPPHIRE operates on a stream of frames, it is throughput limited. Thus, we exploit the datalevel parallelism through pipelining. However, an interesting feature of the algorithm is that the pipelined parallelism is not available at one given level, but rather buried hierarchically across multiple levels of the design. To exploit this parallelism, we develop a novel threetiered, hierarchically pipelined architecture shown in Figure 5. A brief description of each level
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks