29 Mar We Got 99 Video Analysis Problems but a Lack of Content Ain’t One
Seriously Though, It’s Paramount We Differentiate Between Them
Just like the oft-confused difference between Strong and Weak AI, the video problems within the greater intelligence community (and anywhere else video is used to enhance decision making) present some difficulty regarding the successful analysis (automated or manual) of that video and, while related, are completely separate from each other. Improving in Alexa’s ability to understand speech and improving Computer Vision algorithms to identify stop signs in self-driving cars are two separate things. They may be addressed simultaneously, but the problem sets presented by each challenge must be tackled individually. The same rules apply to analyzing video. In a single dataset, there are a plethora of problems the engineering teams must consider when successfully fulfilling a customer need, and each one must be engaged individually. There are definitely AI solutions on the horizon to address, in whole or in part, each of these problems, but the longer we fail to separate the root issue of each pain point from each other, the more difficult it will be to correctly solve any problem related to extracting actionable intelligence from video or any other unstructured dataset. Consider the problems of resolution, location, and object identification. These are three separate problems to solve that would save the most time and money related to analyzing video, but they all must be approached differently.
Using “resolution”, a loaded and misunderstood term in and of itself as a lead-off to try to explain “video analytics” simply underscores the point that the state of the industry does not match the societal understanding of what is possible. This leads to great confusion amongst any decision-makers or businesses attempting to get their hands around video-related AI analysis tools. Resolution is used to describe the number of pixels per given area of the camera capturing the video, the bitrate (amount of data flowing over a given time period), or as a limitation based on bandwidth available for internet-based streaming. The more appropriate term for use in intelligence gathering is “interpretability.” Interpretability is a measurement of what can be observed in the video. Interpretability is not only dependent on the clarity, sharpness, and zoom level of the camera (traditional measures of resolution), but also dependent on the frame rate captured. Sometimes, a picture is worth a thousand words. A single still shot zoomed in perfectly from a security camera of a suspect committing a crime is the critical piece of evidence needed for a conviction. Frame rate does not matter. The single image is all the information needed. Other times, a constant measurement of fuzzy dots of two different sizes moving along a highlighted route is all that is needed to detect people (the smaller fuzzy dots) and vehicles (the larger fuzzy dots) traveling from one area to another along a road (the highlighted route). You don’t need to “see” it to know from context and motion what actions are taking place. Without a constantly updating frame rate, all the observer would see would be pictures with fuzzy dots. By having an asset source with a frame rate sufficient to display flow rates, it doesn’t matter if you can’t tell the different car models apart from each other. The frame rate provides the action taking place, not the resolution. Depending on the question being asked, analysts need to consult different video sources that are applicable to their problem sets. Herein lies problem 1: there is no way to scale the interpretability rating application system which is currently in place today.
The second problem is one of location. Only in fixed security cameras and in the highest-priority drone flights are the latitude and longitude locations of the footage within the video file known. The first issue associated with location data of video is it is difficult to extract and all but impossible to index. Most times, the drone-related information (location, heading, speed, tail number, pilot name, etc.) are hidden in a KLV track. This Key-Length Value track runs concurrent with the video track in any video file. Much like the audio track, the KLV track “plays” alongside the video track in time, but there is no commercially available speaker that can “play” the KLV track like the audio track. Additionally, the work in this area is further complicated by the fact there is no standardized way across any industry as to how to structure a KLV track, making most solutions unable to scale between videos. The other issue with location is that unstructured data is not indexed. If any data scientists are reading this they are scoffing right now, thinking, “No kidding! That’s what makes it unstructured!!!” The bottom line is there’s no way to search for a scene in a movie that hasn’t been rendered searchable manually by a human being. If in possession of a video file (any .mp4 or other format) containing a known location or object of interest, there’s no way to search for that location or thing without having to manually apply a database model of some sort to the video that can be searched. The search results try to show the timestamp within the video when the desired location/object appear. The user must then fast-forward or rewind to that spot within the video where the desired content, hopefully, appears. The inability to search for relevant location-based footage is hugely detrimental to the intelligence community. The current stores of video are similar to a giant cardboard box marked “Afghanistan Movies” containing poorly named video files “DCIM123.mp4”. When an analyst wants video of a particular bridge in this particular city in this particular province from within the country, there’s not even a clean way to find out which file is appropriate to begin the search. Being from Seattle, a fitting metaphor would be to search online maps for a Starbucks without the ability to specify “nearby” and having to figure out which of the over 250 Seattle-based shops is appropriate to fulfill my requirement. Without the ability to narrow the field based on location, all searches result in an impossible amount of results to analyze. This is the issue enveloping problem 2: most any sane analyst will ignore the historical database as completely unusable and a waste of time.
The final problem of using video is related to the location issue outlined above but stands alone as a separate issue entirely. The problem of finding specific content within the video file is exhausting by itself, must less trying to address the separate problem of returning an analyst to previously-identified content. To simplify, the analysts aren’t directed to the place in the video where the information is located, they are given a text answer because it’s easier. It boils down to the difficulties in indexing unstructured data. For example, when interacting with a DVD, the scene selector menu is a hard-coded solution only applicable to one particular DVD. The scene breaks are catalogued and mapped to specific I-frames within the video and then in/out points are then mapped to those I-frames. If a user wants to jump to a particular scene, they are essentially fast-forward/jumping to a known I-frame from the video and pressing the play button from that point in time. At present, it is very difficult to scale this functionality to anything resembling the specificity required by the business or intelligence communities hoping to use video as an intelligence source. Very few companies even attempt to deliver to-the-frame indexing/labeling of video data and the vast majority who do provide a database of the information contained in the video rather than the video data itself. It is simply more feasible to return a report of text, an easily searchable document format, than to return the user back to the video itself. This is how most analysts work on the video format today. While watching a video, they have a text/chat file open where they make notes as to the activity observed and the time from within the video the activity took place. Typically, the information in the chat/notes is all that is used in making assessments as it is too difficult to return to the location from within the video itself where the action occurred, even if the video evidence would be more compelling. The third problem is one of content. Specifically, the problem is one of desired content. To continue the above coffee example, the ability to search through the library for all video containing a Starbucks storefront is something analysts don’t possess in any meaningful, scalable way, so they don’t even attempt it. At present, the state of the art is a report showing there is a Starbucks storefront at location X and possibly a still frame shot of the building. Even if it would be more powerful for decision makers to see video of the location(s) in question, the difficulty in actually presenting the video from the past preclude this in most use cases.
These three problems are all related, but they all stand alone. Marking groupings of video as useful or not (resolution) is different from marking each video as useful or not (based on location) which is also different from finding the useful demarcations (content-based objects of interest) from within the video. All joking aside, it is a matter of developing a scalable way to index the entire file based on interpretability rating, developing a scalable way to index sections within each file based on location data, and developing a scalable way to index objects of interest down to the frame level. While a system meeting all these requirements would be a video analytics capability unicorn, as industry moves forward it is important to address each of these challenges uniquely.
To see how Orions Systems is tacking these, and many other, video-related challenges, contact Gabe Harris or visit http://www.orionssystems.com. Gabe is the VP of Product Development at Orions Systems, a pioneer in the development of smart vision systems for government, sports, law enforcement, and anyone attempting to use unstructured data as a first-class data type.