23 Apr Ground Truth Is Greater Than AI
AI as a term is more than 60 years old, and as industry finally obtained the computational power required to take advantage of it more than ever before, the terminology surrounding AI is increasingly misused and misunderstood. According to Splunk, the differences between WeakAI (Siri or Alexa) and StrongAI (learn & adapt systems) are deliberately muddied to insinuate advancements in the easier technology reflect advancements in the more difficult one. The numbers of people needed to support algorithmic systems (Amazon’s Next Gen Stats football analytics system) are grossly underreported if they are mentioned at all. And while you no longer need an army of Ph.D. candidates to operate many AI systems, the algorithm development industry has a history of creating programs that are both difficult to get customer data into and to get trustedanswers out of. Industry (government and other) needs actionable intelligence in order to take action. We need better tools to help produce actionable intelligence from the oceans of data we’re swimming in. Actionable intelligence is nothing more than knowing the implication of a particular set of data points and assessing how to best move forward. The whole assessment process starts with the data.
With data collection sensor development outpacing the world’s ability to review the data it collects, discerning the actionable intelligence from the useless is more difficult than ever before. If we’re to focus on finding the correct data so correct meaning can be assessed, we’re going to need to develop automation tools to help sort through the mountain looming before us. Using Machine Learning (ML) as a method to help analyze data is a really good place to start, especially if the problem set you’re attempting to analyze involves unstructured data (pictures and video). Don’t get me wrong, any foray into AI may be fraught with money pits and roadblocks, but there are things the educated customer can do to minimize those problems. At its most basic form, the ML process is one where you take your data and look at it (label), teach a computer to look at it (algorithm), recognize a pattern in it (model), and then have the computer apply that pattern to the rest of your data (operation) giving you metrics and analytics on what the computer found. It sounds pretty easy, but as a whole, the AI development industry is only paying lip service to the first part of labeling the data. A cottage industry of data labeling service providers has sprung up to support the developers and researchers designing algorithms, but nothing exists to support the customer buying the ML. They are expected to buy the algorithm made by the developer and hope that it works on their data set. If the ML model is only as good as the ground truth used to train it, the customer has a huge stake in requiring the AI developer to use the customer’s own data during development. At present, algorithm developers are using whatever labeled data sets they can find during development. At Orions, we’ve developed a way for customers to label their own unstructured data sets to facilitate smarter AI development.
One of the more basic AI tools can be applied to many use cases and can be used by many different customers. Take motion detection in a video camera, for instance. It’s pretty simple and it doesn’t need to be specifically trained by my own data to work for me. A simple computer program runs to monitor the pixels in my doorbell camera feed, and if the pixels change, that must indicate motion, and I get an alarm. Some simple filters can be applied to make it look at groups of pixels over time (so it doesn’t count the slow movement of shadows as the sun tracks across the sky as motion) to make it more applicable, but that’s about it. Very simple, and practically useless. Any security expert will tell you that systems such as these are full of false positives. For instance, if I wanted to be notified by my camera automatically when someone is leaving a package at my door, I set a motion alarm. The same alarm sends me an alert whenever a bird hops in my front yard. There are things I can do to reduce the number of edge cases (false alarms on hopping birds) I encounter, but with every (expensive) change I make to the system, I am more likely to increase the number of false negatives (failure to detect the event required). Typically, using generic data to build AI ends up producing a system that doesn’t fill either requirement satisfactorily. It doesn’t catch all the deliveries and it doesn’t ignore all the birds. If I pay a ton more for the AI developer to capture and label my specific doorbell camera feed and that data to refine the generic AI I bought in the first place, I will see some improvements, but it’s often not worth the price. There are simply too many edge cases when trying to turn my motion detection algorithm into a package-delivery-detection algorithm. Eventually, I turn the notifications off, thinking to myself that the technology just isn’t ready, and use my system to investigate history (stolen packages from my front porch). What I have found is this truth: as you increase the specificity and nuance required of any AI system, better training data is required from a methodology that scales. That training data, or Ground Truth Library (GTL), must be based on a customer’s own data and must be labeled in a way that matters to them.
Turning the AI development model over to allow businesses to organize, label, and index their own data for GTL production constitutes a shift in the way automation tools are typically, currently, developed. When purchasing an algorithm, a customer isn’t helping to train it or working with their technology developer hand-in-hand to define the model. They just want to buy something that works. If the world of customized AI dev were perfect, a customer would give their data, in whatever form they have it, to their technical development partner, who would analyze it to determine the best model development pathway. The tech partner labels the customer data appropriately using terminology relevant to that customer’s pain points and uses the GTL created to train a model that can then be applied to future datasets. The GTL is retained for refinement of the edge cases met during operation and can be used for other purposes like training or ML development competitions and is forever owned by the customer. Since the AI was developed from the customer’s own data, there are no interoperability issues encountered in putting new data through the model. While the world isn’t perfect, acknowledging that the operators are the ones with the knowledge to label these libraries, not data scientists, is a step toward understanding the next toolset to develop.
AI City is Orions Systems’ toolset designed to help the operator solve their own data problems. A platform-based approach to facilitating the use of unstructured data, it is the world’s first patented AI Ops solution specializing in the usage of video to accelerate the development of customer-specific AI capabilities. The platform serves as an underlying “operating system” upon which tools can be built to solve specific challenges, providing the infrastructure needed to render ground truth training data from video and provide it for algorithm development and testing. Without Ground Truth Libraries there can be no ML, so it is of the utmost importance you develop your ground truth correctly! If you would like to learn more about how to add value to your data, contact us.
Gabe is the VP of Product Development at Orions Systems, a pioneer in the development of smart vision systems for government, sports, law enforcement, and anyone attempting to use unstructured data as a first-class data type. To see how Orions Systems is tacking these, and many other, video-related challenges, visit http://www.orionssystems.com.