Technology•Jun 20, 2019
Lessons Learned From Prototyping With Augmented Reality and Machine Learning
A few months ago, we had a meeting with the AR/VR Special Interest Group at Credera to discuss potential projects that could be relevant for our clients. While we recognized that a typical enterprise use case for augmented reality (AR) is displaying a virtual catalog in the user’s space, we wanted to do something different. With that in mind, we decided to prototype an iOS app for the energy industry that would be able to:
Detect energy-related objects through machine learning (e.g., light switches, lamps, electronics, etc.).
Place an augmented reality icon near the detected object.
Allow the user to tap the icon and pull up energy-saving tips related to the object.
For example, if users pointed their phones at a computer, an icon would display that, when tapped, would indicate that turning off a computer when it’s not being used is an easy way to save on energy costs. Ultimately, our app could help users lower energy costs and usage.
The project was both ambitious and challenging, and we learned a lot about the abilities and limitations of ARKit and Apple’s machine learning libraries.
How We Approached the Problem
We decided to use an open-source project as a template that combined Apple’s Core ML and ARKit in order to kickstart our efforts. Since both Core ML and ARKit are well-established resources, they didn’t need a lot of modification. Core ML, Apple’s native machine learning library, is easily able to leverage a model given to it and displaying an object in augmented reality is a solved problem that ARKit can already handle well. The interesting decisions for us to make were:
Which machine learning model to use?
Where and when to place the icons in augmented reality?
Which Machine Learning Model to Use:
A key component in the success of any app using machine learning is the quality of the model. If the dataset isn’t representative of the predictions that need to be made, you could end up with inaccurate predictions. Knowing this, we wanted to make sure the model we chose made sense for our use case of detecting energy-related household objects.
VGG16: We initially selected VGG16 to use for our prototype, a pre-trained model that Apple had converted to the Core ML model format. While this model was effective at detecting objects, it was about 550 MB in size. Because the model itself had to be packaged with the app, the app also ended up being several hundred megabytes in size, much too large for a simple iOS app. Additionally, the pre-trained model was able to detect hundreds more objects than our app needed.
Google’s Cloud Vision API: We also considered using Google’s Cloud Vision API to analyze the user’s view and detect objects. While this significantly decreased the size of the packaged app, we decided that the latency and data costs associated with remotely analyzing a stream of images were too high.
Custom Model: Since predefined models were too large and remote processing would be too slow, we decided to create our own model. We chose to only supply the model with images of objects that we were interested in detecting; enough to provide the system with a high degree of confidence without bloating the size of the app. We took pictures of the objects we wanted to identify from around our office and trained the model using Apple’s Create ML.
We then polled the Core ML framework to identify objects in the scene from our custom trained model. If Core ML identified an object with a confidence greater than 90%, we placed an energy saving tip in the scene automatically. This made it so that instead of making the user tap the screen to scan the environment and identify objects, objects would be detected automatically.
Where and When to Place the Icons in Augmented Reality:
There were also a number of questions we had to answer when determining where to place an icon in augmented reality:
If there were multiple detected objects close together, how much room should there be between their respective icons?
If we detected another instance of the same object, should we show the icon again or suppress it?
If we suppress it, how long before we show the icon again?
To us, these were design questions; although they affected usability, the implementation effort was similar no matter which option we chose. For our prototype, we decided to place icons one meter apart from each other and only show an icon once per instance of an object.
Machine Learning (Core ML)
After taking pictures and creating a model, we found that we obtained reasonable results with a small dataset. The resulting file size was 115 kB, much better than the roughly 550 MB for the pre-trained model. Because of our small dataset, however, false positives were an issue. Objects were being identified as something else entirely, and with a high degree of confidence. The system was trying to identify everything it saw, but only had a small sample of objects to choose from. One possible solution that we didn’t have time to implement was to train the model with images of objects that we weren’t interested in recognizing. We could allow the system to identify an object and simply not place an icon for that object. While this would increase the size of the app, it would still be much smaller than using the pre-trained model.
Additionally, although our model could correctly identify an object, ARKit didn’t provide us with information about where it was in the scene. If the model had detected a laptop in the corner of the display and we wanted to show an icon, we had no choice but to place the icon in the middle of the display. This led to what seemed to be bad results, where objects seemed to be misidentified because of bad icon placement. There are more robust APIs such as Google Cloud Vision that are capable of identifying where certain objects are within a picture, but they are designed to handle static images over processing a stream of images.
Augmented Reality (ARKit)
We found that the hardware the app was running on significantly changed the accuracy of the placement and tracking of the AR icon. On older devices, icons would often be placed farther or closer than the object they were supposed to be on top of and would sometimes float away in space as you stared at them. Newer devices, particularly those with two cameras, maintained the position of the icons more reliably, even when turning away and looking back at them.
Without the additional sensors built into more advanced hardware like the Microsoft Hololens, mobile phones have a limited ability to understand the physical space around them. Phones with two cameras gave better results, due to the ability to more reliably measure depth and view the scene from different angles. Lighting conditions in the room also greatly affected the ability of ARKit to correctly place icons in 3D space.
Given only the phone’s camera image as input, ARKit did not reliably return the correct distance at which to place objects, and they would often be placed much further or closer than the real-world object they were supposed to be placed on top of. There was also an issue with the ability of ARKit to maintain the state of the scene when it was not visible, and often objects that were placed would disappear, fly away, or move significantly when not directly in view.
What stood out to us the most about working with Apple’s ARKit framework is how robust the tooling and capabilities are and how relatively easy it was to get started with development. With the ability to toggle statistical and debug options such as feature point and world origin scene icons, it’s relatively easy to troubleshoot scene positioning issues and identify object inconsistencies.
The support for ARKit functionality is present on iOS devices all the way back to the iPhone 6s and 6s Plus, and several generations of iPhone users can engage in AR experiences. However, while the framework is developer friendly and most consumers have access to these features, the true challenge to creating a successful and engaging AR experience lies in finding a creative approach to engage these users through a new medium.
How Do We Recommend Using the Technology?
We recommend focusing on the Core ML framework to create an augmented reality experience through image recognition over heavy use of 3D rendered objects in a scene. For our use case, ARKit lacks reliable and consistent tracking for an AR scene across a variety of surfaces. If AR scene placement is a critical part of your next application, we recommend limiting usage to well-defined and flat-plane tracking. For example, “Amazon AR View” and “IKEA Place” use AR flat-plane tracking for placing a 3D rendering of a product in a real-world space enabling users to manipulate and walk around the object.
Where Can We Take This Technology in the Future?
As more and improved cameras get added to iOS and Android devices, detection and tracking of multiple surfaces will improve and allow for more capability. Improvements in scene depth computation and accelerometers will lead to more realistic interactions in the environment through object occlusion or hiding part of a virtual object behind a real-world one.
Additionally, we can look forward to improvements to augmentation of real-world objects. Some of these concepts already exist, such as movie posters animating off of the wall into 3D space and applying apparel to your body. However, many of these applications are relatively low quality in terms of graphics and lighting, so as devices improve in computational rendering power, so will the realism of these experiences.
Lastly, large-scale object placement with robust stabilization and tracking is a continued need for multi-user AR applications. Examples would be room size collaboration spaces and environments that completely surround the user with a jitter-free experience.
Although the support for high-quality AR applications is minimal at this moment, there is hope for the near future that the devices we carry today will improve to provide more immersive and seamless experiences to bring the digital world into the physical world.
If you have an AR/VR or machine learning idea that you are hoping to prototype, we would love to help talk it through with you. Feel free to reach out to us at firstname.lastname@example.org.