Leveraging Machine Vision to Evaluate Social Distancing

Matt Patterman

Leveraging Machine Vision to Evaluate Social Distancing

As the world plunged into isolation amid fears of a global pandemic, countries were quick to implement the now ever-present policy of ‘social distancing.’ This term is used to describe the various methods we have at our disposal to create space between ourselves and those around us. While the general population is now aware of these methods, a growing number of individuals question if social distancing practices are actually being used by the average person. In a recent study, 39.8% of participants reported as non-compliant in regards to self-isolation practices. In order to measure the true effectiveness of social distancing, we need a way to accurately and quickly count the number of people in a space. Machine vision, which is technology used to provide imaging-based automatic inspection and analysis, is a compelling, efficient solution to this problem.

In this blog post, I will discuss how machine vision can be used to count crowds and gauge social distancing. I’ll discuss what you need to get started with machine vision, examine machine vision shortcomings, explain how crowd counting is done in the industry today, and provide an example of an effective crowd counting model used to evaluate social distancing.

Figure 1: Our model measurements of the number of people in select images of Times Square over five months.

Getting Started: Gathering Images to Train Models

To use machine vision to measure social distancing, the first step is to gather images you can use to train a model. For this task, I sourced images of crowds from various sources such as this sample image of Times Square.

You can find sample images by checking online resources such as EarthCam or downloading academic image sets used to train models for research papers. Once you’ve gathered images, you can then test different methodologies for your task.

Machine Vision Shortcomings: Applying Facial Detection to Crowd Counting

One of the most public and easy to use examples of machine vision is facial detection. With a quick Google search, hundreds of samples of code and video tutorials can be found to help someone set up a facial detection or recognition system in minutes. Code bases such as OpenCV and YOLO allow users to detect and recognize faces using pre-trained, high fidelity models and achieve near state-of-the-art results.

So why not construct a crowd counting model using one of these models? While these models can be extremely well versed in detecting various objects and human faces, our use case, in which we need to count the number of bodies in a space, introduces limiting factors that are not accounted for in these facial detection solutions.

To receive accurate crowd detection metrics, a vastly different model needs to be trained than the model supporting the facial recognition software on your phone. Facial and object detection systems generally require well lit, front facing, relatively high definition images of faces to function. Here is an example of a common model, trained to recognize full human bodies in the same way models are trained to recognize faces, trying to count the subjects in Times Square.

Figure 2: Our image of Times Square with an OpenCV full body detection model. The model fails to correctly identify almost every person in the scene and gets a total count of 12 people. This model is effectively useless in our scenario.

As shown in this image, the model fails to identify the majority of people in Times Square. It identifies 12 subjects and only one of these is correct. This is due to poor lighting conditions, people being obstructed by others or their environment, and low resolution. In short, our model was not designed to perform well in our scenario. While facial or object recognition is not an effective method to evaluate social distancing, other machine vision models can provide better results.

Crowd Counting in the Industry Today: Selecting the Best Model for the Job

In order to perform well when counting crowds, we need to make various considerations to better tailor our model development to the task at hand. The model we select will need to function well with a large number of subjects, various lighting conditions, and a low resolution per subject. With these considerations in mind, we should now filter and select a methodology that will produce a model that can perform well in this scenario.

The ideas behind most crowd counting methodologies is head counting and context. Instead of recognizing faces and bodies, most modern approaches to crowd counting rely on the accurate detection of heads. While faces tend to be unique and recognizable, counting the tops of heads requires considerable accuracy as it is easy to mistake a series of shadows or objects as a head. As such, models need to be specifically trained to recognize the tops of heads and disregard everything else. Historically this has been achieved with multiple different neural net architectures. For example, a large network might be used to target fine grain features such as the head itself, while a smaller network will attempt to solely identify the area in an image that has heads to count. The combination of these networks allows a model to recognize heads with as little information as possible, often a handful of pixels, while avoiding overcounting by omitting areas with little to no human subjects.

By choosing the right methodology for the job at hand, we can create and train a much more accurate model for counting people in Times Square.

An Effective Crowd Counting Model

In order to effectively count crowds, we will use the code and procedure described in the 2019 paper by Weizhe Liu, Mathieu Salzmann, and Pascal Fua, “Context-Aware Crowd Counting.” In this paper Liu, Salzmann, and Fua describe a system that “adaptively encodes the scale of the contextual information in an image required to accurately predict crowd density.” Liu, Salzmann, and Fua’s modifications allow the network to achieve a higher accuracy by breaking the image into four smaller sections. We can see the results of their architecture here.

Figure 3: Heat map of CVPR with total count of 186, this algorithm divides the image into four sectors.

From the images above we can see how much better our model performed once tailored to the specific use case. Our model estimates the number of people in the image at 186 while the true number is about 220 subjects.

To truly evaluate social distancing, we must identify whether or not the number of people within the space goes over a threshold set by social distancing guidelines. If we know our subject space is X square feet, we just need to divide by the 36 square feet per person minimum (6 feet apart in all directions) required to get the maximum number of occupants allowed in a space. If we assume our Times Square space is about 4,000 square feet, we know the maximum number of people that should be allowed is 111. By comparison to a standard day in Times Square, we can examine what Times Square looks like today.

Figure 4: Image of Times Square on June 25, 2020

Using an image of June 25, 2020, the model counts 30 people which is substantially less than our 111 person maximum.

Moving Forward With Machine Vision

In light of the continued impact of COVD-19, using machine vision to measure social distancing is one very timely application of this technology. This same model could be applied to surveying attendance of large events, monitoring foot traffic, or any number of other crowd counting scenarios. Machine vision tools like the one outlined here are extremely useful in a wide variety of applications.

Machine vision is already leveraged by tech capable companies to drive both efficiency and performance. Machine vision models are used to automate inspection of machined parts and assemblies. Not only are these models capable of verifying the construction of a part, but also a prediction of when the part will fail based on miniscule deviations from the design specification. In business environments, the same type of models are used to automate data entry. Leveraging machine vision, businesses are able to automatically process thousands of scanned documents for important information in seconds.

Machine vision is a rapidly developing technology with enormous potential to automate tasks across a wide range of industries. At the same time, ethical concerns about the use cases of machine vision require its users to have a significant amount of knowledge in the field before moving a model to production. At Credera, we help our clients leverage technology to innovate and compete in today’s market. Whether you need to measure social distancing, automate a process, or count customers in a space, we’d love to help! Feel free to reach out to us at marketing@credera.com to learn how your organization can leverage machine vision.