Social Distancing Technique For Covid-19 Using Yolo V5 & CNN For Fast Object Detection and Better Accuracy

Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Social Distancing Technique for Covid-19 Using

Yolo V5 & CNN for Fast object Detection and
better Accuracy
1
Baibhav Pathy
Abstract:- The world has fallen in the crunches of a The goal of this study is to utilize a deep convolutional
deadly virus that has caused pandemics and recessions neural network to recognize individuals in photos or video
all over the world. It has forced even the superpower feeds and then use that information to estimate the distance
country like USA and RUSSIA to go into lockdown and between them. Although much research has already been done
thus decrease the Gross domestic product (GDP) of the in this field, we are revisiting it with the help of a new object
economy. So to prevent the further spread of the virus, identification framework, YoloV5 [5]. YoloV5, a cutting-
Awareness was required until the vaccine with full edge object detection algorithm, is extremely powerful and
functionality of giving us immunity against any variant of rapid, making it suitable for use in surveillance cameras.
covid. One way to stop the further spreading of the virus
is social distancing. In this paper, we are implementing a II. ABOUT DATASET
deep-learning algorithm along with Yolo V5. This project
will use OpenCV, Deep Learning, Computer Vision, and Most of the Images and videos that we have used are
YOLO V5 to work together and through surveillance taken by our student volunteers. Other sources for the
cameras to create social distance between people by collection of data are googled and YouTube videos with open
constantly analyzing video input that will be fed into the license. Search on google has a large number of photos from
designed system through the surveillance cameras and many sources, making it swift & easier to complete image-
will notify the authorities if any social distancing gathering tasks.
violations occur. The proposed method can assist save
money and save the authorities who are required to keep We have also used Ms Coco’s data set. Microsoft
people maintaining social distance from getting infected released the MS COCO dataset[19], which is large-scale
with covid-19. It can also substantially reduce covid19 object identification, classification, and labelling dataset.
deaths. picture collection was built with the objective of improving
image identification, therefore COCO stands for Common
I. INTRODUCTION Objects in Context. The COCO dataset offers demanding,
significant visual datasets for object recognition, with the
USA has utilized 2.58 million CCTV covering 15.35 majority of the datasets containing state-of-the-art neural
million individuals 131 (2020a) to keep a record of people networks. COCO, for example, is frequently used to test the
and make illegal pursuit easier. As a result, there are six efficiency of the context of real-world identification
people allocated to each camera. The cameras are used to systems. Sophisticated artificial neural packages
monitor the facial feature of individuals. All of this is automatically comprehend the COCO dataset’s format.
achievable because of the recurrent neural network Lecun et
al. (2015). Deep learning is the process of extracting many III. METHODOLOGY
levels of abstraction from data to learn attributes. Since its
inception, this computational model has been employed in a Individual identification requires a deep learning
wide range of applications, from recognizing production model to be trained with images of numerous individuals in
process faults to accurately identifying celestial objects that various scenarios. The detection procedure is divided into
would take a long time or be inconceivable to discover with four phases: a collection of data, data categorization, model
artificial cognition. training, and validation of the model with testing as shown
fig 1.
COVID-19 has caused a pandemic in the year 2019
[17] to date, killing approximately 5.43 million individuals
and infecting 283.20 million people world meter(2021)[18] 131
(2020b). Due to the lack of a vaccine, the World health
organization (WHO) recommends using hand sanitizer and
maintaining safe social distancing to reduce the virus’s
spread throughout the globe.
IJISRT23MAY850 www.ijisrt.com 1062

ISSN No:-2456-2165
Fig 2 Output After Running the Model
IV. ABOUT YOLO V5
To locate the item inside the picture, all the previous

object detection techniques have utilized areas. The system
does not examine the entire picture. Rather, portions of the
image with a high likelihood of containing the item. You
Only Look Once, or YOLO is an object detection
framework that differs significantly from the region-based
techniques discussed previously. The bounding boxes and
class probabilities for these boxes are predicted by a single
neural network in YOLO.
Fig 1 Flowchart of the Method to be followed while YOLO works by splitting a picture into an S x S grid as
Executing the Model shown in fig 3 and creating m bounding boxes inside each
matrix. The model outputs a class probability and offset
 Data Collection values for each frame for each bounding box. The bounding
boxes with a class probability greater than or equal to a
Data collection consisted of the collection of prerequisite
data from external open license sources. Data used are threshold value are chosen and utilized to find the item
videos and photos of the author taken from his smartphone inside the picture. YOLO is folded higher quicker than any
and DLSR during various activities in college and in his other object detection technique (45 frames per second). The
nearby localities. YOLO algorithm’s drawback is that it has trouble detecting
tiny things in images; for example, it could have trouble
 Data Categorization group of people as shown in fig 4. This is owing to the
We must supply the predicted model parameters of the algorithm’s spatial restrictions.
dataset while training and testing because we are utilizing
supervised deep learning. Convolution layers are used to
categories pictures in Deep learning for Object recognition.
Below are some examples of how the photographs are
labelled. To categories our training example, we utilized
Labeling Bradski (2000) for mac.
 Model Training and Validation of the Model with

Testing
The classifier is then trained once the data has been
labelled. The classifier for people identification was trained
using Yolov5. Configuring the hyper-parameters that play a
big influence in the detecting mechanism’s performance &
reliability is a big part of fitting the network. A machine was
used to run the training for thousand epochs. We perform
the evaluation with the CNN model and show the results in
an appropriate graph as shown in fig 2.
Fig 3 Class Probability Map

ISSN No:-2456-2165
VI. IMPLEMENTATION
The method of calculating the isometric perspective of

images collected from a certain angle is known as Inverse
Perspective Mapping. The OpenCV Bradski (2000) module
makes finding the inverse perspective mapping of any
picture simple. We can acquire the projection of the entire
picture to the top view by using four features of the image
and translating them to actual points in an isometric
perspective. The detailed structure of the picture acquired
Fig 4 Feature Selection using this mapping approach is shown in figure 6.
V. PROPOSED MODEL
As shown Figure 5 depicts a high-level perspective of

the suggested social distancing surveillance system. A dense
framework was built on top of the YOLOv5 model for
effective feature utilization and visualization. The Bbox can
better match the form of a tomato, allowing for more exact
localisation. Furthermore, the Bbox can calculate a more
accurate IoU between predictions, which is critical in the
NMS process, and therefore increase detection outcomes.
YOLO- social distancing is the name of the suggested
model. A flowchart of the YOLO- social distancing training
and detection procedure is shown in Figure 5.
Fig 6 Flowchart of Implementation Method

Layer by Layer
Fig 7 Final Output Footage
The 4 dots on both flanks of the frame reflect the visual

references used to get the object’s isometric perspective.
This assessment was completed once, and we used that
linear transformation matrix to calculate path length as
shown in fig 7. The graph’s resolution is essential to
determine the true proximity between the items. To compute
the difference between items in this research, we employed an
estimated metric.
Fig 5 A flowchart of training and detection process of

YOLO-Face

ISSN No:-2456-2165
We then use the R - CNN neural network for model
evaluation. To get around the challenge of picking a large
couple of aspects in the image, Ross Girshick et al. devised
an approach in which the limited selection is used to extract
just that regions from the image, which he calls
segmentation proposal region [9]. As a result, rather than
attempting to identify a large number of locations. The
selective search technique described below was used to
create these region ideas. These proposed potential areas are
twisted into a square and input into a CNN (convolutional
neural network), which outputs a 4096-dimensional feature
vector as shown in fig 10. The CNN acts as a feature
extractor, and the resultant dense layer contains the
Fig 8 Dense four-layer block All previous feature maps are
characteristics collected from the picture, which are input
used as input for every tier, which in turn makes a
into an SVM to evaluate the existence of the item inside the
significant contribution for all future levels. Hi denotes the
region proposals suggestion. The method anticipates values,
operation BN-ReLU-Conv1×1-BN-ReLU-Conv3×3. which are offset features for raising the accuracy of the
bounding box, in order to forecast the presence of an object
inside the zone suggested.
Fig 9 Depth Analysis of Each Layer
Extraction image features and implementing

convolution are the two elements of SSD object recognition.
Detection is performed by the SSD from a single layer. In
fact, it detects objects separately using many layers (cross-
function mappings) as shown in fig 8. The precision of the Fig 11 Overall Representation of Bounding Box
extracted features decreases when the spatial dimension of Suggesting Presence of an Object Inside the Zone
CNN is increasingly reduced. SSD detects larger-scale
objects with lower resolution layers. The 4x4 extracted  Algorithm 1 The Pseudo-Code
features, for instance, are employed for the massive object.
After VGG16, SSD adds 6 extra auxiliary convolution  Input:
layers to the picture as shown in fig 9. For object
recognition, five of these levels would be implemented. We  = {b1, • • • , bN},  = {C1, • • , CN}, nms
produce six forecasts instead of four in three of those levels.  is the list of initial detection boxes
SSD uses 6 pooling layers to create 8732 forecasts in total.  contains corresponding detection confidences
 nms is the NMS threshold
 Output: List of final detection boxes 
   {}
 while B   do
 m  argmax C
     bm;    - bm;    - Cm
 for bi   do
 if IoU (bm, bi) ≥ nms then
    - bi ;    - Ci
 end if
 end for
 end while
Fig 10 Neural Network Overall Representation

ISSN No:-2456-2165
VII. MATHEMATICAL MODEL
A. Network Architecture : Eq no. 1
 Yolo Backbone Where b, b′ is the box's hotspot, symbolizes the

The entire design of improved\sYOLOv5s is Euclidean distance, c is the diagonal separation of the
represented in Figure 3 which comprises the foundation, smallest encompassing rectangular box, and w, h is the
recognition collar, and recognition face. Firstly, a newly target's physical size. Facial subjects are not only abundant
designed backbone termed CSPNet is employed. We update but also layered in security footage pictures [15], resulting
it with a new element called CBS consisting of a in several targets for each sector. However, basing decisions
Convolution layer, a SILU and a BN layer[10]. on a single criterion frequently results in low precision and
Furthermore, a stem file is utilised to replace the centre layer recall [16]. As a result of the ring structure created by
in YOLOv5s. a C3 square is being utilized to recreate the combining CIOU and NMS, the applicant box in the same
preceding CSP block with two pieces. One is transported matrix may be assessed and inspected many rounds,
through a CBS block, numerous bottleneck frames, and a successfully overcoming the risk of skipped identification.
Conv layer, while the other one consists of a Convolution
layer. After the two paths with a Concatination and a CBS
block proceeded, we also change the SPP frame [11] to
boost the face recognition efficiency. In this block, the
dimension of the tri kernels is modified to relatively small
kernels.
 Recognition Collar
The architecture of the detecting neck is also depicted
in Figure 3 which comprises a conventional feature pyramid
network (FPN) [12] and path aggregation network (PAN)
[3]. However, we adjust the specifics of several sections,
such as the CS module and the CBS block as shown in fig
12.
 Recognition Face
Through multilayer perceptron architecture and path
consolidation [13] network, the front segment of the
network achieves the full fusion of low-level features and
high-level features to build rich feature maps, which can
identify the most happening more often instances. However,
for low-resolution photographs, feature fusion cannot
improve the original information of the image, and after
layers of iteration, the prior knowledge of small faces is still
lacking. To boost the recognition rate of small faces in low-
resolution pictures, SR is fused in the detection head
component of the system. For the grid to be computed, the
area data is entered into SRGAN to carry out high-level
functional reconstruction and face detection again through
its coordinate information. Finally, the output of the two-
stage.
 Loss of Functionality
In detection systems, the IOU index is commonly Fig 12 Block System of how the Model Takes Data and
utilized. It is utilized in most alignment [14] approaches not Processes in Different Layer of the Model
only to evaluate the favorable and unfavourable specimen
but also to measure the difference between the projected VIII. DISCUSSION
box's position and the classification algorithm. The study
suggests that the following factors be taken into account: Given the amount of information we were working
intertwining area, convergence point proximity, and image with, the findings appeared to be sufficient. The covid
resolution, all of which have sparked consternation. More detection algorithm's results are evaluated using the
scholars are proposing superior performance techniques, Accuracy, Reliability, Precision, and mAP indicators. The
such as DIOU, IOU, CIOU and GIOU at the moment. In this graph below depicts the evolution of numerous metrics
study, we suggest replacing GIOU with CIOU and throughout several training rounds.
nonmaximal reduction in YOLOv5s (NMS) Our bounding
box regression loss function is defined as.

ISSN No:-2456-2165
feature extraction method for obtaining global features.
features and getting local features from the study area. In
addition, the study endeavour aims to combine classical
classifiers to recognise the item.
The algorithm was able to recognise persons in the

video stream and estimate proximity between them as a
whole. We were also capable to use red boundary lines to
alert persons who did not acknowledge social separation.
FUTURE SCOPE
A conventional webcam was used to do item

Fig 13 Graph of Accuracy , Precision & Recall for Yolo V5 recognition and target tracking. The principle may be used
Face Mask Detection in a multitude of scenarios, including Artificial Robots,
Automatic Assisted Automobiles, Network Security
Improvement to recognize suspect behaviour as well as
weaponry, detect anomalous enemy movements on the
border with the aid of target acquisition cams, and many
more.
REFERENCES
[1]. The Guardian. Big brother is watching.

https://www.theguardian.com/cities/ 2019/dec/02/big-
brother-is-watching-chinese-city-with-26m-cameras-
is-worlds-most-hea Online; accessed 11 Nov 2020.
[2]. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.
Fig 14 Comparision in Different Yolo Version Deep learning. Nature, 521(7553):436–444, May
2015.
Now with the public dataset, we acquired a Recall of [3]. Worldometer. Coronavirus update.
98 per cent and a Precision of 92 per cent after thousand https://www.worldometers.info/coronavirus/.Online;
epochs. The mAP@0.6 and mAP@0.6-0.95 are respectively accessed 29 Nov 2021.
0.96 and 0.59 from graph in fig 13 and fig 14. [4]. https://covid19.who.int/
[5]. Ultralytics. Yolov5. https://https://github.com/
Our algorithm recognises things based on their ultralytics. Online; accessed 11 Nov 2020.
categories and assigns each entity a tag as well as [6]. https://cocodataset.org/#home
percentages on the detected image, as expected. With the [7]. https://doi.org/10.1038/nature14539
particular location of a pixel item in the image in the x,y- [8]. G. Bradski. The OpenCV Library. Dr. Dobb’s Journal
axis, we may discover things more correctly and identify of Software Tools, 2000.
them independently, based on the study findings of the [9]. R. Girshick, J. Donahue, T. Darrell and J. Malik,
trials. This investigation also numerous utilizations on "Rich Feature Hierarchies for Accurate Object
different ways for item detection and characterisation, as Detection and Semantic Segmentation," 2014 IEEE
well as an assessment of each approach's efficiency. Conference on Computer Vision and Pattern
Recognition, 2014, pp. 580-587, doi: 10.1109/CVPR.
IX. CONCLUSION 2014.81.
[10]. S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-
Yolov5, as a unified framework that is faster than weighted linear units for neural network function
earlier two-stage detectors, was able to accurately recognise approximation in reinforcement learning,” Neural
humans. We were also able to use image manipulation to Networks, vol. 107, pp. 3–11, 2018.
turn the image into a bird's-eye viewpoint and calculate the [11]. K. He, X. Zhang, S. Ren, and J. Sun, “Spatial
distance between two spots between people. pyramid pooling in deep convolutional networks for
visual recognition,” IEEE Transactions on Pattern
Visual recognition systems may help with security Analysis and Machine Intelligence, vol. 37, no. 9, pp.
cameras, face recognition, defect condition monitoring, text 1904–1916, 2015.
classification, and other applications. The purpose of this [12]. T. Y. Lin, P. Dollár, R. Girshick, K. He, B.
thesis is to develop object recognition software that can Hariharan, and S. Belongie, “Feature pyramid
differentiate between people and determine their relative networks for object detection,” in Proceedings of the
distances. The performance of an object recognition system IEEE conference on computer vision and pattern
is determined by the attributes employed and the recognition recognition, pp. 2117–2125, Honolulu, HI, United
algorithm used. The goal of this research is to offer a novel States, 2017.

ISSN No:-2456-2165
[13]. H. Bai, J. Cheng, X. Huang, S. Liu, and C. Deng,
“HCANet: a hierarchical context aggregation network
for semantic segmentation of high-resolution remote
sensing images,” IEEE Geoscience and Remote
Sensing Letters., pp. 1–5, 2021.
[14]. B. Yu and D. Tao, “Anchor cascade for efficient face
detection,” IEEE Transactions on Image Processing,
vol. 28, no. 5, pp. 2490–2501, 2019
[15]. C. Ding and D. Tao, “Trunk-branch ensemble
convolutional neural networks for video-based face
recognition,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 40, no. 4, pp. 1002–
1014, 2018.
[16]. Z. Tang, G. Zhao, and T. Ouyang, “Two-phase deep
learning model for short-term wind direction
forecasting,” Renewable Energy, vol. 173, pp. 1005–
1016, 2021.
[17]. Qian M, Jiang J. COVID-19 and social distancing. Z
Gesundh Wiss. 2022;30(1):259-261. doi:
10.1007/s10389-020-01321-z. Epub 2020 May 25.
PMID: 32837835; PMCID: PMC7247774.
[18]. Kumar P, Sah AK, Tripathi G, Kashyap A, Tripathi
A, Rao R, Mishra PC, Mallick K, Husain A, Kashyap
MK. Role of ACE2 receptor and the landscape of
treatment options from convalescent plasma therapy
to the drug repurposing in COVID-19. Mol Cell
Biochem. 2021 Feb;476(2):553-574. doi:
10.1007/s11010-020-03924-2. Epub 2020 Oct 7.
PMID: 33029696; PMCID: PMC7539757.
[19]. arXiv:1405.0312

Social Distancing Technique For Covid-19 Using Yolo V5 & CNN For Fast Object Detection and Better Accuracy

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Social Distancing Technique For Covid-19 Using Yolo V5 & CNN For Fast Object Detection and Better Accuracy

Uploaded by

Copyright:

Available Formats

Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology

Social Distancing Technique for Covid-19 Using

IJISRT23MAY850 www.ijisrt.com 1062

Fig 2 Output After Running the Model

IV. ABOUT YOLO V5

To locate the item inside the picture, all the previous

 Model Training and Validation of the Model with

IJISRT23MAY850 www.ijisrt.com 1063

The method of calculating the isometric perspective of

As shown Figure 5 depicts a high-level perspective of

Fig 6 Flowchart of Implementation Method

Fig 7 Final Output Footage

The 4 dots on both flanks of the frame reflect the visual

Fig 5 A flowchart of training and detection process of

IJISRT23MAY850 www.ijisrt.com 1064

Fig 9 Depth Analysis of Each Layer

Extraction image features and implementing

 Output: List of final detection boxes 

IJISRT23MAY850 www.ijisrt.com 1065

A. Network Architecture : Eq no. 1

 Yolo Backbone Where b, b′ is the box's hotspot, symbolizes the

IJISRT23MAY850 www.ijisrt.com 1066

The algorithm was able to recognise persons in the

A conventional webcam was used to do item

[1]. The Guardian. Big brother is watching.

IJISRT23MAY850 www.ijisrt.com 1067

IJISRT23MAY850 www.ijisrt.com 1068

You might also like