What is the difference between yolort and yolov5¶
Now yolort
adopts the same model structure as the official yolov5
. The main difference between yolort and YOLOv5 is the strategy in pre-processing and post-processing. The main reason we have adopted a different strategy from the official one is to enable the pre-processing and post-processing modules to be jit traceable and scriptable. From this we can gain an end-to-end Graph for inferencing on LibTorch
, ONNX Runtime
and TVM
.
For pre-processing, YOLOv5 uses the letterboxing (padding) resizing that maintains the aspect ratio. The error arises from the interpolate
operator used in resizing. YOLOv5 uses the cv2.resize operator on the input uint8 [0-255]
images, the operators in OpenCV are not traceable or scriptable, so we use
torch.nn.functional.interpolate in yolort. Fortunately, the interpolation
operator of PyTorch is aligned with that of OpenCV, however PyTorch’s interpolate
only supports the float data type now, we can only operate with images cast to float types,
and therefore there will introduce some errors.
YOLOv5 provides a very powerful function to do the post-processing, of which we implement only a non-agnostic version, but the accuracy here should be able to be consistent with the original version. See our doc for more details.
Prepare environment, image and model weights to test¶
[1]:
import os
import torch
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
[2]:
import cv2
from yolort.models import YOLOv5
from yolort.utils import cv2_imshow, read_image_to_tensor
from yolort.utils.image_utils import color_list, plot_one_box, parse_single_image
from yolort.v5 import load_yolov5_model, letterbox, non_max_suppression, scale_coords, attempt_download
from yolort.v5.utils.downloads import safe_download
[3]:
img_size = 640
stride = 64
score_thresh = 0.35
iou = 0.45
fixed_shape = None
[4]:
# img_source = "https://huggingface.co/spaces/zhiqwang/assets/resolve/main/zidane.jpg"
img_source = "https://huggingface.co/spaces/zhiqwang/assets/resolve/main/bus.jpg"
img_path = "bus.jpg"
safe_download(img_path, img_source)
Downloading https://huggingface.co/spaces/zhiqwang/assets/resolve/main/bus.jpg to bus.jpg...
[5]:
# yolov5n6.pt is downloaded from 'https://github.com/ultralytics/yolov5/releases/download/v6.0/yolov5n6.pt'
model_path = 'yolov5n6.pt'
checkpoint_path = attempt_download(model_path)
Load model as ultralytics and inference¶
YOLOv5 provides an input-robust model wrapper named AutoShape
for passing cv2/np/PIL/torch inputs, which includes pre-processing, inference and post-processing (NMS). This wrapper is currently only valid for pytorch inference. To peel back the essence of what’s inside, we use a vanilla interface to call yolov5. We do the pre-processing part first.
[6]:
# Preprocess
img_raw = cv2.imread(img_path)
image = letterbox(
img_raw,
new_shape=(img_size, img_size),
stride=stride,
auto=not fixed_shape,
)[0]
image = read_image_to_tensor(image)
image = image.to(device)
image = image[None]
[7]:
vis = parse_single_image(image[0])
Let’s visualize the letterboxed image by the way.
[8]:
cv2_imshow(vis, imshow_scale=0.75, convert_bgr_to_rgb=False)
[9]:
model_yolov5 = load_yolov5_model(checkpoint_path, autoshape=False, verbose=False)
model_yolov5 = model_yolov5.to(device)
model_yolov5.conf = score_thresh # confidence threshold (0-1)
model_yolov5.iou = iou # NMS IoU threshold (0-1)
model_yolov5 = model_yolov5.eval()
from n params module arguments
0 -1 1 1760 yolort.v5.models.common.Conv [3, 16, 6, 2, 2]
1 -1 1 4672 yolort.v5.models.common.Conv [16, 32, 3, 2]
2 -1 1 4800 yolort.v5.models.common.C3 [32, 32, 1]
3 -1 1 18560 yolort.v5.models.common.Conv [32, 64, 3, 2]
4 -1 2 29184 yolort.v5.models.common.C3 [64, 64, 2]
5 -1 1 73984 yolort.v5.models.common.Conv [64, 128, 3, 2]
6 -1 3 156928 yolort.v5.models.common.C3 [128, 128, 3]
7 -1 1 221568 yolort.v5.models.common.Conv [128, 192, 3, 2]
8 -1 1 167040 yolort.v5.models.common.C3 [192, 192, 1]
9 -1 1 442880 yolort.v5.models.common.Conv [192, 256, 3, 2]
10 -1 1 296448 yolort.v5.models.common.C3 [256, 256, 1]
11 -1 1 164608 yolort.v5.models.common.SPPF [256, 256, 5]
12 -1 1 49536 yolort.v5.models.common.Conv [256, 192, 1, 1]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 [-1, 8] 1 0 yolort.v5.models.common.Concat [1]
15 -1 1 203904 yolort.v5.models.common.C3 [384, 192, 1, False]
16 -1 1 24832 yolort.v5.models.common.Conv [192, 128, 1, 1]
17 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
18 [-1, 6] 1 0 yolort.v5.models.common.Concat [1]
19 -1 1 90880 yolort.v5.models.common.C3 [256, 128, 1, False]
20 -1 1 8320 yolort.v5.models.common.Conv [128, 64, 1, 1]
21 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
22 [-1, 4] 1 0 yolort.v5.models.common.Concat [1]
23 -1 1 22912 yolort.v5.models.common.C3 [128, 64, 1, False]
24 -1 1 36992 yolort.v5.models.common.Conv [64, 64, 3, 2]
25 [-1, 20] 1 0 yolort.v5.models.common.Concat [1]
26 -1 1 74496 yolort.v5.models.common.C3 [128, 128, 1, False]
27 -1 1 147712 yolort.v5.models.common.Conv [128, 128, 3, 2]
28 [-1, 16] 1 0 yolort.v5.models.common.Concat [1]
29 -1 1 179328 yolort.v5.models.common.C3 [256, 192, 1, False]
30 -1 1 332160 yolort.v5.models.common.Conv [192, 192, 3, 2]
31 [-1, 12] 1 0 yolort.v5.models.common.Concat [1]
32 -1 1 329216 yolort.v5.models.common.C3 [384, 256, 1, False]
33 [23, 26, 29, 32] 1 164220 yolort.v5.models.yolo.Detect [80, [[19, 27, 44, 40, 38, 94], [96, 68, 86, 152, 180, 137], [140, 301, 303, 264, 238, 542], [436, 615, 739, 380, 925, 792]], [64, 128, 192, 256]]
/opt/conda/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2157.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Model Summary: 355 layers, 3246940 parameters, 3246940 gradients, 4.6 GFLOPs
[10]:
with torch.no_grad():
dets_yolov5 = model_yolov5(image)[0]
dets_yolov5 = non_max_suppression(dets_yolov5, score_thresh, iou, agnostic=False)[0]
Then restore the coordinates to the original scale of the image.
[11]:
boxes_yolov5 = scale_coords(image.shape[2:], dets_yolov5[:, :4], img_raw.shape[:-1])
labels_yolov5 = dets_yolov5[:, 5].to(dtype=torch.int64)
scores_yolov5 = dets_yolov5[:, 4]
Now we can visualize the inference results after completing the post-processing.
[12]:
# Get label names
import requests
label_path = "https://huggingface.co/spaces/zhiqwang/assets/resolve/main/coco.names"
response = requests.get(label_path)
names = response.text
LABELS = []
for label in names.strip().split('\n'):
LABELS.append(label)
COLORS = color_list()
[13]:
for box, label in zip(boxes_yolov5.tolist(), labels_yolov5.tolist()):
img_raw = plot_one_box(box, img_raw, color=COLORS[label % len(COLORS)], label=LABELS[label])
[14]:
cv2_imshow(img_raw, imshow_scale=0.5)
At that moment we have completed the whole inference process of yolov5.
Use yolort’s approach to inferencing¶
yolort now supports loading models trained by yolov5. Here is also an end-to-end inference pipeline, and this pipeline supports both jit tracing and scripting modes. This model can be used to export ONNX and torchscript graph and to inference on ONNX Runtime, LibTorch and TVM VirtualMachine backends.
[15]:
model_yolort = YOLOv5.load_from_yolov5(
checkpoint_path,
score_thresh=score_thresh,
nms_thresh=iou,
size_divisible=stride,
fixed_shape=fixed_shape,
)
model_yolort = model_yolort.eval()
model_yolort = model_yolort.to(device)
from n params module arguments
0 -1 1 1760 yolort.v5.models.common.Conv [3, 16, 6, 2, 2]
1 -1 1 4672 yolort.v5.models.common.Conv [16, 32, 3, 2]
2 -1 1 4800 yolort.v5.models.common.C3 [32, 32, 1]
3 -1 1 18560 yolort.v5.models.common.Conv [32, 64, 3, 2]
4 -1 2 29184 yolort.v5.models.common.C3 [64, 64, 2]
5 -1 1 73984 yolort.v5.models.common.Conv [64, 128, 3, 2]
6 -1 3 156928 yolort.v5.models.common.C3 [128, 128, 3]
7 -1 1 221568 yolort.v5.models.common.Conv [128, 192, 3, 2]
8 -1 1 167040 yolort.v5.models.common.C3 [192, 192, 1]
9 -1 1 442880 yolort.v5.models.common.Conv [192, 256, 3, 2]
10 -1 1 296448 yolort.v5.models.common.C3 [256, 256, 1]
11 -1 1 164608 yolort.v5.models.common.SPPF [256, 256, 5]
12 -1 1 49536 yolort.v5.models.common.Conv [256, 192, 1, 1]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 [-1, 8] 1 0 yolort.v5.models.common.Concat [1]
15 -1 1 203904 yolort.v5.models.common.C3 [384, 192, 1, False]
16 -1 1 24832 yolort.v5.models.common.Conv [192, 128, 1, 1]
17 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
18 [-1, 6] 1 0 yolort.v5.models.common.Concat [1]
19 -1 1 90880 yolort.v5.models.common.C3 [256, 128, 1, False]
20 -1 1 8320 yolort.v5.models.common.Conv [128, 64, 1, 1]
21 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
22 [-1, 4] 1 0 yolort.v5.models.common.Concat [1]
23 -1 1 22912 yolort.v5.models.common.C3 [128, 64, 1, False]
24 -1 1 36992 yolort.v5.models.common.Conv [64, 64, 3, 2]
25 [-1, 20] 1 0 yolort.v5.models.common.Concat [1]
26 -1 1 74496 yolort.v5.models.common.C3 [128, 128, 1, False]
27 -1 1 147712 yolort.v5.models.common.Conv [128, 128, 3, 2]
28 [-1, 16] 1 0 yolort.v5.models.common.Concat [1]
29 -1 1 179328 yolort.v5.models.common.C3 [256, 192, 1, False]
30 -1 1 332160 yolort.v5.models.common.Conv [192, 192, 3, 2]
31 [-1, 12] 1 0 yolort.v5.models.common.Concat [1]
32 -1 1 329216 yolort.v5.models.common.C3 [384, 256, 1, False]
33 [23, 26, 29, 32] 1 164220 yolort.v5.models.yolo.Detect [80, [[19, 27, 44, 40, 38, 94], [96, 68, 86, 152, 180, 137], [140, 301, 303, 264, 238, 542], [436, 615, 739, 380, 925, 792]], [64, 128, 192, 256]]
Model Summary: 355 layers, 3246940 parameters, 3246940 gradients, 4.6 GFLOPs
Its interface is also very easy to use.
[16]:
with torch.no_grad():
dets_yolort = model_yolort.predict(img_path)
[17]:
boxes_yolort = dets_yolort[0]['boxes']
labels_yolort = dets_yolort[0]['labels']
scores_yolort = dets_yolort[0]['scores']
Verify the detection results between yolort and ultralytics¶
We print out the results of both inferencing.
[18]:
print(f"Detection boxes with yolov5:\n{boxes_yolov5}\n")
print(f"Detection boxes with yolort:\n{boxes_yolort}")
Detection boxes with yolov5:
tensor([[ 32.51723, 225.12900, 810.00000, 741.03424],
[ 50.41119, 387.52475, 241.58034, 897.60645],
[219.00005, 386.05475, 345.78729, 869.04047],
[678.08923, 374.60596, 809.77881, 874.63422]], device='cuda:0')
Detection boxes with yolort:
tensor([[ 32.27846, 225.15259, 811.47729, 740.91077],
[ 50.42178, 387.48911, 241.54393, 897.61035],
[219.03334, 386.14346, 345.77686, 869.02582],
[678.05023, 374.65341, 809.80341, 874.80621]], device='cuda:0')
[19]:
print(f"Detection scores with yolov5:\n{scores_yolov5}\n")
print(f"Detection scores with yolort:\n{scores_yolort}")
Detection scores with yolov5:
tensor([0.88235, 0.84495, 0.72589, 0.70359], device='cuda:0')
Detection scores with yolort:
tensor([0.88238, 0.84486, 0.72629, 0.70077], device='cuda:0')
[20]:
print(f"Detection labels with yolort:\n{labels_yolov5}\n")
print(f"Detection labels with yolort:\n{labels_yolort}")
Detection labels with yolort:
tensor([5, 0, 0, 0], device='cuda:0')
Detection labels with yolort:
tensor([5, 0, 0, 0], device='cuda:0')
[21]:
# Testing boxes
torch.testing.assert_allclose(boxes_yolort, boxes_yolov5, rtol=1e-2, atol=1e-7)
# Testing scores
torch.testing.assert_allclose(scores_yolort, scores_yolov5, rtol=1e-3, atol=1e-2)
# Testing labels
torch.testing.assert_allclose(labels_yolort, labels_yolov5)
print("Exported model has been tested, and the result looks good!")
Exported model has been tested, and the result looks good!
As you can see from this result, there are still some differences in the boxes, but the scores and labels are relatively accurate.
View this document as a notebook: https://github.com/zhiqwang/yolov5-rt-stack/blob/main/notebooks/comparison-between-yolort-vs-yolov5.ipynb