What is the difference between yolort and yolov5

Now yolort adopts the same model structure as the official yolov5. The main difference between yolort and YOLOv5 is the strategy in pre-processing and post-processing. The main reason we have adopted a different strategy from the official one is to enable the pre-processing and post-processing modules to be jit traceable and scriptable. From this we can gain an end-to-end Graph for inferencing on LibTorch, ONNX Runtime and TVM.

For pre-processing, YOLOv5 uses the letterboxing (padding) resizing that maintains the aspect ratio. The error arises from the interpolate operator used in resizing. YOLOv5 uses the cv2.resize operator on the input uint8 [0-255] images, the operators in OpenCV are not traceable or scriptable, so we use torch.nn.functional.interpolate in yolort. Fortunately, the interpolation operator of PyTorch is aligned with that of OpenCV, however PyTorch’s interpolate only supports the float data type now, we can only operate with images cast to float types, and therefore there will introduce some errors.

YOLOv5 provides a very powerful function to do the post-processing, of which we implement only a non-agnostic version, but the accuracy here should be able to be consistent with the original version. See our doc for more details.

Prepare environment, image and model weights to test

[1]:
import os
import torch

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
[2]:
import cv2
from yolort.models import YOLOv5
from yolort.utils import cv2_imshow, read_image_to_tensor
from yolort.utils.image_utils import color_list, plot_one_box, parse_single_image
from yolort.v5 import load_yolov5_model, letterbox, non_max_suppression, scale_coords, attempt_download
from yolort.v5.utils.downloads import safe_download
[3]:
img_size = 640
stride = 64
score_thresh = 0.35
iou = 0.45
fixed_shape = None
[4]:
# img_source = "https://huggingface.co/spaces/zhiqwang/assets/resolve/main/zidane.jpg"
img_source = "https://huggingface.co/spaces/zhiqwang/assets/resolve/main/bus.jpg"
img_path = "bus.jpg"
safe_download(img_path, img_source)
Downloading https://huggingface.co/spaces/zhiqwang/assets/resolve/main/bus.jpg to bus.jpg...

[5]:
# yolov5n6.pt is downloaded from 'https://github.com/ultralytics/yolov5/releases/download/v6.0/yolov5n6.pt'
model_path = 'yolov5n6.pt'
checkpoint_path = attempt_download(model_path)

Load model as ultralytics and inference

YOLOv5 provides an input-robust model wrapper named AutoShape for passing cv2/np/PIL/torch inputs, which includes pre-processing, inference and post-processing (NMS). This wrapper is currently only valid for pytorch inference. To peel back the essence of what’s inside, we use a vanilla interface to call yolov5. We do the pre-processing part first.

[6]:
# Preprocess
img_raw = cv2.imread(img_path)

image = letterbox(
    img_raw,
    new_shape=(img_size, img_size),
    stride=stride,
    auto=not fixed_shape,
)[0]
image = read_image_to_tensor(image)
image = image.to(device)
image = image[None]
[7]:
vis = parse_single_image(image[0])

Let’s visualize the letterboxed image by the way.

[8]:
cv2_imshow(vis, imshow_scale=0.75, convert_bgr_to_rgb=False)
../_images/notebooks_comparison-between-yolort-vs-yolov5_10_0.png
[9]:
model_yolov5 = load_yolov5_model(checkpoint_path, autoshape=False, verbose=False)
model_yolov5 = model_yolov5.to(device)
model_yolov5.conf = score_thresh  # confidence threshold (0-1)
model_yolov5.iou = iou  # NMS IoU threshold (0-1)
model_yolov5 = model_yolov5.eval()

                 from  n    params  module                                  arguments
  0                -1  1      1760  yolort.v5.models.common.Conv            [3, 16, 6, 2, 2]
  1                -1  1      4672  yolort.v5.models.common.Conv            [16, 32, 3, 2]
  2                -1  1      4800  yolort.v5.models.common.C3              [32, 32, 1]
  3                -1  1     18560  yolort.v5.models.common.Conv            [32, 64, 3, 2]
  4                -1  2     29184  yolort.v5.models.common.C3              [64, 64, 2]
  5                -1  1     73984  yolort.v5.models.common.Conv            [64, 128, 3, 2]
  6                -1  3    156928  yolort.v5.models.common.C3              [128, 128, 3]
  7                -1  1    221568  yolort.v5.models.common.Conv            [128, 192, 3, 2]
  8                -1  1    167040  yolort.v5.models.common.C3              [192, 192, 1]
  9                -1  1    442880  yolort.v5.models.common.Conv            [192, 256, 3, 2]
 10                -1  1    296448  yolort.v5.models.common.C3              [256, 256, 1]
 11                -1  1    164608  yolort.v5.models.common.SPPF            [256, 256, 5]
 12                -1  1     49536  yolort.v5.models.common.Conv            [256, 192, 1, 1]
 13                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 14           [-1, 8]  1         0  yolort.v5.models.common.Concat          [1]
 15                -1  1    203904  yolort.v5.models.common.C3              [384, 192, 1, False]
 16                -1  1     24832  yolort.v5.models.common.Conv            [192, 128, 1, 1]
 17                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 18           [-1, 6]  1         0  yolort.v5.models.common.Concat          [1]
 19                -1  1     90880  yolort.v5.models.common.C3              [256, 128, 1, False]
 20                -1  1      8320  yolort.v5.models.common.Conv            [128, 64, 1, 1]
 21                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 22           [-1, 4]  1         0  yolort.v5.models.common.Concat          [1]
 23                -1  1     22912  yolort.v5.models.common.C3              [128, 64, 1, False]
 24                -1  1     36992  yolort.v5.models.common.Conv            [64, 64, 3, 2]
 25          [-1, 20]  1         0  yolort.v5.models.common.Concat          [1]
 26                -1  1     74496  yolort.v5.models.common.C3              [128, 128, 1, False]
 27                -1  1    147712  yolort.v5.models.common.Conv            [128, 128, 3, 2]
 28          [-1, 16]  1         0  yolort.v5.models.common.Concat          [1]
 29                -1  1    179328  yolort.v5.models.common.C3              [256, 192, 1, False]
 30                -1  1    332160  yolort.v5.models.common.Conv            [192, 192, 3, 2]
 31          [-1, 12]  1         0  yolort.v5.models.common.Concat          [1]
 32                -1  1    329216  yolort.v5.models.common.C3              [384, 256, 1, False]
 33  [23, 26, 29, 32]  1    164220  yolort.v5.models.yolo.Detect            [80, [[19, 27, 44, 40, 38, 94], [96, 68, 86, 152, 180, 137], [140, 301, 303, 264, 238, 542], [436, 615, 739, 380, 925, 792]], [64, 128, 192, 256]]
/opt/conda/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Model Summary: 355 layers, 3246940 parameters, 3246940 gradients, 4.6 GFLOPs

[10]:
with torch.no_grad():
    dets_yolov5 = model_yolov5(image)[0]
    dets_yolov5 = non_max_suppression(dets_yolov5, score_thresh, iou, agnostic=False)[0]

Then restore the coordinates to the original scale of the image.

[11]:
boxes_yolov5 = scale_coords(image.shape[2:], dets_yolov5[:, :4], img_raw.shape[:-1])
labels_yolov5 = dets_yolov5[:, 5].to(dtype=torch.int64)
scores_yolov5 = dets_yolov5[:, 4]

Now we can visualize the inference results after completing the post-processing.

[12]:
# Get label names
import requests

label_path = "https://huggingface.co/spaces/zhiqwang/assets/resolve/main/coco.names"
response = requests.get(label_path)
names = response.text

LABELS = []

for label in names.strip().split('\n'):
    LABELS.append(label)

COLORS = color_list()
[13]:
for box, label in zip(boxes_yolov5.tolist(), labels_yolov5.tolist()):
    img_raw = plot_one_box(box, img_raw, color=COLORS[label % len(COLORS)], label=LABELS[label])
[14]:
cv2_imshow(img_raw, imshow_scale=0.5)
../_images/notebooks_comparison-between-yolort-vs-yolov5_18_0.png

At that moment we have completed the whole inference process of yolov5.

Use yolort’s approach to inferencing

yolort now supports loading models trained by yolov5. Here is also an end-to-end inference pipeline, and this pipeline supports both jit tracing and scripting modes. This model can be used to export ONNX and torchscript graph and to inference on ONNX Runtime, LibTorch and TVM VirtualMachine backends.

[15]:
model_yolort = YOLOv5.load_from_yolov5(
    checkpoint_path,
    score_thresh=score_thresh,
    nms_thresh=iou,
    size_divisible=stride,
    fixed_shape=fixed_shape,
)

model_yolort = model_yolort.eval()
model_yolort = model_yolort.to(device)

                 from  n    params  module                                  arguments
  0                -1  1      1760  yolort.v5.models.common.Conv            [3, 16, 6, 2, 2]
  1                -1  1      4672  yolort.v5.models.common.Conv            [16, 32, 3, 2]
  2                -1  1      4800  yolort.v5.models.common.C3              [32, 32, 1]
  3                -1  1     18560  yolort.v5.models.common.Conv            [32, 64, 3, 2]
  4                -1  2     29184  yolort.v5.models.common.C3              [64, 64, 2]
  5                -1  1     73984  yolort.v5.models.common.Conv            [64, 128, 3, 2]
  6                -1  3    156928  yolort.v5.models.common.C3              [128, 128, 3]
  7                -1  1    221568  yolort.v5.models.common.Conv            [128, 192, 3, 2]
  8                -1  1    167040  yolort.v5.models.common.C3              [192, 192, 1]
  9                -1  1    442880  yolort.v5.models.common.Conv            [192, 256, 3, 2]
 10                -1  1    296448  yolort.v5.models.common.C3              [256, 256, 1]
 11                -1  1    164608  yolort.v5.models.common.SPPF            [256, 256, 5]
 12                -1  1     49536  yolort.v5.models.common.Conv            [256, 192, 1, 1]
 13                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 14           [-1, 8]  1         0  yolort.v5.models.common.Concat          [1]
 15                -1  1    203904  yolort.v5.models.common.C3              [384, 192, 1, False]
 16                -1  1     24832  yolort.v5.models.common.Conv            [192, 128, 1, 1]
 17                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 18           [-1, 6]  1         0  yolort.v5.models.common.Concat          [1]
 19                -1  1     90880  yolort.v5.models.common.C3              [256, 128, 1, False]
 20                -1  1      8320  yolort.v5.models.common.Conv            [128, 64, 1, 1]
 21                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 22           [-1, 4]  1         0  yolort.v5.models.common.Concat          [1]
 23                -1  1     22912  yolort.v5.models.common.C3              [128, 64, 1, False]
 24                -1  1     36992  yolort.v5.models.common.Conv            [64, 64, 3, 2]
 25          [-1, 20]  1         0  yolort.v5.models.common.Concat          [1]
 26                -1  1     74496  yolort.v5.models.common.C3              [128, 128, 1, False]
 27                -1  1    147712  yolort.v5.models.common.Conv            [128, 128, 3, 2]
 28          [-1, 16]  1         0  yolort.v5.models.common.Concat          [1]
 29                -1  1    179328  yolort.v5.models.common.C3              [256, 192, 1, False]
 30                -1  1    332160  yolort.v5.models.common.Conv            [192, 192, 3, 2]
 31          [-1, 12]  1         0  yolort.v5.models.common.Concat          [1]
 32                -1  1    329216  yolort.v5.models.common.C3              [384, 256, 1, False]
 33  [23, 26, 29, 32]  1    164220  yolort.v5.models.yolo.Detect            [80, [[19, 27, 44, 40, 38, 94], [96, 68, 86, 152, 180, 137], [140, 301, 303, 264, 238, 542], [436, 615, 739, 380, 925, 792]], [64, 128, 192, 256]]
Model Summary: 355 layers, 3246940 parameters, 3246940 gradients, 4.6 GFLOPs

Its interface is also very easy to use.

[16]:
with torch.no_grad():
    dets_yolort = model_yolort.predict(img_path)
[17]:
boxes_yolort = dets_yolort[0]['boxes']
labels_yolort = dets_yolort[0]['labels']
scores_yolort = dets_yolort[0]['scores']

Verify the detection results between yolort and ultralytics

We print out the results of both inferencing.

[18]:
print(f"Detection boxes with yolov5:\n{boxes_yolov5}\n")
print(f"Detection boxes with yolort:\n{boxes_yolort}")
Detection boxes with yolov5:
tensor([[ 32.51723, 225.12900, 810.00000, 741.03424],
        [ 50.41119, 387.52475, 241.58034, 897.60645],
        [219.00005, 386.05475, 345.78729, 869.04047],
        [678.08923, 374.60596, 809.77881, 874.63422]], device='cuda:0')

Detection boxes with yolort:
tensor([[ 32.27846, 225.15259, 811.47729, 740.91077],
        [ 50.42178, 387.48911, 241.54393, 897.61035],
        [219.03334, 386.14346, 345.77686, 869.02582],
        [678.05023, 374.65341, 809.80341, 874.80621]], device='cuda:0')
[19]:
print(f"Detection scores with yolov5:\n{scores_yolov5}\n")
print(f"Detection scores with yolort:\n{scores_yolort}")
Detection scores with yolov5:
tensor([0.88235, 0.84495, 0.72589, 0.70359], device='cuda:0')

Detection scores with yolort:
tensor([0.88238, 0.84486, 0.72629, 0.70077], device='cuda:0')
[20]:
print(f"Detection labels with yolort:\n{labels_yolov5}\n")
print(f"Detection labels with yolort:\n{labels_yolort}")
Detection labels with yolort:
tensor([5, 0, 0, 0], device='cuda:0')

Detection labels with yolort:
tensor([5, 0, 0, 0], device='cuda:0')
[21]:
# Testing boxes
torch.testing.assert_allclose(boxes_yolort, boxes_yolov5, rtol=1e-2, atol=1e-7)
# Testing scores
torch.testing.assert_allclose(scores_yolort, scores_yolov5, rtol=1e-3, atol=1e-2)
# Testing labels
torch.testing.assert_allclose(labels_yolort, labels_yolov5)

print("Exported model has been tested, and the result looks good!")
Exported model has been tested, and the result looks good!

As you can see from this result, there are still some differences in the boxes, but the scores and labels are relatively accurate.

View this document as a notebook: https://github.com/zhiqwang/yolov5-rt-stack/blob/main/notebooks/comparison-between-yolort-vs-yolov5.ipynb