OpenCV Java Object Detection

Young BaekMarch 26th, 2025Last Updated: March 24th, 2025

0 106 44 minutes read

In this article, OpenCV Java Object Detection demonstrates how to identify specific objects in a real image using OpenCV DNN(Deep Neural Network) module, powered by the OpenCV inference engine.

Moreover, object detection technology has become a widely used method for identifying specific objects in images from pictures or videos across various fields.

Additionally, OpenCV provides the OpenCV DNN module as an inference engine. Therefore, the goal of OpenCV Java Object Detection is to detect real-world objects from different image sources using both the Mask R-CNN model, trained with TensorFlow, and the YOLO model, implemented in Darknet.

1. OpenCV Java Object Detection Overview

The following figure represents my intended approach to implementing OpenCV Java Object Detection from a programmer’s perspective.

The OpenCV Java Object Detection process works as follows:

Takes user input for the CNN models and image information.
Processes object detection based on the provided input.
Displays the identified objects on the screen.

2. DNN Models For Object Detection

The Mask R-CNN and YOLO models are essential components for implementing OpenCV Java Object Detection. Therefore, see the References section for more details.

Specifically, for the YOLO model, YOLOv4 remains a powerful choice in terms of reusability with YOLOv3, offering an improved model that is still very effective, even though the latest release is YOLOv7.

On the other hand, Mask R-CNN, it is an extension of Faster R-CNN, designed as a deep learning model for both object detection and object segmentation. Mask R-CNN performs object segmentation by predicting a pixel-level mask for each object, allowing you to obtain not only the bounding box but also the exact shape of the object.

2.1 YOLO Model

YOLO (You Only Look Once) is a single Convolutional Neural Network (CNN) structure:

Optimizes detection speed by dividing the image into grids.
Predicts the class (type) and bounding box for each grid cell simultaneously.
Allows real-time, high-speed detection.
Achieves approximately 43-50% accuracy (Coco dataset) in object detection.

2.2 Mask R-CNN Model

Based on Faster R-CNN, Mask R-CNN offers higher accuracy than YOLO, but at a lower speed.

Predicts the bounding box, class (object type), and mask (at the pixel level) for each object.
Provides more detailed predictions by adopting ROI Align.
Achieves approximately 60-70% accuracy (Coco dataset) in object detection.

3. Download Pre-trained Model

The OpenCV DNN module is an inference engine in OpenCV that supports various frameworks, including:

TensorFlow (.pb, .pbtxt)
Caffe (.prototxt, .caffemodel)
ONNX (.onnx)
Darknet (YOLO, .cfg, .weights)
Torch (.t7)
OpenVINO Model (.xml, .bin)

To use both YOLO and Mask R-CNN models, the OpenCV DNN module requires downloading pre-trained models for both Darknet and TensorFlow.

3.1 Download Pre-trained Model for YOLO

In OpenCV Java Object Detection, the following two files for YOLOv4 must be downloaded:

The yolov4.weights file contains the learned parameters of the Darknet model, while the yolov4.config file defines the model’s architecture and training method. Therefore, these two files are required as input parameters for OpenCV’s inference engine.

3.2 Download Pre-trained Model for Mask R-CNN

You can use a model in the Frozen Graph (.pb) format, which is compatible with TensorFlow 1.x and can be loaded into the OpenCV DNN module. Additionally, the latest version of the Mask R-CNN model uses the Inception V2 architecture, released on January 28, 2018. You can download the model from the following site:

Mask R-CNN frozen model (TensorFlow).
Mask R-CNN Label.

Note
In case the link I provided for downloading the Mask R-CNN frozen model is blocked, you can avoid this issue by copying the link, pasting it in a new tab, and allowing the file to download.

4.Setting Up OpenCV Java Object Detection

Before diving into OpenCV Java Object Detection, we need to set up the necessary tools and identify the main class to execute. Therefore, it’s important to familiarize yourself with the tools required for development and launch the main program.

4.1 Tools for Downloading

The following tools must be downloaded or installed for the development of OpenCV Java Object Detection:

JDK 17 or higher: The base SDK for OpenCV Java Object Detection.
IDE: Apache NetBeans or Eclipse can be used. NetBeans includes the Matisse design tool, which makes building GUI applications easier.
OpenCV: The Computer Vision library for object detection from various image sources.

You can refer to the previous post for instructions on how to set up Apache NetBeans and OpenCV.

4.2 How to Launch the Program

There are several ways to run the OpenCV Java Object Detection program. For example, you can use development tools like JDK with the command line, Eclipse IDE, or any other Java IDE. However, I will explain how to run it using Apache NetBeans IDE, where I wrote and tested the source code. First, you can download and extract the zip file from the download section to any location on your computer.

a. Open Properties

Right-click on the project named ‘JavaOpenCVObjDetector’ and select Properties.

a. Add OpenCV Java library

You need to add the OpenCV Java library under the Libraries category.

b. Add Native OpenCV library

In the Run category, the ObjectDetectorMain class is our main class to launch. In addition, you need to set the VM options as follows:

-Djava.library.path=where_to_opecv\opencv\build\java\x64

5. Implement OpenCV Java Object Detection

OpenCV’s inference engine, the DNN module, supports various deep learning models. Therefore, the implementation of our OpenCV Java Object Detection is ready using this module, as we have downloaded the model files for both YOLO and Mask R-CNN, as described in Section 3, ‘Download Pre-trained Models.’

5.1 Process of OpenCV Java Object Detection How-to

We will review the steps and methods involved in the process before implementing OpenCV Java Object Detection.

a. User Input:

Set up the model, configuration file, and image file from user input.

b. Load CNN Model:

Use different methods depending on the model type:

readNetFromTensorflow → Load Mask R-CNN model
readNetFromDarknet → Load YOLO model

c. Prepare the Blob for CNN Input:

blobFromImage: Convert the image into a blob, suitable for CNN input.
VideoCapture: Extract a frame from an image source (e.g., video).

d. Perform Inference with the Blob:

forward: Run the actual object detection.
setInput: Pass the blob into the CNN model — the data is now in a format the model can understand.

e. Retrieve the Detection Results:

Mask R-CNN: Extract Bounding Box and Mask (Segmentation).
YOLO: Extract Bounding Box Coordinates, Class ID, and Confidence Score.

f. Draw the Detected Objects on the Original Image:

rectangle: Draw bounding boxes.
putText: Display Class ID and Confidence Score.
For Mask R-CNN, render the mask on the image.

g. Save or Display the Output Result Base on User Option:

imshow: Display the result on the screen.
VideoWriter: Save the final detected image (optional).

5.2 Configuring Output Layers for Models

First, we must define or specify which layer will serve as the output layer before processing object detection. The output layer refers to the layer that we need to identify.

In contrast, the YOLO model can automatically detect its output layers, whereas Mask R-CNN cannot. Therefore, you need to manually add an output layer named detection_out_final if you choose the Mask R-CNN model.

5.2.1 YOLO Model Output Layers

The YOLO model can detect predicted values from its connected layers. As a result, we can obtain detection outputs even if we do not explicitly define an output layer, since the predicted values are derived from these connected layers.

YOLO predicts objects across multiple output layers, requiring us to select the predictions with the highest confidence scores. To produce the final detection results, we apply the Non-Maximum Suppression (NMS) technique, which eliminates redundant and overlapping bounding boxes.

The output results can be extracted directly without explicitly defining output layers, even though necessary information can be found across disconnected layers.

5.2.2 Output Layers of the Mask R-CNN Model

The Mask R-CNN model performs object detection along with object segmentation.

This model processes two layers simultaneously:

detection_out_final: Bounding boxes and confidence scores of detected objects.
detection_masks: Masks for each detected object.

The detection_out_final output layer contains bounding boxes and class information, which identify the locations and types of objects. However, this layer is not considered the output layer for mask predictions. On the other hand, the detection_masks layer is responsible for predicting the masks. This means that mask prediction is crucial for the Mask R-CNN model.

5.2.3 Structure of the detection_out_final Layer

This layer is composed of a 4D array, and the description of each axis is as follows:

a. Meaning of each dimension:

First axis (Batch size): Represents the batch size. Mask R-CNN typically processes one image at a time, so the batch size is usually 1.
Second axis (Number of Channels): Represents the number of channels, which is not used in this context.
Third axis (Number of Objects): The maximum number of detected objects. Mask R-CNN supports a maximum of 100 objects for detection.
Fourth axis (Attributes of Objects): Contains information about each detected object, including 7 values:
- detection_out_final[0, 0, i, 0]: Confidence score
- detection_out_final[0, 0, i, 1]: Class ID
- detection_out_final[0, 0, i, 2]: Confidence score (same as the one in index 0)
- detection_out_final[0, 0, i, 3]: Bounding box (x1)
- detection_out_final[0, 0, i, 4]: Bounding box (y1)
- detection_out_final[0, 0, i, 5]: Bounding box (x2)

5.2.3 Structure of the detection_masks Layer

The structure of the detection_masks layer is a 4D array, similar to the detection_out_final layer. This layer provides object masks, which are an important output. To better understand its function, let’s take a closer look at the structure of the detection_masks layer and its dimensions.

a. Background

This layer contains binary masks of the detected objects in the Mask R-CNN model. The binary mask array represents the object within the bounding box at the pixel level. A value of 1 indicates the presence of the object, while 0 represents the absence of the object.

b. Structure of the detection_masks Layer

The structure of the detection_masks layer is a 4D array with a general shape of (1, 1, 100, H, W):

First axis (Batch size):
- Typically, one image is processed at a time, so this is generally set to 1.
Second axis (Number of channels):
- This represents the number of channels, which is usually set to 1.
Third axis (Number of detected objects):
- This axis represents the maximum number of detected objects. The Mask R-CNN model supports a maximum of 100 objects for detection.
Fourth and fifth axes (Size of Mask: H x W):
- These axes represent the size of the mask for each object. The mask indicates the exact bounding box of the object, and the size (H, W) is smaller than the original image size.

5.3 Input Parameters of blobFromImage

Typically, the readNetFromDarknet method is used for the YOLO model, while the readNetFromTensorflow method is used for the Mask R-CNN model, both of which require a CNN configuration.

Additionally, a blob is a type of parameter for the CNN and is one of the key parameters of the blobFromImage function.

Method signature:

blob = cv2.dnn.blobFromImage(image, scalefactor=1/255.0, size=input_size, swapRB=True, crop=False)

Let’s find out the general input size of image fit used by YOLO or Mask R-CNN model .

5.3.1 Arguments of `blobFromImage` for YOLO Model

scalefactor=1/255.0: Normalizes pixel values to a range between 0 and 1.
size=(416, 416): Adjusts the image size for the YOLO model.
swapRB=true: Converts the image to RGB, as OpenCV uses BGR order by default.
crop=false: Resizes the image while keeping the original aspect ratio (no cropping).

5.3.2 Arguments of `blobFromImage` for Mask R-CNN Model

scalefactor=1.0: No normalization of pixel values (keeps original scale).
size=(1024, 1024): Adjusts the image size. Common sizes are 800×800 or 1024×1024.
swapRB=true: Converts the image to RGB, as OpenCV uses BGR order by default.
crop=false: Resizes the image while maintaining the original aspect ratio.

5.4 YOLO Model Size Parameters Explained

The size parameter of the YOLO model defines the input dimensions of the Deep Neural Network (DNN), typically represented as (width, height). As a result, you must resize your images to match the model’s expected input size, which was defined during the training of the Convolutional Neural Network (CNN).

For example:

YOLOv4 → (416, 416)
SSD MobileNet → (300, 300)
ResNet → (224, 224)

Using the correct input size is crucial for achieving accurate results. Furthermore, in YOLO models, the input size is typically fixed and determined by the model’s architecture.

Model	Input Size (size)
YOLOv3	416×416
YOLOv4	416×416 or 608×608
YOLOv5	640×640
YOLOv6	640×640
YOLOv7	640×640
YOLOv8	640×640

The default size for YOLOv3 and YOLOv4 is 416×416, while a size of 608×608 is also available for YOLOv4.
Starting from YOLOv5, the default size is 640×640.

5.5 Mask R-CNN Model Size Parameters Explained

Mask R-CNN with the COCO Dataset:
Mask R-CNN supports various input sizes, but sizes of (800, 800) or (1024, 1024) are commonly used. The exact input size is specified by the Mask R-CNN model type.
For certain models, 1024×1024 is used as the input size.
Research has shown that using a 1024×1024 image can lead to higher accuracy.

6. Class Design for OpenCV Java Object Detection

Now that we have an understanding of both the YOLO and Mask R-CNN models for the OpenCV DNN module, we can move on to the classes involved in our OpenCV Java Object Detection. Therefore, I believe six classes are necessary for the implementation of our OpenCV Java Object Detection.

6.1 Class Diagram and Associations

Here is my concept of the classes for OpenCV Java Object Detection.

Figure 2. The class diagram to implement OpenCV Java Object Detection.

The responsibilities of each class are described as follows:

YoloDetector: The implementation class for object detection using the YOLO model. It is inherited from ObjectDetector.
DNNOption: The user input class that accepts files and options for detection.
ObjectDetectorMain: The main class that contains the entry point the user must run.
IObjectDetector: The interface class that defines the detectObject method, which subclasses must implement.
ObjectDetector: The parent class of both YOLO and Mask R-CNN implementations, which has an abstract detectObject method.
MaskRCNNDetector: The implementation class for object detection using the Mask R-CNN model. It is inherited from ObjectDetector.

7. User Input UI Layout

Using the OpenCV DNN module requires several parameters, which vary depending on the selected model for the OpenCV Java Object Detection implementation. Initially, I considered using a command-line parser to gather input from the user, but this would make the program more complicated. Therefore, I decided to create a simple window for user input. Afterward, we will focus on the methods provided by OpenCV.

Figure 3. The layout of the user input of OpenCV Java Object Detection.

Next, we will discuss the components and their associated variables, along with a description table.

Swing Components	Variable Name	Value	Note
JComboBox	comboModelType	‘MRCNN’ or ‘YOLO’	Select Mask R-CNN Model or YOLO Model.
JComboBox	comboMediaType	‘Image’ or ‘Video’	Select Image or Video.
JComboBox	comboDeviceType	‘cpu’ or ‘gpu’	Select CPU or GPU
JTextField	txtModel	Path to weights (YOLO) or .pb files (Mask R-CNN)	Set the file extension according to the ComboModelType selection.
JTextField	txtConfig	config file path	configuration file for YOLO or Mask-RCNN model.
JTextField	txtFileIn	Image or Video file path	File path for Image or Video file by mediaType
JTextField	txtThreshold	Confidence Score	Skip objects if the confidence score is below the threshold. Keep objects if the score is above it. The default threshold is 0.5.
JCheckBox	isSaveOutput	true or false	Save images or videos with detected objects highlighted.
JCheckBox	isMaskShading	true or false	Mask objects only if the Mask R-CNN model is selected.

7.1 Implementation of User Input Layout

You can save the parameters from the user input GUI to the DNNOption class for later use. In this case, the role of this class is to store the user input.

This is the GUI implementation:

Figure 4. The user input dialog of OpenCV Java Object Detection.

setUpOption – The main method is based on the showConfirmDialog method of JOptionPane and uses JPanel for user input.

int result = JOptionPane.showConfirmDialog(null, panel, WinName,JOptionPane.OK_CANCEL_OPTION, JoptionPane.PLAIN_MESSAGE);

The JPanel uses GridBagLayout for user input. All information entered by the user can be saved when the “OK” button is clicked, as shown below:

if (result == JOptionPane.OK_OPTION) {
    mediaType = comboMediaType.getSelectedItem().toString().trim();
    deviceType = comboDeviceType.getSelectedItem().toString().trim();
    modelPath = txtModel.getText();
    modelConfiguration = txtConfig.getText();
    mediaFile = txtFileIn.getText();
    isSaveOutput = chkSaveVdo.isSelected();
    threshold = Double.parseDouble(txtThreshold.getText());
    return this;
} else {
    //System.out.println("Cancelled");
    return null;
}

Next, here is the full source code of the setUpOption method.

public DNNOption setUpOptions() {
    // Create main panel
    JPanel panel = new JPanel();
    panel.setPreferredSize(new Dimension(400,350));
    panel.setLayout(new GridBagLayout());
    GridBagConstraints gbc = new GridBagConstraints();
    gbc.insets = new Insets(5, 5, 5, 5);
    gbc.fill = GridBagConstraints.HORIZONTAL;
    gbc.gridx = 0;
    gbc.gridy = 0;
 
    // Create a JComboBox group
    JPanel comboBoxGroup = new JPanel(new FlowLayout(FlowLayout.LEFT, 5, 0));
    JComboBox comboModelName = new JComboBox(new String[]{ObjectDetector.TYPE_NONE, ObjectDetector.MODEL_YOLO, ObjectDetector.MODEL_MASK_RCNN});
    JComboBox comboMediaType = new JComboBox(new String[]{ObjectDetector.TYPE_NONE, ObjectDetector.TYPE_IMAGE, ObjectDetector.TYPE_VIDEO});
    JComboBox comboDeviceType = new JComboBox(new String[]{ObjectDetector.TYPE_CPU, ObjectDetector.TYPE_GPU});
    comboBoxGroup.add(comboModelName);
    comboBoxGroup.add(comboMediaType);
    comboBoxGroup.add(comboDeviceType);
    comboModelName.setPreferredSize(new Dimension(100,30));
    comboMediaType.setPreferredSize(new Dimension(100,30));
    comboDeviceType.setPreferredSize(new Dimension(100, 30));
    panel.add(comboBoxGroup, gbc);
    comboBoxGroup.setPreferredSize(new Dimension(320,40));
 
    // Create  JTextField + JButton Group
    gbc.gridy++;
    JPanel group1 = new JPanel(new FlowLayout(FlowLayout.LEFT, 5, 0));
    JPanel ingroup1 = new JPanel(new BorderLayout());
    JLabel lblModel = new JLabel("Model: ");
    JTextField txtModel = new JTextField(10);
    JButton btnFind1 = new JButton("Find");
 
    txtModel.setPreferredSize(new Dimension(200,30));
    btnFind1.setPreferredSize(new Dimension(60,30));
 
    ingroup1.add(lblModel,BorderLayout.WEST);
    ingroup1.add(txtModel,BorderLayout.CENTER);
    ingroup1.add(btnFind1,BorderLayout.EAST);
    ingroup1.setPreferredSize(new Dimension(310,30));
 
    group1.setPreferredSize(new Dimension(350,40));
    group1.add(ingroup1);
 
    panel.add(group1, gbc);
 
    // Create  JTextField + JButton Group
    gbc.gridy++;
    JPanel group2 = new JPanel(new FlowLayout(FlowLayout.LEFT, 5, 0));
    JPanel ingroup2 = new JPanel(new BorderLayout());
    group2.setPreferredSize(new Dimension(100,100));
    JLabel lblConfig = new JLabel("Config: ");
    JTextField txtConfig = new JTextField(10);
    JButton btnFind2 = new JButton("Find");
 
    txtConfig.setPreferredSize(new Dimension(200,30));
    btnFind2.setPreferredSize(new Dimension(60,30));
 
    ingroup2.add(lblConfig,BorderLayout.WEST);
    ingroup2.add(txtConfig,BorderLayout.CENTER);
    ingroup2.add(btnFind2,BorderLayout.EAST);
    ingroup2.setPreferredSize(new Dimension(310,30));
 
    group2.setPreferredSize(new Dimension(350,40));
    group2.add(ingroup2);
 
    panel.add(group2, gbc);
 
    // Create  JTextField + JButton Group 
    gbc.gridy++;
    JPanel group3 = new JPanel(new FlowLayout(FlowLayout.LEFT, 5, 0));
    JPanel ingroup3 = new JPanel(new BorderLayout());
    group3.setPreferredSize(new Dimension(100,100));
    JLabel lblFileIn = new JLabel("InFile: ");
    JTextField txtFileIn = new JTextField(10);
    JButton btnFind3 = new JButton("Find");
 
    txtFileIn.setPreferredSize(new Dimension(200,30));
    btnFind3.setPreferredSize(new Dimension(60,30));
 
    ingroup3.add(lblFileIn,BorderLayout.WEST);
    ingroup3.add(txtFileIn,BorderLayout.CENTER);
    ingroup3.add(btnFind3,BorderLayout.EAST);
    ingroup3.setPreferredSize(new Dimension(310,30));
 
    group3.setPreferredSize(new Dimension(350,40));
    group3.add(ingroup3);
 
    panel.add(group3, gbc);
     
    // Create  JLabel + JCheckBox Group 
    gbc.gridy++;
    JPanel group4 = new JPanel(new FlowLayout(FlowLayout.LEFT, 5, 0));
    JPanel ingroup4 = new JPanel(new BorderLayout());
    group4.setPreferredSize(new Dimension(100,100));
    JLabel lblThreshold = new JLabel("Threshold: ");
    JTextField txtThreshold = new JTextField();
    txtThreshold.setHorizontalAlignment(JTextField.RIGHT);
    txtThreshold.setText("0.5");
     
    ingroup4.add(lblThreshold,BorderLayout.WEST);
    ingroup4.add(txtThreshold,BorderLayout.CENTER);
    ingroup4.setPreferredSize(new Dimension(310,30));
     
    group4.setPreferredSize(new Dimension(350,40));
    group4.add(ingroup4);
 
    panel.add(group4, gbc);
     
    // Create  JLabel + JCheckBox Group 
    gbc.gridy++;
    JPanel group5 = new JPanel(new FlowLayout(FlowLayout.LEFT, 5, 0));
    JPanel ingroup5 = new JPanel(new BorderLayout());
    JLabel lblChkSaveVdo = new JLabel("Save Output: ");
    JCheckBox chkSaveVdo = new JCheckBox();
    
    ingroup5.add(lblChkSaveVdo,BorderLayout.CENTER);
    ingroup5.add(chkSaveVdo,BorderLayout.EAST);
    ingroup5.setPreferredSize(new Dimension(310,30));
 
    group5.setPreferredSize(new Dimension(350,40));
    group5.add(ingroup5);
 
    panel.add(group5, gbc);
     
    gbc.gridy++;
    JPanel group6 = new JPanel(new FlowLayout(FlowLayout.LEFT, 5, 0));
    JPanel ingroup6 = new JPanel(new BorderLayout());
    //ingroup6.setPreferredSize(new Dimension(100,100));
    JLabel lblChkMaskShade = new JLabel("Shading Mask: ");
    JCheckBox chkhkMaskShade = new JCheckBox();
    chkhkMaskShade.setEnabled(false);
 
    ingroup6.add(lblChkMaskShade,BorderLayout.CENTER);
    ingroup6.add(chkhkMaskShade,BorderLayout.EAST);
    ingroup6.setPreferredSize(new Dimension(310,30));
 
    group6.setPreferredSize(new Dimension(350,40));
    group6.add(ingroup6);
 
    panel.add(group6, gbc);
     
    comboModelName.addActionListener((ActionEvent e) -> {
        modelName = (String) comboModelName.getSelectedItem();
        if(modelName.equals(ObjectDetector.MODEL_MASK_RCNN))
        {
            chkhkMaskShade.setEnabled(true);
        }
        else
        {
            chkhkMaskShade.setEnabled(false);
        }
    });
     
    comboMediaType.addActionListener((ActionEvent e) -> {
        mediaType = (String) comboMediaType.getSelectedItem();
    });
 
    comboDeviceType.addActionListener((ActionEvent e) -> {
        deviceType = (String) comboDeviceType.getSelectedItem();
    });
 
    btnFind1.addActionListener((ActionEvent e) -> {
        JFileChooser fileChooser = new JFileChooser();
        fileChooser.setCurrentDirectory(new File("C:\\"));
         
        String model = comboModelName.getSelectedItem().toString().trim();
         
        if(MODEL_YOLO.equals(model))
        {
            FileNameExtensionFilter filter = new FileNameExtensionFilter(
                    "YOLO File (.weights)", "weights");
            fileChooser.setFileFilter(filter);
        }
        else if(MODEL_MASK_RCNN.equals(model))
        {
           FileNameExtensionFilter filter = new FileNameExtensionFilter(
                    "MRCNN File (.pb)", "pb");
            fileChooser.setFileFilter(filter);
        }
         
        int returnValue = fileChooser.showOpenDialog(null);
        if (returnValue == JFileChooser.APPROVE_OPTION) {
            File selectedFile = fileChooser.getSelectedFile();
            txtModel.setText(selectedFile.getAbsolutePath());
        }
    });
 
    btnFind2.addActionListener((ActionEvent e) -> {
        JFileChooser fileChooser = new JFileChooser();
        fileChooser.setCurrentDirectory(new File("C:\\"));
        String model = comboModelName.getSelectedItem().toString().trim();
       if(MODEL_YOLO.equals(model))
        {
            FileNameExtensionFilter filter = new FileNameExtensionFilter(
                    "YOLO Config (.cfg)", "cfg");
            fileChooser.setFileFilter(filter);
        }
        else if(MODEL_MASK_RCNN.equals(model))
        {
           FileNameExtensionFilter filter = new FileNameExtensionFilter(
                    "MRCNN Config (.pbtxt)", ".pbtxt");
            fileChooser.setFileFilter(filter);
        }
 
        int returnValue = fileChooser.showOpenDialog(null);
        if (returnValue == JFileChooser.APPROVE_OPTION) {
            File selectedFile = fileChooser.getSelectedFile();
 
            txtConfig.setText(selectedFile.getAbsolutePath());
        }
    });
 
    btnFind3.addActionListener((ActionEvent e) -> {
        JFileChooser fileChooser = new JFileChooser();
        fileChooser.setCurrentDirectory(new File("C:\\")); // Default directory
 
        String mType = comboMediaType.getSelectedItem().toString().trim();
 
        if(TYPE_IMAGE.equals(mType))
        {
            FileNameExtensionFilter filter = new FileNameExtensionFilter(
                    "Image Files (JPG, PNG)", "jpg", "jpeg", "png");
            fileChooser.setFileFilter(filter);
        }
        else if(TYPE_VIDEO.equals(mType))
        {
            //chkSaveVdo.setEnabled(true);
            FileNameExtensionFilter filter = new FileNameExtensionFilter(
                    "Video Files (MP4)", "mp4");
            fileChooser.setFileFilter(filter);
        }
 
        int returnValue = fileChooser.showOpenDialog(null);
        if (returnValue == JFileChooser.APPROVE_OPTION) {
            File selectedFile = fileChooser.getSelectedFile();
            txtFileIn.setText(selectedFile.getAbsolutePath());
        }
    });
 
    // OK / CANCEL button dialog
    int result = JOptionPane.showConfirmDialog(null, panel, WinName,
            JOptionPane.OK_CANCEL_OPTION, JOptionPane.PLAIN_MESSAGE);
 
    if (result == JOptionPane.OK_OPTION) {
        mediaType = comboMediaType.getSelectedItem().toString().trim();
        deviceType = comboDeviceType.getSelectedItem().toString().trim();
        modelPath = txtModel.getText();
        modelConfiguration = txtConfig.getText();
        mediaFile = txtFileIn.getText();
        threshold = Double.parseDouble(txtThreshold.getText());
        isSaveOutput = chkSaveVdo.isSelected();
        isMaskRendering = chkhkMaskShade.isSelected();
         
        return this;
    } else {
         return null;
    }
}

7.2 OpenCV Java Object Detection Main Method

The ObjectDetectorMain class contains the main method for OpenCV Java Object Detection. Specifically, this class uses the DNNOption class to store the user input parameters.

Here is the source code:

public static void main(String[] args)
{
   DNNOption dnnOption = new DNNOption();
   IObjectDetector detector = null;
    
   if(dnnOption.setUpOptions() != null && dnnOption.isValid())
   {
       if(dnnOption.modelName.equals(ObjectDetector.MODEL_MASK_RCNN))
       {
           detector = new MaskRCNNDetector(dnnOption);
       }
       else if(dnnOption.modelName.equals(ObjectDetector.MODEL_YOLO))
       {
           detector = new YoloDetector(dnnOption);
       }
        
       if(detector != null)
       {
           detector.detectObject();
       }
   }
   else
   {
       JOptionPane.showMessageDialog(null, "Please retry after checking your options.", "Insufficient option", JOptionPane.WARNING_MESSAGE);
   }
}

It calls the setUpOptions method for user input and performs a validation check on the options.

1	`if(dnnOption.setUpOptions() !=` `null` `&& dnnOption.isValid())`

The MaskRCNNDetector class is instantiated if the model selected by the user is Mask R-CNN, while the YoloDetector class is instantiated if the user selects the YOLO model. In either case, each class must implement the detectObject method inherited from the IObjectDetector interface.

Based on the modelName selected by the user, the appropriate class is instantiated.

if(dnnOption.modelName.equals(ObjectDetector.MODEL_MASK_RCNN))
{
   detector = new MaskRCNNDetector(dnnOption);
}
else if(dnnOption.modelName.equals(ObjectDetector.MODEL_YOLO))
{
   detector = new YoloDetector(dnnOption);
}

Once the class instantiation is complete, the detectObject method of the implementation class is called.

if(detector != null)
{
   detector.detectObject();
}

8. ObjectDetector The Parent Class

The ObjectDetector class is an abstract class that serves as the parent class for both the YOLO and Mask R-CNN models. It implements the IObjectDetector interface, which contains the detectObject method as an abstract method

Below is the full source code of the ObjectDetector class:

/*
 * Click nbfs://nbhost/SystemFileSystem/Templates/Licenses/license-default.txt to change this license
 * Click nbfs://nbhost/SystemFileSystem/Templates/Classes/Class.java to edit this template
 */
package com.tobee.opencv.detector;
 
import com.tobee.opencv.detector.intrfce.IObjectDetector;
import com.tobee.opencv.detector.input.DNNOption;
import conf.CocoLabel;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.opencv.core.Core;
import org.opencv.core.Mat;
import org.opencv.core.MatOfInt;
import org.opencv.core.Rect;
import org.opencv.core.Scalar;
import org.opencv.core.Size;
import org.opencv.dnn.Dnn;
import org.opencv.dnn.Net;
import org.opencv.imgproc.Imgproc;
 
/**
 *
 * @author Tobee
 */
public abstract class ObjectDetector implements IObjectDetector {
    protected String winName;
    protected String modelName;
    protected String mediaType;
    protected String deviceType;
    protected String mediaFile;
    protected String modelPath; 
    protected String modelConfiguration;
    protected boolean isSaveOuput;
    public boolean isMaskRendering;
    protected double threshold;
         
    private final DNNOption dnnOption;
     
    static{
        System.loadLibrary(Core.NATIVE_LIBRARY_NAME);
    }
     
    protected ObjectDetector(final DNNOption dnnOption)
    {
        this.dnnOption = dnnOption;
        setupOption();
        setupLabelSet();
         
    }
     
    private void setupLabelSet()
    {
        if(modelName.equals(MODEL_MASK_RCNN))
        {
            CocoLabel.setupCocoLabelSet(MASK_RCNN_LABEL_FILE);
        }
        else if(modelName.equals(MODEL_YOLO))
        {
            CocoLabel.setupCocoLabelSet(YOLO_LABEL_FILE);
        }
    }
     
    private void setupOption()
    {
        winName = dnnOption.WinName.replace("${Model}",dnnOption.modelName);
        modelName = dnnOption.modelName;
        mediaType = dnnOption.mediaType;
        deviceType = dnnOption.deviceType;
        mediaFile = dnnOption.mediaFile;
        modelPath = dnnOption.modelPath; 
        modelConfiguration = dnnOption.modelConfiguration;
        isSaveOuput = dnnOption.isSaveOutput;
        isMaskRendering = dnnOption.isMaskRendering;
        threshold = dnnOption.threshold;
    }
     
    protected Net readDNNetByModelName()
    {
        Net dnnNet = null;
         
        if(modelName.equals(MODEL_YOLO))
        {
            dnnNet = Dnn.readNetFromDarknet(modelConfiguration, modelPath);
        }
        else if(modelName.equals(MODEL_MASK_RCNN))
        {
            dnnNet = Dnn.readNetFromTensorflow(modelPath, modelConfiguration);
        }
         
        if(dnnNet == null) return null;
         
        switch(deviceType)
        {
        case "cpu" -> { 
            dnnNet.setPreferableTarget(org.opencv.dnn.Dnn.DNN_TARGET_CPU);
            }
        case "gpu" -> { 
            dnnNet.setPreferableBackend(Dnn.DNN_BACKEND_CUDA);
            dnnNet.setPreferableTarget(Dnn.DNN_TARGET_CUDA);
            }
        }
         
        return dnnNet;
    }
     
    protected Map getMapOutputLayer(final Net net)
    {
        if(modelName.equals(MODEL_YOLO))
        {
            return getYoloOutputLayer(net);
        }
        else if(modelName.equals(MODEL_MASK_RCNN))
        {
            return getMaskRCnnOutputLayer(net);
        }
         
        return null;
    }
     
    protected Map getMaskRCnnOutputLayer(final Net net) {
        Map retMap = new HashMap();
         
        // Auto detection of the ouput layer.
        List layerNames = net.getLayerNames();
        MatOfInt outLayers = net.getUnconnectedOutLayers();
        int[] outLayersArray = outLayers.toArray();
 
        List outputLayerNames = new ArrayList();
        for (int i : outLayersArray) {
            retMap.put(layerNames.get(i - 1), null);
            outputLayerNames.add(layerNames.get(i - 1)); // OpenCV uses 1-based index 
        }
 
        // Generally, Both `detection_out_final` and `detection_masks` layers are used by Mask R-CNN
        boolean foundDetectionOutFinal = false;
        for (String layer : outputLayerNames) {
            if (layer.contains("detection_out_final")) {
                foundDetectionOutFinal = true;
                break;
            }
        }
 
        // Add `detection_out_final` layer.
        if (!foundDetectionOutFinal) {
            outputLayerNames.add("detection_out_final");
            retMap.put("detection_out_final", null);
        }
         
        return retMap;
    }
     
    protected Map  getYoloOutputLayer(final Net net) {
        Map retMap = new HashMap();
         
        // Retrieve all layer names in the Net.
        List layerNames = net.getLayerNames();
 
        // Get indexes for output layers
        MatOfInt outLayers = net.getUnconnectedOutLayers();
        int[] outLayersArray = outLayers.toArray();
 
        // Auto detection of the ouput layer and add the index to the map
        for (int i : outLayersArray) {
            retMap.put(layerNames.get(i - 1), null);
        }
       
        return retMap;
    }
     
    /**
     * 
     * @param orgMat
     * @param targetSize
     * @return java.util.Map
     * 
     */
    protected Map resizeMatImage(Mat orgMat, Size targetSize) {
        Map retMap = new HashMap();
        Rect padding =  new Rect();
        int targetWidth = (int)targetSize.width;
        int targetHeight = (int)targetSize.height;
         
        int originalWidth = orgMat.cols();
        int originalHeight = orgMat.rows();
 
        // Minimum scale against the original ratio.
        float scaleX = (float) targetWidth / originalWidth;
        float scaleY = (float) targetHeight / originalHeight;
        float minScale = Math.min(scaleX, scaleY);
 
        int newWidth = (int) (originalWidth * minScale);
        int newHeight = (int) (originalHeight * minScale);
 
         // Padding value calculation.
        int left = (targetWidth - newWidth) / 2;
        int top = (targetHeight - newHeight) / 2;
        int right = targetWidth - newWidth - left;
        int bottom = targetHeight - newHeight - top;
 
        // Resizing and padding adjustment.
        Mat resizedMat = new Mat();
        Imgproc.resize(orgMat, resizedMat, new Size(newWidth, newHeight));
 
        Mat paddedMat = new Mat();
        Core.copyMakeBorder(resizedMat, paddedMat, top, bottom, left, right, Core.BORDER_CONSTANT, new Scalar(0, 0, 0));
 
        // The padding information.
        padding.x = left;
        padding.y = top;
        padding.width = right;
        padding.height = bottom;
 
        retMap.put("mat", paddedMat);
        retMap.put("padding", padding);
        retMap.put("scale", minScale);// The scale information.
        return retMap;
    }
         
     public static Mat getMrcnnROI(Mat frame, DetectedBBox mybbox) {
        int frameHeight = frame.rows();
        int frameWidth = frame.cols();
        Rect bbox = new Rect(mybbox.left, mybbox.top, mybbox.right - mybbox.left + 1, mybbox.bottom - mybbox.top + 1);
         
        // Adjust coordinates to ensure the image stays within range.
        int top = Math.max(0, Math.min(bbox.y, frameHeight - 1));
        int bottom = Math.max(0, Math.min(bbox.y + bbox.height, frameHeight));
        int left = Math.max(0, Math.min(bbox.x, frameWidth - 1));
        int right = Math.max(0, Math.min(bbox.x + bbox.width, frameWidth));
 
        //Add another check to validate the ROI, ensuring the width and height are greater than 0.
        if (top >= bottom || left >= right) {
            System.err.println("Invaild ROI!");
            return new Mat();  // retun empty Mat.
        }
 
        // return Safe Mat
        return frame.submat(top, bottom, left, right);
    }
    protected static class DetectedBBox
    {
        protected final double confidenceValue;
        protected final int classId;
        protected final int left;
        protected final int top;
        protected final int right;
        protected final int bottom;
         
        DetectedBBox( final double confidence, final int classId, final int left, final int top, final int right, final int bottom)
        {
            this.confidenceValue = confidence;
            this.classId = classId;
            this.left = left;
            this.right = right;
            this.top = top;
            this.bottom = bottom;
        }        
    }  
}

8.1 Setting Output Layers for Models

The getMapOutputLayer method implements the section in “5.2 Configuring Output Layers for Models” The method selects the appropriate output layer based on the model, as follows:

protected Map getMapOutputLayer(final Net net)
{
    if(modelName.equals(MODEL_YOLO))
    {
        return getYoloOutputLayer(net);
    }
    else if(modelName.equals(MODEL_MASK_RCNN))
    {
        return getMaskRCnnOutputLayer(net);
    }
     
    return null;
}

8.1.1 Setting Output Layers for YOLO

This method saves an unconnected output layer in the model’s CNN network.

The signature of the getYoloOutputLayer method is as follows:

1	`protected` `Map getYoloOutputLayer(final` `Net net)`

The method searches for an unconnected layer among all layers in the CNN network.

The implementation code is as follows:

protected Map  getYoloOutputLayer(final Net net) {
    Map retMap = new HashMap();
     
    // Retrieve all layer names in the Net.
    List layerNames = net.getLayerNames();
 
    // Get indexes for output layers
    MatOfInt outLayers = net.getUnconnectedOutLayers();
    int[] outLayersArray = outLayers.toArray();
 
    // Auto detection of the ouput layer and add the index to the map
    for (int i : outLayersArray) {
        retMap.put(layerNames.get(i - 1), null);
    }
   
    return retMap;
}

8.1.2 Setting Output Layers for Mask R-CNN

The getMaskRCnnOutputLayer method implements the section “5.2.2 Output Layers of the Mask R-CNN Model.” Specifically, this method saves the unconnected output layers and adds the detection_out_final layer to the model’s CNN network. The key point of this method is adding the detection_out_final layer, as both the detection_out_final and detection_masks layers are used in the Mask R-CNN model.

The method signature is:

1	`protected` `Map getMaskRCnnOutputLayer(final` `Net net)`

The method uses a Net object as a parameter and returns a map containing layer names and Mat objects. In this process, the current status of the Mat object is set to null because the blobFromImage method is called, which applies object detection.

if (!foundDetectionOutFinal) {
    outputLayerNames.add("detection_out_final");
    retMap.put("detection_out_final", null);
}

This is source code of the method :

protected Map getMaskRCnnOutputLayer(final Net net) {
    Map retMap = new HashMap();
     
    // Auto detection of the ouput layer.
    List layerNames = net.getLayerNames();
    MatOfInt outLayers = net.getUnconnectedOutLayers();
    int[] outLayersArray = outLayers.toArray();
 
    List outputLayerNames = new ArrayList();
    for (int i : outLayersArray) {
        retMap.put(layerNames.get(i - 1), null);
        outputLayerNames.add(layerNames.get(i - 1)); // OpenCV uses 1-based index 
    }
 
    // Generally, Both `detection_out_final` and `detection_masks` layers are used by Mask R-CNN
    boolean foundDetectionOutFinal = false;
    for (String layer : outputLayerNames) {
        if (layer.contains("detection_out_final")) {
            foundDetectionOutFinal = true;
            break;
        }
    }
 
    // Add `detection_out_final` layer.
    if (!foundDetectionOutFinal) {
        outputLayerNames.add("detection_out_final");
        retMap.put("detection_out_final", null);
    }
     
    return retMap;
}

8.2 CNN Net Initialization

The readDNNetByModelName method creates a DNN from the model and configuration files based on the user input.

The method signature is:

1	`protected` `Net readDNNetByModelName()`

Implementation code :

protected Net readDNNetByModelName()
{
    Net dnnNet = null;
     
    if(modelName.equals(MODEL_YOLO))
    {
        dnnNet = Dnn.readNetFromDarknet(modelConfiguration, modelPath);
    }
    else if(modelName.equals(MODEL_MASK_RCNN))
    {
        dnnNet = Dnn.readNetFromTensorflow(modelPath, modelConfiguration);
    }
     
    if(dnnNet == null) return null;
     
    switch(deviceType)
    {
    case "cpu" -> { 
        dnnNet.setPreferableTarget(org.opencv.dnn.Dnn.DNN_TARGET_CPU);
        }
    case "gpu" -> { 
        dnnNet.setPreferableBackend(Dnn.DNN_BACKEND_CUDA);
        dnnNet.setPreferableTarget(Dnn.DNN_TARGET_CUDA);
        }
    }
     
    return dnnNet;
}

The readNetFromDarknet method is used for YOLO, and the readNetFromTensorflow method is used for Mask R-CNN during DNN generation.

if(modelName.equals(MODEL_YOLO))
{
    dnnNet = Dnn.readNetFromDarknet(modelConfiguration, modelPath);
}
else if(modelName.equals(MODEL_MASK_RCNN))
{
    dnnNet = Dnn.readNetFromTensorflow(modelPath, modelConfiguration);
}

The rest of the code handles the selection of CPU or GPU for processing.

switch(deviceType)
{
case "cpu" -> { 
    dnnNet.setPreferableTarget(org.opencv.dnn.Dnn.DNN_TARGET_CPU);
    }
case "gpu" -> { 
    dnnNet.setPreferableBackend(Dnn.DNN_BACKEND_CUDA);
    dnnNet.setPreferableTarget(Dnn.DNN_TARGET_CUDA);
    }
}

9. Setting up Labels and Colors

The CocoLabel class stores the appropriate label data based on the selected model and assigns a randomly generated color to each label. Therefore, the class is instantiated when the setupLabelSet() method is called in the constructor of the ObjectDetector class, as shown in the following code:

protected ObjectDetector(final DNNOption dnnOption)
{
    this.dnnOption = dnnOption;
    setupOption();
    setupLabelSet();  
}

9.1 Label Initialization

“The setupLabelSet method selects the YOLO_LABEL_FILE option for the YOLO model; otherwise, it selects the MASK_RCNN_LABEL_FILE option for the Mask R-CNN. Additionally, both YOLO_LABEL_FILE and MASK_RCNN_LABEL_FILE are parameters of the setupCocoLabelSet method in the CocoLabel class.

private void setupLabelSet()
{
    if(modelName.equals(MODEL_MASK_RCNN))
    {
        CocoLabel.setupCocoLabelSet(MASK_RCNN_LABEL_FILE);
    }
    else if(modelName.equals(MODEL_YOLO))
    {
        CocoLabel.setupCocoLabelSet(YOLO_LABEL_FILE);
    }
}

9.2 Save Label and Color

The setupCocoLabelSet method saves a List object containing both COCO_LABELS and the labels read from the label data file.

public static void setupCocoLabelSet(String labelFileName)
{
    BufferedReader breader;
    COCO_LABLE_FILE = labelFileName;
     
    try
    {
        String line;
        breader = new BufferedReader(new InputStreamReader(CocoLabel.class.getResourceAsStream(COCO_LABLE_FILE)));
 
        String labels[];
        while((line = breader.readLine()) != null)
        {
            if(line.isBlank() || line.isEmpty()) continue;
            if(line.startsWith("#")) continue;
 
            labels = line.split("\\:");
 
            if(labels.length == 2)
            {
                COCO_LABLES.put(Integer.valueOf(labels[0]), labels[1].trim());
            }
        }
    }
    catch(IOException ie)
    {
        ie.printStackTrace();
    }
     
    // generate radnom color in order to draw bounding boxes
    Random random = new Random();
     
    for (int i = 0; i < COCO_LABLES.size(); i++) {
        COLORS.add(
            new Scalar(
                new double[] { random.nextInt(255), random.nextInt(255), random.nextInt(255) }));
    }
}

9.3 Label File

Each detected object has a class ID, which is used to identify the object during the training process. Additionally, there are two different label files for each model: coco.names for YOLO and mscoco_labels.names for Mask R-CNN. These are the files used as labels for the respective models.

Make sure to use the appropriate file for each model to ensure proper functionality.

while((line = breader.readLine()) != null)
{
    if(line.isBlank() || line.isEmpty()) continue;
    if(line.startsWith("#")) continue;
 
    labels = line.split("\\:");
 
    if(labels.length == 2)
    {
        COCO_LABLES.put(Integer.valueOf(labels[0]), labels[1].trim());
    }
}

Below is a sample label file that saves the labels and their corresponding class IDs.

person
bicycle
car
motorcycle
airplane
bus
train
truck
boat
...

9.3 Colors and Labels for Bounding Box

The following code describes how the labels are matched with colors for the bounding box. Specifically, these colors are randomly generated when the label data is first read from the file.

// generate radnom color in order to draw bounding boxes
Random random = new Random();
 
for (int i = 0; i < COCO_LABLES.size(); i++) {
    COLORS.add(
        new Scalar(
            new double[] { random.nextInt(255), random.nextInt(255), random.nextInt(255) }));
}

You can use the method as follows:

1	`List colors = CocoLabel.getColorsByLabel();`

10. Object detection in Mask R-CNN Model

The mediaType can be either ‘Image’ or ‘Video’. The ‘Image’ type calls the detectObjectOnImage method, while the ‘Video’ type calls the detectObjectOnVideo method.

10.1 User Input Values

The user must fill in the following information: Select ‘MRCNN’, choose either ‘Image’ or ‘Video’, and select either ‘GPU’ or ‘CPU’ in the respective combo boxes. Based on the selected file type (‘Image’ or ‘Video’), choose the corresponding file for the ‘InFile’.

Figure 5. The user input dialog and its options for Mask R-CNN model.

Here is the description of each component:

Model: The file path for frozen_inference_graph.pb. You can select this file by clicking the Find button.
Config: The file path for mask_rcnn_inception_v2_coco_2018_01_28.pbtxt. You can select this file by clicking the Find button.
InFile: The media file path, which can either be an image or a video file depending on what you select from the first combo box.
Save Output: A flag that determines whether to save the result or not. The output will be saved in the ‘out’ folder within the image or video file path selected by the user.
Shading Mask: A flag that determines whether to shade masks on the original image. If unchecked, the masks will not be shaded on the original image.

You can get some test images from the Reference section.

a. Rendering masks, labels, and scores will be displayed when the checkbox is checked.

Figure 5. The output of Mask R-CNN model with a rendering mask option.

b. Labels will be shaded on the image only when the checkbox is checked.

Figure 6. The output of Mask R-CNN model with a common option.

10.2 detectObject Method in Object Detection

This method implements the detectObject method from the IObjectDetector interface.

@Override
public void detectObject()
{
    if(!new File(mediaFile).exists()) {
        JOptionPane.showMessageDialog(null, String.format("File9[%s] Not Found", mediaFile), "File Error", JOptionPane.WARNING_MESSAGE);
        return;
    }
    if(!new File(modelPath).exists() || !new File(modelConfiguration).exists()) {
        JOptionPane.showMessageDialog(null, String.format("Can not open Model[%s] \n config[%s]", 
                modelPath,modelConfiguration), "Model Error", JOptionPane.WARNING_MESSAGE);
        return;
    }
     
    Net dnnNet = readDNNetByModelName();
    Map outputLayers = getMapOutputLayer(dnnNet);
             
    if (mediaType.equals("Image"))
    {
        detectObjectOnImage(winName, dnnNet, outputLayers, mediaFile, threshold, isSaveOuput, isMaskShading);
    }
    else if (mediaType.equals("Video"))
    {
        System.out.println(outputLayers.size());
        detectObjectOnVideo(winName, dnnNet, outputLayers, mediaFile, threshold, isSaveOuput, isMaskShading);
    }
}

In the case of ‘Image’, the method calls detectObjectOnImage, while in the case of ‘Video’, it calls detectObjectOnVideo, based on the mediaType variable selected by the user.

if (mediaType.equals("Image"))
    {
        detectObjectOnImage(winName, dnnNet, outputLayers, mediaFile, threshold, isSaveOuput, isMaskShading);
    }
    else if (mediaType.equals("Video"))
    {
        System.out.println(outputLayers.size());
        detectObjectOnVideo(winName, dnnNet, outputLayers, mediaFile, threshold, isSaveOuput, isMaskShading);
    }
}

10.3 detectObjectOnImage for Image Detection

This is more streamlined while still effectively communicating the core idea. Therefore, the method performs object detection on image inputs.

The method signature is as follows:

public void detectObjectOnImage(
    String winName, Net dnnNet, Map outputLayers, 
    String mediaFile, double threshold, boolean isSaveOuput, boolean isMaskShading
); 

Description of parameters:

winName: The title of the window.
dnnNet: The CNN network loaded from the readDNNetByModelName method.
outputLayers: The list of output layers returned by the getMaskRCnnOutputLayer method.
mediaFile: The path to the media file (image or video).
isSaveOutput: A flag indicating whether to save the results.
isMaskShading: A flag indicating whether to draw masks on the result.

The full source code is as follows:

public void detectObjectOnImage(
        String winName, Net dnnNet, Map outputLayers, 
        String mediaFile, double threshold, boolean isSaveOuput, boolean isMaskShading
) {
   HighGui.namedWindow(winName, HighGui.WINDOW_NORMAL);
    
    Mat frame = Imgcodecs.imread(mediaFile);
     
    int idx = new File(mediaFile).getName().split("\\.").length - 2;
     
    String outputFile 
        = new File(mediaFile).getParent() + File.separator + "out" +  File.separator + 
            new File(mediaFile).getName().split("\\.")[idx] + "_mrcnn_out.jpg";
     
    // Stop the program if reached end of video
    if (frame.empty()) {
        System.out.println( "can not read!!!");
        System.out.println( "Output file is stored as " + outputFile);
        HighGui.waitKey(3000);
        return;
    }
     
    Map resultMap  = resizeMatImage(frame, BLOB_MASK_RCNN_SIZE);
     
    Mat resizedMat = (Mat)resultMap.get("mat");
    Rect padding = (Rect)resultMap.get("padding");
    float scale = (float)resultMap.get("scale");
     
    Mat blob = Dnn.blobFromImage(resizedMat, 1.0, BLOB_MASK_RCNN_SIZE, Scalar.all(0), true, false);
    dnnNet.setInput(blob);
    blob.release();
     
    for(Iterator iter = outputLayers.keySet().iterator(); iter.hasNext();)
    {
        String key = iter.next();
        Mat dtectMat = dnnNet.forward(key);
        outputLayers.put(key, dtectMat);
    }
     
     // create an object for label information.
    Map cocoLabels =  CocoLabel.getCocoLabel();
    // generate radnom color in order to draw bounding boxes
    List colors = CocoLabel.getColorsByLabel();
     
    Mat detection = outputLayers.get("detection_out_final");
    Mat detectionMasks = outputLayers.get("detection_masks");
     
    float[] data = new float[(int) (detection.total())];
    detection.get(0, 0, data);
 
    int rows = detection.size(2);
     
    boolean foundObject = false;
    detection = detection.reshape(1, (int) detection.total() / 7);
     
    // extract confidence value
    int cols = detection.cols();
    data = new float[(int) detection.total()];
    detection.get(0, 0, data);  // Mat data to 1D array.
 
    for (int i = 0; i  THRESHOLD) {  
            foundObject = true;
            int classId = (int)data[i * cols + 1] ;
 
            // convert to Bounding Box coordinates
            int left   = (int)((data[i * cols + 3] * MASK_RCNN_SIZE - padding.x) / scale);
            int top    = (int)((data[i * cols + 4] * MASK_RCNN_SIZE - padding.y) / scale);
            int right  = (int)((data[i * cols + 5] * MASK_RCNN_SIZE - padding.x) / scale);
            int bottom = (int)((data[i * cols + 6] * MASK_RCNN_SIZE - padding.y) / scale);
             
            DetectedBBox bbox = new DetectedBBox(confidence, classId, left, top, right, bottom);
            String label = String.format("%s: %.2f", cocoLabels.get(bbox.classId), confidence * 100);
            Scalar classColor = colors.get(bbox.classId);
             
            if(isMaskShading)
            {
                // 1. Retrieve a mask data of specific class from rowMat to 2D array conversion.
                Mat rowMat = detectionMasks.row(i).col(bbox.classId);
 
                // 2. Set the size of mask, gray scale conversion and creation of the object mask.
                Mat objectMask = new Mat(rowMat.size(2), rowMat.size(3), CvType.CV_32F);
                rowMat.reshape(1, (int) rowMat.total()).copyTo(objectMask.reshape(1, (int) rowMat.size(2)));
                drawMaskedBBox(frame, bbox, objectMask, label,classColor);
            }
            else
            {
                drawBoxSimple(frame, bbox, label, classColor);
            }
           bbox = null;
        }
    }
     
    // Put efficiency information. The function getPerfProfile returns the overall time for inference(t) 
    // and the timings for each of the layers(in layersTimes)
    MatOfDouble layersTimes = new MatOfDouble();
    double freq = Core.getTickFrequency() / 1000;
    double t = dnnNet.getPerfProfile(layersTimes) / freq;
    String label = String.format("Mask-RCNN, Inference time for a frame : %f ms", t);
    Imgproc.putText(frame, label, new Point(0, 15), 
                    Imgproc.FONT_HERSHEY_SIMPLEX, 0.5, new Scalar(0, 0, 0));
     
    if (!foundObject) {
        System.out.println("Object not found");
    }
    HighGui.imshow(winName, frame);
    if(isSaveOuput)
    {
        // Write the frame with the detection boxes
        Mat detectedFrame = new Mat();
        frame.convertTo(detectedFrame, CvType.CV_8U);
 
        Imgcodecs.imwrite(outputFile, detectedFrame);
         
        detectedFrame.release();
    }
 
    layersTimes.release();
     
    // Handle to exit
    int key = HighGui.waitKey(0); // Waits key input
    if (key == 27) {  // Exit when ESC key pressed
        System.out.println("Window closing with ESC key");
    }
 
    // Call destroyAllWindows to close.
    HighGui.destroyAllWindows();  // Destroy all windows.
    System.exit(0);  // exit
}

10.4 Resizing The Image Source

The optimal image size for the Mask R-CNN model is 800; therefore, the input image may be resized if its original size is out of range.

10.4.1 Resize Mat Object

The resizeMatImage method resizes the source image to fit the Mask R-CNN model’s optimized input size of 800. Additionally, the method adjusts the image size accordingly, filling any extra space with padding to ensure the image meets the required dimensions if its original size is either too large or too small.

/**
 * 
 * @param orgMat : the original Mat object
 * @param targetSize : target size to resize
 * @return java.util.Map
 * 
 */
protected Map resizeMatImage(Mat orgMat, Size targetSize) {
    Map retMap = new HashMap();
    Rect padding =  new Rect();
    int targetWidth = (int)targetSize.width;
    int targetHeight = (int)targetSize.height;
     
    //float[] scale = null;
    int originalWidth = orgMat.cols();
    int originalHeight = orgMat.rows();
 
    // Minimum scale against the original ratio.
    float scaleX = (float) targetWidth / originalWidth;
    float scaleY = (float) targetHeight / originalHeight;
    float minScale = Math.min(scaleX, scaleY);
 
    int newWidth = (int) (originalWidth * minScale);
    int newHeight = (int) (originalHeight * minScale);
 
    // Padding value calculation.
    int left = (targetWidth - newWidth) / 2;
    int top = (targetHeight - newHeight) / 2;
    int right = targetWidth - newWidth - left;
    int bottom = targetHeight - newHeight - top;
 
    // Resizing and padding adjustment.
    Mat resizedMat = new Mat();
    Imgproc.resize(orgMat, resizedMat, new Size(newWidth, newHeight));
 
    Mat paddedMat = new Mat();
    Core.copyMakeBorder(resizedMat, paddedMat, top, bottom, left, right, Core.BORDER_CONSTANT, new Scalar(0, 0, 0));
 
    // The padding information.
    padding.x = left;
    padding.y = top;
    padding.width = right;
    padding.height = bottom;
     
    retMap.put("mat", paddedMat);
    retMap.put("padding", padding);
    retMap.put("scale", minScale); // The scale information.
    return retMap;
}

The method returns a HashMap object containing the following items:

Adjustment Scale: The scale factor used to resize the original image to the new size.
Resized Mat Object: The resized image that has been adjusted to the target size.
Padding Information: Information about the padding applied to the image to maintain the aspect ratio.

retMap.put("mat", paddedMat);
retMap.put("padding", padding);
retMap.put("scale", minScale);

10.5 Find Blob for Mask R-CNN Model

Now, you can use the blobFromImage method to create a blob as the input for the Mask R-CNN model using the resized Mat object, adjusted to an 800-pixel size.

Map resultMap  = resizeMatImage(frame, BLOB_MASK_RCNN_SIZE);
         
Mat resizedMat = (Mat)resultMap.get("mat");
Rect padding = (Rect)resultMap.get("padding");
float scale = (float)resultMap.get("scale");
 
Mat blob = Dnn.blobFromImage(resizedMat, 1.0, BLOB_MASK_RCNN_SIZE, Scalar.all(0), true, false);
dnnNet.setInput(blob);
blob.release();

The information about the detected objects is now saved after the forward pass of the OpenCV DNN module.

for(Iterator iter = outputLayers.keySet().iterator(); iter.hasNext();)
{
    String key = iter.next();
    Mat dtectMat = dnnNet.forward(key);
    outputLayers.put(key, dtectMat);
}

We have declared the saved information, namely detection_out_final and detection_masks, as output layers, as described in section 5.2.2 ‘Output Layers of the Mask R-CNN Model.’

Mat detection = outputLayers.get("detection_out_final");
Mat detectionMasks = outputLayers.get("detection_masks");

Here are sample representations of both the detection_out_final and detection_masks layers:

a The detection_out_final layer

Mat [ 1*1*100*7*CV_32FC1, isCont=true, isSubmat=true, nativeObj=0x1eff1c35870, dataAddr=0x1ef80776080 ]

b. The detection_masks layer

Mat [ 100*90*15*15*CV_32FC1, isCont=true, isSubmat=true, nativeObj=0x1eff1c3b450, dataAddr=0x1effeedb080 ]

The dimensional information for both layers, as shown above, has already been described in sections 5.2.3 (‘Structure of detection_out_final layer’) and 5.2.4 (‘Structure of detection_masks layer’).

10.5.1 Obtain Information of detection_out_final

We can obtain bounding boxes and confidence scores from the detection_out_final layer, and this information is used to generate bounding boxes and labels on the original images.

The following outlines the steps and code involved in this process.

a. Reshape the Mat object by 7 groups.

1	`detection = detection.reshape(1, (int) detection.total() / 7);`

b. Assign the confidence value, convert the Mat data to 1D array.

int cols = detection.cols();
float[] data = new float[(int) (detection.total())];
detection.get(0, 0, data);

c. Draw the detected objects.

The following code selects bounding boxes with a default threshold value of 0.5 and draws the boxes on the image. You can adjust the threshold value if necessary.

for (int i = 0; i  (threshold <= 0 ? DEFAULT_THRESHOLD : threshold)) {  
        foundObject = true;
        int classId = (int)data[i * cols + 1] ;
 
        // convert to Bounding Box Conversion
        int left   = (int)((data[i * cols + 3] * MASK_RCNN_SIZE - padding.x) / scale);
        int top    = (int)((data[i * cols + 4] * MASK_RCNN_SIZE - padding.y) / scale);
        int right  = (int)((data[i * cols + 5] * MASK_RCNN_SIZE - padding.x) / scale);
        int bottom = (int)((data[i * cols + 6] * MASK_RCNN_SIZE - padding.y) / scale);
         
        DetectedBBox bbox = new DetectedBBox(confidence, classId, left, top, right, bottom);
         
        if(isMaskShading)
        {
            // 1. Retrieve a mask data of specific class from rowMat to 2D array conversion.
            Mat rowMat = detectionMasks.row(i).col(bbox.classId);
 
            // 2. Set the size of mask, gray scale conversion and creation of the object mask.
            Mat objectMask = new Mat(rowMat.size(2), rowMat.size(3), CvType.CV_32F);
            rowMat.reshape(1, (int) rowMat.total()).copyTo(objectMask.reshape(1, (int) rowMat.size(2)));
            drawMaskedBBox(frame, bbox, objectMask, confidence);
        }
        else
        {
            drawBoxSimple(frame, bbox, confidence);
        }
    }
     
    detection.release();
}

10.5.2 Drawing Bounding Box

In this final step, we draw bounding boxes using either the drawMaskedBBox or drawBoxSimple method. The choice of method is determined by the boolean value isMaskShading, which is set based on the user’s input. For example, Specifically, if isMaskShading is true, the drawMaskedBBox method is used; otherwise, drawBoxSimple is applied.

a. drawMaskedBBox method
This method draws bounding boxes, masks, and labels on the detected objects.

// Draw the predicted bounding box, colorize and show the mask on the image
private boolean drawMaskedBBox(
    Mat frame, DetectedBBox bbox, Mat objectMask,float confidence
) {
    boolean foundObject = true;
    // create an object for label information.
    Map cocoLabels =  CocoLabel.getCocoLabel();
    // generate radnom color in order to draw bounding boxes
    List colors = CocoLabel.getColorsByLabel();
    String label = String.format("%s: %.2f", cocoLabels.get(bbox.classId), confidence * 100);
    Scalar classColor = colors.get(bbox.classId);
     
    Rect box = new Rect(bbox.left, bbox.top, bbox.right - bbox.left + 1, bbox.bottom - bbox.top + 1);
     
    // Display the label at the top of the bounding box
    int[] baseLine = { 0 };
    Size labelSize = Imgproc.getTextSize(label, Imgproc.FONT_HERSHEY_SIMPLEX, 0.5, 1, baseLine);
    box.y = Math.max(box.y, (int) labelSize.height);
    // Drawing a Bounding Box Only
   // Draw background for text
    Imgproc.rectangle(
        frame, new Point(box.x, box.y - Math.round(1.5 * labelSize.height)),
        new Point(box.x + Math.round(1.5 * labelSize.width), box.y + baseLine[0]), 
        new Scalar(255, 255, 255), Imgproc.FILLED);
 
    Imgproc.putText(frame, label, new Point(bbox.left, bbox.top - 5), Imgproc.FONT_HERSHEY_SIMPLEX, 0.5, classColor, 2);
     
    // Check if frame is empty
    if (frame.empty()) {
        System.out.println("frame is empty! failed to load an image.");
        return false;
    }
 
    // check the frame vaildation
    int width = frame.cols();
    int height = frame.rows();
    if (width <= 0 || height <= 0) {
        System.out.println("invalid frame size.");
        return false;
    }
     
    // 3. Set the threshold for adjusting  binarizatiion of the mask 
    Imgproc.threshold(objectMask, objectMask, 0.5, 1.0, Imgproc.THRESH_BINARY);
 
    // 4. About to fuilfil color by Converting `objectMask` to `CV_8UC1`
    objectMask.convertTo(objectMask, CvType.CV_8UC1); // Convert
 
    // 5. Shading Color making a gray scale `objectMask` 
    Mat colorMask = new Mat();
    Imgproc.cvtColor(objectMask, colorMask, Imgproc.COLOR_GRAY2BGR);
 
    // 6. Apply the color mask :  classColor
    Core.multiply(colorMask, classColor, colorMask); // Adjust masks mutiply by the color
 
    // 7. Smoothing the edges
    Mat smoothMask = new Mat();
    Imgproc.GaussianBlur(colorMask, smoothMask, new Size(objectMask.rows(), objectMask.cols()), 0);
 
    // 8. Apply the ROI.
    Mat roi = getMrcnnROI(frame, bbox);
    // 9. Check the same size of both the colcoMask and roi, if not resize the image.
    if (roi.size().width != smoothMask.size().width || roi.size().height != smoothMask.size().height) {
        Imgproc.resize(smoothMask, smoothMask, roi.size());
    }
 
    // 10. Convert the data type of `smoothMask` to `CV_8U`covered by `roi`
    smoothMask.convertTo(smoothMask, CvType.CV_8U);
 
    // 11. Draw a BBox (adding a boundary for the original image)
    Imgproc.rectangle(frame, new Point(bbox.left, bbox.top), new Point(bbox.right, bbox.bottom), classColor, 2); // BBox
 
    // 12. Paint the mask to the original image(Alpha Blending)
    Core.addWeighted(roi, 0.8, smoothMask, 0.5, 0, roi);
     
    return foundObject;
}

The output is shown in the following figure:

Figure7. Draw bounding boxes with presenting masks, labels and confidence scores.

b. Simple Bounding Box Drawing

This method draws bounding boxes around the objects without applying masks or labels.

private static void drawBoxSimple(
    Mat frame, DetectedBBox bbox, 
    String label, Scalar selectedColor
) {
    Rect box = new Rect(bbox.left, bbox.top, bbox.right - bbox.left + 1, bbox.bottom - bbox.top + 1);
    //Scalar selectedColor = colors.get(classId);
    Point x_y = new Point(box.x, box.y);
    Point w_h = new Point(box.x + box.width, box.y + box.height);
     
    Imgproc.rectangle(frame, w_h, x_y, selectedColor, 1);
     
    int baseLine[] = { 0 };
    Size labelSize = Imgproc.getTextSize(label, Imgproc.FONT_HERSHEY_SIMPLEX, 0.5, 1, baseLine);
    box.y = Integer.max(box.y, (int) labelSize.height);
     
    // Draw a Bounding Box Only
    //Imgproc.rectangle(frame, new Point(box.x, box.y), new Point(bbox.right, bbox.bottom), classColor, 2);
    // Draw background for text
    Imgproc.rectangle(
        frame, new Point(box.x, box.y - Math.round(1.5 * labelSize.height)),
        new Point(box.x + Math.round(1.5 * labelSize.width), box.y + baseLine[0]), 
        new Scalar(255, 255, 255), Imgproc.FILLED);
     
    Imgproc.putText(frame, label, new Point(box.x, box.y), 
        Imgproc.FONT_HERSHEY_SIMPLEX, 0.5, selectedColor,1);
}

The output is slightly different from the previous one:

Figure8. Draw bounding boxes labels and confidence scores but without masks.

10.6 Object Detection in Video

The detectObjectOnVideo method detects objects in a video in which The same detection technique is applied. However, the key difference is that the video is processed frame by frame. Specifically, each frame is converted to a Mat object, and then objects are detected individually in each frame.

public void detectObjectOnVideo(
   String winName, Net dnnNet, Map outputLayers, String mediaFile,
   double threshold, boolean isSaveOuput, boolean isMaskShading
) {
   HighGui.namedWindow(winName, HighGui.WINDOW_NORMAL);
     
    Mat frame = new Mat();
    VideoCapture capture = new VideoCapture(mediaFile);
     
    if (capture.open(mediaFile)) {
        capture.read(frame);
        HighGui.imshow(winName, frame);
    } else {
        System.out.println("Capture is not opened!");
        return;
    }
     
    VideoWriter vwriter = new VideoWriter();
    Mat detectedFrame = new Mat();
    String outputFile = mediaFile.substring(0, mediaFile.length() - 4) + "_yolov4_out.avi";
     
    Mat detection;
    Mat detectionMasks;
     
    while (capture.isOpened())
    {
        if(HighGui.waitKey(100) == 27) //ESC key to exit
        {
            System.out.println( "[ESC]Done processing !!!");
            HighGui.waitKey(1000);
            break;
        }
         
        // get frame from the video
        capture.read(frame);
        
        // Stop the program if reached end of video
        if (frame.empty()) {
            System.out.println( "Done processing !!!");
            System.out.println( "Output file is stored as " + outputFile);
            HighGui.waitKey(3000);
             if(vwriter.isOpened()) {
                vwriter.release();
            } 
            break;
        }
         
        Map resultMap  = resizeMatImage(frame, BLOB_MASK_RCNN_SIZE);
     
        Mat resizedMat = (Mat)resultMap.get("mat");
        Rect padding = (Rect)resultMap.get("padding");
        float scale = (float)resultMap.get("scale");
         
        // Create a 4D blob from a frame.
        Mat blob = Dnn.blobFromImage(resizedMat, 1.0, BLOB_MASK_RCNN_SIZE, Scalar.all(0), true, false);
        //Sets the input to the network
        dnnNet.setInput(blob);
         
         //List outNames = new ArrayList(2);
        for(Iterator iter = outputLayers.keySet().iterator(); iter.hasNext();)
        {
            String key = iter.next();
            //outNames.add(iter.next());
            //System.out.println(key);
            Mat dtectMat = dnnNet.forward(key);
            outputLayers.put(key, dtectMat);
        }
        resizedMat.release();
        blob.release();
         
        for(Iterator iter = outputLayers.keySet().iterator(); iter.hasNext();)
        {
            String key = iter.next();
            Mat dtectMat = dnnNet.forward(key);
            outputLayers.put(key, dtectMat);
        }
        
        detection = outputLayers.get("detection_out_final");
        detectionMasks = outputLayers.get("detection_masks");
 
        int maxObjCnt = detection.size(2);
        int attrSize = detection.size(3);
        detection = detection.reshape(1, (int) detection.total() / attrSize);
 
        // extract confidence value
        int cols = detection.cols();
        float[] data = new float[(int) (detection.total())];
        detection.get(0, 0, data);  // Mat data to 1D array.
         
        for (int i = 0; i  (threshold <= 0 ? DEFAULT_THRESHOLD : threshold)) {  
                int classId = (int)data[i * cols + 1] ;
 
                // convert to Bounding Box Conversion
                int left   = (int)((data[i * cols + 3] * MASK_RCNN_SIZE - padding.x) / scale);
                int top    = (int)((data[i * cols + 4] * MASK_RCNN_SIZE - padding.y) / scale);
                int right  = (int)((data[i * cols + 5] * MASK_RCNN_SIZE - padding.x) / scale);
                int bottom = (int)((data[i * cols + 6] * MASK_RCNN_SIZE - padding.y) / scale);
 
                DetectedBBox bbox = new DetectedBBox(confidence, classId, left, top, right, bottom);
 
                if(isMaskShading)
                {
                     // 1. Retrieve a mask data of specific class from rowMat to 2D array conversion.
                    Mat rowMat = detectionMasks.row(i).col(bbox.classId);
 
                    // 2. Set the size of mask, gray scale conversion and creation of the object mask.
                    Mat objectMask = new Mat(rowMat.size(2), rowMat.size(3), CvType.CV_32F);
                    rowMat.reshape(1, (int) rowMat.total()).copyTo(objectMask.reshape(1, (int) rowMat.size(2)));
                 
                    drawMaskedBBox(frame, bbox, objectMask,confidence);
                     
                    objectMask.release();
                    rowMat.release();
                }
                else
                {
                    drawBoxSimple(frame, bbox,confidence);
                }
            }
             
            detection.release();
        }
 
        // Put efficiency information. The function getPerfProfile returns the overall time for inference(t) 
        // and the timings for each of the layers(in layersTimes)
        MatOfDouble layersTimes = new MatOfDouble();
        double freq = Core.getTickFrequency() / 1000;
        double t = dnnNet.getPerfProfile(layersTimes) / freq;
        String label = String.format("%s, Inference time for a frame : %f ms", modelName, t);
        Imgproc.putText(frame, label, new Point(0, 15), 
                        Imgproc.FONT_HERSHEY_SIMPLEX, 0.5, new Scalar(0, 0, 0), 2);
         
        HighGui.imshow(winName, frame);
         
        if(isSaveOuput)
        {
            double fps = capture.get(Videoio.CAP_PROP_FPS);
 
            if(fps == 0D)
            {
                continue;
            }
             
            Size size 
                = new Size(capture.get(Videoio.CAP_PROP_FRAME_WIDTH), capture.get(Videoio.CAP_PROP_FRAME_HEIGHT));
 
            int fourcc_mjpg = VideoWriter.fourcc('M', 'J', 'P', 'G');  //# Motion JPEG format
            isSaveOuput = vwriter.open(outputFile, fourcc_mjpg, fps, size, true);
 
            if(!isSaveOuput)
            {
                JOptionPane.showMessageDialog(null, 
                    String.format("Can not open the video file[sz:%s, fps:%f]", size.toString(), fps), 
                    "error", JOptionPane.INFORMATION_MESSAGE);
                break;
            }
             
            frame.convertTo(detectedFrame, CvType.CV_8U);
            if(vwriter.isOpened()) 
            {
                vwriter.write(detectedFrame);
            }
            else
            {
                System.out.println("========vwriter is not opened=======");
            }
        }
 
        layersTimes.release();
    }
     
    // Explicit termination of window
    HighGui.destroyAllWindows();  // close all windows
    System.exit(0);  // exit
}

Next, you can refer to the following skeleton code to extract frames from a video file, one by one.

public void detectObjectOnVideo(
       String winName, Net dnnNet, Map outputLayers, String mediaFile,
       double threshold, boolean isSaveOuput, boolean isMaskShading
) {
    ...
    HighGui.namedWindow(winName, HighGui.WINDOW_NORMAL);
     
    Mat frame = new Mat();
    VideoCapture capture = new VideoCapture(mediaFile);
     
    if (capture.open(mediaFile)) {
        capture.read(frame);
        HighGui.imshow(winName, frame);
    } else {
        System.out.println("Capture is not opened!");
        return;
    }
     
    VideoWriter vwriter = new VideoWriter();
     
    ...
     
    while (capture.isOpened())
    {
        if(HighGui.waitKey(100) == 27) //ESC key to exit
        {
            ...
            HighGui.waitKey(1000);
            break;
        }
         
        // get frame from the video
        capture.read(frame);
        
        // Stop the program if reached end of video
        if (frame.empty()) {
            HighGui.waitKey(3000);
             if(vwriter.isOpened()) {
                vwriter.release();
            } 
            break;
        }
         
        ...
         
        HighGui.imshow(winName, frame);
        ...
    }
    HighGui.destroyAllWindows();
     
    System.exit(0);  
}

The result is :

a. Video rendering with OpenCV, displaying labels, bounding boxes, and masks.

Figure 9. The video output of Mask R-CNN model with a drawing mask option.

b. Video rendering with OpenCV, displaying labels and bounding boxes without masks.

Figure 10. The video output of Mask R-CNN model with a common option.

10.6.1 Threshold Adjustment

In the figure below, we can see a bird identified as a false detection because the confidence score is approximately 0.65 or lower. Therefore, to improve the result, we can adjust the threshold from 0.5 to 0.655, as shown in the following figure.

Figure 11. Change the threshold adjustment.

Now, we have a clean result that no longer contains the falsely detected object, the bird.

Figure 12. The video output after adjustment.

10.6.2 Check Out of Range ROI

We also need to check whether the ROI(Region of Interest) is out of range when drawing masked bounding boxes. For example, if the ROI is out of range and not checked, you may encounter an error like this :

Exception in thread "main" CvException [org.opencv.core.CvException: cv::Exception: OpenCV(4.10.0) C:\GHA-OCV-1\_work\ci-gha-workflow\ci-gha-workflow\opencv\modules\core\src\matrix.cpp:767: error: (-215:Assertion failed) 0 <= _rowRange.start && _rowRange.start <= _rowRange.end && _rowRange.end <= m.rows in function 'cv::Mat::Mat'

The following getMrcnnROI method can eliminate this error:

public static Mat getMrcnnROI(Mat frame, DetectedBBox mybbox) {
    int frameHeight = frame.rows();
    int frameWidth = frame.cols();
    Rect bbox = new Rect(mybbox.left, mybbox.top, mybbox.right - mybbox.left + 1, mybbox.bottom - mybbox.top + 1);
     
    // Adjust coordinates to ensure the image stays within range.
    int top = Math.max(0, Math.min(bbox.y, frameHeight - 1));
    int bottom = Math.max(0, Math.min(bbox.y + bbox.height, frameHeight));
    int left = Math.max(0, Math.min(bbox.x, frameWidth - 1));
    int right = Math.max(0, Math.min(bbox.x + bbox.width, frameWidth));
 
    //Add another check to validate the ROI, ensuring the width and height are greater than 0.
    if (top >= bottom || left >= right) {
        System.err.println("Invaild ROI!");
        return new Mat();  // retun empty Mat.
    }
 
    // return Safe Mat
    return frame.submat(top, bottom, left, right);
}

11. Object detection by YOLO Model

In this section, we explore the object detection technique using the YOLO model in our OpenCV Java Object Detection.

Similar to the Mask R-CNN approach, detection can be performed using either the detectObjectOnImage or detectObjectOnVideo method, depending on the mediaType value provided by the user, which must be either ‘Image’ or ‘Video’.

11.1 User Input Values

The user fills in the following information: For example, select ‘YOLO’, choose ‘Image’, and either ‘GPU’ or ‘CPU’ in the respective combo boxes. Alternatively, select ‘YOLO’, choose ‘Video’, and either ‘GPU’ or ‘CPU’. Depending on the file type (‘Image’ or ‘Video’), choose the appropriate file for the ‘InFile’ label.

Figure 13. The user input dialog and its options for YOLO model

The Description of Components:

Model: The file path for yolov4.weights. You can select this file by clicking the Find button.
Config: The file path for yolov4.cfg. You can select this file by clicking the Find button.
InFile: The path to the media file. The type (image or video) depends on the selection made in the first combo box.
Save Output: A flag that determines whether to save the result. The output will be saved in the out folder within the directory of the selected image or video file.
Shading Mask: Disabled — applicable only to the Mask R-CNN model.

11.2 Explanation of Detection Results

We can extract a blob using the blobFromImage method, which serves as an argument for the setInput method of the CNN network.

Mat blob = Dnn.blobFromImage(frame, 1 / 255.0, BLOB_YOLO_V4_SIZE,
    new Scalar(new double[] { 0.0, 0.0, 0.0 }), true, false);
dnnNet.setInput(blob);
blob.release();

Then, the forward method processes the blob and returns a Mat containing information about the detected objects based on the outputLayers names.

for(Iterator iter = outputLayers.keySet().iterator(); iter.hasNext();)
{
    String key = iter.next();
    Mat dtectMat = dnnNet.forward(key);
    outputLayers.put(key, dtectMat);
}

The following table describes the three different output layers from the code above, where the YOLO model produces three output layers containing detection information.

Output Layers	Grid Count (Predicted BBox)	Role
yolo_139	17328 × 85	The small size object detection(Finest resolution).
yolo_150	4332 × 85	The middle size object detection.
yolo_161	1083 × 85	The large size object detection. ( lowest resolution)

Each row of the Mat represents a grid cell, and each column contains 85 items, including the bounding box, objectness score, and class scores. Additionally, these outputs are analyzed and saved at three different resolutions: yolo_150, yolo_161, and yolo_139.

Now, we are ready to draw the detections on the original image.

11.3 Image Detection Method

If the user selects ‘Image’ from the combo box, the detectObjectOnImage method is activated. This method performs object detection on the image using the OpenCV DNN module.

The signature of the method:

public void detectObjectOnImage(
    String winName, Net dnnNet, Map outputLayers, 
    String mediaFile, double threshold, boolean isSaveOuput, boolean isMaskShading
);

The following describes the method parameters:

winName: The title of the activated window.
dnnNet: The CNN model returned from the readDNNetByModelName method.
outputLayers: The output layers from the getMaskRCnnOutputLayer method.
mediaFile: The file path of the image.
isSaveOutput: A flag indicating whether to save the image.
isMaskShading: Not available (Mask R-CNN only).

The full source code of the method:

public void detectObjectOnImage(
    final String winName, final Net dnnNet, Map outputLayers, 
    final String mediaFile, double threshold, boolean isSaveOuput
) {
    Mat frame = Imgcodecs.imread(mediaFile);
     
    int idx = new File(mediaFile).getName().split("\\.").length - 2;
     
    String outputFile 
        = new File(mediaFile).getParent() + File.separator + "out" +  File.separator + 
            new File(mediaFile).getName().split("\\.")[idx] + "_yolov_out.jpg";
     
    Mat blob = Dnn.blobFromImage(frame, 1 / 255.0, BLOB_YOLO_V4_SIZE,
                    new Scalar(new double[] { 0.0, 0.0, 0.0 }), true, false);
    dnnNet.setInput(blob);
    blob.release();
     
    for(Iterator iter = outputLayers.keySet().iterator(); iter.hasNext();)
    {
        String key = iter.next();
        Mat dtectMat = dnnNet.forward(key);
        outputLayers.put(key, dtectMat);
        //System.out.println(key);
    }
     
    // Lists to save the detected objects
    List boxes = new ArrayList();
    List classIDs = new ArrayList();
    List confidences = new ArrayList();
     
    int width = (int) frame.size().width;
    int height = (int) frame.size().height;
     
    for(Iterator outputMatIter = outputLayers.keySet().iterator(); outputMatIter.hasNext();) {
        String layerName = outputMatIter.next();
        Mat output = outputLayers.get(layerName);
 
        // row : Grid cell counts
        int rows = output.rows();
        // column: 85 (BBox Coordinates + Objectness + Confidenct sorce)
        int cols = output.cols();
 
        for (int r = 0; r < rows; r++) {
            // Make BBox Information
            float[] detection = new float[cols];
            output.get(r, 0, detection);
 
            float x = detection[0];  // Center x coordinate
            float y = detection[1];  // Center y coordinate
            float w = detection[2];  // width
            float h = detection[3];  // height
            float confidence = detection[4];  // Objectness score
 
            // find confidence score
            float maxClassScore = -1;
            int classID = -1;
            for (int j = 5; j  maxClassScore) {
                    maxClassScore = detection[j];
                    classID = j - 5;
                }
            }
 
            // Only save objects which is higher than   THRESHOLD
            if (confidence > THRESHOLD) {
                int xLeft = (int) ((x - w / 2) * width);
                int yTop = (int) ((y - h / 2) * height);
                int boxWidth = (int) (w * width);
                int boxHeight = (int) (h * height);
                Rect2d box = new Rect2d(xLeft, yTop, boxWidth, boxHeight);
                //boxes.add(new Rect(xLeft, yTop, boxWidth, boxHeight));
                boxes.add(box);
                classIDs.add(classID);
                confidences.add(confidence);
            }
        }
    }
     
MatOfInt indices = applyNonMaximumSuppression(boxes, confidences);
    drawBBox(frame, boxes, classIDs, indices);
 
    HighGui.imshow(winName, frame);
     
    MatOfDouble layersTimes = new MatOfDouble();
    double freq = Core.getTickFrequency() / 1000;
    double t = dnnNet.getPerfProfile(layersTimes) / freq;
    String label = String.format("%s, Inference time for a frame : %f ms", modelName, t);
    Imgproc.putText(frame, label, new Point(0, 15), 
                    Imgproc.FONT_HERSHEY_SIMPLEX, 0.5, new Scalar(0, 0, 0), 2);
     
    if(isSaveOuput)
    {
        // Write the frame with the detection boxes
        Mat detectedFrame = new Mat();
        frame.convertTo(detectedFrame, CvType.CV_8U);
 
        System.out.println("-->" + outputFile);
        Imgcodecs.imwrite(outputFile, detectedFrame);
         
        detectedFrame.release();
    }
     
    // Proecess when the window is closing.
    int key = HighGui.waitKey(0); // Wait for key
    if (key == 27) {  // Close when ESC key clicked
        System.out.println("Window closed");
    }
 
    // Explicit termination of window
    HighGui.destroyAllWindows();  // close all windows
    System.exit(0);  // exit
}

11.4 Handling Duplicate Bounding Boxes

The YOLO model generates three output layers (yolo_150, yolo_161, yolo_139), leading to multiple bounding boxes—typically three—around the same object. In conclusion, this results in overlapping detections.

Figure 14. Before applying Non-Maximum Suppression (NMS) technique

To resolve this, we apply the Non-Maximum Suppression (NMS) technique, which eliminates redundant boxes and keeps the most accurate one.

a. Apply the Non-Maximum Suppression(NMS)

We define the applyNonMaximumSuppression method to perform NMS (Non-Maximum Suppression) on the bounding boxes for the final result.

1	`MatOfInt indices = applyNonMaximumSuppression(boxes, confidences);`

Here’s the method signature:

1	`private` `MatOfInt applyNonMaximumSuppression(List boxes, List confidences):`

Description of arguments:

List<Rect2d> boxes: The list of detected bounding boxes.
List<Float> confidences: The confidence scores associated with each bounding box.

Full source code of the method:

private MatOfInt applyNonMaximumSuppression(List boxes, List confidences)
{
    MatOfRect2d mOfRect = new MatOfRect2d();
    mOfRect.fromList(boxes);
    MatOfFloat mfConfs = new MatOfFloat(Converters.vector_float_to_Mat(confidences));
    MatOfInt indices = new MatOfInt();
    //float nmsThreshold = 0.4f;
    float score_threshold = THRESHOLD; //0.6f;
    float nms_threshold= THRESHOLD; //0.5f;
    Dnn.NMSBoxes(mOfRect, mfConfs, score_threshold, nms_threshold, indices);
     
    return indices;
}

Then, call the Dnn.NMSBoxes method with the detected bounding boxes and the threshold from the code above.

1	`Dnn.NMSBoxes(mOfRect, mfConfs, score_threshold, nms_threshold, indices);`

11.5 Drawing Bounding Boxes

It is time to draw the bounding boxes from the YOLO processing. The drawBBox method is responsible for drawing bounding boxes and labels.
The method signature is:

1	`drawBBox(Mat frame, List boxes, List classIDs, MatOfInt indices)`

Arguments Description:

Mat frame: The original image.
List<Rect2d> boxes: A list of bounding boxes.
List<Integer> classIDs: Detected class IDs.
MatOfInt indices: The indices of the detected objects after applying Non-Maximum Suppression (NMS).

Full source code:

private void drawBBox(Mat frame, List boxes, List classIDs, MatOfInt indices)
{
    // create an object for label information.
    Map cocoLabels =  CocoLabel.getCocoLabel();
    // generate radnom color in order to draw bounding boxes
    List colors = CocoLabel.getColorsByLabel();
     
    List indices_list = indices.toList();
     
    for (int i = 0; i < boxes.size(); i++) {
        if (indices_list.contains(i)) {
            Rect2d box = boxes.get(i);
            Point x_y = new Point(box.x, box.y);
            Point w_h = new Point(box.x + box.width, box.y + box.height);
            Point text_point = new Point(box.x, box.y - 5);
             
            Imgproc.rectangle(frame, w_h, x_y, colors.get(classIDs.get(i)), 1);
            String label = cocoLabels.get(classIDs.get(i));
             
            Imgproc.putText(frame, label, text_point, 
                Imgproc.FONT_HERSHEY_SIMPLEX, 0.7, colors.get(classIDs.get(i)),2);
        }
    }
}

First, prepare the labels and arbitrary colors using getCocoLabel() and getColorsByLabel() from the CocoLabel class.

// create an object for label information.
Map cocoLabels =  CocoLabel.getCocoLabel();
// generate radnom color in order to draw bounding boxes
List colors = CocoLabel.getColorsByLabel();

Then, draw the bounding boxes by applying Non-Maximum Suppression (NMS) using the indices.

List indices_list = indices.toList();
         
for (int i = 0; i < boxes.size(); i++) {
    if (indices_list.contains(i)) {
        Rect2d box = boxes.get(i);
        Point x_y = new Point(box.x, box.y);
        Point w_h = new Point(box.x + box.width, box.y + box.height);
        Point text_point = new Point(box.x, box.y - 5);
         
        Imgproc.rectangle(frame, w_h, x_y, colors.get(classIDs.get(i)), 1);
        String label = cocoLabels.get(classIDs.get(i));
         
        Imgproc.putText(frame, label, text_point, 
            Imgproc.FONT_HERSHEY_SIMPLEX, 0.7, colors.get(classIDs.get(i)),2);
    }
}

The output is as follows:

Figure 15. The output of detected image by YOLO model

11.6 Video Detection Method

This method detects objects in video frames when the user selects the ‘Video’ type. However, it uses the same technique as the detectObjectOnImage method, with the main difference being the use of the VideoCapture class from OpenCV.

Figure 16. The user input dialog and its options for YOLO model

Here is the full source code of the detectObjectOnVideo method:

public void detectObjectOnImage(
    final String winName, final Net dnnNet, Map outputLayers, 
    final String mediaFile, double threshold, boolean isSaveOuput
) {
    Mat frame = Imgcodecs.imread(mediaFile);
     
    int idx = new File(mediaFile).getName().split("\\.").length - 2;
     
    String outputFile 
        = new File(mediaFile).getParent() + File.separator + "out" +  File.separator + 
            new File(mediaFile).getName().split("\\.")[idx] + "_yolov_out.jpg";
     
    Mat blob = Dnn.blobFromImage(frame, 1 / 255.0, BLOB_YOLO_V4_SIZE,
                    new Scalar(new double[] { 0.0, 0.0, 0.0 }), true, false);
    dnnNet.setInput(blob);
    blob.release();
     
    for(Iterator iter = outputLayers.keySet().iterator(); iter.hasNext();)
    {
        String key = iter.next();
        Mat dtectMat = dnnNet.forward(key);
        outputLayers.put(key, dtectMat);
    }
     
    // Lists to save the detected objects
    List boxes = new ArrayList();
    List classIDs = new ArrayList();
    List confidences = new ArrayList();
     
    int width = (int) frame.size().width;
    int height = (int) frame.size().height;
     
    for(Iterator outputMatIter = outputLayers.keySet().iterator(); outputMatIter.hasNext();) {
        String layerName = outputMatIter.next();
        Mat output = outputLayers.get(layerName);
 
        // row : Grid cell counts
        int rows = output.rows();
        // column: 85 (BBox Coordinates + Objectness + Confidenct sorce)
        int cols = output.cols();
 
        for (int r = 0; r < rows; r++) {
            // Make BBox Information
            float[] detection = new float[cols];
            output.get(r, 0, detection);
 
            float x = detection[0];  // Center x coordinate
            float y = detection[1];  // Center y coordinate
            float w = detection[2];  // width
            float h = detection[3];  // height
            float confidence = detection[4];  // Objectness score
 
            // find confidence score
            float maxClassScore = -1;
            int classID = -1;
            for (int j = 5; j  maxClassScore) {
                    maxClassScore = detection[j];
                    classID = j - 5;
                }
            }
 
            // Only save objects which is higher than   THRESHOLD
            if (confidence > (threshold " + outputFile);
        Imgcodecs.imwrite(outputFile, detectedFrame);
         
        detectedFrame.release();
    }
     
    // Proecess when the window is closing.
    int key = HighGui.waitKey(0); // Wait for key
    if (key == 27) {  // Close when ESC key clicked
        System.out.println("Window closed");
    }
 
    // Explicit termination of window
    HighGui.destroyAllWindows();  // close all windows
    System.exit(0);  // exit
}

You can see the skeleton code for extracting frames from a video file using the OpenCV library.

public void detectObjectOnVideo(
       String winName, Net dnnNet, Map outputLayers, String mediaFile,
       double threshold, boolean isSaveOuput, boolean isMaskShading
) {
    ...
    HighGui.namedWindow(winName, HighGui.WINDOW_NORMAL);
     
    Mat frame = new Mat();
    VideoCapture capture = new VideoCapture(mediaFile);
     
    if (capture.open(mediaFile)) {
        capture.read(frame);
        HighGui.imshow(winName, frame);
    } else {
        System.out.println("Capture is not opened!");
        return;
    }
     
    VideoWriter vwriter = new VideoWriter();
     
    ...
     
    while (capture.isOpened())
    {
        if(HighGui.waitKey(100) == 27) //ESC key to exit
        {
            ...
            HighGui.waitKey(1000);
            break;
        }
         
        // get frame from the video
        capture.read(frame);
        
        // Stop the program if reached end of video
        if (frame.empty()) {
            HighGui.waitKey(3000);
             if(vwriter.isOpened()) {
                vwriter.release();
            } 
            break;
        }
         
        ...
         
        HighGui.imshow(winName, frame);
        ...
    }
    HighGui.destroyAllWindows();
     
    System.exit(0);  
}

The result of the YOLO detection is as follows:

Figure 17. The output of detected video image by YOLO model

Figure 18. The output of detected video image by YOLO model

12. Wrapping Up

This post marks my first step into the world of object detection from a developer’s perspective. While the approach I introduced follows OpenCV’s conventions, its widespread adoption speaks to the strength and flexibility of the OpenCV library. I used the DNN module of OpenCV for this code, which is limited to models supported by the OpenCV inference engine.
I hope this introduction can serve as a helpful starting point for anyone beginning their journey into computer vision, as it has for me.

13. Download the Source Code

Download
You can download the full source code of this example here: OpenCV Java Object Detection

References

Image/Video Source

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!

1. JPA Mini Book

2. JVM Troubleshooting Guide

3. JUnit Tutorial for Unit Testing

4. Java Annotations Tutorial

5. Java Interview Questions

6. Spring Interview Questions

7. Android UI Design

and many more ....

I agree to the Terms and Privacy Policy

1. OpenCV Java Object Detection Overview

2. DNN Models For Object Detection

2.1 YOLO Model

2.2 Mask R-CNN Model

3. Download Pre-trained Model

3.1 Download Pre-trained Model for YOLO

3.2 Download Pre-trained Model for Mask R-CNN

4.Setting Up OpenCV Java Object Detection

4.1 Tools for Downloading

4.2 How to Launch the Program

5. Implement OpenCV Java Object Detection

5.1 Process of OpenCV Java Object Detection How-to

5.2 Configuring Output Layers for Models

5.2.1 YOLO Model Output Layers

5.2.2 Output Layers of the Mask R-CNN Model

5.2.3 Structure of the detection_out_final Layer

5.2.3 Structure of the detection_masks Layer

5.3 Input Parameters of blobFromImage

5.3.1 Arguments of blobFromImage for YOLO Model

5.3.2 Arguments of blobFromImage for Mask R-CNN Model

5.4 YOLO Model Size Parameters Explained

5.5 Mask R-CNN Model Size Parameters Explained

6. Class Design for OpenCV Java Object Detection

6.1 Class Diagram and Associations

7. User Input UI Layout

7.1 Implementation of User Input Layout

7.2 OpenCV Java Object Detection Main Method

8. ObjectDetector The Parent Class

8.1 Setting Output Layers for Models

8.1.1 Setting Output Layers for YOLO

8.1.2 Setting Output Layers for Mask R-CNN

8.2 CNN Net Initialization

9. Setting up Labels and Colors

9.1 Label Initialization

9.2 Save Label and Color

9.3 Label File

9.3 Colors and Labels for Bounding Box

10. Object detection in Mask R-CNN Model

10.1 User Input Values

10.2 detectObject Method in Object Detection

10.3 detectObjectOnImage for Image Detection

10.4 Resizing The Image Source

10.4.1 Resize Mat Object

10.5 Find Blob for Mask R-CNN Model

a The detection_out_final layer

10.5.1 Obtain Information of detection_out_final

10.5.2 Drawing Bounding Box

10.6 Object Detection in Video

10.6.1 Threshold Adjustment

10.6.2 Check Out of Range ROI

11. Object detection by YOLO Model

11.1 User Input Values

11.2 Explanation of Detection Results

11.3 Image Detection Method

11.4 Handling Duplicate Bounding Boxes

11.5 Drawing Bounding Boxes

11.6 Video Detection Method

12. Wrapping Up

13. Download the Source Code

References

Image/Video Source

Thank you!

Related Articles

Thank you!

5.3.1 Arguments of `blobFromImage` for YOLO Model

5.3.2 Arguments of `blobFromImage` for Mask R-CNN Model