Parse HTML Table With Jsoup

Yatin BatraMay 21st, 2024Last Updated: May 21st, 2024

0 137 5 minutes read

Jsoup, an open-source library, serves the purpose of scraping HTML pages by offering an API for parsing, extracting, and manipulating data through DOM API methods. Let us delve into understanding how to parse an HTML table with Jsoup in Java.

1. Understanding Jsoup: Parsing HTML Tables

Jsoup is an open-source Java library that simplifies HTML parsing. It equips developers with a robust API to navigate, extract, and manipulate HTML content. Specifically, it excels in parsing HTML tables, offering a straightforward approach to extracting tabular data from web pages.

1.1 Pros

Ease of Use: Jsoup boasts a simple and intuitive API, making HTML parsing accessible even to novice developers.
HTML Manipulation: With Jsoup, manipulating HTML content becomes seamless. Developers can easily traverse the DOM tree, extract desired elements, and modify them as needed.
Robust Parsing: Jsoup exhibits robust parsing capabilities, handling even complex HTML structures with efficiency.
Cross-Platform Compatibility: Being a Java library, Jsoup ensures cross-platform compatibility, allowing developers to use it across different operating systems seamlessly.
Extensive Documentation: Jsoup offers comprehensive documentation, including tutorials and examples, which facilitate quick learning and implementation.

1.2 Cons

Java Dependency: As Jsoup is a Java library, developers need to have proficiency in Java programming to utilize its functionalities effectively.
Limited JavaScript Execution: Jsoup primarily parses static HTML content and does not support dynamic content rendered by JavaScript. This limitation restricts its applicability for scraping JavaScript-heavy websites.
Performance Concerns: While Jsoup provides efficient parsing, performance may degrade when dealing with extremely large HTML documents or when executing complex parsing operations.

2. Working Example

2.1 Adding Dependencies

To use Jsoup in your Java project, you need to include its dependency in your pom.xml if you’re using Maven:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.15.3</version>
</dependency>

Or, if you’re using Gradle, add the following to your build.gradle:

implementation 'org.jsoup:jsoup:1.15.3'

2.2 Code Snippet

Below is a Java code showcasing how the Jsoup library is employed to parse HTML tables, as well as to make updates and deletions within them.

package com.jcg.example;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class TableManipulationExample {

    public static void main(String[] args) {
        try {
            // Parse HTML content containing a table
            String html = "<table>" +
                    "<tr><td>Row 1, Cell 1</td><td>Row 1, Cell 2</td></tr>" +
                    "<tr><td>Row 2, Cell 1</td><td>Row 2, Cell 2</td></tr>" +
                    "</table>";
            Document doc = Jsoup.parse(html);

            // Select the table element
            Element table = doc.select("table").first();

            // Print the original table
            System.out.println("Original Table:");
            printTable(table);

            // Update content of a specific cell
            Element cellToUpdate = table.select("tr:eq(1) td:eq(1)").first();
            cellToUpdate.text("Updated Content");

            // Print the updated table
            System.out.println("\nUpdated Table:");
            printTable(table);

            // Create a new row element
            Element newRow = new Element("tr");
            // Populate the row with cell elements
            newRow.append("<td>New Row, Cell 1</td>");
            newRow.append("<td>New Row, Cell 2</td>");
            // Append the new row to the table
            table.append(String.valueOf(newRow));

            // Print the table after adding a new row
            System.out.println("\nTable after adding a new row:");
            printTable(table);

            // Select the row to delete
            Element rowToDelete = table.select("tr:eq(1)").first();
            // Remove the row from the table
            rowToDelete.remove();

            // Print the table after deleting a row
            System.out.println("\nTable after deleting a row:");
            printTable(table);

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    // Helper method to print the table
    private static void printTable(Element table) {
        // Iterate over each row in the table
        for (Element row : table.select("tr")) {
            // Iterate over each cell in the row
            for (Element cell : row.select("td")) {
                System.out.print(cell.text() + "\t");
            }
            System.out.println();
        }
    }
}

2.2.1 Code Explanation

The code defines:

Importing Required Classes: Import necessary classes from the Jsoup library for parsing HTML documents and manipulating HTML elements.
Main Class Definition: This is the main class definition. It contains the main method where the execution of the program starts.
Parsing HTML Content:
- Create an HTML string containing a table with two rows and two columns.
- Parse this HTML string into a Document object using Jsoup’s parse method.
Selecting the Table Element: Use Jsoup’s CSS selector syntax to select the first table element from the parsed document and store it in the table variable.
Printing the Original Table: Print the original table by calling the printTable method, passing the table element as an argument.
Updating Content of a Cell: Select the cell at the second row and second column of the table and update its text content to “Updated Content”.
Printing the Updated Table: Print the updated table after modifying the content of a cell.
Adding a New Row to the Table: Create a new row, populate it with two cells, and append it to the table.
Printing the Table After Adding a New Row: Print the table after adding a new row.
Deleting a Row from the Table: Select the second row of the table and remove it from the DOM.
Printing the Table After Deleting a Row: Print the table after deleting a row.
Helper Method to Print the Table: Iterate over each row and cell in the provided table element and print the text content of each cell. This method is called to print both the original and modified tables.

2.2.2 Code Output

When executed, the code will output the following on the IDE console:

Original Table:
Row 1, Cell 1	Row 1, Cell 2	
Row 2, Cell 1	Row 2, Cell 2	

Updated Table:
Row 1, Cell 1	Row 1, Cell 2	
Row 2, Cell 1	Updated Content	

Table after adding a new row:
Row 1, Cell 1	Row 1, Cell 2	
Row 2, Cell 1	Updated Content	
New Row, Cell 1	New Row, Cell 2	

Table after deleting a row:
Row 1, Cell 1	Row 1, Cell 2	
New Row, Cell 1	New Row, Cell 2

3. Conclusion

In conclusion, the utilization of the Jsoup library for parsing, updating, and deleting HTML tables provides developers with a powerful and efficient means to interact with web content. Jsoup simplifies the process of extracting data from HTML documents, particularly tables, by offering a robust API and intuitive methods for traversal and manipulation.

Through the example code provided earlier, we’ve seen how Jsoup enables the parsing of HTML tables, allowing access to individual cells and rows for data extraction or modification. Additionally, the ability to update cell content and dynamically manipulate the structure of the table, such as adding or removing rows, demonstrates the versatility and flexibility that Jsoup offers in handling HTML content.

Moreover, Jsoup’s support for CSS selectors enables precise targeting of specific elements within the HTML document, streamlining the process of data extraction and manipulation. This capability proves invaluable, especially when dealing with complex HTML structures or large datasets.

However, it’s important to note that while Jsoup excels in parsing static HTML content, its functionality may be limited when it comes to dynamically generated content or websites heavily reliant on JavaScript for rendering. In such cases, alternative approaches or supplementary tools may be necessary to fully capture and process the desired data.

Overall, Jsoup stands as a reliable and indispensable tool for developers engaged in web scraping, data extraction, and HTML parsing tasks. Its ease of use, robust feature set, and extensive documentation make it a preferred choice for a wide range of projects and applications in the realm of web development and data analysis.