3 Examples of Parsing HTML File in Java using Jsoup

Javin PaulSeptember 23rd, 2014Last Updated: September 23rd, 2014

0 626 6 minutes read

HTML is the core of the web, all the pages you see on the internet are based on HTML, whether they are dynamically generated by JavaScript, JSP, PHP, ASP or any other web technology. Your browser actually parse HTMLs and render it for you. But what do you do, if you need to parse an HTML document and find some elements, tags, attributes or check if a particular element exists or not, all that using a Java program.

If you have been in Java programming for some years, I am sure you have done some XML parsing work using parsers like DOM and SAX. Ironically, there are few instances when you need to parse HTML document from a core Java application, which doesn’t include Servlet and other Java web technologies. To make things worse, there is no HTTP or HTML library in the core JDK as well. That’s why when it comes to parsing an HTML file, many Java programmers had to look at Google to find out how to get value of an HTML tag in Java.

When I needed that I was sure that there would be an open source library which will implement that functionality for me, but didn’t know that it was as wonderful and feature rich as JSoup. It not only provides support to read and parse HTML document but also allows you to extract any element from the HTML file, their attributes, their CSS class in JQuery style, and at the same time it allows you to modify them. You can probably do anything with a HTML document using Jsoup.

In this article, we will parse and HTML file and find out the value of the title and heading tags. We will also see examples of downloading and parsing HTML from file as well as any URL or internet by parsing Google’s home page in Java.

What is JSoup Library

Jsoup is an open source Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. Jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers like Chrome and Firefox do. Here are some of the useful features of jsoup library :

Jsoup can scrape and parse HTML from a URL, file, or string
Jsoup can find and extract data, using DOM traversal or CSS selectors
Jsoup allows you to manipulate the HTML elements, attributes, and text
Jsoup provides clean user-submitted content against a safe white-list, to prevent XSS attacks
Jsoup also output tidy HTML

Jsoup is designed to deal with different kinds of HTML found in the real world, which includes properly validated HTML to incomplete non-validate tag collection. One of the core strengths of Jsoup is that it’s very robust.

HTML Parsing in Java using JSoup

In this Java HTML parsing tutorial, we will see three different examples of parsing and traversing HTML documents in Java using jsoup. In the first example, we will parse an HTML String, the contents of which are all tags, in form of a String literal in Java. In the Second example, we will download our HTML document from the web, and in the third example, we will load our own sample HTML file login.html for parsing. This file is a sample HTML document which contains a title tag and a div in the body section which contains an HTML form. It has input tags to capture username and password and submit and reset button for further action. It’s a proper HTML which can be validated i.e. all tags and attributes are properly closed. Here is how our sample HTML file look like :

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
        <title>Login Page</title>
    </head>
    <body>
        <div id="login" class="simple" >
            <form action="login.do">
                Username : <input id="username" type="text" /><br>
                Password : <input id="password" type="password" /><br>
                <input id="submit" type="submit" />
                <input id="reset" type="reset" />
            </form>
        </div>
    </body>
</html>

HTML parsing is very simple with Jsoup, all you need to call is the static method Jsoup.parse()and pass your HTML String to it. JSoup provides several overloaded parse() method to read HTML files from String, a File, from a base URI, from an URL, and from an InputStream. You can also specify character encoding to read HTML files correctly in case they are not in “UTF-8” format.

The parse(String html) method parses the input HTML into a new Document. In Jsoup, Document extends Element which extends Node. Also TextNode extends Node. As long as you pass in a non-null string, you’re guaranteed to have a successful, sensible parse, with a Document containing (at least) a head and a body element. Once you have a Document, you can get the data you want by calling appropriate methods in Document and its parent classes Element and Node.

Java Program to parse HTML Document

Here is our complete Java program to parse an HTML String, an HTML file downloaded from the internet and an HTML file from the local file system. In order to run this program, you can either use the Eclipse IDE or you can just use any IDE or command prompt. In Eclipse, it’s very easy, just copy this code, create a new Java project, right click on src package and paste it. Eclipse will take care of creating proper package and Java source file with same name, so absolutely less work. If you already have a Sample Java project, then it’s just one step. Following Java program shows 3 examples of parsing and traversing HTML file. In first example, we directly parse an String with html content, in the second example we parse an HTML file downloaded from an URL, in the third example we load and parse an HTML document from local file system.

import java.io.File;
import java.io.IOException;
 
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
 
/**
* Java Program to parse/read HTML documents from File using Jsoup library.
* Jsoup is an open source library which allows Java developer to parse HTML
* files and extract elements, manipulate data, change style using DOM, CSS and
* JQuery like method.
*
* @author Javin Paul
*/
public class HTMLParser{
 
    public static void main(String args[]) {
 
        // Parse HTML String using JSoup library
        String HTMLSTring = "<!DOCTYPE html>"
                + "<html>"
                + "<head>"
                + "<title>JSoup Example</title>"
                + "</head>"
                + "<body>"
                + "<table><tr><td><h1>HelloWorld</h1></tr>"
                + "</table>"
                + "</body>"
                + "</html>";
 
        Document html = Jsoup.parse(HTMLSTring);
        String title = html.title();
        String h1 = html.body().getElementsByTag("h1").text();
 
        System.out.println("Input HTML String to JSoup :" + HTMLSTring);
        System.out.println("After parsing, Title : " + title);
        System.out.println("Afte parsing, Heading : " + h1);
 
        // JSoup Example 2 - Reading HTML page from URL
        Document doc;
        try {
            doc = Jsoup.connect("http://google.com/").get();
            title = doc.title();
        } catch (IOException e) {
            e.printStackTrace();
        }
 
        System.out.println("Jsoup Can read HTML page from URL, title : " + title);
 
        // JSoup Example 3 - Parsing an HTML file in Java
        //Document htmlFile = Jsoup.parse("login.html", "ISO-8859-1"); // wrong
        Document htmlFile = null;
        try {
            htmlFile = Jsoup.parse(new File("login.html"), "ISO-8859-1");
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } // right
        title = htmlFile.title();
        Element div = htmlFile.getElementById("login");
        String cssClass = div.className(); // getting class form HTML element
 
        System.out.println("Jsoup can also parse HTML file directly");
        System.out.println("title : " + title);
        System.out.println("class of div tag : " + cssClass);
    }
 
}

Output:
Input HTML String to JSoup :<!DOCTYPE html><html><head><title>JSoup Example</title></head><body><table><tr><td><h1>HelloWorld</h1></tr></table></body></html>
After parsing, Title : JSoup Example
Afte parsing, Heading : HelloWorld
Jsoup Can read HTML page from URL, title : Google
Jsoup can also parse HTML file directly
title : Login Page
class of div tag : simple

The Jsoup HTML parser will make every attempt to create a clean parse from the HTML you provide, regardless of whether the HTML is well-formed or not. It can handle the following mistakes :
unclosed tags (e.g. Java Scala to Java Scala)
implicit tags (e.g.a naked <td>Java is Great</td> is wrapped into a <table><tr><td>)
reliably creating the document structure (html containing a head and body, and only appropriate elements within the head).

Jsoup is an excellent and robust open source library which makes reading html documents, body fragments, html strings and directly parsing html content from the web, extremely easy.

Reference:

3 Examples of Parsing HTML File in Java using Jsoup from our JCG partner Javin Paul at the Javarevisited blog.