Core Java

Unescape HTML Characters in Java

HTML entities are special characters reserved in HTML that need to be represented by a specific code. For instance, the less-than symbol “<” must be represented as “&lt;” in HTML. When working with data that contains HTML entities, it’s often necessary to convert these codes back into their corresponding symbols, a process known as unescaping. Let us delve into understanding how to use Java to unescape HTML characters effectively.

1. Ways to Unescape HTML Characters

In Java, there are multiple ways to unescape HTML symbols.

1.1 Using Apache Commons Text

Apache Commons Text is a library that provides a simple and efficient way to handle text processing. It includes a method for unescaping HTML entities.

1.1.1 Dependency

Include the following dependency in pom.xml file.

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-text</artifactId>
    <version>1.9</version>
</dependency>

1.1.2 Code Example

import org.apache.commons.text.StringEscapeUtils;

public class UnescapeHtmlExample {
    public static void main(String[] args) {
        String escapedHtml = "&lt;p&gt;This is a paragraph.&lt;/p&gt;";
        String unescapedHtml = StringEscapeUtils.unescapeHtml4(escapedHtml);
        
        System.out.println("Escaped HTML: " + escapedHtml);
        System.out.println("Unescaped HTML: " + unescapedHtml);
    }
}

The code above uses the StringEscapeUtils.unescapeHtml4() method from the Apache Commons Text library. This method converts HTML entities back into their corresponding characters. In this example, the escaped string is converted to its normal form.

The output of the above code will be:

Escaped HTML: &lt;p&gt;This is a paragraph.&lt;/p&gt;
Unescaped HTML: <p>This is a paragraph.</p>

1.2 Using Jsoup Library

Jsoup is a popular Java library used for working with real-world HTML. It also provides functionality to unescape HTML entities.

1.2.1 Dependency

Include the following dependency in pom.xml file.

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version>
</dependency>

1.2.2 Code Example

import org.jsoup.parser.Parser;

public class UnescapeHtmlWithJsoup {
    public static void main(String[] args) {
        String escapedHtml = "&lt;a href="https://example.com"&gt;Link&lt;/a&gt;";
        String unescapedHtml = Parser.unescapeEntities(escapedHtml, true);
        
        System.out.println("Escaped HTML: " + escapedHtml);
        System.out.println("Unescaped HTML: " + unescapedHtml);
    }
}

In this example, the Jsoup library is used to unescape HTML entities. The method Parser.unescapeEntities() converts HTML entities back to their corresponding characters. The second parameter, true, indicates that we want to preserve the ampersands in the unescaped output.

The output of the above code will be:

Escaped HTML: &lt;a href="https://example.com"&gt;Link&lt;/a&gt;
Unescaped HTML: <a href="https://example.com">Link</a>

1.3 Using HTMLDecoder from OWASP

The OWASP Encoder library provides methods to encode and decode HTML entities, offering a secure and straightforward way to handle HTML content.

1.3.1 Dependency

Include the following dependency in pom.xml file.

<dependency>
    <groupId>org.owasp.encoder</groupId>
    <artifactId>encoder</artifactId>
    <version>1.2.3</version>
</dependency>

1.3.2 Code Example

import org.owasp.encoder.Encode;

public class UnescapeHtmlWithOwasp {
    public static void main(String[] args) {
        String escapedHtml = "&lt;div&gt;Hello World&lt;/div&gt;";
        String unescapedHtml = Encode.forHtmlContent(escapedHtml);
        
        System.out.println("Escaped HTML: " + escapedHtml);
        System.out.println("Unescaped HTML: " + unescapedHtml);
    }
}

The OWASP Encoder library’s Encode.forHtmlContent() method unescapes HTML entities. This method is secure and prevents potential security risks associated with improper handling of HTML content.

The output of the above code will be:

Escaped HTML: &lt;div&gt;Hello World&lt;/div&gt;
Unescaped HTML: <div>Hello World</div>

2. Conclusion

Unescaping HTML entities is a common requirement when working with web data in Java. The methods provided by libraries like Apache Commons Text, Jsoup, and OWASP Encoder offer reliable and efficient ways to handle this task. Each approach has its advantages, and the choice of method may depend on your specific needs and existing dependencies in your project. Whether you’re dealing with simple or complex HTML, these tools will help you ensure that your content is correctly processed.

Yatin Batra

An experience full-stack engineer well versed with Core Java, Spring/Springboot, MVC, Security, AOP, Frontend (Angular & React), and cloud technologies (such as AWS, GCP, Jenkins, Docker, K8).
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button