Unescape HTML Characters in Java
HTML entities are special characters reserved in HTML that need to be represented by a specific code. For instance, the less-than symbol “<” must be represented as “<” in HTML. When working with data that contains HTML entities, it’s often necessary to convert these codes back into their corresponding symbols, a process known as unescaping. Let us delve into understanding how to use Java to unescape HTML characters effectively.
1. Ways to Unescape HTML Characters
In Java, there are multiple ways to unescape HTML symbols.
1.1 Using Apache Commons Text
Apache Commons Text is a library that provides a simple and efficient way to handle text processing. It includes a method for unescaping HTML entities.
1.1.1 Dependency
Include the following dependency in pom.xml
file.
<dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-text</artifactId> <version>1.9</version> </dependency>
1.1.2 Code Example
import org.apache.commons.text.StringEscapeUtils; public class UnescapeHtmlExample { public static void main(String[] args) { String escapedHtml = "<p>This is a paragraph.</p>"; String unescapedHtml = StringEscapeUtils.unescapeHtml4(escapedHtml); System.out.println("Escaped HTML: " + escapedHtml); System.out.println("Unescaped HTML: " + unescapedHtml); } }
The code above uses the StringEscapeUtils.unescapeHtml4()
method from the Apache Commons Text library. This method converts HTML entities back into their corresponding characters. In this example, the escaped string is converted to its normal form.
The output of the above code will be:
Escaped HTML: <p>This is a paragraph.</p> Unescaped HTML: <p>This is a paragraph.</p>
1.2 Using Jsoup Library
Jsoup is a popular Java library used for working with real-world HTML. It also provides functionality to unescape HTML entities.
1.2.1 Dependency
Include the following dependency in pom.xml
file.
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.14.3</version> </dependency>
1.2.2 Code Example
import org.jsoup.parser.Parser; public class UnescapeHtmlWithJsoup { public static void main(String[] args) { String escapedHtml = "<a href="https://example.com">Link</a>"; String unescapedHtml = Parser.unescapeEntities(escapedHtml, true); System.out.println("Escaped HTML: " + escapedHtml); System.out.println("Unescaped HTML: " + unescapedHtml); } }
In this example, the Jsoup library is used to unescape HTML entities. The method Parser.unescapeEntities()
converts HTML entities back to their corresponding characters. The second parameter, true
, indicates that we want to preserve the ampersands in the unescaped output.
The output of the above code will be:
Escaped HTML: <a href="https://example.com">Link</a> Unescaped HTML: <a href="https://example.com">Link</a>
1.3 Using HTMLDecoder from OWASP
The OWASP Encoder library provides methods to encode and decode HTML entities, offering a secure and straightforward way to handle HTML content.
1.3.1 Dependency
Include the following dependency in pom.xml
file.
<dependency> <groupId>org.owasp.encoder</groupId> <artifactId>encoder</artifactId> <version>1.2.3</version> </dependency>
1.3.2 Code Example
import org.owasp.encoder.Encode; public class UnescapeHtmlWithOwasp { public static void main(String[] args) { String escapedHtml = "<div>Hello World</div>"; String unescapedHtml = Encode.forHtmlContent(escapedHtml); System.out.println("Escaped HTML: " + escapedHtml); System.out.println("Unescaped HTML: " + unescapedHtml); } }
The OWASP Encoder library’s Encode.forHtmlContent()
method unescapes HTML entities. This method is secure and prevents potential security risks associated with improper handling of HTML content.
The output of the above code will be:
Escaped HTML: <div>Hello World</div> Unescaped HTML: <div>Hello World</div>
2. Conclusion
Unescaping HTML entities is a common requirement when working with web data in Java. The methods provided by libraries like Apache Commons Text, Jsoup, and OWASP Encoder offer reliable and efficient ways to handle this task. Each approach has its advantages, and the choice of method may depend on your specific needs and existing dependencies in your project. Whether you’re dealing with simple or complex HTML, these tools will help you ensure that your content is correctly processed.