Regular Expressions in Java – Soft Introduction

Farhan KhwajaFebruary 7th, 2012Last Updated: October 21st, 2012

0 62 3 minutes read

A regular expression is a kind of pattern that can be applied to text (String, in Java). Java provides the java.util.regex package for pattern matching with regular expressions. Java regular expressions are very similar to the Perl programming language and very easy to learn.

A regular expression either matches the text ( or a part of it) or it fails to match.
* If regular expression matches a part of text then we can find it out which one.
** If regular expression in complex, then we can easily find out which part of the regular expression matches with which part of the text.

A First Example

The regular expression “[a-z]+” matches all lower case letters in the text.
[a-z] means any character from a to z, inclusive and + means “one or more”.

Suppose we supply a string “code 2 learn java tutorial”.

How to do it in Java

First, you must compile the pattern :
import java.util.regex.*;
Pattern p = Pattern.compile(“[a-z]+”);

Next you must create a matcher for the text by sending a message to the pattern :
Matcher m = p.matcher(“code 2 learn java tutorial”);

NOTE :

Neither Pattern nor Matcher have a public constructor, we create it by using methods in Pattern class.

Pattern Class: A Pattern object is a compiled representation of a regular expression. The Pattern class provides no public constructors. To create a pattern, you must first invoke one of its public static compile methods, which will then return a Pattern object. These methods accept a regular expression as the first argument.

Matcher Class: A Matcher object is the engine that interprets the pattern and performs match operations against an input string. Like the Pattern class, Matcher defines no public constructors. You obtain a Matcher object by invoking the matcher method on a Pattern object.

After we have done the above steps, and now that we have matcher m, we can check whether the match has been found or not and if yes then from which index position it starts, etc.

m.matches() returns true if the pattern matches the entire string or else false.
m.lookingAt() returns true if the pattern matches at the beginning of the string , and false otherwise.
m.find() returns true if pattern matches any part of the text.

Finding what was matched

After a successful match, m.start() will return the index of the first character matched and m.end() will return the index of the last character matched, plus one.

If no match was attempted, or if the match was unsuccessful, m.start() and m.end() will throw an IllegalStateException
– This is a RuntimeException, so you don’t have to catch it

It may seem strange that m.end() returns the index of the last character matched plus one, but this is just what most String methods require
– For example, “Now is the time“.substring(m.start(), m.end())
will return exactly the matched substring.

Java Program :

import java.util.regex.*;
 
public class RegexTest {
   public static void main(String args[]) {
      String pattern = "[a-z]+";
      String text = "code 2 learn java tutorial";
      Pattern p = Pattern.compile(pattern);
      Matcher m = p.matcher(text);
      while (m.find()) {
          System.out.print(text.substring(m.start(), m.end()) + "*");
      }
  }
}

Output: code*learn*java*tutorial*.

Additional Methods

If m is a matcher, then

– m.replaceFirst(replacement) returns a new String where the first substring matched by the pattern has been replaced by replacement
– m.replaceAll(replacement) returns a new String where every substring matched by the pattern has been replaced by replacement
– m.find(startIndex) looks for the next pattern match, starting at the specified index
– m.reset() resets this matcher
– m.reset(newText) resets this matcher and gives it new text to examine (which may be a String, StringBuffer, or CharBuffer)

Regular Expression Syntax

Here is the table listing down all the regular expression metacharacter syntax available in Java:

Subexpression	Matches
^	Matches beginning of line.
$	Matches end of line.
.	Matches any single character except newline. Using m option allows it to match newline as well.
[…]	Matches any single character in brackets.
[^…]	Matches any single character not in brackets
\A	Beginning of entire string
\z	End of entire string
\Z	End of entire string except allowable final line terminator.
re*	Matches 0 or more occurrences of preceding expression.
re+	Matches 1 or more of the previous thing
re?	Matches 0 or 1 occurrence of preceding expression.
re{ n}	Matches exactly n number of occurrences of preceding expression.
re{ n,}	Matches n or more occurrences of preceding expression.
re{ n, m}	Matches at least n and at most m occurrences of preceding expression.
a\| b	Matches either a or b.
(re)	Groups regular expressions and remembers matched text.
(?: re)	Groups regular expressions without remembering matched text.
(?> re)	Matches independent pattern without backtracking.
\w	Matches word characters.
\W	Matches nonword characters.
\s	Matches whitespace. Equivalent to [\t\n\r\f].
\S	Matches nonwhitespace.
\d	Matches digits. Equivalent to [0-9].
\D	Matches nondigits.
\A	Matches beginning of string.
\Z	Matches end of string. If a newline exists, it matches just before newline.
\z	Matches end of string.
\G	Matches point where last match finished.
\n	Back-reference to capture group number “n”
\b	Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.
\B	Matches nonword boundaries.
\n, \t, etc.	Matches newlines, carriage returns, tabs, etc.
\Q	Escape (quote) all characters up to \E
\E	Ends quoting begun with \Q