xiaoxing tech

November 14, 2008

Java regex (Regular expression)

Filed under: Java, regex — xiaoxing @ 12:41 pm

regex java

US Zip Code: \d{5}(?:-\d{4})?
A zip code has two formats: five digits OR five digits, a hyphen and four more digits.
\d{5}: the pattern starts with five digits. The 5 within the braces indicates that the previous pattern should match five times.
(?:-\d{4})?: It’s a non-capturing parenthesis, starts with (?:. Question mark at the end means optional.
-\d{4}: a hyphen followed by four digits

US SSN(social security number): \d{3}([-]?)\d{2}\1\d{4}
the digits are written in two different ways: as nine digits or as three digits, a hyphen, two digits, another hyphen and four more digits.
\d{3}: The pattern starts with three digits
([-]?): is a capturing group. Inside the group is an optional hyphen. Whatever matches inside the parentheses will be remembered and can be recalled later in the pattern as \1.
\d{2}: two more digits.
\1: the symbol for the grouped pattern from earlier in the regular expression. If a hyphen was used earlier in the expression, then a hyphen must be used here. If a hyphen was not matched earlier, then a hyphen cannot be entered here.
\d{4}: four more digits.

User ID: [a-zA-Z]{3,5}[a-zA-Z0-9]?\d{2}
might be from three to five letters followed by three digits or three to six letters followed by two digits.

Note: When writing regular expressions in Java, each \ character must be written as double backslashes, \\.

1. Square brackets define a character class. [xyz] will match x, y or z, but only one of them.
2. If the class starts with ^, then it will match all characters that are not listed inside the brackets: [^abc] any character except a, b or c (negation)
3. A hyphen can be used to include a range of characters: [a-z] will match any lowercase letter. [a-zA-Z] a through z or A through Z, inclusive (range)

Character Class Meaning
. Any character, except line terminators
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]

Blog at WordPress.com.