Basics of Regular Expressions

Hasan Gökçe
5 min readDec 21, 2020

--

1. Intro

The regular expression is everywhere even we don’t realize it. Regular expressions for short: regex, or regexp. The story of regex can be found on Wikipedia. It's used for:

  • validating forms
  • parsing HTML
  • searching text

2. Literals

The most basic match that we can do is literals. The regex apple will match the text apple, and the regex lemon will match the text lemon.

  • regex: apple match with: apple
  • regex: lemon match with: lemon
  • regex: 3 match with: 3
  • regex: 5 cats match with: 5 cats

Regex works by walking character by character, from left to right.

3. Alternation

Do you like cats and dogs? We can find both of them with single regular expressions using. This is called alternation! Alternation, applied with | symbol. It means OR. The regular expression is cats|dogs.

4. Character Sets

It’s quite common making mistakes on some words such as consensus. Character sets are shown with brackets []. These brackets make it possible to match different or invalid spellings.

The regex con[sc]en[sc]us matches:

  • consensus
  • concensus
  • concencus

[sc] means that there are different possibilities for this character. s and c are okay but nothing else. Therefore [cat] matches with c, a, t bu not matches with cat. The beauty is the operation becomes more flexible than matching with literals!

And the powerful character is:

  • ^ caret symbol

Caret symbol is placed at the front of a character. [^cat] matches any character that is not c, a, and t.

Are we in a consensus that regex is pretty cool?

5. Wild for Wildcard

We want to match any 7-character piece of text. The regex ……. is our answer. Seven dots matches with leopard, gorilla, hamster, and giraffe. Dot is used as a wildcard.

What if we want to match the real dot character? We can use the escape character. The regex Busy as a bee\. matches with Busy as a bee.

Escape characters are common for all programming languages but with different implementations. More in can be found at https://en.wikipedia.org/wiki/Escape_character .

6. Ranges

Without using ranges of regex, the real power of regex cannot be understood. Regex [a-c] is equivalent to regex [abc]

The dash character makes it possible to determine a range.

I adopted[2–4] [a-c]ats will match the text

  • I adopted 3 bats
  • I adopted 4 cats
  • I adopted 2 hats

By using ranges, we can match

  • [A-Z] will match any single capitals
  • [a-z] will match any single lowercase letter
  • [0–9] will match any single number
  • [A-Za-z] will match any single capital or lowercase alphabetical character

Worth to remember: for all cases, we match only one character with []

7. Shorthand Character Classes

While ranges are beneficial, they might be cumbersome for creating them for every single character. The solution is that there are shorthand character classes for common ranges. They make regular expressions much simpler.

  • \w : the word character. It represents [A-Za-z0–9_] that matches a single uppercase character, a single lowercase character, a single number, or a single underscore.
  • \d : the digit character. It represents [0–9]
  • \s : the whitespace character. It represents [\t\r\n\f\v] that matches a single space, tab, carriage return, line break, form feed, or vertical tab.

For instance, the regex \w\w\w\w\w\s\d matches 5 word characters, followed by a whitespace character, followed by a digit character. As a result, it totally matches the text sense 8

In addition to that:

  • \W the non-word character. It represents [^A-Za-z0–9_] that means opposite of \w
  • \D the non-digit character. It represents [^0-9] that matches any character that does not include any digit. Opposite of \d
  • \S the non-whitespace character. It represents [^\t\r\b\f\v], matching any character that is not whitespace character. Opposite of \s

8. Grouping

Remember that we used cats|dogs to emphasize “either”. But what if we want to match the whole piece of text I love dogs and I love cats. First might be using the regex I love cats|dogs. This regex will completely match I love cats or dogs. The reason is | symbol matches the whole expression before and after itself.

And the time is grouping to rescue! Grouping is shown with ( ) parenthesis, letting us create groups to apply different regex.

The regex I love (cats|dogs) will match the text I love and then match cats or dogs.

Grouping also called capture groups. They are powerful to select or capture.

9. Quantifiers — Fixed

Now, things started to be exciting. Up to now, we have only matched character by character. For matching with heyyyy yo, we can write regex \w\w\w\w\w\w\s\w\w which matches 6-word characters, followed by a whitespace character, and then followed by more 2 word characters. Is there a better way to do this?

Yesss, thanks to quantifiers!. Fixed quantifiers, shown with { } curly braces allowing us to determine how many characters we want so select. y{3} matches yyy or y{2,3} matches yy or yyy.

  • \w{5} matches only 5 word characters
  • \w{2,4} matches ar minimum 2 word characters and at maximum 4 word characters

The regex sto{3}p

  • stooop

The regex ye{3,5} matches:

  • yesss
  • yessss
  • yesssss

10. Quantifiers — Optional

For example, we want to match humor but also we don’t want to forget British English speaking countries, where they use humour. Optional quantifiers are ready to save us!

Optional quantifiers, shown by the question mark ?, make it possible to show that a character is optional. humou?r is our solution. Note that question mark ? only works for the character immediately before it.

By combining quantifiers and grouping, we can write more advanced regexes. The regex I will save (the )?world will exactly match I will save the world and I will save world

Since ? is a special character for regex, we must use escape character for matching normal question marks. The regex Isn’t the world beautiful \? matches Isn’t the world beautiful?

11. Quantifiers — 0 or More, 1 or More

Stephen Cole Kleene developed regex to match patterns with mathematical notation in1951. In his honor, the next regular expression syntax we will learn is known as The Kleene star. It is shown with an asterisk *. And this is also a quantifier that matches the preceding character 0 or more times. It means the character before the asterisk can appear never, can appear once, or can appear many many times. Another quantifier is the Kleene plus, meaning a character before plus symbol can appear one or more.

  • * matches the preceding character 0 or more times.
  • + matches the preceding character 1 or more times.

Don’t forget that if you want to use this character with their normal meaning use escape character \* or \+

12. Anchors

^ and $

This time we can use these symbols to say that this sentence starts with ^My and ends with John$. For instance the regex ^My John$

  • My name is John — match
  • Hey, my name is John — no match

^ character guaranteed that the text will start My and $ character ensures that the sentence will end with John.

Oh my god!

Are you now ready to use your regex superpower in the wild?

For short, we have learned up to now:

  • literals
  • Alternation |
  • Character sets []
  • Wildcards .
  • Ranges
  • Shorthand character classes\w, \d and \s
  • Groupings ()
  • Fixed quantifiers{}
  • Optional quantifiers ?
  • The Kleene star *
  • The Kleene plus+
  • The anchor symbols^ and $

Good luck dude!

--

--

Hasan Gökçe
Hasan Gökçe

Written by Hasan Gökçe

Boğaziçi University - Software Engineering

No responses yet