Getting Started with Regex

If you’ve ever worked with messy data, weird text formats, inconsistent naming, you’ve probably heard of regex (regular expressions). At its core, it’s a way to search, match, and manipulate text patterns efficiently. Once you understand the basics, it becomes one of the most powerful tools in your data toolkit.

So what is regex?

Regex is essentially a pattern language. Instead of searching for exact text, you define a rule that describes what you’re looking for.

For example:

  • Searching for "cat" finds exactly “cat”
  • Regex: "c.t" finds “cat”, “cut”, “cot”

That "." is a wildcard, it matches any single character. Already, you’re more flexible than a basic search.

Why does regex matters?

Think about real-world data:

  • Phone numbers in different formats (1-800-293-2394 vs 91-754-5393567)
  • Emails mixed into text (asdfasdf0-vick@email.com-jdjsknd)
  • IDs embedded in messy strings

Instead of manually cleaning or writing long logic, regex lets you:

  • Extract patterns
  • Validate formats
  • Clean inconsistent fields

This is huge in tools like Tableau Prep, Alteryx, Python, or SQL, where text cleaning is constant.

Core building blocks

You don’t need to memorize everything, as documentation on how to use RegEx is everywhere. But a good place to start is with these:

1. Literal characters

hello

Matches exactly “hello”


2. Wildcards

.

Matches any single character

Example: h.t

Matches:

  • hit
  • hot
  • hat

3. Character sets

b[aeiou]t

This means:

  • b : starts with b
  • [aeiou] : any vowel
  • t : ends with t

Matches:

  • bat
  • bet
  • bit
  • bot
  • but

Won’t match:

  • btt (no vowel)
  • boat (too many letters in between) you would need a quantifier for this!

4. Quantifiers (how many times?)

*  → 0 or more
+  → 1 or more
?  → 0 or 1

Example:

lo+l

Matches: lol, lool, loool, …


5. Digits and word shortcuts

\\d → any number (0–9)
\\w → letters + numbers
\\s → whitespace

Example:

\\d\\d\\d

Matches any 3-digit number


Real example

I want to extract a number from an email for example.

Say you have a messy string like:

Call me at (123) 456-7890 tomorrow

And your regex would be something like this:

\\(\\d{3}\\)\\s\\d{3}-\\d{4}

Leading to the match to be:

(123) 456-7890

Breakdown:

  • \\d{3} : exactly 3 digits
  • \\s : space
  • {} : “repeat this amount”

This might look intimidating at first, but it’s just combining the rules together.

Final Thoughts

From a data analyst perspective, regex is great for:

  • Cleaning category fields
  • Extracting IDs from strings
  • Standardizing messy inputs
  • Filtering specific patterns (like error codes or tags)

I like to think of it like this:

Instead of fixing data row-by-row, regex lets you fix patterns at scale.

Here is a website I like to use to practice:

https://regex101.com/

Regex feels confusing at first because it looks like code mashed into symbols. But once you treat it like a pattern game, it becomes way more intuitive.

Author:
Vivek Patel
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2026 The Information Lab