Getting Started with Regex

If you’ve ever worked with messy data, weird text formats, inconsistent naming, you’ve probably heard of regex (regular expressions). At its core, it’s a way to search, match, and manipulate text patterns efficiently. Once you understand the basics, it becomes one of the most powerful tools in your data toolkit.

So what is regex?

Regex is essentially a pattern language. Instead of searching for exact text, you define a rule that describes what you’re looking for.

For example:

Searching for "cat" finds exactly “cat”
Regex: "c.t" finds “cat”, “cut”, “cot”

That "." is a wildcard, it matches any single character. Already, you’re more flexible than a basic search.

Why does regex matters?

Think about real-world data:

Phone numbers in different formats (1-800-293-2394 vs 91-754-5393567)
Emails mixed into text (asdfasdf0-vick@email.com-jdjsknd)
IDs embedded in messy strings

Instead of manually cleaning or writing long logic, regex lets you:

Extract patterns
Validate formats
Clean inconsistent fields

This is huge in tools like Tableau Prep, Alteryx, Python, or SQL, where text cleaning is constant.

Core building blocks

You don’t need to memorize everything, as documentation on how to use RegEx is everywhere. But a good place to start is with these:

1. Literal characters

hello

Matches exactly “hello”

2. Wildcards

Matches any single character

Example: h.t

Matches:

3. Character sets

b[aeiou]t

This means:

b : starts with b
[aeiou] : any vowel
t : ends with t

Matches:

Won’t match:

btt (no vowel)
boat (too many letters in between) you would need a quantifier for this!

4. Quantifiers (how many times?)

*  → 0 or more
+  → 1 or more
?  → 0 or 1

Example:

lo+l

Matches: lol, lool, loool, …

5. Digits and word shortcuts

\\d → any number (0–9)
\\w → letters + numbers
\\s → whitespace

Example:

\\d\\d\\d

Matches any 3-digit number

Real example

I want to extract a number from an email for example.

Say you have a messy string like:

Call me at (123) 456-7890 tomorrow

And your regex would be something like this:

\\(\\d{3}\\)\\s\\d{3}-\\d{4}

Leading to the match to be:

(123) 456-7890

Breakdown:

\\d{3} : exactly 3 digits
\\s : space
{} : “repeat this amount”

This might look intimidating at first, but it’s just combining the rules together.

Final Thoughts

From a data analyst perspective, regex is great for:

Cleaning category fields
Extracting IDs from strings
Standardizing messy inputs
Filtering specific patterns (like error codes or tags)

I like to think of it like this:

Instead of fixing data row-by-row, regex lets you fix patterns at scale.

Here is a website I like to use to practice:

https://regex101.com/

Regex feels confusing at first because it looks like code mashed into symbols. But once you treat it like a pattern game, it becomes way more intuitive.

Author:

Vivek Patel

View Profile