If you’ve ever worked with messy data, weird text formats, inconsistent naming, you’ve probably heard of regex (regular expressions). At its core, it’s a way to search, match, and manipulate text patterns efficiently. Once you understand the basics, it becomes one of the most powerful tools in your data toolkit.
So what is regex?
Regex is essentially a pattern language. Instead of searching for exact text, you define a rule that describes what you’re looking for.
For example:
- Searching for "cat" finds exactly “cat”
- Regex: "c.t" finds “cat”, “cut”, “cot”
That "." is a wildcard, it matches any single character. Already, you’re more flexible than a basic search.
Why does regex matters?
Think about real-world data:
- Phone numbers in different formats (1-800-293-2394 vs 91-754-5393567)
- Emails mixed into text (asdfasdf0-vick@email.com-jdjsknd)
- IDs embedded in messy strings
Instead of manually cleaning or writing long logic, regex lets you:
- Extract patterns
- Validate formats
- Clean inconsistent fields
This is huge in tools like Tableau Prep, Alteryx, Python, or SQL, where text cleaning is constant.
Core building blocks
You don’t need to memorize everything, as documentation on how to use RegEx is everywhere. But a good place to start is with these:
1. Literal characters
hello
Matches exactly “hello”
2. Wildcards
.
Matches any single character
Example: h.t
Matches:
- hit
- hot
- hat
3. Character sets
b[aeiou]t
This means:
b: starts with b[aeiou]: any vowelt: ends with t
Matches:
- bat
- bet
- bit
- bot
- but
Won’t match:
- btt (no vowel)
- boat (too many letters in between) you would need a quantifier for this!
4. Quantifiers (how many times?)
* → 0 or more
+ → 1 or more
? → 0 or 1
Example:
lo+l
Matches: lol, lool, loool, …
5. Digits and word shortcuts
\\d → any number (0–9)
\\w → letters + numbers
\\s → whitespace
Example:
\\d\\d\\d
Matches any 3-digit number
Real example
I want to extract a number from an email for example.
Say you have a messy string like:
Call me at (123) 456-7890 tomorrow
And your regex would be something like this:
\\(\\d{3}\\)\\s\\d{3}-\\d{4}
Leading to the match to be:
(123) 456-7890
Breakdown:
\\d{3}: exactly 3 digits\\s: space{}: “repeat this amount”
This might look intimidating at first, but it’s just combining the rules together.
Final Thoughts
From a data analyst perspective, regex is great for:
- Cleaning category fields
- Extracting IDs from strings
- Standardizing messy inputs
- Filtering specific patterns (like error codes or tags)
I like to think of it like this:
Instead of fixing data row-by-row, regex lets you fix patterns at scale.
Here is a website I like to use to practice:
Regex feels confusing at first because it looks like code mashed into symbols. But once you treat it like a pattern game, it becomes way more intuitive.
