Regular Expressions

Regular Expression (RegEx) is a sequence of characters that forms a search pattern. It allows you to search, match, and manipulate strings based on specific patterns.

Why Use RegEx?

Pattern Matching - Find specific patterns in text
Data Validation - Validate emails, phones, etc.
Text Processing - Extract, replace, or split text
Search & Replace - Advanced find and replace operations

Key Concepts

Pattern - The regex expression you write
Match - Text that fits your pattern
Groups - Capture specific parts of matches
Raw Strings - Always use raw strings: r"pattern"

Python Setup

import re                    # Import the re module
pattern = r"your_pattern"    # Always use raw strings

Python re Module Functions

re.search(pattern, string)     # Returns first match object or None
re.findall(pattern, string)    # Returns list of all matches
re.split(pattern, string)      # Split string at each match
re.sub(pattern, repl, string)  # Replace matches with replacement
re.match(pattern, string)      # Match only at beginning
re.compile(pattern)            # Compile for reuse (faster)

Quick Examples:

re.search(r'\\d+', "Age: 25")           # Find first number
re.findall(r'\\d+', "Age: 25, Score: 98") # Find all numbers: ['25', '98']
re.split(r'\\s+', "hello   world")      # Split by whitespace: ['hello', 'world']
re.sub(r'\\d+', 'X', "Age: 25")         # Replace numbers: "Age: X"

Meta Characters

Character	Meaning	Example	Matches
`.`	Any character (except newline)	`a.c`	"abc", "axc"
`^`	Start of string	`^hello`	"hello world"
`$`	End of string	`world$`	"hello world"
`*`	0 or more repetitions	`ab*c`	"ac", "abc", "abbc"
`+`	1 or more repetitions	`ab+c`	"abc", "abbc"
`?`	0 or 1 repetition	`ab?c`	"ac", "abc"
`\|`	OR operator	`cat\|dog`	"cat", "dog"
`[]`	Character set	`[abc]`	"a", "b", "c"
`()`	Capturing group	`(ab)+`	"ab", "abab"

Character Sets and Ranges

[abc]           # Matches 'a', 'b', or 'c'
[a-z]           # Any lowercase letter
[A-Z]           # Any uppercase letter
[0-9]           # Any digit
[a-zA-Z0-9]     # Any letter or digit

# Negated sets (^ means NOT)
[^abc]          # Any character EXCEPT 'a', 'b', 'c'
[^0-9]          # Any non-digit
[^a-zA-Z]       # Any non-letter

# Combinations
[0-5][0-9]      # Two digits: 00-59
[a-z][A-Z]      # Lowercase + uppercase

Special Sequences

Sequence	Meaning	Example
`\\d`	Digit (0-9)	`\\d{3}` matches "123"
`\\D`	NOT digit	`\\D+` matches "abc"
`\\w`	Word char (a-z, A-Z, 0-9, _)	`\\w+` matches "hello_123"
`\\W`	NOT word char	`\\W+` matches "!@#"
`\\s`	Whitespace	`\\s+` matches spaces/tabs
`\\S`	NOT whitespace	`\\S+` matches "hello"
`\\b`	Word boundary	`\\bword\\b` matches whole word
`\\B`	NOT word boundary	`\\Bword\\B` matches middle

Quantifiers (Repetition)

Quantifier	Meaning	Example
`{m}`	Exactly m times	`a{3}` → "aaa"
`{m,n}`	Between m and n times	`a{2,4}` → "aa", "aaa", "aaaa"
`{m,}`	m or more times	`a{2,}` → "aa", "aaa"...
`{,n}`	0 to n times	`a{,3}` → "", "a", "aa", "aaa"
`*`	0 or more (same as `{0,}`)	`a*`
`+`	1 or more (same as `{1,}`)	`a+`
`?`	0 or 1 (same as `{0,1}`)	`a?`

Groups and Capturing

# Basic groups
(abc)                    # Capture group
(?:abc)                  # Non-capturing group
(?P<name>abc)           # Named group

# Using groups
pattern = r'(\\d{4})-(\\d{2})-(\\d{2})'  # Date: YYYY-MM-DD
match = re.search(pattern, "2023-12-25")
match.group(0)          # Full match: "2023-12-25"
match.group(1)          # First group: "2023"
match.group(2)          # Second group: "12"

# Named groups
pattern = r'(?P<year>\\d{4})-(?P<month>\\d{2})-(?P<day>\\d{2})'
match.group('year')     # "2023"

Flags (Modifiers)

re.IGNORECASE or re.I    # Case-insensitive
re.MULTILINE or re.M     # ^ and $ match line boundaries
re.DOTALL or re.S        # . matches newlines
re.VERBOSE or re.X       # Allow comments and whitespace

# Usage
re.search(pattern, text, re.IGNORECASE)
re.search(pattern, text, re.I | re.M)  # Multiple flags

Common Patterns

Email

r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'

Phone Numbers

r'\\(\\d{3}\\) \\d{3}-\\d{4}'     # (123) 456-7890
r'\\d{3}-\\d{3}-\\d{4}'         # 123-456-7890
r'\\d{10}'                    # 1234567890

Dates

r'\\d{4}-\\d{2}-\\d{2}'         # YYYY-MM-DD
r'\\d{2}/\\d{2}/\\d{4}'         # MM/DD/YYYY
r'[A-Za-z]+ \\d{1,2}, \\d{4}'  # Month DD, YYYY

URLs

r'https?://[^\\s]+'           # Basic URL
r'https?://(?:[-\\w.])+(?:\\:[0-9]+)?(?:/(?:[\\w/_.])*)?'  # Detailed

IP Address

r'\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}'

Credit Card

r'\\d{4}[- ]?\\d{4}[- ]?\\d{4}[- ]?\\d{4}'

Password (8+ chars, mixed case, digit, special)

r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\\d)(?=.*[!@#$%^&*]).{8,}$'

Text Processing Examples

Clean Text

# Remove extra whitespace
re.sub(r'\\s+', ' ', text).strip()

# Extract only letters and numbers
re.sub(r'[^a-zA-Z0-9\\s]', '', text)

# Remove HTML tags
re.sub(r'<[^>]+>', '', html_text)

Extract Data

# Find all numbers
re.findall(r'\\d+', text)

# Find all words
re.findall(r'\\w+', text)

# Find all emails
re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}', text)

Split Text

# Split by multiple delimiters
re.split(r'[,;:|]', text)

# Split by whitespace
re.split(r'\\s+', text)

Advanced Features

Lookahead/Lookbehind

(?=...)      # Positive lookahead
(?!...)      # Negative lookahead
(?<=...)     # Positive lookbehind
(?<!...)     # Negative lookbehind

# Example: Match 'Java' only if followed by 'Script'
r'Java(?=Script)'

Greedy vs Non-Greedy

.*           # Greedy (matches as much as possible)
.*?          # Non-greedy (matches as little as possible)

# Example with HTML
r'<.*>'      # Greedy: matches entire '<div>hello</div><div>world</div>'
r'<.*?>'     # Non-greedy: matches '<div>', '</div>', '<div>', '</div>'

Best Practices

Use Raw Strings

# Good
pattern = r"\\d+\\s+\\w+"

# Bad (double escaping)
pattern = "\\\\d+\\\\s+\\\\w+"

Compile for Reuse

# Good for multiple uses
compiled = re.compile(r'\\d+')
for text in texts:
    compiled.search(text)

# Bad for multiple uses
for text in texts:
    re.search(r'\\d+', text)

Escape Special Characters

# Escape dots, dollars, etc.
r'\\$\\d+\\.\\d+'    # For "$25.99"
r'\\(\\d{3}\\)'     # For "(123)"

Anchor When Possible

r'^pattern$'     # Full string match
r'^pattern'      # Start of string
r'pattern$'      # End of string

Common Mistakes

Not using raw strings → Use r"pattern"
Forgetting to escape → \\. for literal dot
Greedy matching → Use .*? instead of .*
Case sensitivity → Use re.IGNORECASE flag
Not anchoring → Use ^ and $ for exact matches

Quick Testing

def test_pattern(pattern, test_string):
    match = re.search(pattern, test_string)
    return bool(match)

# Example
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'
print(test_pattern(pattern, "user@example.com"))  # True
print(test_pattern(pattern, "invalid.email"))     # False

Why Use RegEx?​

Key Concepts​

Python Setup​

Python re Module Functions​

Meta Characters​

Character Sets and Ranges​

Special Sequences​

Quantifiers (Repetition)​

Groups and Capturing​

Flags (Modifiers)​

Common Patterns​

Email​

Phone Numbers​

Dates​

URLs​

IP Address​

Credit Card​

Password (8+ chars, mixed case, digit, special)​

Text Processing Examples​

Clean Text​

Extract Data​

Split Text​

Advanced Features​

Lookahead/Lookbehind​

Greedy vs Non-Greedy​

Best Practices​

Use Raw Strings​

Compile for Reuse​

Escape Special Characters​

Anchor When Possible​

Common Mistakes​

Quick Testing​

Why Use RegEx?

Key Concepts

Python Setup

Python re Module Functions

Meta Characters

Character Sets and Ranges

Special Sequences

Quantifiers (Repetition)

Groups and Capturing

Flags (Modifiers)

Common Patterns

Email

Phone Numbers

Dates

URLs

IP Address

Credit Card

Password (8+ chars, mixed case, digit, special)

Text Processing Examples

Clean Text

Extract Data

Split Text

Advanced Features

Lookahead/Lookbehind

Greedy vs Non-Greedy

Best Practices

Use Raw Strings

Compile for Reuse

Escape Special Characters

Anchor When Possible

Common Mistakes

Quick Testing