Filed under: TUAW Tips
TUAW Tip: Regular Expressions for Beginners
Sometimes I think Regular Expressions are like the tax code: if someone professes to know everything about them, they're probably not telling the truth. In reality, Regular Expressions (or RegEx) is a syntax to help you construct very precise search terms to find and replace bits of text in a variety of applications.
In applications like Coda, BBEdit, and TextMate, you can search for a "string" -- meaning just any old collection of letters next to each other -- using a Regular Expression. For example, I could search for the string "laugh" and it would show up in laughter, slaughter, and Laughlin.
While I can't show you everything about Regular Expressions, I can at least start you off. Keep reading for more about how you can integrate Regular Expressions into your workflow.
Let's pretend I have a list of items. They happen to be domain names, in this case:
- tuaw.com
- apple.com
- last.fm
- navy.mil
- google.com
- code.google.com
Personally, I think the most handy search term for me is .+. It works like a wildcard. In our list, if I searched for .+.com it would show hits on lines 1, 2, 5 and 6.
Of course, I could just search for .com and it would hit on the same lines. The difference is that with the RegEx in place, text editors will frequently highlight the entire line, making it easy to find and replace things. For example, if I wanted to delete lines that contained .com, I would search for .+.com and replace it with an empty string.
(Commenter Eric notes I can make the expression .+.com$ to ensure that it's not catching something like www.commons.org. You can read more about the $ character in a little bit. Thanks, Eric!)
I can also search for something like g.+g, and get the string "goog" on lines 5 and 6.
Next is the pipe character: |, which means "or." If I search for fm|mil, it will hit on both lines 3 and 4. I can highlight the entire line if I search for .+fm|.+mil.
You can also use Regular Expressions to add text in a repeatable sort of way. For example, if I wanted to add "http://" to the beginning of every line, I would search for ^ (that's shift + 6), and type "http://" in the replace box. After clicking "Replace All," I'd get a list that looked like this:
- http://tuaw.com
- http://apple.com
- http://last.fm
- http://navy.mil
- http://google.com
- http://code.google.com
You can do the same thing for adding text to the ends of lines by searching for $. Just as ^ is the way to find the beginning of a line, $ is the way to find the end of a line.
This is just the very tippy top of the massive iceberg that is Regular Expressions. For example, you could find any email address in a text document by searching for \b[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}\b. Scary! But there are plenty of sites to help you learn about RegEx, and even help you build search queries.
- regular-expressions.info is a great resource for learning about RegEx, including an excellent tutorial. (This is also where I got that giant email search query.)
- You can download a fantastic RegEx cheat sheet from Added Bytes.
- RegExr is an excellent web-based utility that helps you construct a RegEx query by showing you results in real time. Hits are highlighted as you write your expression.
If you have a favorite RegEx tip aimed at beginners, feel free to share in comments!
Get a WordPress.com Blog
![TUAW [Cafepress]](http://www.blogsmithmedia.com/www.tuaw.com/media/tuaw-cafepress-promo.png)


Reader Comments (Page 1 of 1)
Ichthydru said 4:11PM on 9-08-2008
Although ".+.com" could also match anythingcom as the ".+" part would match "anythin", the . (meaning any character) would match the "g", and the "com" matching... any guesses?
Reply
JTaby said 4:48PM on 9-08-2008
".+\.com"
Eric said 4:16PM on 9-08-2008
Your initial example is just slightly too general.
Instead of using ".+.com" I would use ".+\.com$" to make sure that ".com" is at the end. Likewise, searching for just ".com" would match a line like:
www.commons.org
or
haleyscomet.org
Also, kudos for including "+" in e-mail address. Many e-mail filters leave that out.
Reply
Robert Palmer said 4:22PM on 9-08-2008
Thanks -- though I can't take credit for the email filter ... as I noted, it came from regular-expressions.info.
And thanks also for the $ catch ... I've added that to the story.
Matti Niemelä said 4:38PM on 9-08-2008
\b[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}\b
Please do NOT use this, since starting 2009, ICANN (or whatever) has agreed to custom top level domains, which means I could buy myself (for something like 100k$) my-domain.apple -address (if I'll be faster than Apple, that is. And had that money.)
This means that my .apple TLD wouldn't validate with that RegEx.
Reply
eropuri said 12:13AM on 9-09-2008
Well, that is true, but there are already top-level domains that would not be matched by this. .MUSEUM fails this regexp, and is in use today. All of the IDN domains (non-ASCII domain names) fail this regexp. So it is poorly conceived.
Shan said 7:00PM on 9-08-2008
RegExr is also available as a desktop app for Mac/Win/Lin, thanks to Adobe Flex and AIR.
http://gskinner.com/RegExr/desktop/
Reply
Jfro said 5:39PM on 9-08-2008
http://rubular.com/ is another good online regex testing site, sans flash which is a bonus in my book.
Reply
Tom said 5:05PM on 9-08-2008
Regexp's rock!
Possibly a better way to use the pipe character is to surround the expressions you are 'or'ing in parentheses:
.+\.(com|mil|fm|org)$
The pipe then applies only to those items listed in parens.
Parentheses have the added benefit of being regexp 'memory'. If you're doing search and replace, you can often use some variant of $1, $2,... (sometimes \1, \2,...) to refer to text matched by the 1st, 2nd, etc. set of parentheses. This lets you do things like:
Search: .+\.(com|fm|mil|org)$
Replace: href="$1"
Finally, a shout-out to the use of ^ inside square brackets to indicate "not one of these characters". This can help solve problems like this -- say you're searching for the values inside the quotes in href="link/to/page". Doing something like this works most of the time:
href=".+"
unless you've got multiple quotes on the same line:
href="link/to/page">This now works "badly"
Instead of .+, use [^"]+, as such:
href="[^"]+"
(I want href=", followed by 1 or more non-quote characters, followed by a quote)
Reply
Nicolas Webb said 5:15PM on 9-08-2008
Is there a utility similar to RegexBuddy for Windows on the Mac? I'm a huge fan of the program and it's one that gets regular use on my VMWare Windows instance.
Reply
DennisQ said 1:20AM on 9-09-2008
Check out Reggy.app which is fantastic! http://reggyapp.com for download and link to google code page.
Chris said 5:32PM on 9-08-2008
A nice replacement for 'find . | grep' is the 'ack' tool.
http://petdance.com/ack/
Reply
FamousPete said 8:26PM on 9-08-2008
"Some people, when confronted with a problem, think 'I know, I’ll use regular expressions.' Now they have two problems." –Jamie Zawinski
Reply
FamousPete said 8:27PM on 9-08-2008
"Some people, when confronted with a problem, think 'I know, I’ll use regular expressions.' Now they have two problems." –Jamie Zawinski
Reply
Kinjan said 2:26AM on 9-25-2008
Hi,
Good Post....A detailed information about chunks of Regular Expression.
I have also posted on article on it, you may visit it through http://programminghack.wordpress.com
Thanks and Regards,
Kinjan Shah
Reply