Skip to Content

TUAW Tip: Regular Expressions for Beginners

Sometimes I think Regular Expressions are like the tax code: if someone professes to know everything about them, they're probably not telling the truth. In reality, Regular Expressions (or RegEx) is a syntax to help you construct very precise search terms to find and replace bits of text in a variety of applications.

In applications like Coda, BBEdit, and TextMate, you can search for a "string" -- meaning just any old collection of letters next to each other -- using a Regular Expression. For example, I could search for the string "laugh" and it would show up in laughter, slaughter, and Laughlin.

While I can't show you everything about Regular Expressions, I can at least start you off. Keep reading for more about how you can integrate Regular Expressions into your workflow.

Let's pretend I have a list of items. They happen to be domain names, in this case:

  1. tuaw.com
  2. apple.com
  3. last.fm
  4. navy.mil
  5. google.com
  6. code.google.com

Personally, I think the most handy search term for me is .+. It works like a wildcard. In our list, if I searched for .+.com it would show hits on lines 1, 2, 5 and 6.

Of course, I could just search for .com and it would hit on the same lines. The difference is that with the RegEx in place, text editors will frequently highlight the entire line, making it easy to find and replace things. For example, if I wanted to delete lines that contained .com, I would search for .+.com and replace it with an empty string.

(Commenter Eric notes I can make the expression .+.com$ to ensure that it's not catching something like www.commons.org. You can read more about the $ character in a little bit. Thanks, Eric!)

I can also search for something like g.+g, and get the string "goog" on lines 5 and 6.

Next is the pipe character: |, which means "or." If I search for fm|mil, it will hit on both lines 3 and 4. I can highlight the entire line if I search for .+fm|.+mil.

You can also use Regular Expressions to add text in a repeatable sort of way. For example, if I wanted to add "http://" to the beginning of every line, I would search for ^ (that's shift + 6), and type "http://" in the replace box. After clicking "Replace All," I'd get a list that looked like this:

  1. http://tuaw.com
  2. http://apple.com
  3. http://last.fm
  4. http://navy.mil
  5. http://google.com
  6. http://code.google.com

You can do the same thing for adding text to the ends of lines by searching for $. Just as ^ is the way to find the beginning of a line, $ is the way to find the end of a line.

This is just the very tippy top of the massive iceberg that is Regular Expressions. For example, you could find any email address in a text document by searching for \b[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}\b. Scary! But there are plenty of sites to help you learn about RegEx, and even help you build search queries.

  • regular-expressions.info is a great resource for learning about RegEx, including an excellent tutorial. (This is also where I got that giant email search query.)
  • You can download a fantastic RegEx cheat sheet from Added Bytes.
  • RegExr is an excellent web-based utility that helps you construct a RegEx query by showing you results in real time. Hits are highlighted as you write your expression.

If you have a favorite RegEx tip aimed at beginners, feel free to share in comments!



Categories

TUAW Tips

Sometimes I think Regular Expressions are like the tax code: if someone professes to know everything about them, they're probably not...
 

Add a Comment

*0 / 3000 Character Maximum

15 Comments

Filter by:
Kinjan

Hi,

Good Post....A detailed information about chunks of Regular Expression.

I have also posted on article on it, you may visit it through http://programminghack.wordpress.com

Thanks and Regards,
Kinjan Shah

September 25 2008 at 2:25 AM Report abuse rate up rate down Reply
FamousPete

"Some people, when confronted with a problem, think 'I know, I’ll use regular expressions.' Now they have two problems." –Jamie Zawinski

September 08 2008 at 8:27 PM Report abuse rate up rate down Reply
FamousPete

"Some people, when confronted with a problem, think 'I know, I’ll use regular expressions.' Now they have two problems." –Jamie Zawinski

September 08 2008 at 8:26 PM Report abuse rate up rate down Reply
ChrisW

A nice replacement for 'find . | grep' is the 'ack' tool.

http://petdance.com/ack/

September 08 2008 at 5:18 PM Report abuse rate up rate down Reply
Nicolas Webb

Is there a utility similar to RegexBuddy for Windows on the Mac? I'm a huge fan of the program and it's one that gets regular use on my VMWare Windows instance.

September 08 2008 at 5:14 PM Report abuse rate up rate down Reply
1 reply to Nicolas Webb's comment
DennisQ

Check out Reggy.app which is fantastic! http://reggyapp.com for download and link to google code page.

September 09 2008 at 1:15 AM Report abuse rate up rate down Reply
tom

Regexp's rock!

Possibly a better way to use the pipe character is to surround the expressions you are 'or'ing in parentheses:

.+.(com|mil|fm|org)$

The pipe then applies only to those items listed in parens.

Parentheses have the added benefit of being regexp 'memory'. If you're doing search and replace, you can often use some variant of $1, $2,... (sometimes 1, 2,...) to refer to text matched by the 1st, 2nd, etc. set of parentheses. This lets you do things like:

Search: .+.(com|fm|mil|org)$
Replace: href="$1"

Finally, a shout-out to the use of ^ inside square brackets to indicate "not one of these characters". This can help solve problems like this -- say you're searching for the values inside the quotes in href="link/to/page". Doing something like this works most of the time:

href=".+"

unless you've got multiple quotes on the same line:

href="link/to/page">This now works "badly"

Instead of .+, use [^"]+, as such:

href="[^"]+"

(I want href=", followed by 1 or more non-quote characters, followed by a quote)


September 08 2008 at 5:04 PM Report abuse rate up rate down Reply
Jeremy Knope

http://rubular.com/ is another good online regex testing site, sans flash which is a bonus in my book.

September 08 2008 at 4:55 PM Report abuse rate up rate down Reply
Shan

RegExr is also available as a desktop app for Mac/Win/Lin, thanks to Adobe Flex and AIR.

http://gskinner.com/RegExr/desktop/

September 08 2008 at 4:42 PM Report abuse rate up rate down Reply
Matti Niemelä

b[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}b

Please do NOT use this, since starting 2009, ICANN (or whatever) has agreed to custom top level domains, which means I could buy myself (for something like 100k$) my-domain.apple -address (if I'll be faster than Apple, that is. And had that money.)

This means that my .apple TLD wouldn't validate with that RegEx.

September 08 2008 at 4:38 PM Report abuse rate up rate down Reply
1 reply to Matti Niemelä's comment
eropuri

Well, that is true, but there are already top-level domains that would not be matched by this. .MUSEUM fails this regexp, and is in use today. All of the IDN domains (non-ASCII domain names) fail this regexp. So it is poorly conceived.

September 09 2008 at 12:13 AM Report abuse rate up rate down Reply
Eric

Your initial example is just slightly too general.

Instead of using ".+.com" I would use ".+.com$" to make sure that ".com" is at the end. Likewise, searching for just ".com" would match a line like:
www.commons.org
or
haleyscomet.org

Also, kudos for including "+" in e-mail address. Many e-mail filters leave that out.

September 08 2008 at 4:15 PM Report abuse rate up rate down Reply
1 reply to Eric's comment
Robert Palmer

Thanks -- though I can't take credit for the email filter ... as I noted, it came from regular-expressions.info.

And thanks also for the $ catch ... I've added that to the story.

September 08 2008 at 4:22 PM Report abuse rate up rate down Reply
Buy an ad here

Hot Apps on TUAW

Tweets

© 2012 AOL Inc. All Rights Reserved.