One feature that was added to both Progenesis SameSpots and Progenesis LC-MS in the past year is the ability to search for gels, runs, and peptides using regular expressions. But what’s a regular expression? Here’s what regular‑expressions.info has to say:
A regular expression (or ‘regex’) is a special text format for describing a search term. You can think of regular expressions as wildcards on steroids.
(I’ve misquoted that slightly, to make it sound less like it’s coming from a computer programmer!) And, thanks to dictionary.reference.com, a wildcard is defined as:
A symbol that can represent any character or group of characters, as in a filename.
When you open a file in Microsoft Word, for instance, you may notice that the Open dialog box is showing files named *.docx. The asterisk in that is the wildcard.
If they’re so powerful, why haven’t you talked about them before?
It’s fair to say that we’ve been a bit quiet about this. While we’ve certainly been advising people via support on how to use regular expressions, they’re quite an advanced topic; it’s a bit like learning a new language.
To see what I mean, here’s an example search term that will find an email address in a bit of text:
Thankfully, you’re unlikely to be searching for email addresses within your gel and run names. To help you get started with regular expressions, however, there are now FAQ pages in the support section:
- SameSpots: Can I use wildcards when filtering my gels?
- LC-MS: Can I use wildcards when filtering my runs and peptides?
But let’s quickly look at an example of where you could use a regular expression.
Example: finding missed cleavages in LC-MS
In Progenesis LC-MS, it’s not just the runs for which you can use regular expressions; you can also use them in the Peptide Filter screen, to search in any of the peptides’ text-based properties.
In this example, we’ll use our knowledge that trypsin should cleave immediately after lysine or arginine, unless it’s followed by proline. These rules can be encoded as the following regular expression, entered into the Sequence field:
Translating this into English, each of the three terms in square brackets specifies a single amino acid in the ‘missed cleavage’ sequence we’re looking for. The letters inside the brackets specify which amino acids we’re looking for. If the first character inside the brackets is the caret symbol (^), however, we’re specifying which amino acids must not be present at that position.
So, this particular regex says we’re looking for peptides whose sequence contains:
- anything except Lysine (K) or Arginine (R), followed by
- either Lysine or Arginine, followed by
- anything except Proline (P)
Remember, the above 3 amino acids can appear anywhere in the peptide’s sequence, but they must appear consecutively and in the order stated. In the screenshot shown above, the matching, uncleaved peptides are highlighted in pink.
They may be daunting at first, but regular expressions can hugely speed up some of the more repetitive tasks in your analysis. When you get a chance, it’s probably worth a little bit of time to learn some of the basics (as seen on the mug above). You may find they can increase your productivity, especially when dealing with very large datasets.
I’d love to hear how you get on too, whether good or bad – just leave a message in the comments for this post. Thanks.