Wednesday, July 22, 2015

Regex: Match All or Nothing

I ran across a fun Regular Expression problem today. This uses the PCRE features such as \K, and verbs like (*SKIP) and (*FAIL).

A string contains text which has some amount of whitespace at the beginning of each line (except the first line, which we want to ignore). Capture that leading whitespace, so it can be counted, replaced, etc.

The only caveat: if ANY of the lines (barring the first line) have no leading whitespace, then NONE of the lines with leading whitespace should be captured.

function some_name(
    1,
    20,
    'this is text',
    );

Other than the top line that starts with 'function some_name', the four lines that start with whitespace should match with an expression such as this:

^(\s+)

But how do you handle this scenario:

function some_name(
    1,
20,
    'this is text',
    );

where the line with '20,' has no leading whitespace?

The solution:

^\A[^\n]*\n\K
(?=[\s\S]*^\S)
(?:[\s\S]*(*SKIP)(*FAIL))
|
^(\s+)

Let's break this down:

^\A[^\n]*\n\K

The \A starts at the beginning of the string. Since we don't care about the first line, [^\n]*\n matches everything to the end of the line (including the newline). Then the \K option tells the regex engine to throw away all the stuff it matched up to this point and start anew.

Then using a look-ahead, see whether ANY lines start with a non-whitespace character:

(?=[\a\S]*^\S)

If it found even a single line that matches this, consume ALL the characters to the end of the entire string, then (*SKIP) them and mark the match as a (*FAIL).

What this effectively does is consumes all the remaining characters, so any other portion of the regex pattern that comes afterward has no text to match against, and the pattern FAILS.

On the other hand, if that look-ahead could NOT find any lines that start with non-whitespace, the OR condition at the end will look for, and match, the leading whitespace:

^(\s+)

You can see this in action on this page: https://regex101.com/r/pC6gB3/1 where you can see the differences when you add or remove the leading whitespace for any of the lines (again, barring the first line, which we ignore).