Regular Expression "Matching" vs "Capturing" - regex

I've been looking up regular expression tutorials trying to get the hang of them and was enjoying the tutorial in this link right up until this problem: http://regexone.com/lesson/12
I cannot seem to figure out what the difference between "matching" and "capturing" is. Nothing I write seems to select the text under the "Capture" section (not even .*).
Edit: Here is an example for the tutorial that confuses me: (.* (.*)) is considered correct and (.* .*) is not. Is this a problem with the tutorial or something I am not understanding?

Matching:
When engine matches a part of string or the whole but does return nothing.
Capturing:
When engine matches a part of string or the whole and does return something.
--
What's the meaning of returning?
When you need to check/store/validate/work/love a part of string that your regex matched it before you need capturing groups (...)
At your example this regex .*?\d+ just matches the dates and years See here
And this regex .*?(\d+) matches the whole and captures the year See here
And (.*?(\d+)) will match the whole and capture the whole and the year respectively See here
*Please notice the bottom right box titled Match groups
So returning....
1:
preg_match("/.*?\d+/", "Jan 1987", $match);
print_r($match);
Output:
Array
(
[0] => Jan 1987
)
2:
preg_match("/(.*?\d+)/", "Jan 1987", $match);
print_r($match);
Output:
Array
(
[0] => Jan 1987
[1] => Jan 1987
)
3:
preg_match("/(.*?(\d+))/", "Jan 1987", $match);
print_r($match);
Output:
Array
(
[0] => Jan 1987
[1] => Jan 1987
[2] => 1987
)
So as you can see at the last example, we have 2 capturing groups indexed at 1 and 2 in the array, and 0 is always the matched string however it's not captured.

capturing in regexps means indicating that you're interested not only in matching (which is finding strings of characters that match your regular expression), but you're also interested in using specific parts of the matched string later on.
for example, the answer to the tutorial you linked to would be (\w{3}\s+(\d+)).
now, why ?
to simply match the date strings it would be enough to write \w{3}\s+\d+ (3 word characters, followed by one or more spaces, followed by one or more digits), but adding capture groups to the expression (a capture group is simply anything enclosed in parenthesis ()) will allow me to later extract either the whole expression (using "$1", because the outer-most pair of parenthesis are the 1st the parser encounters) or just the year (using "$2", because the 2nd pair of parenthesis, around the \d+, are the 2nd pair that the regexp parser encounters)
capture groups come in handy when you're interested not only in matching strings to pattern, but also extracting data from the matched strings or modifying them in any way. for example, suppose you wanted to add 5 years to each of those dates in the tutorial - being able to extract just the year part from a matched string (using $2) would come in handy then

In a nutshell, a "Capture" saves the collected value in a special place so you can access it later.
As some have pointed out, the captured stuff can be used 'later on' in the same pattern, so that
/(ab*c):\1/
will match ac:ac, or abc:abc, or abbc:abbc etc. The (ab*c) will match an a, any number of b, then a c. Whatever it DOES match is 'captured'. In many programming and scripting languages, the syntax like \1, \2 etc has the special meaning referring to the first, second, etc captures. Since the first one might be abbc, then the \1 bit has to match abbc only, thus the only possible full match would then be 'abbc:abbc'
Perl (and I think) PHP both allow the \1 \2 syntax, but they also use $1 $2 etc which is considered more modern. Many languages have picked up the powerful RegEx engine from Perl so there's increasing use of this in the world.
Since your sample question seems to be on a PHP site, the typical use of $1 in PHP is:
/(ab*c)(de*f)/
then later (eg next line of code)
$x = $1 . $2; # I hope that's PHP syntax for concatenation!
So the capture is available until your next use of a regex. Depending on the programming language in use, those captured values may be smashed by the next pattern match, or they may be permanently available through special syntax or use of the language.

take a look at these 2 regex - from your example
# first
/(... (\d\d\d\d))/
#second
/... \d\d\d\d/
they both match "Jun 1965" and "May 2000"
(and incidentally many other things like "555 1234")
the second one just matches it - yesno
so you could say
if ($x=~/... \d\d\d\d/){do something}
the first one captures so
/(... (\d\d\d\d))/
print $1,";;;",$2
would print "Jun 1967;;;1967"

Related

RegEx Replace - Remove Non-Matched Values

Firstly, apologies; I'm fairly new to the world of RegEx.
Secondly (more of an FYI), I'm using an application that only has RegEx Replace functionality, therefore I'm potentially going to be limited on what can/can't be achieved.
The Challange
I have a free text field (labelled Description) that primarily contains "useless" text. However, some records will contain either one or multiple IDs that are useful and I would like to extract said IDs.
Every ID will have the same three-letter prefix (APP) followed by a five digit numeric value (e.g. 12911).
For example, I have the following string in my Description Field;
APP00001Was APP00002TEST APP00003Blah blah APP00004 Apple APP11112OrANGE APP
THE JOURNEY
I've managed to very crudely put together an expression that is close to what I need (although, I actually need the reverse);
/!?APP\d{1,5}/g
Result;
THE STRUGGLE
However, on the Replace, I'm only able to retain the non-matched values;
Was TEST Blah blah Apple OrANGE APP
THE ENDGAME
I would like the output to be;
APP00001 APP00002 APP00003 APP00004 APP11112
Apologies once again if this is somewhat of a 'noddy' question; but any help would be much appreciated and all ideas welcome.
Many thanks in advance.
You could use an alternation | to capture either the pattern starting with a word boundary in group 1 or match 1+ word chars followed by optional whitespace chars.
What you capture in group 1 can be used as the replacement. The matches will not be in the replacement.
Using !? matches an optional exclamation mark. You could prepend that to the pattern, but it is not part of the example data.
\b(APP\d{1,5})\w*|\w+\s*
See a regex demo
In the replacement use capture group 1, mostly using $1 or \1

Regex referencing captured groups

Firstly, I'm very new to Regex so my apologies if this is a dumb question.
I'm just using an online Regex tester https://regex101.com (PCRE) to build the following scenario.
I want to capture 123445 and ABC1234 from the following sentence
Foo Bar 123445 Ref ABC1234
I just wanted to use a simple capturing group
((?:\w)+)
Which will identify 5 matching groups And then I could back reference it with $3 and $5
However when I attempt using Substitution with just one group, $3, I end up with the whole string. I tried some of the other languages and ended up with
$3 $3 $3 $3 $3
In the end I just used Foo\s*Bar\s*(\w+)\s*Ref\s*(\w+) and referencing groups $1 and $2 which works fine but just isn't very elegant.
Is it possible to create this kind of back referencing without specifically building capturing groups around each part of what you are trying to capture?
Thanks :)
((?:\w)+)
Which will identify 5 matching groups And then I could back reference
it with $3 and $5
No, that's not how backreferences work. There are exactly N groups in a regex, and N is the number of opening parenthesis.
In ((?:\w)+) there are 2 groups, one "capturing" (which creates a backreference) and one "non-capturing" (which does not).
The number of times a group matches in a target string does not change the number of backreferences. Imagine the chaos this would create. Except for the most simplistic cases, how would you even know if what you're looking for is $3, $9 or $9000?
If your input string has a fixed structure, then your approach Foo\s*Bar\s*(\w+)\s*Ref\s*(\w+) with $1 and $2 is perfectly fine.
Is it possible to create this kind of back referencing without
specifically building capturing groups around each part of what you
are trying to capture?
No. You must build one capturing group for each part that you are trying to backreference to. If a group matches multiple times, you will get the last instance of each match in the input.
Some regex engines let you to access each instance of what a particular group has captured from the host language. For example the .NET regex engine does that. This is nice for post-processing, but the backreferences themselves (i.e. the $1) still work as above.
All that being said, the way to get '123445' and 'ABC1234' out of Foo Bar 123445 Ref ABC1234 in the way you were thinking of is to avoid regex and string.split() at the space, taking parts 2 and 3.
It isn't entirely clear what you are trying to match and what you want to substitute with based on your question.
For the purpose of trying to get an answer for you, I'm going to assume that you want to match any word that has a number and replace it with something else.
\w*?\d+\w*? will match any word with a digit in it, and with JavaScript (you didn't specify a language), you perform a manual substitution, or a dynamic one with a replacer function.
const expression = /\b(\w*?\d+\w*?)\b/g;
const inputs = [
'Foo Bar 123445 Ref ABC1234',
'Hello World 123 Foo ABC123XYZ456'
];
// static string
console.log(inputs.map(i => i.replace(expression, '**redacted**')));
// dynamic string
console.log(inputs.map(i => i.replace(expression, s => new Array(s.length).fill('*').join(''))));

Regex Conditionnals

I would like to control orphans in InDesign by applying a "No Break" character style based on a GREP expression. Basically, I need to target the last 2 words of a paragraph (That is to say: The last 2 strings of characters separated by a space).
I found a solution for my English publications where (\H+?\h?){2}$ works like a charm.
The problem is with my French publications where some punctuation requires to have a space before it. I am trying to specify the Matching Pattern based on the last character of the paragraph: If it is a ?, ! or :, I match the last 3 "words" using (\H+?\h?){3}$, if not than I match the last 2.
I thought the following expression would work:
(?(?=[\?!:]$)((\H+?\h?){3}$)|(\H+?\h?){2}$)
but somehow it always default to the "else" statement.
Can someone tell me where I did go wrong?
Maybe you want option (A) below
See if I understand correctly ...
The requirements are:
Capture the last two words
Even if in the end it is ?,! or :
(A) Use this to capture as group: https://regexr.com/4lr6h
(\w*)(?:\s*)(\w*)(?:\s*)(\w*)(?:[\?!:]|$)
(B) Use this to capture only words: https://regexr.com/4lr84
\w*\s\w*(?=(?:$|[\?!:]))
(C) Use this to capture tree last words with marks: https://regexr.com/4lr87
\w*\s\w*[\?!:]?$

Powershell script to search, split and join in one line

Been racking my Friday brain on a regex problem with dealing with Sql Server object names.
An input to my Powershell script is a procedure name. The name can take many forms, such as
dbo.Procedure
[dbo].Procedure
dbo.[Procedure.Name]
etc
So far I'd come up with the following to split the value into it's constituent parts:
[string[]] $procNameA = $procedure.Split("(?:\.)(?=(?:[^\]]|\[[^\]]*\])*$)")
In addition I have a regex that I could use to handle the square brackets
(?:\[)*([A-Za-z0-9. !]+)(?:\])*
And this is about as far as my limited regex experience will take me.
Now granted I could deal with a lot of this by treating each element in a ForEach and doing a RegEx replace there, but y'know that just seems so, I dunno, ungainly. So, question I have for any passing Powershell & RegEx guru: "How can I do all this in one line?"
What'd I'm looking for is where I can get the following results
Original Corrected
===================== =====================
dbo.ProcName [dbo].[ProcName]
dbo.[ProcName] [dbo].[ProcName]
[dbo].ProcName [dbo].[ProcName]
[dbo].[ProcName] [dbo].[ProcName]
[My.Schema].[My.Proc] [My.Schema].[My.Proc]
[My.Schema].ProcName [MySchema].[ProcName]
dbo.[ABadBADName! [dbo].[[ABadBADName!]
(Notice the last instance where an object name starts but does not end with a square bracket (not that I'm expecting that [and if I saw anyone on my team naming an object like that I'd be asking HR if I can fire them for it], but I do like to be so thorough).
Think that covers everything...
So, over to you Powershell & RegEx gurus - how do I do this?
Please limit any answers to FULLY answering the question with code I can actually use and not just syntax suggestions.
Clarification: I am acutely aware that sometimes 'slow and steady wins the race' may apply here and that support wise it would be potentially safer to handle the rest in a ForEach, but that's not the point. Part of this is to help me understand just how flexible RegEx can be, so this is more of an educational exercise rather than a philosophical one.
Okay how about this:
#'
dbo.ProcName
dbo.[ProcName]
[dbo].ProcName
[dbo].[ProcName]
[My.Schema].[My.Proc]
[My.Schema].ProcName
dbo.[ABadBADName!
'# -split '\s*\r?\n\s*' | % {
$_ -replace '^(?:\[(?<schema>[^\]]+)\]|(?<schema>[^\.]+))\.(?:\[(?<proc>[^\]]+)\]|(?<proc>[^\.]+))$', '[${schema}].[${proc}]'
}
Note that I'm only using ForEach-Object (%) here to iterate through your test cases; the actual replace is done with a single regex / replace.
Explanation
So the important part here is the regex:
^(?:\[(?<schema>[^\]]+)\]|(?<schema>[^\.]+))\.(?:\[(?<proc>[^\]]+)\]|(?<proc>[^\.]+))$
Breaking it down:
^ -- match the beginning of the string
(?: -- open a non-capturing group (for alternation purposes)
\[ -- match a literal left bracket [
(?<schema> -- start a named capture group, with the name schema
[^\]]+ -- match 1 or more of any character that is not a literal right square bracket ]
) -- end the schema capture group
| -- alternation; if the previous expression didn't match, try what comes after this
(?<schema> -- again start a named capture group called schema; this is only tried if the other one didn't match.
[^\.]+ -- match 1 or more of any character that is not a literal dot .
) -- end the alternate schema capture group
) -- end the non-capturing group
\. -- match a literal dot . (this is the one separating schema and proc)
(the next part for proc is exactly the same steps as above, with a different name for the capturing group)
$ -- match the end of the string
In the replace, we just qualify the names of the groups with ${name} syntax instead of the numbers $1 (which would work too actually).

Regular Expression, dynamic number

The regular expression which I have provided will select the string 72719.
Regular expression:
(?<=bdfg34f;\d{4};)\d{0,9}
Text sample:
vfhnsirf;5234;72159;2;668912;28032009;4;
bdfg34f;8467;72719;7;6637912;05072009;7;
b5g342sirf;234;72119;4;774582;20102009;3;
How can I rewrite the expression to select that string even when the number 8467; is changed to 84677; or 846777; ? Is it possible?
First, when asking a regex question, you should always specify which language you are using.
Assuming that the language you are using does not support variable length lookbehind (and most don't), here is a solution which will work. Your original expression uses a fixed-length lookbehind to match the pattern preceding the value you want. But now this preceding text may be of variable length so you can't use a look behind. This is no problem. Simply match the preceding text normally and capture the portion that you want to keep in a capture group. Here is a tested PHP code snippet which grabs all values from a string, capturing each value into capture group $1:
$re = '/^bdfg34f;\d{4,};(\d{0,9})/m';
if (preg_match_all($re, $text, $matches)) {
$values = $matches[1];
}
The changes are:
Removed the lookbehind group.
Added a start of line anchor and set multi-line mode.
Changed the \d{4} "exactly four" to \d{4,} "four or more".
Added a capture group for the desired value.
Here's how I usually describe "fields" in a regex:
[^;]+;[^;]+;([^;]+);
This means "stuff that isn't semi-colon, followed by a semicolon", which describes each field. Do that twice. Then the third time, select it.
You may have to tweak the syntax for whatever language you are doing this regex in.
Also, if this is just a data file on disk and you are using GNU tools, there's a much easier way to do this:
cat file | cut -d";" -f 3
to match the first number with a minimum of 4 digits
(?<=bdfg34f;\d{4,};)\d{0,9}
and to match the first number with 1 or more length
(?<=bdfg34f;\d+;)\d{0,9}
or to match the first number only if the length is between 4 and 6
(?<=bdfg34f;\d{4,6};)\d{0,9}
This is a simple text parsing problem that probably doesn't mandate the use of regular expressions.
You could take the input line by line and split on ';', i.e. (in php, I have no idea what you're doing)
foreach (explode("\n", $string) as $line) {
$bits = explode(";", $line);
echo $bits[3]; // third column
}
If this is indeed in a file and you happen to be using PHP, using fgetcsv would be much better though.
Anyway, context is missing, but the bottom line is I don't think you should be using regular expressions for this.