Regex referencing captured groups - regex

Firstly, I'm very new to Regex so my apologies if this is a dumb question.
I'm just using an online Regex tester https://regex101.com (PCRE) to build the following scenario.
I want to capture 123445 and ABC1234 from the following sentence
Foo Bar 123445 Ref ABC1234
I just wanted to use a simple capturing group
((?:\w)+)
Which will identify 5 matching groups And then I could back reference it with $3 and $5
However when I attempt using Substitution with just one group, $3, I end up with the whole string. I tried some of the other languages and ended up with
$3 $3 $3 $3 $3
In the end I just used Foo\s*Bar\s*(\w+)\s*Ref\s*(\w+) and referencing groups $1 and $2 which works fine but just isn't very elegant.
Is it possible to create this kind of back referencing without specifically building capturing groups around each part of what you are trying to capture?
Thanks :)

((?:\w)+)
Which will identify 5 matching groups And then I could back reference
it with $3 and $5
No, that's not how backreferences work. There are exactly N groups in a regex, and N is the number of opening parenthesis.
In ((?:\w)+) there are 2 groups, one "capturing" (which creates a backreference) and one "non-capturing" (which does not).
The number of times a group matches in a target string does not change the number of backreferences. Imagine the chaos this would create. Except for the most simplistic cases, how would you even know if what you're looking for is $3, $9 or $9000?
If your input string has a fixed structure, then your approach Foo\s*Bar\s*(\w+)\s*Ref\s*(\w+) with $1 and $2 is perfectly fine.
Is it possible to create this kind of back referencing without
specifically building capturing groups around each part of what you
are trying to capture?
No. You must build one capturing group for each part that you are trying to backreference to. If a group matches multiple times, you will get the last instance of each match in the input.
Some regex engines let you to access each instance of what a particular group has captured from the host language. For example the .NET regex engine does that. This is nice for post-processing, but the backreferences themselves (i.e. the $1) still work as above.
All that being said, the way to get '123445' and 'ABC1234' out of Foo Bar 123445 Ref ABC1234 in the way you were thinking of is to avoid regex and string.split() at the space, taking parts 2 and 3.

It isn't entirely clear what you are trying to match and what you want to substitute with based on your question.
For the purpose of trying to get an answer for you, I'm going to assume that you want to match any word that has a number and replace it with something else.
\w*?\d+\w*? will match any word with a digit in it, and with JavaScript (you didn't specify a language), you perform a manual substitution, or a dynamic one with a replacer function.
const expression = /\b(\w*?\d+\w*?)\b/g;
const inputs = [
'Foo Bar 123445 Ref ABC1234',
'Hello World 123 Foo ABC123XYZ456'
];
// static string
console.log(inputs.map(i => i.replace(expression, '**redacted**')));
// dynamic string
console.log(inputs.map(i => i.replace(expression, s => new Array(s.length).fill('*').join(''))));

Related

RegEx Replace - Remove Non-Matched Values

Firstly, apologies; I'm fairly new to the world of RegEx.
Secondly (more of an FYI), I'm using an application that only has RegEx Replace functionality, therefore I'm potentially going to be limited on what can/can't be achieved.
The Challange
I have a free text field (labelled Description) that primarily contains "useless" text. However, some records will contain either one or multiple IDs that are useful and I would like to extract said IDs.
Every ID will have the same three-letter prefix (APP) followed by a five digit numeric value (e.g. 12911).
For example, I have the following string in my Description Field;
APP00001Was APP00002TEST APP00003Blah blah APP00004 Apple APP11112OrANGE APP
THE JOURNEY
I've managed to very crudely put together an expression that is close to what I need (although, I actually need the reverse);
/!?APP\d{1,5}/g
Result;
THE STRUGGLE
However, on the Replace, I'm only able to retain the non-matched values;
Was TEST Blah blah Apple OrANGE APP
THE ENDGAME
I would like the output to be;
APP00001 APP00002 APP00003 APP00004 APP11112
Apologies once again if this is somewhat of a 'noddy' question; but any help would be much appreciated and all ideas welcome.
Many thanks in advance.
You could use an alternation | to capture either the pattern starting with a word boundary in group 1 or match 1+ word chars followed by optional whitespace chars.
What you capture in group 1 can be used as the replacement. The matches will not be in the replacement.
Using !? matches an optional exclamation mark. You could prepend that to the pattern, but it is not part of the example data.
\b(APP\d{1,5})\w*|\w+\s*
See a regex demo
In the replacement use capture group 1, mostly using $1 or \1

Regex fragment for "one or more instances of this pattern" [duplicate]

I need to get an array of floats (both positive and negative) from the multiline string. E.g.: -45.124, 1124.325 etc
Here's what I do:
text.scan(/(\+|\-)?\d+(\.\d+)?/)
Although it works fine on regex101 (capturing group 0 matches everything I need), it doesn't work in Ruby code.
Any ideas why it's happening and how I can improve that?
See scan documentation:
If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
You should remove capturing groups (if they are redundant), or make them non-capturing (if you just need to group a sequence of patterns to be able to quantify them), or use extra code/group in case a capturing group cannot be avoided.
In this scenario, the capturing group is used to quantifiy a pattern sequence, thus all you need to do is convert the capturing group into a non-capturing one by replacing all unescaped ( with (?: (there is only one occurrence here):
text = " -45.124, 1124.325"
puts text.scan(/[+-]?\d+(?:\.\d+)?/)
See demo, output:
-45.124
1124.325
Well, if you need to also match floats like .04 you can use [+-]?\d*\.?\d+. See another demo
There are cases when you cannot get rid of a capturing group, e.g. when the regex contains a backreference to a capturing group. In that case, you may either a) declare a variable to store all matches and collect them all inside a scan block, or b) enclose the whole pattern with another capturing group and map the results to get the first item from each match, c) you may use a gsub with just a regex as a single argument to return an Enumerator, with .to_a to get the array of matches:
text = "11234566666678"
# Variant a:
results = []
text.scan(/(\d)\1+/) { results << Regexp.last_match(0) }
p results # => ["11", "666666"]
# Variant b:
p text.scan(/((\d)\2+)/).map(&:first) # => ["11", "666666"]
# Variant c:
p text.gsub(/(\d)\1+/).to_a # => ["11", "666666"]
See this Ruby demo.
([+-]?\d+\.\d+)
assumes there is a leading digit before the decimal point
see demo at Rubular
If you need capture groups for a complex pattern match, but want the entire expression returned by .scan, this can work for you.
Suppose you want to get the image urls in this string perhaps from a markdown text with html image tags:
str = %(
Before
<img src="https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-1842z4b73d71">
After
<img src="https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-a235b84bf150.png">).strip
You may have a regular expression defined to match just the urls, and maybe used a Rubular example like this to build/test your Regexp
image_regex =
/https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b/
Now you don't need each sub-capture group, but just the the entire expression in your your .scan, you can just wrap the whole pattern inside a capture group and use it like this:
image_regex =
/(https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b)/
str.scan(image_regex).map(&:first)
=> ["https://user-images.githubusercontent.com/1949900/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png",
"https://user-images.githubusercontent.com/1949900/75255473-02bca700-57b0-11ea-852a-58424698cfb0.png"]
How does this actually work?
Since you have 3 capture groups, .scan alone will return an Array of arrays with, one for each capture:
str.scan(image_regex)
=> [["https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png", "user-", "githubusercontent"],
["https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-0714c8f76f68", nil, "zenhubusercontent"]]
Since we only want the 1st (outter) capture group, we can just call .map(&:first)

How to Use Delphi TRegEx to replace a particular capture group?

I am trying to use TRegex in Delphi XE7 to do a search and replace in a string.
The string looks like this "#FXXX(b, v," and I want to replace the second integer value v.
For example:
#F037(594,2027,-99,-99,0,0,0,0)
might become
#F037(594,Fred,-99,-99,0,0,0,0)
I am a newbie at RegEx but made up this pattern that seems to work fine for finding the match and identifying the right capturing group for the "2027" (the part below in parentheses). Here it is:
#F\d{3}(\s*\d{1,5}\s*,\s*(\d{1,5})\s*,
My problem is that I cannot work out how to replace just the captured group "2027" using the Delphi TRegEx implementation. I am getting rather confused about TMatch and TGroup and how to use them. Can anyone suggest some sample code? I also suspect I am not understanding the concept of backreferences.
Here is what I have so far:
Uses
RegularExpressions;
//The function that does the actual replacement
function TForm6.DoReplace(const Match: TMatch): string;
begin
//This causes the whole match to be replaced.
//#F037(594,2027,-99,-99,0,0,0,0) becomes Fred-99,-99,0,0,0,0)
//How to just replace the first matched group (ie 2027)?
If Match.Success then
Result := 'Fred';
end;
//Code to set off the regex replacement based on source text in Edit1 and put the result back into Memo1
//Edit1.text set to #F037(594,2027,-99,-99,0,0,0,0)
procedure TForm6.Button1Click(Sender: TObject);
var
regex: TRegEx;
Pattern: string;
Evaluator: TMatchEvaluator;
begin
Memo1.Clear;
Pattern := '#F\d{3}\(\s*\d{1,5}\s*,\s*(\d{1,5})\s*,';
RegEx.Create(Pattern);
Evaluator := DoReplace;
Memo1.Lines.Add(RegEx.Replace(Edit1.Text, Pattern, Evaluator));
end;
When using regex replacements, the whole matched content will be replaced. You have access to the whole match, captured groups and named captured groups.
There are two different ways of doing this in Delphi.
You are currently using an Evaluator, that is a object method containing instructions what to replace. Inside this method you have access to the whole match content. The result will be the replacement string.
This way is useful if vanilla regex is not capable of things you want to do in the replace (e.g. increasing numbers, changing charcase)
There is another overload Replace method that uses a string as replacement. As you want to do a basic regex replace here, I would recommend using it.
In this string you can backreference to your matched pattern ($0 for whole match, $Number for captured groups, ${Name} for named capturing groups), but also add whatever characters you want.
So you can capture everything you want to keep in groups and then backreference is as recommended in Wiktors comment.
As you are doing a single replace, I would als recommend using the class function TRegex.Replace instead of creating the Regex and then replacing.
Memo1.Lines.Add(
TRegex.Replace(
Edit1.Text,
'(#F\d{3}\(\s*\d{1,5}\s*,\s*)\d{1,5}(\s*,)',
'$1Fred$2'));
PCRE regex also supports \K (omits everything matched before) and lookaheads, which can be used to capture exactly what you want to replace, like
Memo1.Lines.Add(
TRegex.Replace(
Edit1.Text,
'#F\d{3}\(\s*\d{1,5}\s*,\s*\K\d{1,5}(?=\s*,)',
'Fred'));

Force first letter of regex matched value to uppercase

I am trying to get better at regular expressions. I am using regex101.com. I have a regular expression that has two capturing groups. I am then using substitution to incorporate my captured values into another location.
For example I have a list of values:
fat dogs
thin cats
skinny cows
purple salamanders
etc...
and this captures them into two variables:
^([^\s]+)\s+([^\s;]+)?.*
which I then substitute into new sentences using $1 and $2. For example:
$1 animals like $2 are a result of poor genetics.
(obviously this is a silly example)
This works and I get my sentences made but I'm stumped trying to force $1 to have an uppercase first letter. I can see all sorts of examples on MATCHING uppercase or lowercase but not transforming to uppercase.
It seems I need to do some sort of "function" processing. I need to pass $1 to something that will then break it into two pieces...first letter and all the other letters....transform piece one to uppercase...then smash back together and return the result.
Add to that error checking...and while it is unlikely $1 will have numeric values we should still do a safety check of some sort.
So if someone can just point me to the reading material I would appreciate it.
A regular expression will only match what is there. What you are doing is essentially:
Match item
Display matches
but what you want to be doing is:
Match item
Modify matches
Display modified matches
A regular expression doesn't do any 'processing' on the matches, it is just a syntax for finding the matches in the first place.
Most languages have string processing, for instance, if you had you matches in the variables $1 and $2 as above, you would want to do something along the lines of:
$1 = upper(substring($1, 0, 1)) + substring($1, 1)
assuming the upper() function if you language's strung uppercasing function, and substring() returns a sub-string (zero indexed).
Put very simply, regex can only replace from what is in your original string. There is no capital F in fat dogs so you can't get Fat dogs as your output.
This is possible in Perl, however, but only because Perl processes the text after the regex substitution has finished, it is not a feature of the regex itself. The following is a short Perl program (sans regex) that performs case transformation if run from the command line:
#!/usr/bin/perl -w
use strict;
print "fat dogs\n"; # fat dogs
print "\ufat dogs\n"; # Fat dogs
print "\Ufat dogs\n"; # FAT DOGS
The same escape sequences work in regexs too:
#!/usr/bin/perl -w
use strict;
my $animal = "fat dogs";
$animal =~ s/(\w+) (\w+)/\u$1 \U$2/;
print $animal; # Fat DOGS
Let me repeat though, it is Perl doing this, not the regex.
Depending on your real world example you may not have to change the case of the letter. If your input is Fat dogs then you will get the desired result. Otherwise, you will have to process $1 yourself.
In PHP you can use preg_replace_callback() to process the entire match, including captured groups, before returning the substitution string. Here is a similar PHP program:
<?php
$animal = "fat dogs";
print(preg_replace_callback('/(\w+) (\w+)/', 'my_callback', $animal)); // Fat DOGS
function my_callback($match) {
return ucfirst($match[1]) . ' ' . strtoupper($match[2]);
}
?>
I think it can be very simple based on your language of choice. You can firs loop over the list of values and find your match then put the groups within your string by using a capitalize method for first matched :
for val in my_list:
m = match(^([^\s]+)\s+([^\s;]+)?.*,val)
print "%sanimals like %s are a result of poor genetics."%(m.group(1).capitalize(), m.group(1))
But if you want to dot it all with regex It's very unlikely to be possible because you need to modify your string and this is generally not a regex a suitable task for regex.
So in the end the answer is that you CAN'T use regex to transform...that's not it's job. Thanks to the input by others I was able to adjust my approach and still accomplish the objective of this self inflicted academic assignment.
First from the OP you'll recall that I had a list and I was capturing two words from that list into regex variables. Well I modified that regex capture to get three capture groups. So for example:
^(\S)(\S+)\s+_(\S)?.*
//would turn fat dogs into
//$1 = f, $2 = at, $3 = dogs
So then using Notepad++ I then replaced with this:
\u$1$2 animals like $3 are a result of poor genetics.
In this way I was able to transform the first letter to uppercase..but as others pointed out this is NOT regex doing the transform but another process. (In this case notepad ++ but could be your c#, perl, etc).
Thank You everyone for helping the newbie.

Regular Expression "Matching" vs "Capturing"

I've been looking up regular expression tutorials trying to get the hang of them and was enjoying the tutorial in this link right up until this problem: http://regexone.com/lesson/12
I cannot seem to figure out what the difference between "matching" and "capturing" is. Nothing I write seems to select the text under the "Capture" section (not even .*).
Edit: Here is an example for the tutorial that confuses me: (.* (.*)) is considered correct and (.* .*) is not. Is this a problem with the tutorial or something I am not understanding?
Matching:
When engine matches a part of string or the whole but does return nothing.
Capturing:
When engine matches a part of string or the whole and does return something.
--
What's the meaning of returning?
When you need to check/store/validate/work/love a part of string that your regex matched it before you need capturing groups (...)
At your example this regex .*?\d+ just matches the dates and years See here
And this regex .*?(\d+) matches the whole and captures the year See here
And (.*?(\d+)) will match the whole and capture the whole and the year respectively See here
*Please notice the bottom right box titled Match groups
So returning....
1:
preg_match("/.*?\d+/", "Jan 1987", $match);
print_r($match);
Output:
Array
(
[0] => Jan 1987
)
2:
preg_match("/(.*?\d+)/", "Jan 1987", $match);
print_r($match);
Output:
Array
(
[0] => Jan 1987
[1] => Jan 1987
)
3:
preg_match("/(.*?(\d+))/", "Jan 1987", $match);
print_r($match);
Output:
Array
(
[0] => Jan 1987
[1] => Jan 1987
[2] => 1987
)
So as you can see at the last example, we have 2 capturing groups indexed at 1 and 2 in the array, and 0 is always the matched string however it's not captured.
capturing in regexps means indicating that you're interested not only in matching (which is finding strings of characters that match your regular expression), but you're also interested in using specific parts of the matched string later on.
for example, the answer to the tutorial you linked to would be (\w{3}\s+(\d+)).
now, why ?
to simply match the date strings it would be enough to write \w{3}\s+\d+ (3 word characters, followed by one or more spaces, followed by one or more digits), but adding capture groups to the expression (a capture group is simply anything enclosed in parenthesis ()) will allow me to later extract either the whole expression (using "$1", because the outer-most pair of parenthesis are the 1st the parser encounters) or just the year (using "$2", because the 2nd pair of parenthesis, around the \d+, are the 2nd pair that the regexp parser encounters)
capture groups come in handy when you're interested not only in matching strings to pattern, but also extracting data from the matched strings or modifying them in any way. for example, suppose you wanted to add 5 years to each of those dates in the tutorial - being able to extract just the year part from a matched string (using $2) would come in handy then
In a nutshell, a "Capture" saves the collected value in a special place so you can access it later.
As some have pointed out, the captured stuff can be used 'later on' in the same pattern, so that
/(ab*c):\1/
will match ac:ac, or abc:abc, or abbc:abbc etc. The (ab*c) will match an a, any number of b, then a c. Whatever it DOES match is 'captured'. In many programming and scripting languages, the syntax like \1, \2 etc has the special meaning referring to the first, second, etc captures. Since the first one might be abbc, then the \1 bit has to match abbc only, thus the only possible full match would then be 'abbc:abbc'
Perl (and I think) PHP both allow the \1 \2 syntax, but they also use $1 $2 etc which is considered more modern. Many languages have picked up the powerful RegEx engine from Perl so there's increasing use of this in the world.
Since your sample question seems to be on a PHP site, the typical use of $1 in PHP is:
/(ab*c)(de*f)/
then later (eg next line of code)
$x = $1 . $2; # I hope that's PHP syntax for concatenation!
So the capture is available until your next use of a regex. Depending on the programming language in use, those captured values may be smashed by the next pattern match, or they may be permanently available through special syntax or use of the language.
take a look at these 2 regex - from your example
# first
/(... (\d\d\d\d))/
#second
/... \d\d\d\d/
they both match "Jun 1965" and "May 2000"
(and incidentally many other things like "555 1234")
the second one just matches it - yesno
so you could say
if ($x=~/... \d\d\d\d/){do something}
the first one captures so
/(... (\d\d\d\d))/
print $1,";;;",$2
would print "Jun 1967;;;1967"