How to Use Delphi TRegEx to replace a particular capture group? - regex

I am trying to use TRegex in Delphi XE7 to do a search and replace in a string.
The string looks like this "#FXXX(b, v," and I want to replace the second integer value v.
For example:
#F037(594,2027,-99,-99,0,0,0,0)
might become
#F037(594,Fred,-99,-99,0,0,0,0)
I am a newbie at RegEx but made up this pattern that seems to work fine for finding the match and identifying the right capturing group for the "2027" (the part below in parentheses). Here it is:
#F\d{3}(\s*\d{1,5}\s*,\s*(\d{1,5})\s*,
My problem is that I cannot work out how to replace just the captured group "2027" using the Delphi TRegEx implementation. I am getting rather confused about TMatch and TGroup and how to use them. Can anyone suggest some sample code? I also suspect I am not understanding the concept of backreferences.
Here is what I have so far:
Uses
RegularExpressions;
//The function that does the actual replacement
function TForm6.DoReplace(const Match: TMatch): string;
begin
//This causes the whole match to be replaced.
//#F037(594,2027,-99,-99,0,0,0,0) becomes Fred-99,-99,0,0,0,0)
//How to just replace the first matched group (ie 2027)?
If Match.Success then
Result := 'Fred';
end;
//Code to set off the regex replacement based on source text in Edit1 and put the result back into Memo1
//Edit1.text set to #F037(594,2027,-99,-99,0,0,0,0)
procedure TForm6.Button1Click(Sender: TObject);
var
regex: TRegEx;
Pattern: string;
Evaluator: TMatchEvaluator;
begin
Memo1.Clear;
Pattern := '#F\d{3}\(\s*\d{1,5}\s*,\s*(\d{1,5})\s*,';
RegEx.Create(Pattern);
Evaluator := DoReplace;
Memo1.Lines.Add(RegEx.Replace(Edit1.Text, Pattern, Evaluator));
end;

When using regex replacements, the whole matched content will be replaced. You have access to the whole match, captured groups and named captured groups.
There are two different ways of doing this in Delphi.
You are currently using an Evaluator, that is a object method containing instructions what to replace. Inside this method you have access to the whole match content. The result will be the replacement string.
This way is useful if vanilla regex is not capable of things you want to do in the replace (e.g. increasing numbers, changing charcase)
There is another overload Replace method that uses a string as replacement. As you want to do a basic regex replace here, I would recommend using it.
In this string you can backreference to your matched pattern ($0 for whole match, $Number for captured groups, ${Name} for named capturing groups), but also add whatever characters you want.
So you can capture everything you want to keep in groups and then backreference is as recommended in Wiktors comment.
As you are doing a single replace, I would als recommend using the class function TRegex.Replace instead of creating the Regex and then replacing.
Memo1.Lines.Add(
TRegex.Replace(
Edit1.Text,
'(#F\d{3}\(\s*\d{1,5}\s*,\s*)\d{1,5}(\s*,)',
'$1Fred$2'));
PCRE regex also supports \K (omits everything matched before) and lookaheads, which can be used to capture exactly what you want to replace, like
Memo1.Lines.Add(
TRegex.Replace(
Edit1.Text,
'#F\d{3}\(\s*\d{1,5}\s*,\s*\K\d{1,5}(?=\s*,)',
'Fred'));

Related

VSCode - find and replace with regexp, but keep word

I have multiple occurance of src={icons.ICON_NAME_HERE} in my code, that I would like to change to name="ICON_NAME_HERE".
Is it possible to do it with regular expressions, so I can keep whatever is in code as ICON_NAME_HERE?
To clarify:
I have for example src={icons.upload} and src={icons.download}, I want to do replace all with one regexp, so those gets converted to name="upload" and name="download"
Try searching on the following pattern:
src=\{icons\.([^}]+)\}
And then replace with your replacement:
name="$1"
In case you are wondering, the quantity in parentheses in the search pattern is captured during the regex search. Then, we can access that captured group using $1 in the replacement. In this case, the captured group should just be the name of the icon.

Regex capture into group everything from string except part of string

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

Replace string using regular expression in KETTLE

I would like to use regular expression for replacing a certain pattern in the Kettle. For example, AAAA >5< BBBB, I want to replace this with AAAA 555 BBBB. I know how to find the pattern, but I am not sure how to replace that with new string. The one thing I have to keep is that I have to find pattern together ><, not separately like > or < because there is another pattern <5>.
You can use the "Replace in String" step in a transformation.
Set use RegEx to "Y", type your regex on the Search box, with capturing groups if necessary, and the replacement string in the replacement box, referring to capture groups as $1, $2, ...
It'll replace all occurrences of the regex in the original string.
If the Out Stream field is ommitted, it'll overwrite the In stream field.
If you want the pattern >\d< replaced by a triple of the found digit, you can use Replace-In-String in regex mode:
Search: (.*)(>(\d)<)(.*)
Replace: $1$3$3$3$4
If you want all such patterns treated the same:
Search: (>(\d)<)
Replace: $2$2$2
EDIT due to your improved requirement
Since you intend to convert your "simple" markup to a more HTML-like markup, you better use a User-Defined-Java-Expression. Also, you must avoid to reintroduce simple markup when replacing repeatedly.

Extract querystring value from url using regex

I need to pull a variable out of a URL or get an empty string if that variable is not present.
Pseudo code:
String foo = "http://abcdefg.hij.klmnop.com/a/b/c.file?foo=123&zoo=panda";
String bar = "http://abcdefg.hij.klmnop.com/a/b/c.file";
when I run my regex I want to get 123 in the first case and empty string in the second.
I'm trying this as my replace .*?foo=(.*?)&?.*
replacing this with $1 but that's not working when foo= isn't present.
I can't just do a match, it has to be a replace.
You can try this:
[^?]+(?:\?foo=([^&]+).*)?
If there are parameters and the first parameter is named "foo", its value will be captured in group #1. If there are no parameters the regex will still succeed, but I can't predict what will happen when you access the capturing group. Some possibilities:
it will contain an empty string
it will contain a null reference, which will be automatically converted to
an empty string
the word "null"
your app will throw an exception because group #1 didn't participate in the match.
This regex matches the sample strings you provided, but it won't work if there's a parameter list that doesn't include "foo", or if "foo" is not the first parameter. Those options can be accommodated too, assuming the capturing group thing works.
I think you need to do a match, then a regex. That way you can extract the value if it is present, and replace it with "" if it is not. Something like this:
if(foo.match("\\?foo=([^&]+)")){
String bar = foo.replace("\\?foo=([^&]+)", $1);
}else{
String bar = "";
}
I haven't tested the regex, so I don't know if it will work.
In perl you could use this:
s/[^?*]*\??(foo=)?([\d]*).*/$2/
This will get everything up to the ? to start, and then isolate the foo, grab the numbers in a group and let the rest fall where they may.
There's an important rule when using regular expressions : don't try to put unnecessary processing into it. Sometimes things can't be done only by using one regular expression. Sometimes it is more advisable to use the host programming language.
Marius' answer makes use of this rule : rather than finding a convoluted way of replacing-something-only-if-it-exists, it is better to use your programming language to check for the pattern's presence, and replace only if necessary.

Need regexp to find substring between two tokens

I suspect this has already been answered somewhere, but I can't find it, so...
I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)
myString = "A=abc;B=def_3%^123+-;C=123;" ;
myB = getInnerString(myString, "B=", ";" ) ;
method getInnerString(inStr, startToken, endToken){
return inStr.replace( EXPRESSION, "$1");
}
so, when I run this using expression ".+B=(.+);.+"
I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.
I've tried using (?=) in search of that first ';' but it gives me the same result.
I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.
any and all help greatly appreciated.
Similar question on SO:
Regex: To pull out a sub-string between two tags in a string
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Replace patterns that are inside delimiters using a regular expression call
RegEx matching HTML tags and extracting text
You're using a greedy pattern by not specifying the ? in it. Try this:
".+B=(.+?);.+"
Try this:
B=([^;]+);
This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.
(This is a continuation of the conversation from the comments to Evan's answer.)
Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.
All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):
String s = "A=abc;B=def_3%^123+-;C=123;";
Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
System.out.println(m.group(1));
}
Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:
print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;
...so we content ourselves with hacks like this:
System.out.println("A=abc;B=def_3%^123+-;C=123;"
.replaceFirst(".+B=(.*?);.+", "$1"));
Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.