Extract querystring value from url using regex - regex

I need to pull a variable out of a URL or get an empty string if that variable is not present.
Pseudo code:
String foo = "http://abcdefg.hij.klmnop.com/a/b/c.file?foo=123&zoo=panda";
String bar = "http://abcdefg.hij.klmnop.com/a/b/c.file";
when I run my regex I want to get 123 in the first case and empty string in the second.
I'm trying this as my replace .*?foo=(.*?)&?.*
replacing this with $1 but that's not working when foo= isn't present.
I can't just do a match, it has to be a replace.

You can try this:
[^?]+(?:\?foo=([^&]+).*)?
If there are parameters and the first parameter is named "foo", its value will be captured in group #1. If there are no parameters the regex will still succeed, but I can't predict what will happen when you access the capturing group. Some possibilities:
it will contain an empty string
it will contain a null reference, which will be automatically converted to
an empty string
the word "null"
your app will throw an exception because group #1 didn't participate in the match.
This regex matches the sample strings you provided, but it won't work if there's a parameter list that doesn't include "foo", or if "foo" is not the first parameter. Those options can be accommodated too, assuming the capturing group thing works.

I think you need to do a match, then a regex. That way you can extract the value if it is present, and replace it with "" if it is not. Something like this:
if(foo.match("\\?foo=([^&]+)")){
String bar = foo.replace("\\?foo=([^&]+)", $1);
}else{
String bar = "";
}
I haven't tested the regex, so I don't know if it will work.

In perl you could use this:
s/[^?*]*\??(foo=)?([\d]*).*/$2/
This will get everything up to the ? to start, and then isolate the foo, grab the numbers in a group and let the rest fall where they may.

There's an important rule when using regular expressions : don't try to put unnecessary processing into it. Sometimes things can't be done only by using one regular expression. Sometimes it is more advisable to use the host programming language.
Marius' answer makes use of this rule : rather than finding a convoluted way of replacing-something-only-if-it-exists, it is better to use your programming language to check for the pattern's presence, and replace only if necessary.

Related

How to Use Delphi TRegEx to replace a particular capture group?

I am trying to use TRegex in Delphi XE7 to do a search and replace in a string.
The string looks like this "#FXXX(b, v," and I want to replace the second integer value v.
For example:
#F037(594,2027,-99,-99,0,0,0,0)
might become
#F037(594,Fred,-99,-99,0,0,0,0)
I am a newbie at RegEx but made up this pattern that seems to work fine for finding the match and identifying the right capturing group for the "2027" (the part below in parentheses). Here it is:
#F\d{3}(\s*\d{1,5}\s*,\s*(\d{1,5})\s*,
My problem is that I cannot work out how to replace just the captured group "2027" using the Delphi TRegEx implementation. I am getting rather confused about TMatch and TGroup and how to use them. Can anyone suggest some sample code? I also suspect I am not understanding the concept of backreferences.
Here is what I have so far:
Uses
RegularExpressions;
//The function that does the actual replacement
function TForm6.DoReplace(const Match: TMatch): string;
begin
//This causes the whole match to be replaced.
//#F037(594,2027,-99,-99,0,0,0,0) becomes Fred-99,-99,0,0,0,0)
//How to just replace the first matched group (ie 2027)?
If Match.Success then
Result := 'Fred';
end;
//Code to set off the regex replacement based on source text in Edit1 and put the result back into Memo1
//Edit1.text set to #F037(594,2027,-99,-99,0,0,0,0)
procedure TForm6.Button1Click(Sender: TObject);
var
regex: TRegEx;
Pattern: string;
Evaluator: TMatchEvaluator;
begin
Memo1.Clear;
Pattern := '#F\d{3}\(\s*\d{1,5}\s*,\s*(\d{1,5})\s*,';
RegEx.Create(Pattern);
Evaluator := DoReplace;
Memo1.Lines.Add(RegEx.Replace(Edit1.Text, Pattern, Evaluator));
end;
When using regex replacements, the whole matched content will be replaced. You have access to the whole match, captured groups and named captured groups.
There are two different ways of doing this in Delphi.
You are currently using an Evaluator, that is a object method containing instructions what to replace. Inside this method you have access to the whole match content. The result will be the replacement string.
This way is useful if vanilla regex is not capable of things you want to do in the replace (e.g. increasing numbers, changing charcase)
There is another overload Replace method that uses a string as replacement. As you want to do a basic regex replace here, I would recommend using it.
In this string you can backreference to your matched pattern ($0 for whole match, $Number for captured groups, ${Name} for named capturing groups), but also add whatever characters you want.
So you can capture everything you want to keep in groups and then backreference is as recommended in Wiktors comment.
As you are doing a single replace, I would als recommend using the class function TRegex.Replace instead of creating the Regex and then replacing.
Memo1.Lines.Add(
TRegex.Replace(
Edit1.Text,
'(#F\d{3}\(\s*\d{1,5}\s*,\s*)\d{1,5}(\s*,)',
'$1Fred$2'));
PCRE regex also supports \K (omits everything matched before) and lookaheads, which can be used to capture exactly what you want to replace, like
Memo1.Lines.Add(
TRegex.Replace(
Edit1.Text,
'#F\d{3}\(\s*\d{1,5}\s*,\s*\K\d{1,5}(?=\s*,)',
'Fred'));

regex expression for selecting a value

I want to write a regexp formula for the below sip message that takes number:
< sip:callpark#as1sip1.com:5060;user=callpark;service=callpark;preason=park;paction=park;ptoken=150009;pautortrv=180;nt_server_host=47.168.105.100:5060 >
(Actually there are "<" and ">" signs in the message, but the site does not let me write)
For this case, I want to select ptoken value.. I wrote an expression such as: ptoken=(.*);p but it returns me ptoken=150009;p, I just need the number:150009
How do I write a regexp for this case?
PS: I write this for XML script..
Thanks,
I SOLVE THE PROBLEM BY USING TWO REGEX:
ereg assign_to="token" check_it="true" header="Refer-To:" regexp="(ptoken=([\d]*))" search_in="hdr"/
ereg assign_to="callParkToken" search_in="var" variable="token" check_it="true" regexp="([\d].*)" /
You could use the following regex:
ptoken=(\d+)
# searches for ptoken= literally
# captures every digit found in the first group
Your wanted numbers are in the first group then. Take a look at this demo on regex101.com. Depending on your actual needs, there could be better approaches (Xpath? as tagged as XML) though.
You should use lookahead and lookbehind:
(?<=ptoken=)(.+?)(?=;)
It captures any character (.+?) before which is ptoken= and behind which is ;
The <ereg ... > action has the assign_to parameter. In your case assign_to="token". In fact, the parameter can receive several variable names. The first is assigned the whole string matching the regular expression, and the following are assigned the "capture groups" of the regular expression.
If your regexp is ptoken=([\d]*), the whole match includes ptoken which is bad. The first capture group is ([\d]*) which is the required value. Thus, use <ereg regexp="ptoken=([\d]*)" assign_to="dummyvar,token" ..other parameters here.. >.
Is it working?

Regular expression and extracting a value

I need to get hold oif the |value| below:
{"token":"<input name=\"__RequestVerificationToken\" type=\"hidden\" value=\"KhWUxVIL697p18Gm3T1b4pCmXjK7iQujsJieYiLOKcKmKbdvC55kgaqg4G-uGqeUzmV3x6EMAV_ejPHe-Ok2kFqnjzVmvZmHySMpwKzGvq01\" />"}
What kind of regular expression would match that?
I have tried to us this:
.check(regex("input name='__RequestVerificationToken' type='hidden' value='([A-Za-z0-9+=/'-'_]+?)'").saveAs("token")))
But it does not match.
Also using a regex tester does not get me anywhere, please help me.
I would use something like this:
regex("<input.+__RequestVerificationToken.+value=\\?(\"|\')(.+)\\?(\"|\').+>")
It can be made shorter, but I was not sure how actual example string looks (does it have escape chars at once, does it use single or double qoutes).
assuming that the string in your question is exactly the way it appears, with escaped double quotes \" etc.
here is the code:
val regexGroupExtractor = """.*value=\\"(.*)\\".*""".r
val regexGroupExtractor(e) = s
// e == "KhWUxVIL697p18Gm3T1b4pCmXjK7iQujsJieYiLOKcKmKbdvC55kgaqg4G-uGqeUzmV3x6EMAV_ejPHe-Ok2kFqnjzVmvZmHySMpwKzGvq01"
In general with regex it is often helpful to think of the pattern in reverse: instead of specifying what is included, specify what is not. In your case there is no need to specify which characters are "in" inside the (), instead focus on where the part you want starts and ends. Specifically in your example - quotes are outside the string you want, in fact the quotes are exactly the edges, so in my regex I capture whatever it is between them.

Repeating named capture groups

I have a string with a field like this: id="ID-120-1, ID-141-5, ID-92-5, N/A"
I'd like to capture only the "ID"s to a named capture group (i.e. without the "N/A" or other items that might creep in). I thought this might work, but no luck:
\bid=\"(?<id>(ID-\d+-\d+)+)
Any ideas?
The expression you are using only returns one because you are counting on the start of the id to be present in front of each ID value. The following adjustment should fix that.
(?:(?:=\")|(?:,\s))(?<id>(?:ID-\d+-\d+)*)
Another option would be to just drop the id=" check part all together
(?<id>(?:ID-\d+-\d+))
Or you could add the ", " check on to the end of the id to make sure you are in attribute.
(?<id>(?:ID-\d+-\d+))(?:(?:,\s)|(?:"))
You would need to capture commas and spaces also, as they are repeated in your string:
\bid=\"(?<id>(ID-\d+-\d+, )+)
I believe what you are trying to do is not possible with pure regex, especially if IDs and 'N/A' can be intermixed. You will need to have a loop in your program, or if you use Perl or PHP, you can run code in the replacement part of the regex (/e switch) to add the matches to an array.

Need regexp to find substring between two tokens

I suspect this has already been answered somewhere, but I can't find it, so...
I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)
myString = "A=abc;B=def_3%^123+-;C=123;" ;
myB = getInnerString(myString, "B=", ";" ) ;
method getInnerString(inStr, startToken, endToken){
return inStr.replace( EXPRESSION, "$1");
}
so, when I run this using expression ".+B=(.+);.+"
I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.
I've tried using (?=) in search of that first ';' but it gives me the same result.
I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.
any and all help greatly appreciated.
Similar question on SO:
Regex: To pull out a sub-string between two tags in a string
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Replace patterns that are inside delimiters using a regular expression call
RegEx matching HTML tags and extracting text
You're using a greedy pattern by not specifying the ? in it. Try this:
".+B=(.+?);.+"
Try this:
B=([^;]+);
This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.
(This is a continuation of the conversation from the comments to Evan's answer.)
Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.
All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):
String s = "A=abc;B=def_3%^123+-;C=123;";
Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
System.out.println(m.group(1));
}
Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:
print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;
...so we content ourselves with hacks like this:
System.out.println("A=abc;B=def_3%^123+-;C=123;"
.replaceFirst(".+B=(.*?);.+", "$1"));
Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.