Regex - Matching a part of a URL - regex

I'm trying to use regular expression to match a part of the following url:
http://www.example.com/store/store.html?ptype=lst&id=370&3434323&root=nav_3&dir=desc&order=popularity
I want the Regex to find:
&3434323
Basically, it's meant to search any part of the argument that doesn't follow the variable=value formula. So basically I need it to search sections of the URL that don't have an equal sign it, but match just that part.
I tried using:
&\w*+[^=_-]
But it returns: &3434323&. I need it to not return the next ampersand.
And it must be done in regex. Thanks in advance!

You can use this regex:
[?&][^=]+(&|$)
It looks for any string that doesn't contain the equal sing [^=]+ and starts with the question mark or the ampersand [?&] and ends with ampersand or the end of the URL (&|$).
Please note that this will return &3434323&, so you'll have to strip the ampersands on both sides in your code. I assume that you're fine with that. If you really don't want the second ampersand, you can use a lookahead:
[?&][^=]+(?=&|$)
If you don't want even the first ampersand, you can use this regex, but not all compilers support it:
(?<=\?|&)[^=]+(?=&|$)

Parsing query parameters can be tricky, but this may do the job:
((?:[?&])[^=&]+)(?=&|$)
It will not catch the ampersand at the end of the parameter, but it will include either the question mark or the ampersand at the beginning. It will match any parameter not in the form of a key-value pair.
Demo here.

Related

How to get the queryparam vid from the url using regex

Help me with the regex, I am trying to get the vid value from the following url.
I tried with like the following but I am not sure with that:
[\&]{1}vid[\=][\d]*
Is that correct?
Use vid=(\d+) for numbers of IDs see regex
Try Your Regex on this place...
https://regex101.com/r/dX3hD4/1
The trick here is to match between two patterns of interest -
"vid="
"&"
Anything you capture between that is what you're after.
Hence use this:
"http://gorid.com/api.jsp?acs=123&vid=432&skey=asdasd-asdas-adsasd".match("vid=([^;]*)&")[1]
We're accessing the 2nd element of the match object because that contains the value.
In a JS/PHP type environment, you can match on something like this, where you just find anything alphanumeric is between vid= and the following &:
vv = str.match(/vid=(.+?)&/)[1];
HERE
If the value is always numeric, replace (.+?) with (\d+?)
The regex you wrote will not work because you are including the characters &vid= in the return value. To make sure the regex engine checks for the string &vid= but does not include it in the result you will need to use a lookbehind:
(?<=&vid=)([^&\r\n]+)
We use a positive lookbehind to find &vid= and then grab everything from that point until the next & sign or the end of the line.
For your second request, if you wish to verify that the content of vid is a valid number you need to specify that all the characters following &vid= should be digits and also include a positive lookahead that makes sure the next character after the digits is a & sign. The corresponding regular expression then becomes:
(?<=&vid=)([^\D]+)(?=&)

Match string does not contain substring with regex

Ok, I know that it is a question often asked, but I did not manage to get what I wanted.
I am looking for a regular expression in order to find a pattern that does not contain a particular substring.
I want to find an url that does not contains the b parameter.
http://www.website.com/a=789&c=146 > MATCH
http://www.website.com/a=789&b=412&c=146 > NOT MATCH
Currently, I have the following Regex:
\bhttp:\/\/www\.website\.com\/((?!b=[0-9]+).)*\b
But I am wrong with the \b, the regex match the beginning of th string and stop when it find b=, instead of not matching.
See: http://regex101.com/r/fN3zU5/3
Can someone help me please?
Just use a lookahead to check anything following the URL must be a space or line end.
\bhttp:\/\/www\.website\.com\/(?:(?!b=[0-9]+).)*?\b(?= |$)
DEMO
use this:
^http:\/\/www\.website\.com\/((?!b=[0-9]+)).*$
\b only matches word endings.
^ matches start and end of string
and you dont even need to do it that complicated, If you dont want the url with the b parameter use this:
^http:\/\/www\.website\.com\/(?!b).*$
demo here : http://regex101.com/r/fN3zU5/5
import re
pattern=re.compile(r"(?!.*?b=.*).*")
print pattern.match(x)
This will look ahead if there is a "b=" present.A negative lookahead means it will not match that string.
You had a look at this possibility:
http://regex101.com/r/fN3zU5/6
^http:\/\/www\.website\.com\/[ac\=\d&]*$
only allow &,=,a,c and digits
complete url in group and there should not be a "b=" parameter
if you have more options and you dont want to list them all:
you dont allow a 'b' to be part of your parameters
^http:\/\/www\.website\.com\/[^b]*$
http://regex101.com/r/fN3zU5/7
^http:\/\/www\.website\.com\/(?!.*?b=.*?).*$ works too here "b=" is permitted at any position of the parameter string so you could even have the "b" string as a value of a parameter.
See
http://regex101.com/r/fN3zU5/8
This is what you want. ^http:\/\/www\.website\.com\/(([^b]=[0-9]+).)*$
Its a simple pattern not flexible but it works :
http:\/\/www\.website\.com\/+a=+\w+&+c=+\w+

How to capture text between two markers?

For clarity, I have created this:
http://rubular.com/r/ejYgKSufD4
My strings:
http://blablalba.com/foo/bar_soap/foo/dir2
http://blablalba.com/foo/bar_soap/dir
http://blablalba.com/foo/bar_soap
My Regular expression:
\/foo\/(.*)
This returns:
/foo/bar_soap/dir/dir2
/foo/bar_soap/dir
/foo/bar_soap
But I only want
/foo/bar_soap
Any ideas how I can achieve this? As illustrated above, I want everything after foo up until the first forward slash.
Thanks in advance.
Edit. I only want the text after foo until until the next forward slash after. Some directories may also be named as foo and this would render incorrect results. Thanks
. will match anything, so you should change it to [^/] (not slash) instead:
\/foo\/([^\/]*)
Some of the other answers use + instead of *. That might be correct depending on what you want to do. Using + forces the regex to match at least one non-slash character, so this URL would not match since there isn't a trailing character after the slash:
http://blablalba.com/foo/
Using * instead would allow that to match since it matches "zero or more" non-slash characters. So, whether you should use + or * depends on what matches you want to allow.
Update
If you want to filter out query strings too, you could also filter against ?, which must come at the front of all query strings. (I think the examples you posted below are actually missing the leading ?):
\/foo\/([^?\/]*)
However, rather than rolling out your own solution, it might be better to just use split from the URI module. You could use URI::split to get the path part of the URL, and then use String#split split it up by /, and grab the first one. This would handle all the weird cases for URLs. One that you probably haven't though of yet is a URL with a specified fragment, e.g.:
http://blablalba.com/foo#bar
You would need to add # to your filtered-character class to handle those as well.
You can try this regular expression
/\/foo\/([^\/]+)/
\/foo\/([^\/]+)
[^\/]+ gives you a series of characters that are not a forward slash.
the parentheses cause the regex engine to store the matched contents in a group ([^\/]+), so you can get bar_soap out of the entire match of /foo/bar_soap
For example, in javascript you would get the matched group as follows:
regexp = /\/foo\/([^\/]+)/ ;
match = regexp.exec("/foo/bar_soap/dir");
console.log(match[1]); // prints bar_soap

Regex for the URL

Can someone help me with writing the regex for the below URL?
I want a Regex to match the whole URL. The url format will be like this.
https://www.mywebsite.com/us/cgi-bin/binary?cmd=_payment-option&transaction_id=8768JKHKJG19322&account_number=6UN85941RH525783L&transaction_date=Apr 12, 2012&transaction_amount=-$11.00&ccode=USD&act_id=6K6218756F7819322&counterparty=Pretty Flower Florist&initiated_page=_login&go_Ah9w8keNJ8YRLMkAMTS_Izeq0br1CF6OVtGv69WzOo8AjgDgGIiBetMG-lK&Go_Actions
This is what I have got so far, but it is matching only till the first '&'
http[s]*:\/\/www.[a-zA-Z0-9.]*mywebsite.[a-zA-Z]*[/]*[a-zA-Z0-9]*[/]*cgi-bin[/]*binary[?]*cmd=[_a-z\-]*[[\&][a-zA-Z0-9_-]*[=][a-z ,A-Z0-9_-]*]*
How can I repeat the pattern &transaction_id=8768JKHKJG19322?
[[\&][a-zA-Z0-9_-]*[=][a-z ,A-Z0-9_-]*]* does not seem to work
This is not very robust regex, but it should give you the idea - repeat common patterns.
http[s]?:\/\/www\.mywebsite\.com(?:\/[a-zA-Z-?=_&\d\s,$\.]+)+
A partial answer, because (as other posters have noted), it's not clear what you're trying to accomplish, and what your context is. If you just want to pull out the value of the query string parameter transaction_id, then this will do the job for you:
[&?]transaction_id=([^&]+)
In your OP, you have nested brackets. Brackets are for character classes only; you can't nest them.
Instead, use parentheses. Parentheses are used for two things: to indicate nesting or grouping, and to "capture" the value into the match[] array in your program.
As for recognizing the rest of the query string, you shouldn't have to match embedded spaces, as in your example &counterparty=Pretty Flower Florist; you should expect that spaces are encoded as + or %20.
Update:
This regex fragment will match the query string part of your input URLs:
([&?]([^=]+)(=([^&]+))?)*
It's not a precise restatement of the rules for query strings, but you can use it to capture parameter names and values. This part
([^=]+)
captures the parameter name, and this part
([^&]+)
captures the parameter value, if any.

A regex that validates a web address and matches an empty string?

The current expression validates a web address (HTTP), how do I change it so that an empty string also matches?
(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?
If you want to modify the expression to match either an entirely empty string or a full URL, you will need to use the anchor metacharacters ^ and $ (which match the beginning and end of a line respectively).
^(|https?:\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)$
As dirkgently pointed out, you can simplify your match for the protocol a little, so I've included that for you too.
Though, if you are using this expression from within a program or script, it may be simpler for you to use the languages own means of checking if the input is empty.
// in no particular language...
if input.length > 0 then
if input matches <regex> then
input is a URL
else
input is invalid
else
input is empty
Put the whole expression in parenthesis and mark it as optional (“?” quantifier, no or one repetition)
((http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)?
Use expression markers ^$ around your expression and add |^$ to the end. This way you're using the | or operator with two expressions showing that you have two different match cases.
^(https?:\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)$|^$
The key here is that |^$ means "or match blank".
Also, that expression with only work in javascript if you use a template string.
Expr? where Expr is your URL matcher. Just like I would for http and https: https?. The ? is a known as a Quantifier -- you can look it up. From Wikipedia:
? The question mark indicates there is zero or one of the preceding element.