Regex for the URL - regex

Can someone help me with writing the regex for the below URL?
I want a Regex to match the whole URL. The url format will be like this.
https://www.mywebsite.com/us/cgi-bin/binary?cmd=_payment-option&transaction_id=8768JKHKJG19322&account_number=6UN85941RH525783L&transaction_date=Apr 12, 2012&transaction_amount=-$11.00&ccode=USD&act_id=6K6218756F7819322&counterparty=Pretty Flower Florist&initiated_page=_login&go_Ah9w8keNJ8YRLMkAMTS_Izeq0br1CF6OVtGv69WzOo8AjgDgGIiBetMG-lK&Go_Actions
This is what I have got so far, but it is matching only till the first '&'
http[s]*:\/\/www.[a-zA-Z0-9.]*mywebsite.[a-zA-Z]*[/]*[a-zA-Z0-9]*[/]*cgi-bin[/]*binary[?]*cmd=[_a-z\-]*[[\&][a-zA-Z0-9_-]*[=][a-z ,A-Z0-9_-]*]*
How can I repeat the pattern &transaction_id=8768JKHKJG19322?
[[\&][a-zA-Z0-9_-]*[=][a-z ,A-Z0-9_-]*]* does not seem to work

This is not very robust regex, but it should give you the idea - repeat common patterns.
http[s]?:\/\/www\.mywebsite\.com(?:\/[a-zA-Z-?=_&\d\s,$\.]+)+

A partial answer, because (as other posters have noted), it's not clear what you're trying to accomplish, and what your context is. If you just want to pull out the value of the query string parameter transaction_id, then this will do the job for you:
[&?]transaction_id=([^&]+)
In your OP, you have nested brackets. Brackets are for character classes only; you can't nest them.
Instead, use parentheses. Parentheses are used for two things: to indicate nesting or grouping, and to "capture" the value into the match[] array in your program.
As for recognizing the rest of the query string, you shouldn't have to match embedded spaces, as in your example &counterparty=Pretty Flower Florist; you should expect that spaces are encoded as + or %20.
Update:
This regex fragment will match the query string part of your input URLs:
([&?]([^=]+)(=([^&]+))?)*
It's not a precise restatement of the rules for query strings, but you can use it to capture parameter names and values. This part
([^=]+)
captures the parameter name, and this part
([^&]+)
captures the parameter value, if any.

Related

Regex capture into group everything from string except part of string

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

Regex - Matching a part of a URL

I'm trying to use regular expression to match a part of the following url:
http://www.example.com/store/store.html?ptype=lst&id=370&3434323&root=nav_3&dir=desc&order=popularity
I want the Regex to find:
&3434323
Basically, it's meant to search any part of the argument that doesn't follow the variable=value formula. So basically I need it to search sections of the URL that don't have an equal sign it, but match just that part.
I tried using:
&\w*+[^=_-]
But it returns: &3434323&. I need it to not return the next ampersand.
And it must be done in regex. Thanks in advance!
You can use this regex:
[?&][^=]+(&|$)
It looks for any string that doesn't contain the equal sing [^=]+ and starts with the question mark or the ampersand [?&] and ends with ampersand or the end of the URL (&|$).
Please note that this will return &3434323&, so you'll have to strip the ampersands on both sides in your code. I assume that you're fine with that. If you really don't want the second ampersand, you can use a lookahead:
[?&][^=]+(?=&|$)
If you don't want even the first ampersand, you can use this regex, but not all compilers support it:
(?<=\?|&)[^=]+(?=&|$)
Parsing query parameters can be tricky, but this may do the job:
((?:[?&])[^=&]+)(?=&|$)
It will not catch the ampersand at the end of the parameter, but it will include either the question mark or the ampersand at the beginning. It will match any parameter not in the form of a key-value pair.
Demo here.

regex expression for selecting a value

I want to write a regexp formula for the below sip message that takes number:
< sip:callpark#as1sip1.com:5060;user=callpark;service=callpark;preason=park;paction=park;ptoken=150009;pautortrv=180;nt_server_host=47.168.105.100:5060 >
(Actually there are "<" and ">" signs in the message, but the site does not let me write)
For this case, I want to select ptoken value.. I wrote an expression such as: ptoken=(.*);p but it returns me ptoken=150009;p, I just need the number:150009
How do I write a regexp for this case?
PS: I write this for XML script..
Thanks,
I SOLVE THE PROBLEM BY USING TWO REGEX:
ereg assign_to="token" check_it="true" header="Refer-To:" regexp="(ptoken=([\d]*))" search_in="hdr"/
ereg assign_to="callParkToken" search_in="var" variable="token" check_it="true" regexp="([\d].*)" /
You could use the following regex:
ptoken=(\d+)
# searches for ptoken= literally
# captures every digit found in the first group
Your wanted numbers are in the first group then. Take a look at this demo on regex101.com. Depending on your actual needs, there could be better approaches (Xpath? as tagged as XML) though.
You should use lookahead and lookbehind:
(?<=ptoken=)(.+?)(?=;)
It captures any character (.+?) before which is ptoken= and behind which is ;
The <ereg ... > action has the assign_to parameter. In your case assign_to="token". In fact, the parameter can receive several variable names. The first is assigned the whole string matching the regular expression, and the following are assigned the "capture groups" of the regular expression.
If your regexp is ptoken=([\d]*), the whole match includes ptoken which is bad. The first capture group is ([\d]*) which is the required value. Thus, use <ereg regexp="ptoken=([\d]*)" assign_to="dummyvar,token" ..other parameters here.. >.
Is it working?

What do the braces () in URL regular expression represent?

For example, in
r'^articles/(\d{4})/$', 'news.views.year_archive'
I understand all regexes except (\d{4}). Four digits but why the braces?
(python/django example)
another example:
r'^articles/(\d{4})/(\d{2})/(\d+)/$', 'news.views.article_detail'
Braces are used for grouping, which can be used to extract a subset of a match. They can also be used to indicate that a subset repeats (or is optional), although your regex does not use them that way.
See http://www.regular-expressions.info/brackets.html
Based on the usage, I'd wager that the code matching this URL is using the brackets to extract the year so that it can be used in a query. See the group function of the Match object
Django automatically extracts grouped subexpressions and uses them as parameters for your view:
The view gets passed an HttpRequest as its first argument and any values captured in the regex as remaining arguments.
...
A request to /articles/2005/03/ would match the third entry in the list. Django would call the function news.views.month_archive(request, '2005', '03').
https://docs.djangoproject.com/en/dev/topics/http/urls/
Besides grouping part of a regular expression together, round brackets also create a "backreference". A backreference stores the part of the string matched by the part of the regular expression inside the parentheses.

How to match a string that does not end in a certain substring?

how can I write regular expression that dose not contain some string at the end.
in my project,all classes that their names dont end with some string such as "controller" and "map" should inherit from a base class. how can I do this using regular expression ?
but using both
public*.class[a-zA-Z]*(?<!controller|map)$
public*.class*.(?<!controller)$
there isnt any match case!!!
Do a search for all filenames matching this:
(?<!controller|map|anythingelse)$
(Remove the |anythingelse if no other keywords, or append other keywords similarly.)
If you can't use negative lookbehinds (the (?<!..) bit), do a search for filenames that do not match this:
(?:controller|map)$
And if that still doesn't work (might not in some IDEs), remove the ?: part and it probably will - that just makes it a non-capturing group, but the difference here is fairly insignificant.
If you're using something where the full string must match, then you can just prefix either of the above with ^.* to do that.
Update:
In response to this:
but using both
public*.class[a-zA-Z]*(?<!controller|map)$
public*.class*.(?<!controller)$
there isnt any match case!!!
Not quite sure what you're attempting with the public/class stuff there, so try this:
public.*class.*(?<!controller|map)$`
The . is a regex char that means "anything except newline", and the * means zero or more times.
If this isn't what you're after, edit the question with more details.
Depending on your regex implementation, you might be able to use a lookbehind for this task. This would look like
(?<!SomeText)$
This matches any lines NOT having "SomeText" at their end. If you cannot use that, the expression
^(?!.*SomeText$).*$
matches any non-empty lines not ending with "SomeText" as well.
You could write a regex that contains two groups, one consists of one or more characters before controller or map, the other contains controller or map and is optional.
^(.+)(controller|map)?$
With that you may match your string and if there is a group() method in the regex API you use, if group(2) is empty, the string does not contain controller or map.
Check if the name does not match [a-zA-Z]*controller or [a-zA-Z]*map.
finally I did it in this way
public.*class.*[^(controller|map|spec)]$
it worked