Regex to include one thing but exclude another - regex

I've been having a lot of trouble finding how to write a regex to include certain URLs starting with a specified phrase while excluding another.
We want to include pages that start with:
/womens
/mens
/kids-clothing/boys
/kids-clothing/girls
/homeware
But we want to exclude anything that has /sXXXXXXX in the URL - where the X's are numbers.
I've written this so far to match the below URLs but it's behaving very oddly. Should I be using lookarounds or something?
\/(womens|mens|kids\-clothing\/boys|kids\-clothing\/boys|homeware).*[^s[0-9]+].*
/homeware/bathroom/s2522424/4-tier-pastel-pop-drawers-approx-91cm-x25cm-x-28cm
/homeware/bathroom/towels-and-bathmats
/homeware/bathroom/towels-and-bathmats/s2506420/boutique-luxury-towels
/homeware/bathroom/towels-and-bathmats?page=3&size=36&cols=4&sort=&id=/homeware/bathroom/towels-and-bathmats&priceRange[min]=1&priceRange[max]=14
/homeware/bathroom?page=3&size=36&cols=4&sort=&id=/homeware/bathroom&priceRange[min]=1&priceRange[max]=35
/homeware/bedroom
/homeware/bedroom/bedding-sets
/homeware/bedroom/bedding-sets/s2471012/striped-reversible-printed-duvet-set
/homeware/bedroom/bedding-sets/s2472706/check-printed-reversible-duvet-set
/homeware/bedroom/bedding-sets/s2475332/union-jack-duvet-set
/kids-clothing/boys/shop-by-age/toddler-3mnths-5yrs/s2520246/boys-lollipop-slogan-t-shirt
/kids-clothing/boys/shop-by-age/toddler-3mnths-5yrs/s2520253/boys-2-pack-dinosaur-t-shirts
/kids-clothing/girls/great-value/sale?page=1&size=36&cols=4&sort=price.asc&id=/kids-clothing/girls/great-value/sale&priceRange[min]=0.5&priceRange[max]=7
/kids-clothing/girls/mini-shops/ballet-outfits
/kids-clothing/girls/shop-by-age/baby--newborn-0-18mths
/kids-clothing/girls/shop-by-age/baby--newborn-0-18mths/s2484120/3-pack-frill-pants-pinks
/kids-clothing/girls/shop-by-age/baby--newborn-0-18mths/s2504431/3-pack-l-s-bodysuit
/mens/categories/tops?page=5&size=36&cols=4&sort=&id=/mens/categories/tops&priceRange[min]=2&priceRange[max]=22.5
/mens/categories/trousers-and-chinos
/mens/categories/trousers-and-chinos/s2438566/easy-essential-cuffed-jogging-bottoms
/mens/categories/trousers-and-chinos/s2438574/easy-essential-cuffed-jogging-bottoms
/mens/categories/trousers-and-chinos/s2458939/regatta-zip-off-lightweight-outdoor-trousers

You are on the right track. A negative lookahead will do it:
"^(?!.*\/s\d+)\/(womens|mens|kids\-clothing\/boys|kids\-clothing\/girls|homeware)\/.*"
The ^ anchors to the start of the string. The (?!.*\/s\d+) means that "/sXXXXXXX" can't appear anywhere in the string, and the rest of it matches your required starting tokens.
The reason [^s[0-9]+] didn't work is that [^xyz] matches only one single character. What you're effectively saying there is that you're looking for any character that isn't any combination of "s", "[" and "0-9", followed by "]". e.g. "s[234[s]".
The reason you need to put your negative lookahead at the start of the string is so nothing is matched at all. If you put it after the \/(womens|mens|kids\-clothing\/boys|kids\-clothing\/girls|homeware)\/.*, you would still successfully match everything before the "/sXXXXXXX". i.e. for line 1 of your data, you would match "/homeware/bathroom/".

Yes, you need a negative lookaround:
/^\/(womens|mens|kids\-clothing\/boys|kids\-clothing\/boys|homeware)(?:\/(?:(?!s\d+).)*)+$/gm
If you're comparing one line at a time you don't need the multiline (m) flag. It's probably behaving strangely because you had a character class (denoted by square brakcets) nested inside more square brackets, which doesn't work; you can't nest character classes. This was tested and works on refiddle.

Related

How to apply correct regex?

I have a special task which requires lots of regex and javascript parsing.
My head is almost exploding, so maybe I'm tired and forgot some small thing else I'm not newbie to regex so perhaps someone will point me to good direction here and show me where I did mistake.
So I have this regex code:
((?<=\ffmpg=).+(?=////u0026cs=nt))
to get the value of substring between 2 strings. The first string is called:
ffmpg= from this string it should start and it will end just before the other string start called //u0026cs=nt
The problem is that it is working fine until the html page contains only one parameter with the same name; because the source html has inside like 10's of ffmg and the same end string called cs=nt.
I can not even make regex to count the characters because every time you visit the html page the number of characters are different, sometimes +3 else +10. So the only way is to get this sting from the start of param1 to the end of param2.
This is the string I need to get: 1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012
This is the source html example:
\u0026doc=IcuU5Oy8\u0026pen=V9PXaHoOp1gKD25rgAg\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\u0026cs=nt\u0026token=gHgig8eLY3qsQ0bXa\\u0026doc=IcuU5Oy8\u0026pen=V9PXaHoOp1gKD25rgAg\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\u0026cs=nt\u0026token=gHgig8eLY3qsQ0bXa\\u0026doc=IcuU5Oy8\u0026pen=V9PXaHoOp1gKD25rgAg\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\u0026cs=nt\u0026token=gHgig8eLY3qsQ0bXa\
I have copied 3 times the same just for this purpose because it is very big html source and I doubt I can upload it here.
Thanks for your help.
In your questions, you use (?<=\ffmpg=) where \f will match a form feed character which is not present in the data example. If you meant to use \\f it will match \f which is also not present in the example data.
You could get the match using a capturing group instead of using lookarounds as lookbehinds are not widely supported by all browsers.
If you just want to get a single match, you can omit the /g global flag.
If you use .+ you will match too much as the .+ will match until the end of the string and then backtracks until the first time it can match \\u0026cs=nt
What you could do instead is be specific in what you would allow to match which for the current string is a character class with the following characters [AC0-9%]+
You could broaden the character class with a range to match chars A-Z instead of AC for example and add more chars or ranges as required.
ffmpg=([AC0-9%]+)\\\\u0026cs=nt
Regex demo
For example
const regex = /ffmpg=([AC0-9%]+)\\\\u0026cs=nt/;
const str = `\\\\u0026doc=IcuU5Oy8\\\\u0026pen=V9PXaHoOp1gKD25rgAg\\\\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\\\\u0026cs=nt\\\\u0026token=gHgig8eLY3qsQ0bXa\\\\\\\\u0026doc=IcuU5Oy8\\\\u0026pen=V9PXaHoOp1gKD25rgAg\\\\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\\\\u0026cs=nt\\\\u0026token=gHgig8eLY3qsQ0bXa\\\\\\\\u0026doc=IcuU5Oy8\\\\u0026pen=V9PXaHoOp1gKD25rgAg\\\\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\\\\u0026cs=nt\\\\u0026token=gHgig8eLY3qsQ0bXa\\\\`;
console.log(str.match(regex)[1]);
Try this:
(?<=ffmpg=)([A-F0-9%]+)
Explanation
Since your string only consists of url-encoded characters, you can use [A-F0-9%]+character class to capture it. It will stop when next string starts because there will be a backslash.
See online demo here.

How write a regex starts and ends with particular string?

I hava a string, like this:
{"content":(uint32)123", "id":(uint64)111, "test":{"hi":"(uint32)456"}}
I want to get result:
(uint32)123
(uint64)111
so I write regex like this:
[^(?!\")](\(uint32\)|\(uint64\))(\d)+[^(?!\")$]
but the result is:
:(uint32)123
:(uint64)111,
here the result adds : and ,
I hope that the regex does not begin with " and does not end with " , now I should how change my regex?
(\(uint(?:32|64)\)\d+) Works for me. It captures the entire string (uint[32/64])<any number of digits\> without bothering about the characters that come before or after.
Tested the following one in python
(?<!\")(\(uint32\)|\(uint64\))\d+(?!(\"|\d))
It looked like you was trying to use negative lookahead and negative lookbehind checks. But you did couple of mistakes:
You put them inside symbol group like this: [^(?!\")] what this regexp really mean - not any of symbols inside square bracket (^ - stands for not). How it should be instead: (?!\") - which mean symbol after current position shouldn't be quote (note: this will also work if there is no symbol after
To check symbol before you need to use look ahead check which have syntax (?<!some_regexp). So it would be (?<!\")
You don't need checks for start or end of the line. If you do you can put then into separate negative look ahead/behind statement.
Here is corrected example without line start/end checks:
(?<!\")(\(uint32\)|\(uint64\))(\d)+(?!\")(?!\d)
Note: you need to add (?!\d) at the end, cause otherwise it would match everything except last digit if there is quote.
Here is example with start/end of line checks:
(?<!^)(?<!\")(\(uint32\)|\(uint64\))(\d)+(?!\")(?!\d)(?!$)
P.S.: depending on language you using - you might not need to escape quote - you do need to escape quote only in case it is string escape sequence not regexp escape sequence.

regex remove all numbers from a paragraph except from some words

I want to remove all numbers from a paragraph except from some words.
My attempt is using a negative look-ahead:
gsub('(?!ami.12.0|allo.12)[[:digit:]]+','',
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)
But this doesn't work. I get this:
"." "" "ami.. " "allo."
Or my expected output is:
"." "" 'ami.12.0','allo.12'
You can't really use a negative lookahead here, since it will still replace when the cursor is at some point after ami.
What you can do is put back some matches:
(ami.12.0|allo.12)|[[:digit:]]+
gsub('(ami.12.0|allo.12)|[[:digit:]]+',"\\1",
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)
I kept the . since I'm not 100% sure what you have, but keep in mind that . is a wildcard and will match any character (except newlines) unless you escape it.
Your regex is actually finding every digit sequence that is not the start of "ami.12.0" or "allo.12". So for example, in your third string, it gets to the 12 in ami.12.0 and looks ahead to see if that 12 is the start of either of the two ignored strings. It is not, so it continues with replacing it. It would be best to generalize this, but in your specific case, you can probably achieve this by instead doing a negative lookbehind for any prefixes of the words (that can be followed by digit sequences) that you want to skip. So, you would use something like this:
gsub('(?<!ami\\.|ami\\.12\\.|allo\\.)[[:digit:]]+','',
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)

How to capture text between two markers?

For clarity, I have created this:
http://rubular.com/r/ejYgKSufD4
My strings:
http://blablalba.com/foo/bar_soap/foo/dir2
http://blablalba.com/foo/bar_soap/dir
http://blablalba.com/foo/bar_soap
My Regular expression:
\/foo\/(.*)
This returns:
/foo/bar_soap/dir/dir2
/foo/bar_soap/dir
/foo/bar_soap
But I only want
/foo/bar_soap
Any ideas how I can achieve this? As illustrated above, I want everything after foo up until the first forward slash.
Thanks in advance.
Edit. I only want the text after foo until until the next forward slash after. Some directories may also be named as foo and this would render incorrect results. Thanks
. will match anything, so you should change it to [^/] (not slash) instead:
\/foo\/([^\/]*)
Some of the other answers use + instead of *. That might be correct depending on what you want to do. Using + forces the regex to match at least one non-slash character, so this URL would not match since there isn't a trailing character after the slash:
http://blablalba.com/foo/
Using * instead would allow that to match since it matches "zero or more" non-slash characters. So, whether you should use + or * depends on what matches you want to allow.
Update
If you want to filter out query strings too, you could also filter against ?, which must come at the front of all query strings. (I think the examples you posted below are actually missing the leading ?):
\/foo\/([^?\/]*)
However, rather than rolling out your own solution, it might be better to just use split from the URI module. You could use URI::split to get the path part of the URL, and then use String#split split it up by /, and grab the first one. This would handle all the weird cases for URLs. One that you probably haven't though of yet is a URL with a specified fragment, e.g.:
http://blablalba.com/foo#bar
You would need to add # to your filtered-character class to handle those as well.
You can try this regular expression
/\/foo\/([^\/]+)/
\/foo\/([^\/]+)
[^\/]+ gives you a series of characters that are not a forward slash.
the parentheses cause the regex engine to store the matched contents in a group ([^\/]+), so you can get bar_soap out of the entire match of /foo/bar_soap
For example, in javascript you would get the matched group as follows:
regexp = /\/foo\/([^\/]+)/ ;
match = regexp.exec("/foo/bar_soap/dir");
console.log(match[1]); // prints bar_soap

How to match a string that does not end in a certain substring?

how can I write regular expression that dose not contain some string at the end.
in my project,all classes that their names dont end with some string such as "controller" and "map" should inherit from a base class. how can I do this using regular expression ?
but using both
public*.class[a-zA-Z]*(?<!controller|map)$
public*.class*.(?<!controller)$
there isnt any match case!!!
Do a search for all filenames matching this:
(?<!controller|map|anythingelse)$
(Remove the |anythingelse if no other keywords, or append other keywords similarly.)
If you can't use negative lookbehinds (the (?<!..) bit), do a search for filenames that do not match this:
(?:controller|map)$
And if that still doesn't work (might not in some IDEs), remove the ?: part and it probably will - that just makes it a non-capturing group, but the difference here is fairly insignificant.
If you're using something where the full string must match, then you can just prefix either of the above with ^.* to do that.
Update:
In response to this:
but using both
public*.class[a-zA-Z]*(?<!controller|map)$
public*.class*.(?<!controller)$
there isnt any match case!!!
Not quite sure what you're attempting with the public/class stuff there, so try this:
public.*class.*(?<!controller|map)$`
The . is a regex char that means "anything except newline", and the * means zero or more times.
If this isn't what you're after, edit the question with more details.
Depending on your regex implementation, you might be able to use a lookbehind for this task. This would look like
(?<!SomeText)$
This matches any lines NOT having "SomeText" at their end. If you cannot use that, the expression
^(?!.*SomeText$).*$
matches any non-empty lines not ending with "SomeText" as well.
You could write a regex that contains two groups, one consists of one or more characters before controller or map, the other contains controller or map and is optional.
^(.+)(controller|map)?$
With that you may match your string and if there is a group() method in the regex API you use, if group(2) is empty, the string does not contain controller or map.
Check if the name does not match [a-zA-Z]*controller or [a-zA-Z]*map.
finally I did it in this way
public.*class.*[^(controller|map|spec)]$
it worked