Regular expression quantifier questions - regex

Im trying to find a regular expression that matches this kind of URL:
http://sub.domain.com/selector/F/13/K/100546/sampletext/654654/K/sampletext_sampletext.html
and dont match this:
http://sub.domain.com/selector/F/13/K/10546/sampletext/5987/K/sample/K/101/sample_text.html
only if the number of /K/ is minimum 1 and maximum 2 (something with a quantifier like {1,2})
Until this moment i have the following regexp:
http://sub\.domain\.com/selector/F/[0-9]{1,2}/[a-z0-9_-]+/
Now i would need a hand to add any kind of condition like:
Match this if in the text appears the /K/ from 1 to 2 times at most.
Thanks in advance.
Best Regards.
Josema

Do you need to this all in one line?
The approach I would take is to do a regex for /K/ and then count the number of matches I got.
I think Boost is a C++ library right? In C# I would do it like this:
string url = "http://sub.domain.com/selector/F/13/K/100546/sampletext/654654/K/sampletext_sampletext.html";
if (Regex.Matches(url, "/K/").Count <= 2)
{
// good url found
}
UPDATE
This regex would match everything up to the first two K's and then only allow the url filename.html after that:
^http://sub.domain.com/selector/F/[\d]+/[a-zA-Z]+/[\d]+/[a-zA-Z]+/[\d]+/K/[a-zA-Z_]+\.html$

This RE will match anything after the/F/[0-9]{1,2} that has 1 or 2 /K/, it could also match http://sub.domain.com/selector/F/13/K/100546/stuff/21515/stuff/sampletext/654654/K/stuff/sampletext_sampletext.html :
^http://sub\.domain\.com/selector/F/[0-9]{1,2}(?:/K(?=/)(?:(?!/K/)/[a-z0-9_.-]+)*){1,2}$

Related

Regular Expression Repitition groups

I have a Regex :
\*777\*[0-9]{10,}\*\d+\*(5|10|20|25|50|100)\*\d+#
That is what i have these far.
It could handle input : *777*9283928839*89*5*9090#.
The format goes like this : *777*phone*Qty*Item Code*pin#
The problem is sometime the input will go like this :
*777*phone*Qty*Item Code*Qty*Item Code*Qty*Item Code*pin#
It will repeat at Qty*Item Code. But the Item code should be one of these 5,10,20,25,50,100
I confuse in making the regex check for Qty*Item Code.
Can someone give a hint?
Thanks.
You can use the following:
\*777\*[0-9]{10,}\*(\d+\*(5|10|20|25|50|100)\*)+\d+#
Explanation
The part that's repeating seems to be this:
\d+\*(5|10|20|25|50|100)\*
If you enclose that in parentheses and add + after it, it will tell regex to match what's inside the parentheses one or more times:
(\d+\*(5|10|20|25|50|100)\*)+

Regex that matches a pattern only if string does not begin with 'N'

I need to put together a regex that matches a patter only if string does not begin with 'N'.
Here is my pattern so far [A-E]+[-+]?.
Now I want to make sure that it does not match something like:
N\A
NA
NB+
NB-
NCAB
This is for REGEXP_SUBSTR command in Oracle SQL DB
UPDATE
It looks like I should have been more specific, sorry
I want to extract from a string [A-E]+[-+]? but if the string also matches ^(N|n) then I want my regex to return nothing.
See examples below:
String Returns
N/A
F1/AAA AAA
NABC
FABC ABC
To match a character between A and E not preceded by N, you can use:
([^N]|^)[A-E]+
If you want to avoid fields that contains N[A-E] use a negation in your query using the pattern N[A-E] (in other words, use two predicates, this one to exclude NA and the first to find A)
To be more clear:
WHERE NOT REGEXP_LIKE(coln, 'N[A-E]') AND REGEXP_LIKE(coln, '[A-E]')
Ok I figured it out, I broadened the scope of the problem a little, I realized that I can also play with other parameters of REGEXP_SUBSTR in this case that I can have returned only second substring.
REGEXP_SUBSTR(field1, '^([^NA-D][^A-D]*)?([A-D]+[-+]?)',1,1,'i',2)
I still have to give you guys the credit, lot of good ideas that led me to here.
Just throw a [^N]? in front. That should do it.
OOPS...
That actually needs to include an " OR ^ "...
It should look like this:
([^N]|^)[A-E]+[-+]?
Sorry about that...It looks like the right answer already got posted anyway.

Regex: matching unknown groups that repeat?

I'm trying to create a generic regex pattern for a crawler, to avoid so called "crawler traps" (links that just add url parameters and refer to the exact same page, which results in tons of useless data). Alot of times, those links just add the same part to a URL over and over again. Here is an example out of a log file:
http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/...
I can use regular expressions to narrow the scope of the crawler and i would love to have a pattern, that tells the crawler to ignore everything that has repeating parts. Is that possible with a regex?
Thanks in advance for some tips!
JUST TO CLARIFY:
the crawlertraps are not designed to prevent crawling, they are a result of poor web design. All the pages we are crawling explicitly allowed us to do so!
If you are already looping through a list of URLs, you could add matching as a condition to skip the current iteration:
array = ["/abcd/abcd/abcd/abcd/", "http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/", "http://examplepage/apple/cake/banana/"]
import re
pattern1 = re.compile(r'.*?([^\/\&?]{4,})(?:[\/\&\?])(.*?\1){3,}.*')
for url in array:
if re.match(pattern1, url):
print "It matches; skipping this URL"
continue
print url
Example regex:
.*?([^\/\&?]{4,})(?:[\/\&\?])(.*?\1){3,}.*
([^\/\&?]{4,}) matches and captures sequences of anything, but not containing [/&?], repeated 4 or more times.
(?:[\/\&\?]) looks for one /,& or ?
(.*?(?:[\/\&\?])\1){3,} match anything until [/&?], followed by what we captured, doing all of this 3 or more times.
demo
You can use a backreference in Python/PERL regexes (and possibly others) to catch a pattern which is repeated:
>>> re.search(r"(/.+)\1", "http://examplepage.com/cssms/chrome/cssms/chrome/cssms/pages/browse/cssms/pages/misc/").group(1)
'/cssms/chrome'
\1 references the first match, so (/.+)\1 means the same sequence repeated twice in a row. The leading / is just to avoid the regex matching the first single repeating letter (which is the t in http) and catch repetitions in the path.

regular expression match domain

I need a regular expression to match the following domains as follows:
http://www.cnn.com/fred = www.cnn.com
cnn.com = cnn.com
www.cnn.com:8080 = www.cnn.com
I have the following regular expression (using pcre):
([^/]+://)?([^:/]+)
The above works fine in case 2 and 3 however with 1 i still have the http:// appended to the matching string, is there a regular expression option which i can use to skip the http part?
many thanks in advance
This one should suit your needs:
^(?:(?:f|ht)tps?://)?([^/:]+)
The first group will contain what you're looking for.
this looks like the closest i could get to what i want not perfect but seems to gets the job done
www?([^/:]+)

MATLAB 2012 regular expression

I have a set of strings that I'd like to parse in MATLAB 2012 that all have the following format:
string-int-int-int-int-string
I'd like to pluck out the third integer (the rest are 'don't cares'), but I haven't used MATLAB in ages and need to refresh on regular expressions. I tried using the regular expression '(.*)-(.*)-(.*)-\d-(.*)' but no dice. I did check out the MATLAB regexp page, but wasn't able to figure out how to apply that information to this case.
Anyone know how I might get the desired result? If so, could you explain what the expression you're using is doing to get that result so that others might be able to apply the answer to their unique situation?
Thanks in advance!
str = 'XyzStr-1-2-1000-56789-ILoveStackExchange.txt';
[tok] = regexp(str, '^.+?-.+?-.+?-(\d+?)-.+?-.+?', 'tokens');
tok{:}
ans =
'1000'
Update
Explanation, upon request.
^ - "Anchor", or match beginning of string.
.+? - Wildcard match, one or more, non-greedy.
- - Literal dash/hyphen.
(\d+?) - Digits match, one or more, non-greedy, captured into a token.
^.*?-.*?-.*?-(\d+)-.*?-.*?$
OR
^(?:[^-]*?-){3}(\d+)(?:.*?)$
Group1 now contains your required data