Regex Ignore First 6 Matches Of Character - regex

This is my string: /my/name/is/the/following/string/name.lastname/file.txt
I want to extract name.lastname from this string.
I've tried using \/.*\.app, but this selects:
/my/name/is/the/following/string/name.lastname
How can I ignore the first 6 or 7 /'s?

You have quite a few good answers going for you. Here's one that uses positive look ahead (?=), with the end of string $.
([^\/]+)(?=\/[^\/]+$)
The benefit here is you can have as many folders prior to your last folder, and it will still work.
DEMO
If we break this down, you have a
capturing group: ([^\/]+), and a
positive look ahead (?=\/[^\/]+$).
The capturing group will match everything except ^ a forward slash /, one to as many times possible +. This would actually capture every string between a forward slash, so that's why we use the positive lookahead.
The biggest factor in your positive lookahead is that it looks for the end of your string $ (signified by the dollar sign). It will look for everything after a forward slash / (hence the (?=\/ portion), then it will ensure no other forward slashes exists but match all other characters [^\/] one to unlimited times + to the end of the string $.

You may use a repeating pattern to consume, but not match, the first six components of the path:
(?:\/[^\/]+){6}\/([^\/]+)
Your item will be available in the first capture group.
Demo

If you want a more flexible solution, i.e. the string between
last 2 slashes (not necessarily 6th and 7th), you can use:
\/([^\/]+)\/(?!.*\/)
Meaning:
\/ - A slash.
([^\/]+) - Capturing group No 1 - a sequence of chars other than a slash.
This is what you actually want to match.
\/ - Another slash.
(?! - Negative lookahead for:
.*\/ - a sequence of any chars and a slash.
) - End of negative lookahead (works even in JavaScript version of Regex).
The above negative lookahead actually means: Nowhere further
can occur any slash.

try this ,it will match 6 or 7 th position
([a-z\.]*)(?=\/[a-z]*\.txt)
(?=\/[a-z]*\.txt) to check ends with .txt
([a-z\.]*) CapturingGroup to capture the name
Demo

((\/)[a-b]*).[^\/]{12}
Hi, Please try the above Reg ex, it should return what you expecting

Related

capturing values after an optional slash

I am trying to write in regex a string that allows me to have
an alphanumeric string of length no longer than 5 (as an example) [a-z0-9]{3,5}
followed by an optional forward slash /?
that cannot end in a 3
I want to capture any group of at least 3, with our without a slash, and then anything after it.
And I am having a very hard time accomplishing this. If I require the slash / it is much easier to do so.
When I try
(?=.+\/?.+)[a-z0-9]{2,5}\/?(?<!3\/|3)
I can capture what I want - up until the slash, but can't crack how to get anything after IF legit things occur
(?=.+\/?.+)[a-z0-9]{2,62}\/?.?
My requirement for length goes up by 1 - to 4 instead of 3 - due to the additional . I put after the \/?. I could change my match to account for it, but it becomes really difficult.
(?=.+\/?.+)[a-z0-9]{2,5}\/?(?<!3\/|3)$
This only gives me the last slash or non slash follwed by 2,5 characters.
(?=.+\/?.+)[a-z0-9]{2,62}\/?.*
or
(?=.+\/?.+)[a-z0-9]{2,62}\/?.?+
simply then ignores my ending rule, of not being able to close with3/ or 3. Also this allows me to use more than 5 characters before the slash. Def not what I want :)
Is there a way to make an optional field still maintain length and ending rules?
I am running this script on both regexr.com and https://www.w3schools.com/jsref/tryit.asp?filename=tryjsref_regexp and gitbash and not getting the results I would like
Try:
^[a-z0-9]{3,5}(?<!3)(?:$|\/.*)
Regex demo.
^ - beginning of the string
[a-z0-9]{3,5} - capture a-z0-9 between 3 and 5 times
(?<!3) - the last character should not be 3
(?:$|\/.*) - match either end of string $ or / and any number of characters.
If the last character in this range [a-z0-9] should not be a 3 you can exclude it like [a-z124-9]
^[a-z0-9]{2,4}[a-z124-9](?:\/.*)?$
Explanation
^ Start of string
[a-z0-9]{2,4} Match 2-4 chars in the ranges a-z 0-9
[a-z124-9] Match a single char a-z and then either 1,2 4-9
(?:\/.*)? Optionally match / and the rest of the line
$ End of string
See a regex101 demo.
If you can not match a 3 at all:
^[a-z124-9]{3,5}(?:\/.*)?$
See another regex101 demo

Regex Help required for User-Agent Matching

Have used an online regex learning site (regexr) and created something that works but with my very limited experience with regex creation, I could do with some help/advice.
In IIS10 logs, there is a list for time, date... but I am only interested in the cs(User-Agent) field.
My Regex:
(scan\-\d+)(?:\w)+\.shadowserver\.org
which matches these:
scan-02.shadowserver.org
scan-15n.shadowserver.org
scan-42o.shadowserver.org
scan-42j.shadowserver.org
scan-42b.shadowserver.org
scan-47m.shadowserver.org
scan-47a.shadowserver.org
scan-47c.shadowserver.org
scan-42a.shadowserver.org
scan-42n.shadowserver.org
scan-42o.shadowserver.org
but what I would like it to do is:
Match a single number with the option of capturing more than one: scan-2 or scan-02 with an optional letter: scan-2j or scan-02f
Append the rest of the User Agent: .shadowserver.org to the regex.
I will then add it to an existing URL Rewrite rule (as a condition) to abort the request.
Any advice/help would be very much appreciated
Tried:
To write a regex for IIS10 to block requests from a certain user-agent
Expected:
It to work on single numbers as well as double/triple numbers with or without a letter.
(scan\-\d+)(?:\w)+\.shadowserver\.org
Input Text:
scan-2.shadowserver.org
scan-02.shadowserver.org
scan-2j.shadowserver.org
scan-02j.shadowserver.org
scan-17w.shadowserver.org
scan-101p.shadowserver.org
UPDATE:
I eventually came up with this:
scan\-[0-9]+[a-z]{0,1}\.shadowserver\.org
This is explanation of your regex pattern if you only want the solution, then go directly to the end.
(scan\-\d+)(?:\w)+
(scan\-\d+) Group1: match the word scan followed by a literal -, you escaped the hyphen with a \, but if you keep it without escaping it also means a literal - in this case, so you don't have to escape it here, the - followed by \d+ which means one more digit from 0-9 there must be at least one digit, then the value inside the group will be saved inside the first capturing group.
(?:\w)+ non-capturing group, \w one character which is equal to [A-Za-z0-9_], but the the plus + sign after the non-capturing group (?:\w)+, means match the whole group one or more times, the group contains only \w which means it will match one or more word character, note the non-capturing group here is redundant and we can use \w+ directly in this case.
Taking two examples:
The first example: scan-02.shadowserver.org
(scan\-\d+)(?:\w)+
scan will match the word scan in scan-02 and the \- will match the hyphen after scan scan-, the \d+ which means match one or more digit at first it will match the 02 after scan- and the value would be scan-02, then the (?:\w)+ part, the plus + means match one or more word character, at least match one, it will try to match the period . but it will fail, because the period . is not a word character, at this point, do you think it is over ? No , the regex engine will return back to the previous \d+, and this time it will only match the 0 in scan-02, and the value scan-0 will be saved inside the first capturing group, then the (?:\w)+ part will match the 2 in scan-02, but why the engine returns back to \d+ ? this is because you used the + sign after \d+, (?:\w)+ which means match at least one digit, and one word character respectively, so it will try to do what it is asked to do literally.
The second example: scan-2.shadowserver.org
(scan\-\d+)(?:\w)+
(scan\-\d+) will match scan-2, (?:\w)+ will try to match the period after scan-2 but it fails and this is the important point here, then it will go back to the beginning of the string scan-2.shadowserver.org and try to match (scan\-\d+) again but starting from the character c in the string , so s in (scan\-\d+) faild to match c, and it will continue trying, at the end it will fail.
Simple solution:
(scan-\d+[a-z]?)\.shadowserver\.org
Explanation
(scan-\d+[a-z]?), Group1: will capture the word scan, followed by a literal -, followed by \d+ one or more digits, followed by an optional small letter [a-z]? the ? make the [a-z] part optional, if not used, then the [a-z] means that there must be only one small letter.
See regex demo

Using regex to determine straight (unordered hand)

A straight in poker is five cards in a row, for example 23456 or 89TJQ. With a "sorted" hand, the regex could be written as:
^(A2345|23456|34567|45678|56789|6789T|789TJ|89TJQ|9TJQK|TJQKA)$
It's a bit verbose but straightforward enough. However, would it be possible to generate a (sensible) regex if the hand was unordered? For example, if the hand was 52634 or JQ89T??
One possible way would be to use a ?=.*<item> lookahead (which would essentially be "unsorted"), for example:
^(?:
(?=.*A)(?=.*2)(?=.*3)(?=.*4)(?=.*5)
|(?=.*2)(?=.*3)(?=.*4)(?=.*5)(?=.*6)
|(?=.*3)(?=.*4)(?=.*5)(?=.*6)(?=.*7)
|(?=.*4)(?=.*5)(?=.*6)(?=.*7)(?=.*8)
|(?=.*5)(?=.*6)(?=.*7)(?=.*8)(?=.*9)
|(?=.*6)(?=.*7)(?=.*8)(?=.*9)(?=.*T)
|(?=.*7)(?=.*8)(?=.*9)(?=.*T)(?=.*J)
|(?=.*8)(?=.*9)(?=.*T)(?=.*J)(?=.*Q)
|(?=.*9)(?=.*T)(?=.*J)(?=.*Q)(?=.*K)
|(?=.*T)(?=.*J)(?=.*Q)(?=.*K)(?=.*A)
)
.{5}$
Are there other / better approaches to finding if a straight exists using regex only?
You can use the following regex:
See regex in use here
(?!.*(.).*\1)(?:[A2345]{5}|[23456]{5}|[34567]{5}|[45678]{5}|[56789]{5}|[6789T]{5}|[789TJ]{5}|[89TJQ]{5}|[9TJQK]{5}|[TJQKA]{5})
This works by first using a negative lookahead to ensure that the string doesn't contain any duplicates (?!.*(.).*\1). Then it matches 5 characters from any of the straight possibilities.
(?!.*(.).*\1)
#^^^ ^ negative lookahead ensuring what follows doesn't match
# ^^ match any character any number of times
# ^^^ capture a character into capture group #1
# ^^ match any character any number of times
# ^^ match the same text as most recently matched by the 1st capture group
Against JQQ89, it works as follows:
- .* matches J
- (.) captures Q
- .* matches nothing
- \1 tries to match Q (and succeeds)
- Negative lookahead has a match, so fail the match.

RegEx - if then else

I am trying to work out a regex expression but struggle with conditionals. I have a list of 100s of URLs that look like this:
/name/something/details/55334
/name/page/1/2
/name/somethingdifferent/34523
/name/page/1
/name/something/553/1
Bottom line is that I want to remove everything when a number appears apart from a scenario where the last thing before the number is a word 'page'.
1. /name/something/details/
2. /name/page/1/2
3. /name/somethingdifferent/
4. /name/page/1
5. /name/something
I will be removing it with Google Analytics Content Grouping or potentially with DataStudio. I already removed /name/ so I have:
1. /something/details/55334
2. /page/1/2
3. /somethingdifferent/34523
4. /page/1
5. /something/553/1
but want to add another rule and remove the numbers so I get:
1. /something/details/
2. /page/1/2
3. /somethingdifferent/
4. /page/1
5. /something
have already tried:
\(?(?=(page\/[0-9]+))(\2)|(\/\d+)
following the syntax of:
(?(?=condition))(IF)|(ELSE)
but it highlights all numbers after text.
Thanks for your help.
sampak
Try ^(\/page.*|[^0-9]*), works with your example.
A Version incl. name: ^(page[\/\d]*|[^\d\s])*
One option might be to match not a whitespace or digit while not matching /page.
Then match a forward slash and 1+ digits followed by any char 0+ times to omit that from the result.
^((?:(?!\/page)[^\d\s])*\/)\d.*
In parts
^ Start of string
( Capture group 1
(?: Non capturing group
(?!\/page) Negative lookahead, assert what is directly to the right is not
[^\d\s] Match any char except a digit or whitespace char
)* Close non capturing group and repeat 0+ times
\/ Match /
) Close group 1
\d.* Match a digit followed by any char except a newline 0+ times
In the replacement use the first capturing group
Regex demo
If you also want to remove /name you could use:
^\/name((?:(?!\/page)[^\d\s])*\/)\d.*
Regex demo

R- regex extracting a string between a dash and a period

First of all I apologize if this question is too naive or has been repeated earlier. I tried to find it in the forum but I'm posting it as a question because I failed to find an answer.
I have a data frame with column names as follows;
head(rownames(u))
[1] "A17-R-Null-C-3.AT2G41240" "A18-R-Null-C-3.AT2G41240" "B19-R-Null-C-3.AT2G41240"
[4] "B20-R-Null-C-3.AT2G41240" "A21-R-Transgenic-C-3.AT2G41240" "A22-R-Transgenic-C-3.AT2G41240"
What I want is to use regex in R to extract the string in between the first dash and the last period.
Anticipated results are,
[1] "R-Null-C-3" "R-Null-C-3" "R-Null-C-3"
[4] "R-Null-C-3" "R-Transgenic-C-3" "R-Transgenic-C-3"
I tried following with no luck...
gsub("^[^-]*-|.+\\.","\\2", rownames(u))
gsub("^.+-","", rownames(u))
sub("^[^-]*.|\\..","", rownames(u))
Would someone be able to help me with this problem?
Thanks a lot in advance.
Shani.
Here is a solution to be used with gsub:
v <- c("A17-R-Null-C-3.AT2G41240", "A18-R-Null-C-3.AT2G41240", "B19-R-Null-C-3.AT2G41240", "B20-R-Null-C-3.AT2G41240", "A21-R-Transgenic-C-3.AT2G41240", "A22-R-Transgenic-C-3.AT2G41240")
gsub("^[^-]*-([^.]+).*", "\\1", v)
See IDEONE demo
The regex matches:
^[^-]* - zero or more characters other than -
- - a hyphen
([^.]+) - Group 1 matching and capturing one or more characters other than a dot
.* - any characters (even including a newline since perl=T is not used), any number of occurrences up to the end of the string.
This can easily be achieved with the following regex:
-([^.]+)
# look for a dash
# then match everything that is not a dot
# and save it to the first group
See a demo on regex101.com. Outputs are:
R-Null-C-3
R-Null-C-3
R-Null-C-3
R-Null-C-3
R-Transgenic-C-3
R-Transgenic-C-3
Regex
-([^.]+)\\.
Description
- matches the character - literally
1st Capturing group ([^\\.]+)
[^\.]+ match a single character not present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
. matches the character . literally
\\. matches the character . literally
Debuggex Demo
Output
MATCH 1
1. [4-14] `R-Null-C-3`
MATCH 2
1. [29-39] `R-Null-C-3`
MATCH 3
1. [54-64] `R-Null-C-3`
MATCH 4
1. [85-95] `R-Null-C-3`
MATCH 5
1. [110-126] `R-Transgenic-C-3`
MATCH 6
1. [141-157] `R-Transgenic-C-3`
This seems an appropriate case for lookarounds:
library(stringr)
str_extract(v, '(?<=-).*(?=\\.)')
where
(?<= ... ) is a positive lookbehind, i.e. it looks for a - immediately before the next captured group;
.* is any character . repeated 0 or more times *;
(?= ... ) is a positive lookahead, i.e. it looks for a period (escaped as \\.) following what is actually captured.
I used stringr::str_extract above because it's more direct in terms of what you're trying to do. It is possible to do the same thing with sub (or gsub), but the regex has to be uglier:
sub('.*?(?<=-)(.*)(?=\\.).*', '\\1', v, perl = TRUE)
.*? looks for any character . from 0 to as few as possible times *? (lazy evaluation);
the lookbehind (?<=-) is the same as above;
now the part we want .* is put in a captured group (...), which we'll need later;
the lookahead (?=\\.) is the same;
.* captures any character, repeated 0 to as many as possible times (here the end of the string).
The replacement is \\1, which refers to the first captured group from the pattern regex.