RegEx: Get every word until last 4 words

RegEx: Get every word until last 4 words - regex

I have strings like
wwww-wwww-wwww
wwww-www-ww-ww
Many w separated with -
But it's not regular wwww-wwww, it could be w-w-w-w as well
I try to find a regex that capture every word until the last 4 words.
So the result for example 1 would be the first 8w's (wwww-wwww)
For 2nd example the first 5w's (wwww-w)
Is it possible to do this in regex?
I have something like this right now:
^\w*(?=\w{4}$)
or maybe
[^-]*(?=\w{4}$)
I have 2 problems with my "solutions":
the last 4 words will not be captured for example 2. They are interrupted by the -
the words before the last 4 will not be captured. They are interrupted by the -.

Yes, it's possible with a slightly more sophisticated lookahead assertion:
/\w(?=(?:-*\w){4,}$)/x
Explanation:
/ # Start of regex
\w # Match a "word" character
(?= # only if the following can be matched afterwards:
(?: # (Start of capturing group)
-* # - zero or more separators
\w # - exactly one word character
){4,} # (End of capturing group), repeated 4 or more times.
$ # Then make sure we've reached the end of the string.
) # End of lookahead assertion/x
Test it live on regex101.com.

Related

Extract a sub-string from a matched string

I am attempting to extract a sub-string from a string after matching for 24 at the beginning of the string. The substring is a MAC id starting at position 6 till the end of the string. I am aware that a sub string method can do the job. I am curious to know a regex implementation.
String = 2410100:80:a3:bf:72:d45
After much trial and error, this the reg-ex I have which I think is convoluted.
[^24*$](?<=^\S{6}).*$
How can this reg-ex be modified to match for 24, then extract the substring from position 6 till the end of the line?
https://regex101.com/r/vcvfMx/2
Expected Results: 00:80:a3:bf:72:d45

You can use:
(?<=^24\S{3}).*$
Here's a demo: https://regex101.com/r/HqT0RV/1/
This will get you the result you expect (i.e., 00:80:a3:bf:72:d45). However, that doesn't seem to be a valid MAC address (the 5 at the end seems to be not part of the MAC). In which case, you should be using something like this:
(?<=^24\S{3})(?:[0-9a-f]{2}:){5}[0-9a-f]{2}
Demo: https://regex101.com/r/HqT0RV/2
Breakdown:
(?<= # Start of a positive Lookbehind.
^ # Asserts position at the beginning of the string.
24 # Matches `24` literally.
\S{3} # Matches any three non-whitespace characters.
) # End of the Lookbehind (five characters so far).
(?: # Start of a non-capturing group.
[0-9a-f] # A number between `0` and `9` or a letter between `a` and `f` (at pos. #6).
{2} # Matches the previous character class exactly two times.
: # Matches `:` literally.
) # End of the non-capturing group.
{5} # Matches the previous group exactly five times.
[0-9a-f] # Any number between `0` and `9` or any letter between `a` and `f`.
{2} # Matches the previous character class exactly two times.

How to use regular expression to use as few groups as possible to match as long string as possible

For example, this is the regular expression
([a]{2,3})
This is the string
aaaa // 1 match "(aaa)a" but I want "(aa)(aa)"
aaaaa // 2 match "(aaa)(aa)"
aaaaaa // 2 match "(aaa)(aaa)"
However, if I change the regular expression
([a]{2,3}?)
Then the results are
aaaa // 2 match "(aa)(aa)"
aaaaa // 2 match "(aa)(aa)a" but I want "(aaa)(aa)"
aaaaaa // 3 match "(aa)(aa)(aa)" but I want "(aaa)(aaa)"
My question is that is it possible to use as few groups as possible to match as long string as possible?

How about something like this:
(a{3}(?!a(?:[^a]|$))|a{2})
This looks for either the character a three times (not followed by a single a and a different character) or the character a two times.
Breakdown:
( # Start of the capturing group.
a{3} # Matches the character 'a' exactly three times.
(?! # Start of a negative Lookahead.
a # Matches the character 'a' literally.
(?: # Start of the non-capturing group.
[^a] # Matches any character except for 'a'.
| # Alternation (OR).
$ # Asserts position at the end of the line/string.
) # End of the non-capturing group.
) # End of the negative Lookahead.
| # Alternation (OR).
a{2} # Matches the character 'a' exactly two times.
) # End of the capturing group.
Here's a demo.
Note that if you don't need the capturing group, you can actually use the whole match instead by converting the capturing group into a non-capturing one:
(?:a{3}(?!a(?:[^a]|$))|a{2})
Which would look like this.

Try this Regex:
^(?:(a{3})*|(a{2,3})*)$
Click for Demo
Explanation:
^ - asserts the start of the line
(?:(a{3})*|(a{2,3})*) - a non-capturing group containing 2 sub-sequences separated by OR operator
(a{3})* - The first subsequence tries to match 3 occurrences of a. The * at the end allows this subsequence to match 0 or 3 or 6 or 9.... occurrences of a before the end of the line
| - OR
(a{2,3})* - matches 2 to 3 occurrences of a, as many as possible. The * at the end would repeat it 0+ times before the end of the line
-$ - asserts the end of the line

Try this short regex:
a{2,3}(?!a([^a]|$))
Demo
How it's made:
I started with this simple regex: a{2}a?. It looks for 2 consecutive a's that may be followed by another a. If the 2 a's are followed by another a, it matches all three a's.
This worked for most cases:
However, it failed in cases like:
So now, I knew I had to modify my regex in such a way that it would match the third a only if the third a is not followed by a([^a]|$). So now, my regex looked like a{2}a?(?!a([^a]|$)), and it worked for all cases. Then I just simplified it to a{2,3}(?!a([^a]|$)).
That's it.
EDIT
If you want the capturing behavior, then add parenthesis around the regex, like:
(a{2,3}(?!a([^a]|$)))

R- regex extracting a string between a dash and a period

First of all I apologize if this question is too naive or has been repeated earlier. I tried to find it in the forum but I'm posting it as a question because I failed to find an answer.
I have a data frame with column names as follows;
head(rownames(u))
[1] "A17-R-Null-C-3.AT2G41240" "A18-R-Null-C-3.AT2G41240" "B19-R-Null-C-3.AT2G41240"
[4] "B20-R-Null-C-3.AT2G41240" "A21-R-Transgenic-C-3.AT2G41240" "A22-R-Transgenic-C-3.AT2G41240"
What I want is to use regex in R to extract the string in between the first dash and the last period.
Anticipated results are,
[1] "R-Null-C-3" "R-Null-C-3" "R-Null-C-3"
[4] "R-Null-C-3" "R-Transgenic-C-3" "R-Transgenic-C-3"
I tried following with no luck...
gsub("^[^-]*-|.+\\.","\\2", rownames(u))
gsub("^.+-","", rownames(u))
sub("^[^-]*.|\\..","", rownames(u))
Would someone be able to help me with this problem?
Thanks a lot in advance.
Shani.

Here is a solution to be used with gsub:
v <- c("A17-R-Null-C-3.AT2G41240", "A18-R-Null-C-3.AT2G41240", "B19-R-Null-C-3.AT2G41240", "B20-R-Null-C-3.AT2G41240", "A21-R-Transgenic-C-3.AT2G41240", "A22-R-Transgenic-C-3.AT2G41240")
gsub("^[^-]*-([^.]+).*", "\\1", v)
See IDEONE demo
The regex matches:
^[^-]* - zero or more characters other than -
- - a hyphen
([^.]+) - Group 1 matching and capturing one or more characters other than a dot
.* - any characters (even including a newline since perl=T is not used), any number of occurrences up to the end of the string.

This can easily be achieved with the following regex:
-([^.]+)
# look for a dash
# then match everything that is not a dot
# and save it to the first group
See a demo on regex101.com. Outputs are:
R-Null-C-3
R-Null-C-3
R-Null-C-3
R-Null-C-3
R-Transgenic-C-3
R-Transgenic-C-3

Regex
-([^.]+)\\.
Description
- matches the character - literally
1st Capturing group ([^\\.]+)
[^\.]+ match a single character not present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
. matches the character . literally
\\. matches the character . literally
Debuggex Demo
Output
MATCH 1
1. [4-14] `R-Null-C-3`
MATCH 2
1. [29-39] `R-Null-C-3`
MATCH 3
1. [54-64] `R-Null-C-3`
MATCH 4
1. [85-95] `R-Null-C-3`
MATCH 5
1. [110-126] `R-Transgenic-C-3`
MATCH 6
1. [141-157] `R-Transgenic-C-3`

This seems an appropriate case for lookarounds:
library(stringr)
str_extract(v, '(?<=-).*(?=\\.)')
where
(?<= ... ) is a positive lookbehind, i.e. it looks for a - immediately before the next captured group;
.* is any character . repeated 0 or more times *;
(?= ... ) is a positive lookahead, i.e. it looks for a period (escaped as \\.) following what is actually captured.
I used stringr::str_extract above because it's more direct in terms of what you're trying to do. It is possible to do the same thing with sub (or gsub), but the regex has to be uglier:
sub('.*?(?<=-)(.*)(?=\\.).*', '\\1', v, perl = TRUE)
.*? looks for any character . from 0 to as few as possible times *? (lazy evaluation);
the lookbehind (?<=-) is the same as above;
now the part we want .* is put in a captured group (...), which we'll need later;
the lookahead (?=\\.) is the same;
.* captures any character, repeated 0 to as many as possible times (here the end of the string).
The replacement is \\1, which refers to the first captured group from the pattern regex.

regex conditional statement (only parse numbers if string has a certain beginning)

first I used a string that returned my relaystates, so “1.0.0.0.1.1.0.0” would get parsed/grouped with \d+,
then my eight switches used ‘format response’, e.g. {1} to get the state for each switch.
now I need to get the numbers out of this string: “RELAYS.1.0.0.0.1.1.0.0”
\d+ will still get the numbers but I only want to get them IF the string starts with “RELAYS"
can anyone please explain how I could do that?
thnx a million in advance!
Edited icebear (today 00:24)

With a .NET engine, you could use the regex (?<=^RELAYS[\d.]*)\d+. But most regex engines don't support indefinite repetition in a negative lookbehind assertion.
See it live on regexhero.net.
Explanation:
(?<= # Assert that the following can be matched before the current position:
^RELAYS # Start of string, followed by "RELAYS"
[\d.]* # and any number of digits/dots.
) # End of lookbehind assertion
\d+ # Match one or more digits.
With a PCRE engine, you could use (?:^RELAYS\.|\G\.)(\d+) and access group 1 for each match.
See it live on regex101.com.
Explanation:
(?: # Start a non-capturing group that matches...
^RELAYS\. # either the start of the string and "RELAYS."
| # or
\G\. # the position after the previous match, followed by "."
) # End of non-capturing group
(\d+) # Match a number and capture it in group 1

I need a regx to validate a name that can be 1, 2, or 3 words

In this example I try to validate for a city name. It works if I enter San Louis Obispo but not if I enter Boulder Creek or Boulder. I thought ? was supposed to make a block optional.
if (!/^[a-zA-Z'-]+\s[a-zA-Z'-]*\s([a-zA-Z']*)?$/.test(field)){
return "Enter City only a-z A-Z .\' allowed and not over 20 characters.\n";
}

I think spaces are the problem (\s). You made second and third words optional (by using * instead of +), but not the spaces. Question mark is only being applied to the third word because of parentheses.

The issue with your regex is that, in english, it says to match a word that's required to be followed by a space that's optionally followed by another word but then is required to have another space and then optionally another word. So, a single-word would not match - however, a word followed by two spaces would. Additionally two words that have a space at the end would also match - but neither without the trailing spaces would match.
To fix your exact regex you should add another grouping (non-matching group with (?: instead of just () around the second word to the end of the sentence) and have this group as optional with ?. Also, move the \s's inside the optional groups as well.
Try this:
^[a-zA-Z'-]+(?:\s[a-zA-Z'-]+(?:\s[a-zA-Z']+)?)?$
Regex explaind:
^ # beginning of line
[a-zA-Z'-]+ # first matching word
(?: # start of second-matching word
\s[a-zA-Z'-]+ # space followed by matching word
(?: # start of third-matching word
\s[a-zA-Z']+ # space followed by matching word
)? # third-matching word is optional
)? # second-matching word is optional
$ # end of line
Alternatively, you can try the following regex:
^([a-zA-Z'-]+(?:\s[a-zA-Z'-]+){0,2})$
This will match 1 through 3 words, or "cities", in a given line with the ability to adjust the range of words without having to further-duplicate the matching set for each new word.
Regex explained:
^( # start of line & matching group
[a-zA-Z'-]+ # required first matching word
(?: # start a non-matching group (required to "match", but not returned as an individual group)
\s # sub-group required to start with a space
[a-zA-Z'-]+ # sub-group matching word
){0,2} # sub-group can match 0 -> 2 times
)$ # end of matching group & line
So, if you want to add the ability to match more than 3 words, you can change the 2 in the {0,2} range above to be the number of words you want to match minus 1 (i.e. if you want to match 4 words, you'll set it to {0,3}).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

RegEx: Get every word until last 4 words - regex

Related

Extract a sub-string from a matched string

How to use regular expression to use as few groups as possible to match as long string as possible

R- regex extracting a string between a dash and a period

regex conditional statement (only parse numbers if string has a certain beginning)

I need a regx to validate a name that can be 1, 2, or 3 words

Categories

Resources