Perl regexp specific letters in string - regex

input strings consists of letters I N P U Y X
-I have to verify that it only contains these letters and nothing else in PERL regexp
-verify that input also contains at least 2 occurrences of "NP" (without quotes)
example string:
INPYUXNPININNPXX
strings are all in uppercase

You can use this lookahead based regex in PCRE:
^(?=(?:.*?NP){2})[INPUYX]+$
Online Demo: http://regex101.com/r/zH3jQ3
Explanation:
^ assert position at start of a line
(?=(?:.*?NP){2}) Positive Lookahead - Assert that the regex below can be matched
(?:.*?NP){2} Non-capturing group
Quantifier: Exactly 2 times
.*? matches any character (except newline)
Quantifier: Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
NP matches the characters NP literally (case sensitive)
[INPUYX]+ match a single character present in the list below
Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
INPUYX a single character in the list INPUYX literally (case sensitive)
$ assert position at end of a line

Use this:
^[INPUYX]*NP[INPUYX]*?NP[INPUYX]*$
See it in action: http://regex101.com/r/vI2xQ6
Effectively what we're doing here is allowing 0 or more of your character class, capturing the first (required) occurrence of NP, then ensuring that it occurs at least once again before the end of the string.
Hypothetically if you wanted to capture out the middle, you could do:
^(?=(?:(.*?)NP){2})[INPUYX]+$
Or as #ikegami points out (matching ONLY the single line) \A(?=(?:(.*?)NP){2})[INPUYX]+\z.

The cleanest solution is:
/^[INPUXY]*\z/ && /NP.*NP/s
The following is the most efficient as it avoids matching the string twice and it prevents backtracking on failure:
/
^
(?: (?:[IPUXY]|N[IUXY])* NP ){2}
[INPUXY]*
\z
/x
See in action
To capture what's between the two NP, you can use
/
^
(?:[IPUXY]|N[IUXY])* NP
( (?:[IPUXY]|N[IUXY])* ) NP
[INPUXY]*
\z
/x
See in action

Related

How to match a pattern which is repeated more than nth times non consecutively?

I am trying to match a pattern "city, state" (e.g., "Austin, TX") on this sample vector
> s <- c("Austin, TX", "Forth Worth, TX", "Ft. Worth, TX",
"Austin TX", "Austin, TX, USA", "Ft. Worth, TX, USA")
> grepl('[[:alnum:]], [[:alnum:]$]', s)
[1] TRUE TRUE TRUE FALSE TRUE TRUE
However, there are two cases I would like to retrieve a FALSE:
-when there are more than 1 comma (i.e., "Austin, TX, USA")
-when there is another punctuation sign before the comma (i.e.,"Ft. Worth, TX")
You can use the following regex pattern:
grepl("^[a-z ]+, [a-z]+$", subject, perl=TRUE, ignore.case=TRUE);
Regex101 Demo
Regex Explanation:
^[a-z ]+, [a-z]+$/gmi
^ assert position at start of a line
[a-z ]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
a-z a single character in the range between a and z (case insensitive)
the literal character
, matches the characters , literally
[a-z]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
a-z a single character in the range between a and z (case insensitive)
$ assert position at end of a line
ignore.case: insensitive. Case insensitive match (ignores case of [a-zA-Z])
Here is the RegEx: ^([^.,])+,\s([^.,])+$
^ assert position at start of the string
1st Capturing group ([^\.,])+
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
[^\.,] match a single character not present in the list below
\. matches the character . literally
, the literal character ,
, matches the character , literally
\s match any white space character [\r\n\t\f ]
2nd Capturing group ([^\.,])+
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
[^\.,] match a single character not present in the list below
\. matches the character . literally
, the literal character ,
$ assert position at end of the string

How to regex just the second half of firsthalf.secondhalf():

Having trouble getting the regex to work for this. I want to basically just recognize the second half of something like this: firsthalf.secondhalf(): as a function. So in the example above just the .secondhalf(): would be recognized as unique and different color than the firsthalf.
I've tried, but to no avail:
<regex>(\w*()\b)</regex>
Try following:
(\w*\(\):)
Debuggex Demo
\w*\(\)\:$
\w* match any word character [a-zA-Z0-9_]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\( matches the character ( literally
\) matches the character ) literally
\: matches the character : literally
$ assert position at end of the string

regex findall to retrieve a substring based on start and end character

I have the following string:
6[Sup. 1e+02]
I'm trying to retrieve a substring of just 1e+02. The variable first refers to the above specified string. Below is what I have tried.
re.findall(' \d*]', first)
You need to use the following regex:
\b\d+e\+\d+\b
Explanation:
\b - Word boundary
\d+ - Digits, 1 or more
e - Literal e
\+ - Literal +
\d+ - Digits, 1 or more
\b - Word boundary
See demo
Sample code:
import re
p = re.compile(ur'\b\d+e\+\d+\b')
test_str = u"6[Sup. 1e+02]"
re.findall(p, test_str)
See IDEONE demo
import re
first = "6[Sup. 1e+02]"
result = re.findall(r"\s+(.*?)\]", first)
print result
Output:
['1e+02']
Demo
http://ideone.com/Kevtje
regex Explanation:
\s+(.*?)\]
Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regex below and capture its match into backreference number 1 «(.*?)»
Match any single character that is NOT a line break character (line feed) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “]” literally «\]»

Need a regular expression to match a pattern

I need a regular expression to match below pattern
Word1 OR Word2 OR Word3 OR......
basically this is a string which contains words split by OR
You can do:
(\w+)(?=(?:\s+OR)|(?:\s*$))
Demo
The following will match based on what you've given:
^\w+(?: OR \w+)*$
^ assert position at start of the string
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
(?: OR \w+)* Non-capturing group
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
_OR_ matches the characters _OR_ literally (case sensitive)
\w+ match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
$ assert position at end of the string
NOTE: I used _OR_ to show the spaces around OR in the quotation as the whitespace was ignored.
Link to Regex101
Do something like this and access the second group using \2:
((\w+)\sOR)*
Try playing with the link below, read the explanation and you'll understand how it works:
https://regex101.com/r/vY6mA7/1

Regex to contain one of three characters?

I need to write a regex that matches strings that has one of three characters say just x, y and z. I tried "[xyz]^" but it doesn't work. The string may containe any other characters but must contain at least one of the three given characters in any order or position
Regex Demo
\b\w*(x|y|z)\w*\b
Debuggex Demo
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
\w* match any word character [a-zA-Z0-9_]
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
1st Capturing group (x|y|z)
1st Alternative: x
x matches the character x literally (case sensitive)
2nd Alternative: y
y matches the character y literally (case sensitive)
3rd Alternative: z
z matches the character z literally (case sensitive)
\w* match any word character [a-zA-Z0-9_]
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
g modifier: global. All matches (don't return on first match)
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
This might be what you are looking for:
^.*[xyz].*$
Debuggex Demo
The following regex should match:
^.*[xyz].*$
In python :
>>> import re
>>> re.match(r'^.*[xyz].*$', 'AzE')
<_sre.SRE_Match object at 0x2643718>
>>> re.match(r'^.*[xyz].*$', 'AEz')
<_sre.SRE_Match object at 0x2643cc8>
>>> re.match(r'^.*[xyz].*$', 'AE')
>>>