regex in parenthesis at the beginning - regex

I have a regex trying to divide questions by speciality. Say I have the following regex:
(?P<speciality>[0-9x]+)
It works fine for this question (correct match: 7)
(7)Which of the following is LEAST to be considered as a risk factor for esophageal cancer?;
And for this (correct match: 8 and 13)
(8,13)30 year old woman with amenorrhea, low serum estrogen and high serum LH/FSH, the most likely diagnosis is:
But not for this one (incorrect match: 20).
First trimester spontaneous abortion (before 20 wk) is most commonly due to:
I only need the numbers in parentheses at the beginning of the question, all other parentheses should be ignored. Is this possible with a regex alone (lookahead?).

If your regex flavor supports \G continuous matching and \K reset beginning of match, try:
(?:^\(|\G,)\K[\dx]+
^\( would match parenthesis at start | OR \G match , after last match. Then \K resets and match + one or more of [\dx]. (\d is a shorthand for [0-9]). Matches will be in $0.
Test at regex101.com; Regex FAQ
PHP example
$str = "(1x,2,3x) abc (1,2x,3) d";
preg_match_all('~(?:^\(|\G,)\K[\dx]+~', $str, $out);
print_r($out[0]);
Array
(
[0] => 1x
[1] => 2
[2] => 3x
)
Test at eval.in

Perhaps something like this will work (you don't mention the regex flavor that you're using, though I am guessing it is PCRE by the use of the named group - and yes, it does use positive lookahead):
^\((?P<speciality>(?:[0-9x]+,?)+)(?=\))/mg
The caret ^ combined with the multiline modifier \m (which causes the anchors ^ and $ to match the beginning and end of lines, respectively, instead of the beginning and end of the string) will ensure that what is matched is at the start of the paragraph. The specialties will be captured in the specialty named capture group; the only caveat is that if more than one specialty is given (as in your example starting (8,13)) the capture will be a comma-delimited list, just as the specialty is a comma-delimited list (to use the same example, the capture will be 8,13 in that case).
Please see Regex Demo here.

(?P<speciality>[0-9x]+) matches any nonempty sequence of digits anywhere in the input. the parentheses just delimit the capturing group but are not part of the match.
to match a number (or more separated by commas) between parentheses at the beginning of the line you could use something like this
^\((\d+)(,(\d+))*\)
EDIT
it seems repeated capturing groups, as in (,(\d+))*, will only return the last match. so to get the values it'd be necessary to catch the complete list of numbers and parse it afterwards:
^\((?P<specialities>(\d+)(,(\d+))*)\)
will catch one or more numbers separated by commas, between parentheses.
added the start of line anchor so it is at the beginning of the line.
Demo

Related

How can I get the second part of a hyphenated word using regex?

For example, I have the word: sh0rt-t3rm.
How can I get the t3rm part using perl regex?
I could get sh0rt by using [(a-zA-Z0-9)+]\[-\], but \[-\][(a-zA-Z0-9)+] doesn't work to get t3rm.
The syntax used for the regex is not correct to get either sh0rt or t3rm
You flipped the square brackets and the parenthesis, and the hyphen does not have to be between square brackets.
To get sh0rt in sh0rt-t3rm you you might use for example one of:
Regex
Demo
Explanation
\b([a-zA-Z0-9]+)-
Demo 1
\b is a word boundary to prevent a partial word match, the value is in capture group 1.
\b[a-zA-Z0-9]+(?=-)
Demo 2
Match the allowed chars in the character class, and assert a - to the right using a positive lookahead (?=-)
To get t3rm in sh0rt-t3rm you might use for example one of:
Regex
Demo
Explanation
-([a-zA-Z0-9]+)\b
Demo 3
The other way around with a leading - and get the value from capture group 1.
-\K[a-zA-Z0-9]+\b
Demo 4
Match - and use \K to keep out what is matched so far. Then match 1 or more times the allowed chars in the character class.
If your whole target string is literally just sh0rt-t3rm then you want all that comes after the -.
So the barest and minimal version, cut precisely for this description, is
my ($capture) = $string =~ /-(.+)/;
We need parenthesis on the left-hand-side so to make regex run in a list context because that's when it returns the matches (otherwise it returns true/false, normally 1 or '').
But what if the preceding text may have - itself? Then make sure to match all up to that last -
my ($capture) = $string =~ /.*-(.+)/;
Here the "greedy" nature of the * quantifier makes the previous . match all it possibly can so that the whole pattern still matches; thus it goes up until the very last -.
There are of course many other variations on how the data may look like, other than just being one hyphenated-word. In particular, if it's a part of a text, you may want to include word-boundaries
my ($capture) = $string =~ /\b.*?-(.+?)\b/;
Here we also need to adjust our "wild-card"-like pattern .+ by limiting it using ? so that it is not greedy. This matches the first such hyphenated word in the $string. But if indeed only "word" characters fly then we can just use \w (instead of . and word-boundary anchors)
my ($capture) = $string =~ /\w*?-(\w+)/;
Note that \w matches [a-zA-Z0-9_] only, which excludes some characters that may appear in normal text (English, not to mention all other writing systems).
But this is clearly getting pickier and cookier and would need careful close inspection and testing, and more complete knowledge of what the data may look like.
Perl offers its own tutorial, perlretut, and the main full reference is perlre
-([a-zA-Z0-9]+) will match a - followed by a word, with just the word being captured.
Demo

Regex Pattern to Match except when the clause enclosed by the tilde (~) on both sides

I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.

Regular expression to find number in parentheses, but only at beginning of string

Disclaimer: I'm new to writing regular expressions, so the only problem may be my lack of experience.
I'm trying to write a regular expression that will find numbers inside of parentheses, and I want both the numbers and the parentheses to be included in the selection. However, I only want it to match if it's at the beginning of a string. So in the text below, I would want it to get (10), but not (2) or (Figure 50).
(10) Joystick Switch - Contains control switches (Figure 50)
Two (2) heavy lifting straps
So far, I have (\(\d+\)) which gets (10) but also (2). I know ^ is supposed to match the beginning of a string (or line), but I haven't been able to get it to work. I've looked at a lot of similar questions, both here and on other sites, but have only found parts of solutions (finding things inside of parentheses, finding just numbers at the beginning for a string, etc.) and haven't quite been able to put them together to work.
I'm using this to create a filter in a CAT tool (for those of you in translation) which means that there's no other coding languages involved; essentially, I've been using RegExr to test all of the other expressions I've written, and that's worked fine.
The regex should be
^\(\d+\)
^ Anchors the regex at the start of the string.
\( Matches (. Should be escaped as it has got special meaning in regex
\d+ Matches one or more digits
\) Matches the )
Capturing brackets like (\(\d+\)) are not necessary as there are no other characters matched from the pattern. It is required only when you require to extract parts from a matched pattern
For example if you like to match (50) but to extract digits, 50 from the pattern then you can use
\((\d+)\)
here the \d+ part comes within the captured group 1, That is the captured group 1 will be 50 where as the entire string matched is (50)
Regex Demo
Like so:
^\(\d+\)
^ anchor
Each of ( and ) are regex meta character, so they need to be escaped with \
So \( and \) match literal parenthesis.
( and ) captures.
\d+ match 1 or more digits
Demo

grab n letter words don't count apostrophes regex

I'm trying to learn regex in R more deeply. I gave myself what I thought was an easy task that I can't figure out. I want to extract all 4 letter words. In these four letter words I want to ignore (don't count) apostrophes. I can do this without regex but want a regex solution. Here's a MWE and what I've tried:
text.var <- "This Jon's dogs' 'bout there in Mike's re'y word."
pattern <- "\\b[A-Za-z]{4}\\b(?!')"
pattern <- "\\b[A-Za-z]{4}\\b|\\b[A-Za-z']{5}\\b"
regmatches(text.var, gregexpr(pattern, text.var, perl = TRUE))
** Desired output:**
[[1]]
[1] "This" "Jon's" "dogs'" "'bout" "word"
I thought the second pattern would work but it grabs words containing 5 characters as well.
This is a good challenging question and here is a tricky answer.
> x <- "This Jon's dogs' 'bout there in Mike's re'y word."
> re <- "(?i)('?[a-z]){5,}(*SKIP)(?!)|('?[a-z]){4}'?"
> regmatches(x, gregexpr(re, x, perl=T))[[1]]
## [1] "This" "Jon's" "dogs'" "'bout" "word"
Explanation:
The idea is to skip any word patterns that consist of 5 or more letter characters and an optional apostrophe.
On the left side of the alternation operator we match the subpattern we do not want.
Making it fail and forcing the regular expression engine to not retry the substring using backtracking control. As explained below:
(*SKIP) # advances to the position in the string where (*SKIP) was
# encountered signifying that what was matched leading up
# to cannot be part of the match
(?!) # equivalent to (*FAIL), causes matching failure,
# forcing backtracking to occur
The right side of the alternation operator matches what we want...
Additional Explanation:
Essentially, in simple terms you are using the discard technique.
(?:'?[a-z]){5,}|((?:'?[a-z]){4}'?)
You use the alternation operator in context placing what you want to exclude on the left, ( saying throw this away, it's garbage ) and place what you want to match in a capturing group on the right side.
You can use this pattern:
(?i)(?<![a-z'])(?:'?[a-z]){4}'?(?![a-z'])
You can use the discard technique and use a regex like this:
\b\w{0,2}\b(?:'\w)?|\b\w{3}(?!')\b|\b\w{5,}\b|('?\b\w+\b'?\w?)
Working demo
MATCH 1
1. [0-4] `This`
MATCH 2
1. [5-10] `Jon's`
MATCH 3
1. [11-16] `dogs'`
MATCH 4
1. [17-22] `'bout`
MATCH 5
1. [32-36] `word`
For R it is needed to be escaped the special characters.
As you can see in the regex pattern you can use whatever you don't want at the left side of the pattern and leaving what you really want inside the capturing group at the rightest side. The idea behind the discard technique is:
discard this|don't want this|still don't care this|(Oh yeah! I grab this)
THANKS to EdConttrell and johnwait for helping me to improve the answer.
EDITED twice: (thanks hex494D49):
(?i)(?<=\W|^)(?<!')'*(?:\w{4}|\w'*\w{3}|\w{2}'*\w{2}|\w{3}'*\w|\w{2}'*\w'*\w|\w'*\w{2}'*\w|\w'*\w'*\w{2}|\w'*\w'*\w'*\w)'*(?!')(?=\W|$)
Better go for every possible cases...
But, title of question states :
grab n letter words don't count apostrophes regex
So I would not recommend my solution.
Another solution that I think may be slightly clearer / more concise:
Regex
(?<![\w'])(?:'?\w'?){4}(?![\w'])
Explanation
(?<![\w'])
This is a Negative Lookbehind Assertion: it checks that the match is not preceded by the ' char or a word char (\w is the same as [a-zA-Z]).
(?:'?\w'?){4}
This matches any word char, optionally preceded/succeeded by a '. The (?: ... ) makes the group non-capturing.
(?![\w'])
This is a Negative Lookahead assertion, ensuring that the group is not followed by another apostrophe or letter char.
The purpose of the first and last terms is to ensure that the 4 matches by the middle group are not surrounded by more characters: i.e. the word only has 4 letters.
They are more or less equivalent to a \b word boundary detection, except that they count an apostrophe as part of a word which \b does not.
Issues
The regex won't match strings that start or end with double apostrophes, ''. I don't think this is a huge loss.
Example
See this link on regex101.com.

Matching on repeated substrings in a regex

Is it possible for a regex to match based on other parts of the same regex?
For example, how would I match lines that begins and end with the same sequence of 3 characters, regardless of what the characters are?
Matches:
abcabc
xyz abc xyz
Doesn't Match:
abc123
Undefined: (Can match or not, whichever is easiest)
ababa
a
Ideally, I'd like something in the perl regex flavor. If that's not possible, I'd be interested to know if there are any flavors that can do it.
Use capture groups and backreferences.
/^(.{3}).*\1$/
The \1 refers back to whatever is matched by the contents of the first capture group (the contents of the ()). Regexes in most languages allow something like this.
You need backreferences. The idea is to use a capturing group for the first bit, and then refer back to it when you're trying to match the last bit. Here's an example of matching a pair of HTML start and end tags (from the link given earlier):
<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>
This regex contains only one pair of parentheses, which capture the string matched by [A-Z][A-Z0-9]* into the first backreference. This backreference is reused with \1 (backslash one). The / before it is simply the forward slash in the closing HTML tag that we are trying to match.
Applying this to your case:
/^(.{3}).*\1$/
(Yes, that's the regex that Brian Carper posted. There just aren't that many ways to do this.)
A detailed explanation for posterity's sake (please don't be insulted if it's beneath you):
^ matches the start of the line.
(.{3}) grabs three characters of any type and saves them in a group for later reference.
.* matches anything for as long as possible. (You don't care what's in the middle of the line.)
\1 matches the group that was captured in step 2.
$ matches the end of the line.
For the same characters at the beginning and end:
/^(.{3}).*\1$/
This is a backreference.
This works:
my $test = 'abcabc';
print $test =~ m/^([a-z]{3}).*(\1)$/;
For matching the beginning and the end you should add ^ and $ anchors.