I'm trying to learn regex in R more deeply. I gave myself what I thought was an easy task that I can't figure out. I want to extract all 4 letter words. In these four letter words I want to ignore (don't count) apostrophes. I can do this without regex but want a regex solution. Here's a MWE and what I've tried:
text.var <- "This Jon's dogs' 'bout there in Mike's re'y word."
pattern <- "\\b[A-Za-z]{4}\\b(?!')"
pattern <- "\\b[A-Za-z]{4}\\b|\\b[A-Za-z']{5}\\b"
regmatches(text.var, gregexpr(pattern, text.var, perl = TRUE))
** Desired output:**
[[1]]
[1] "This" "Jon's" "dogs'" "'bout" "word"
I thought the second pattern would work but it grabs words containing 5 characters as well.
This is a good challenging question and here is a tricky answer.
> x <- "This Jon's dogs' 'bout there in Mike's re'y word."
> re <- "(?i)('?[a-z]){5,}(*SKIP)(?!)|('?[a-z]){4}'?"
> regmatches(x, gregexpr(re, x, perl=T))[[1]]
## [1] "This" "Jon's" "dogs'" "'bout" "word"
Explanation:
The idea is to skip any word patterns that consist of 5 or more letter characters and an optional apostrophe.
On the left side of the alternation operator we match the subpattern we do not want.
Making it fail and forcing the regular expression engine to not retry the substring using backtracking control. As explained below:
(*SKIP) # advances to the position in the string where (*SKIP) was
# encountered signifying that what was matched leading up
# to cannot be part of the match
(?!) # equivalent to (*FAIL), causes matching failure,
# forcing backtracking to occur
The right side of the alternation operator matches what we want...
Additional Explanation:
Essentially, in simple terms you are using the discard technique.
(?:'?[a-z]){5,}|((?:'?[a-z]){4}'?)
You use the alternation operator in context placing what you want to exclude on the left, ( saying throw this away, it's garbage ) and place what you want to match in a capturing group on the right side.
You can use this pattern:
(?i)(?<![a-z'])(?:'?[a-z]){4}'?(?![a-z'])
You can use the discard technique and use a regex like this:
\b\w{0,2}\b(?:'\w)?|\b\w{3}(?!')\b|\b\w{5,}\b|('?\b\w+\b'?\w?)
Working demo
MATCH 1
1. [0-4] `This`
MATCH 2
1. [5-10] `Jon's`
MATCH 3
1. [11-16] `dogs'`
MATCH 4
1. [17-22] `'bout`
MATCH 5
1. [32-36] `word`
For R it is needed to be escaped the special characters.
As you can see in the regex pattern you can use whatever you don't want at the left side of the pattern and leaving what you really want inside the capturing group at the rightest side. The idea behind the discard technique is:
discard this|don't want this|still don't care this|(Oh yeah! I grab this)
THANKS to EdConttrell and johnwait for helping me to improve the answer.
EDITED twice: (thanks hex494D49):
(?i)(?<=\W|^)(?<!')'*(?:\w{4}|\w'*\w{3}|\w{2}'*\w{2}|\w{3}'*\w|\w{2}'*\w'*\w|\w'*\w{2}'*\w|\w'*\w'*\w{2}|\w'*\w'*\w'*\w)'*(?!')(?=\W|$)
Better go for every possible cases...
But, title of question states :
grab n letter words don't count apostrophes regex
So I would not recommend my solution.
Another solution that I think may be slightly clearer / more concise:
Regex
(?<![\w'])(?:'?\w'?){4}(?![\w'])
Explanation
(?<![\w'])
This is a Negative Lookbehind Assertion: it checks that the match is not preceded by the ' char or a word char (\w is the same as [a-zA-Z]).
(?:'?\w'?){4}
This matches any word char, optionally preceded/succeeded by a '. The (?: ... ) makes the group non-capturing.
(?![\w'])
This is a Negative Lookahead assertion, ensuring that the group is not followed by another apostrophe or letter char.
The purpose of the first and last terms is to ensure that the 4 matches by the middle group are not surrounded by more characters: i.e. the word only has 4 letters.
They are more or less equivalent to a \b word boundary detection, except that they count an apostrophe as part of a word which \b does not.
Issues
The regex won't match strings that start or end with double apostrophes, ''. I don't think this is a huge loss.
Example
See this link on regex101.com.
Related
For example, I have the word: sh0rt-t3rm.
How can I get the t3rm part using perl regex?
I could get sh0rt by using [(a-zA-Z0-9)+]\[-\], but \[-\][(a-zA-Z0-9)+] doesn't work to get t3rm.
The syntax used for the regex is not correct to get either sh0rt or t3rm
You flipped the square brackets and the parenthesis, and the hyphen does not have to be between square brackets.
To get sh0rt in sh0rt-t3rm you you might use for example one of:
Regex
Demo
Explanation
\b([a-zA-Z0-9]+)-
Demo 1
\b is a word boundary to prevent a partial word match, the value is in capture group 1.
\b[a-zA-Z0-9]+(?=-)
Demo 2
Match the allowed chars in the character class, and assert a - to the right using a positive lookahead (?=-)
To get t3rm in sh0rt-t3rm you might use for example one of:
Regex
Demo
Explanation
-([a-zA-Z0-9]+)\b
Demo 3
The other way around with a leading - and get the value from capture group 1.
-\K[a-zA-Z0-9]+\b
Demo 4
Match - and use \K to keep out what is matched so far. Then match 1 or more times the allowed chars in the character class.
If your whole target string is literally just sh0rt-t3rm then you want all that comes after the -.
So the barest and minimal version, cut precisely for this description, is
my ($capture) = $string =~ /-(.+)/;
We need parenthesis on the left-hand-side so to make regex run in a list context because that's when it returns the matches (otherwise it returns true/false, normally 1 or '').
But what if the preceding text may have - itself? Then make sure to match all up to that last -
my ($capture) = $string =~ /.*-(.+)/;
Here the "greedy" nature of the * quantifier makes the previous . match all it possibly can so that the whole pattern still matches; thus it goes up until the very last -.
There are of course many other variations on how the data may look like, other than just being one hyphenated-word. In particular, if it's a part of a text, you may want to include word-boundaries
my ($capture) = $string =~ /\b.*?-(.+?)\b/;
Here we also need to adjust our "wild-card"-like pattern .+ by limiting it using ? so that it is not greedy. This matches the first such hyphenated word in the $string. But if indeed only "word" characters fly then we can just use \w (instead of . and word-boundary anchors)
my ($capture) = $string =~ /\w*?-(\w+)/;
Note that \w matches [a-zA-Z0-9_] only, which excludes some characters that may appear in normal text (English, not to mention all other writing systems).
But this is clearly getting pickier and cookier and would need careful close inspection and testing, and more complete knowledge of what the data may look like.
Perl offers its own tutorial, perlretut, and the main full reference is perlre
-([a-zA-Z0-9]+) will match a - followed by a word, with just the word being captured.
Demo
I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.
Using Regular Expression,
from any line of input that has at least one word repeated two or more times.
Here is how far i got.
/(\b\w+\b).*\1
but it is wrong because it only checks for single char, not one word.
input: i might be ill
output: < i might be i>ll
<> marks the matched part.
so, i try to do (\b\w+\b)(\b\w+\b)*\1
but it is not working totally.
Can someone give help?
Thanks.
this should work
(\b\w+\b).*\b\1\b
greedy algorithm will ensure longest match. If you want second instance to be a separate word you have to add the boundaries there as well. So it's the same as
\b(\w+)\b.*\b\1\b
Positive lookahead is not a must here:
/\b([A-Za-z]+)\b[\s\S]*\b\1\b/g
EXPLANATION
\b([A-Za-z]+)\b # match any word
[\s\S]* # match any character (newline included) zero or more times
\b\1\b # word repeated
REGEX 101 DEMO
To check for repeated words you can use positive lookahead like this.
Regex: (\b[A-Za-z]+\b)(?=.*\b\1\b)
Explanation:
(\b[A-Za-z]+\b) will capture any word.
(?=.*\b\1\b) will lookahead if the word captured by group is present or not. If yes then a match is found.
Note:- This will produce repeated results because the word which is matched once will again be matched when regex pointer captures it as a word.
You will have to use programming to strip off the repeated results.
Regex101 Demo
I need to find all the words in an inputted text that has (?i:val) in it and are no longer that 5 characters.
So far I got: \b([a-zA-Z]*(?i:val)[a-zA-Z]*){1,4}\b
If we take this sample text to look in: In computer science, a value is an expression which cannot be evaluated any further (a normal form). Val is also a match
I get 3 matches (value, evaluated and Val), however evaluated should not match the pattern, as it is too long. What is the right way to get this straight?
Your pattern does not account for the length of the words matched.
Use word boundaries and a lookahead like this:
(?i)\b(?=\w*val)\w{1,5}\b
See regex demo
The regex matches:
\b - a leading word boundary since the next pattern is \w
(?=\w*val) - a lookahead making sure there is a val substring after zero or more word characters
\w{1,5} - matches 1 to 5 word characters
\b - trailing word boundary that stops words of more than 5 characters long from matching
You may use an ASCII JS version of the regex:
/\b(?=[a-z]*val)[a-z]{1,5}\b/i
It's important to understand why the "evaluated" was matched. Note:
[a-zA-Z]* matches the "e"
(?i:val) matches "val"
[a-zA-Z]* matches "uated"
Actually there's not repetition here! The pattern was matched in only one iteration.
You can achieve what you want using lookarounds, but I think that regex is not the best tool for this task. I highly recommend you using other functions depending on what you have.
I have a regex trying to divide questions by speciality. Say I have the following regex:
(?P<speciality>[0-9x]+)
It works fine for this question (correct match: 7)
(7)Which of the following is LEAST to be considered as a risk factor for esophageal cancer?;
And for this (correct match: 8 and 13)
(8,13)30 year old woman with amenorrhea, low serum estrogen and high serum LH/FSH, the most likely diagnosis is:
But not for this one (incorrect match: 20).
First trimester spontaneous abortion (before 20 wk) is most commonly due to:
I only need the numbers in parentheses at the beginning of the question, all other parentheses should be ignored. Is this possible with a regex alone (lookahead?).
If your regex flavor supports \G continuous matching and \K reset beginning of match, try:
(?:^\(|\G,)\K[\dx]+
^\( would match parenthesis at start | OR \G match , after last match. Then \K resets and match + one or more of [\dx]. (\d is a shorthand for [0-9]). Matches will be in $0.
Test at regex101.com; Regex FAQ
PHP example
$str = "(1x,2,3x) abc (1,2x,3) d";
preg_match_all('~(?:^\(|\G,)\K[\dx]+~', $str, $out);
print_r($out[0]);
Array
(
[0] => 1x
[1] => 2
[2] => 3x
)
Test at eval.in
Perhaps something like this will work (you don't mention the regex flavor that you're using, though I am guessing it is PCRE by the use of the named group - and yes, it does use positive lookahead):
^\((?P<speciality>(?:[0-9x]+,?)+)(?=\))/mg
The caret ^ combined with the multiline modifier \m (which causes the anchors ^ and $ to match the beginning and end of lines, respectively, instead of the beginning and end of the string) will ensure that what is matched is at the start of the paragraph. The specialties will be captured in the specialty named capture group; the only caveat is that if more than one specialty is given (as in your example starting (8,13)) the capture will be a comma-delimited list, just as the specialty is a comma-delimited list (to use the same example, the capture will be 8,13 in that case).
Please see Regex Demo here.
(?P<speciality>[0-9x]+) matches any nonempty sequence of digits anywhere in the input. the parentheses just delimit the capturing group but are not part of the match.
to match a number (or more separated by commas) between parentheses at the beginning of the line you could use something like this
^\((\d+)(,(\d+))*\)
EDIT
it seems repeated capturing groups, as in (,(\d+))*, will only return the last match. so to get the values it'd be necessary to catch the complete list of numbers and parse it afterwards:
^\((?P<specialities>(\d+)(,(\d+))*)\)
will catch one or more numbers separated by commas, between parentheses.
added the start of line anchor so it is at the beginning of the line.
Demo