regular expression in R, match substring only if things after

regular expression in R, match substring only if things after - regex

my_string = "2011, this year I made 750,000 dollars"
Is there an elegant way to match "2011" and "750,000" in the string above. The idea is to extract numeric values when it looks like to numeric values, i.e. \d+ or \d+[\.,]?\d* depending on the presence of a comma after
I tried this but it doesn't match exactly what I wanted, I got "2011," which is no good
library(stringr)
str_match_all(fkin, "(\\d+[\\.,]?\\d*)
Here is my expected resut:
"2011" "750,000"

You can do:
[0-9]+(?:[,.][0-9]+)*
It's very elegant, I tried it in front of a mirror.

Here is a one regex pure base R approach to extract integer or float values that are not part of the string of digits separated with a hyphen:
> str <- "2011, this year I made 750,000 dollars and 750,000-589 here"
> regmatches(str, gregexpr('(?<!\\d-)\\b\\d+(?:[,.]\\d+)?+(?!-)', str, perl=T))[[1]]
[1] "2011" "750,000"
See the IDEONE demo and a regex demo.
Since the regex contains lookarounds, you need to specify the perl=TRUE argument.
Pattern explanation:
(?<!\d-) - a negative lookbehind failing the match when a digit with a hyhen precedes the current location
\b\d+ - a word boundary (before the next digit, there cannot be a word char - letter, digit or _)
(?:[,.]\d+)?+ - a non-capturing group ((?:...)) matching 1 or 0 sequences of a comma or dot ([,.]) followed with 1 or more digits (and this sequence is matched possessively (see ?+) so that the regex engine did not check for a hyphen after \b\d+)
(?!-) - a negative loookahead that fails the match if there is a hyphen after the digits detected.

Related

How do I create a regex expression for a 10 digit phone number with the same separator?

I am trying to create a basic regular expression to match a phone number which can either use dots [.] or hyphens [-] as the separator.
The format is 123.456.7890 or 123-456-7890.
The expression I am currently using is:
\d\d\d[-.]\d\d\d[-.]\d\d\d\d
The issue here is that it also matches the phone numbers that have both separators in them which I want to be termed as invalid/not a match. For example, with my expression, 123.456-7890 and 123-456.7890 show up as a match, something I do not want happening.
Is there a way to do that?

Use a backreference:
^\d{3}([.-])\d{3}\1\d{4}$
Here is an explanation of the regex:
^ from the start of the number
\d{3} match any 3 digits
([.-]) then match AND capture either a dot or a dash separator
\d{3} match any 3 digits
\1 match the SAME separator seen earlier
\d{4} match any 4 digits
$ end of the number

You can use this regex:
^\d{3}([-.])\d{3}\1\d{4}$
You can see that it works here.
Key point here - is that you capture your desired character using brackets ([-.])
and then reuse it with back reference \1.

Is there a way to use Regex to capture numbers out of a string based on a specific leading letters?

I need to extract any number between 4-10 digits that following directly after 'PO#' OR 'PO# ' (with a whitespace). I do not want to include the PO# with the actual value that is extracted, however I do need it as criteria to target the value within a string. If the digits are less than 4 or greater than 10, I do not wish to capture the value and would like to otherwise ignore it.
A sample string would look like this:
PO#12445 for Vendor Enterprise
or
Invoice# 21412556 for Vendor Enterprise for PO# 12445
My current RegEX expression captures PO# with '#' and I use additional logic after the fact to remove the '#', however my expression is also capturing Invoice# and Inv# which I don't want it to do. I'd like it to only target PO#.
Current Expression: [P][O][#]\s*[0-9]{3,9}\d+\w
Any help would be greatly appreciated!

If you need only the digits, you can use \b(?<=PO#)\s?(\d{4,10})\b, with:
(?<=PO#): positivive lookbehind, be sure that this pattern is present before the needed pattern (PO followed by #)
\s?: 0 or 1 whitespace
(\d{4,10}): between 4 and 10 digits
\b: word boundaries to avoid ie. the 10 first digits of a 11 digits pattern match or 'SPO#' to match
Edit: Alexander Mashin is right about the lookbehind having to be fixed width, so \b(?<=PO#)\s?(\d{4,10})\b is better https://regex101.com/r/1KBQd1/5
Edit: added word boundaries

You can use a capturing group and repeat matching the digits 4-10 times using [0-9]{4,10}.
Note that [P][O][#] is the same as PO#
\bPO#\s*([0-9]{4,10})\b
\bPO#\s* Match PO# preceded by a word boundary and match 0+ whitespace chars
( Capture group 1
[0-9]{4,10} Match 4 - 10 digits
)\b Close group followed by a word boundary to prevent the match being part of a larger word
Regex demo

If PCRE is available, how about:
PO#\s*\K\d{4,10}(?=\D|$)
PO#\s* matches the leading substring "PO#" followed by 0 or more whitespaces.
\K resets the starting position of the match and works as a positive (zero length) lookbehind.
\d{4,10} matches a sequence of digits of 4 <= length <= 10.
(?=\D|$) is the positive lookahead to match a non-digit character or the end of the string.

Match only two combinations and ignore the rest in REGEX - tableau

I have a dozen input ID's and I need to match only two particular patterns while ignoring the rest. I have a column that would flag those valid/invalid if the regex match is true.
Test string:
1.) B-123456
2.) 985463728
My regex should strictly match the above two patterns and ignore the rest. The first test string would have an alphabet B followed by a hyphen and then few digits while the second test string is purely numbers. Below is what I tried:
[Bb\d][-\d][0-9]{1,9}
Please help me out with this as I have tried weird combinations and I am missing out on something tiny. My regex includes other combinations as well which should not happen.

You could match either bB a - and 6 digits, or match 9 digits surrounded by word boundaries:
\b(?:[Bb]-[0-9]{6}|[0-9]{9})\b
Regex demo
If the number of digits can vary, you could make the bB and the hyphen optional and either match 1+ digits using [0-9]+ or use a quantifier [0-9]{1,9}
\b(?:[bB]-)?[0-9]+\b
Or use anchors to assert the start ^ and the end $ of the string
^(?:[bB]-)?[0-9]+$

Regular expression of two digit number where two digits are not same

I am trying to write a regular expression that will match a two digit number where the two digits are not same.
I have used the following expression:
^([0-9])(?!\1)$
However, both the strings "11" and "12" are not matching. I thought "12" would match. Can anyone please tell me where I am going wrong?

You need to allow matching 2 digits. Your regex ^([0-9])(?!\1)$ only allows 1 digit string. Note that a lookahead does not consume characters, it only checks for presence or absence of something after the current position.
Use
^(\d)(?!\1)\d$
^^
See demo
Explanation of the pattern:
^ - start of string
(\d) - match and capture into Group #1 a digit
(?!\1) - make sure the next character is not the same digit as in Group 1
\d - one digit
$ - end of string.

regexp - find numbers in a string in any order

I need to find a regexp that allows me to find strings in which i have all the required numbers but only once.
For example:
a <- c("12","13","112","123","113","1123","23","212","223","213","2123","312","323","313","3123","1223","1213","12123","2313","23123","13123")
I want to get:
"123" "213" "312"
The pattern 123 only once and in any order and in any position of the string
I tried a lot of things and this seemed to be the closer while it's still very far from what I want :
grep('[1:3][1:3][1:3]', a, value=TRUE)
[1] "113" "313" "2313" "13123"

What i exactly need is to find all 3 digit numbers containing 1 2 AND 3 digits
Then you can safely use
grep('^[123]{3}$', a, value=TRUE)
##=> [1] "112" "123" "113" "212" "223" "213" "312" "323" "313"
The regex matches:
^ - start of string
[123]{3} - Exactly 3 characters that are either 1, or 2 or 3
$ - assert the position at the end of string.
Also, if you only need unique values, use unique.
If you do not need to allow the same digit more than once, you need a Perl-based regex:
grep('^(?!.*(.).*\\1)[123]{3}$', a, value=TRUE, perl=T)
## => [1] "123" "213" "312"
Note the double escaped back-reference. The (?!.*(.).*\\1) negative look-ahead will check if the string has no repeated symbols with the help of a capturing group (.) and a back-reference that forces the same captured text to appear in the string. If the same characters are found, there will be no match. See IDEONE demo.
The (?!.*(.).*\\1) is a negative look-ahead. It only asserts the absence of some pattern after the current regex engine position, i.e. it checks and returns true if there is no match, otherwise it returns false. Thus, it does not not "consume" characters, it does not "match" the pattern inside the look-ahead, the regex engine stays at the same location in the input string. In this regex, it is the beginning of string (^). So, right at the beginning of the string, the regex engine starts looking for .* (any character but a newline, 0 or more repetitions), then captures 1 character (.) into group 1, again matches 0 or more characters with .*, and then tries to match the same text inside group 1 with \\1. Thus, if there is 121, there will be no match since the look-ahead will return false as it will find two 1s.

you can as well use this
grep('^([123])((?!\\1)\\d)(?!\\2|\\1)\\d', a, value=TRUE, perl=T)
see demo

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

regular expression in R, match substring only if things after - regex

You can do: [0-9]+(?:[,.][0-9]+)* It's very elegant, I tried it in front of a mirror.

Related

How do I create a regex expression for a 10 digit phone number with the same separator?

Is there a way to use Regex to capture numbers out of a string based on a specific leading letters?

Match only two combinations and ignore the rest in REGEX - tableau

Regular expression of two digit number where two digits are not same

regexp - find numbers in a string in any order

Categories

Resources