Regular expression to keep a running count of individual characters - regex

Consider the following vector x
x <- c("000a000b000c", "abcd00ab", "abcdefg", "000s00r00g00t00")
Using a single regular expression, I'd like to keep only those elements of x that contain more than three letters. Here are the rules:
The letters are not always consecutive (this is the main issue)
The string elements of x can be of any number of characters
There will be nothing in the string except digits and lower-case letters
The simple way I thought of would be to remove everything that is not a letter and then take the number of characters, something like the following.
x[nchar(gsub("[0-9]+", "", x)) > 3]
# [1] "abcd00ab" "abcdefg" "000s00r00g00t00"
I know that there are statements like [a-z]{4,} that finds four or more consecutive lower-case letters. But what if individual letters are scattered about the string? How can I keep a "running count" of letters such that when it passes three, it becomes a non-match? Right now all I can think of is to write [a-z]+ a bunch of times, but this can get ugly if I want to match say, five or more letters.
This gets me there, but you can see how this could be ugly for longer strings.
grep("[a-z]+.*[a-z]+.*[a-z]+.*[a-z]+.*", x)
# [1] 2 3 4
Is there a way to do that with a better regular expression?

Try this where \\D matches a non-digit, .* matches a string of 0 or more characters and (...){4} says to match four times, i.e. more than 3.
grep("(\\D.*){4}", x, value = TRUE)
This will match if there are 4 or any greater number of non-digits. Just replace 4 with 6 if you need more than 5. If its important to have the number 3 in the regexp then try this pattern (\\D.*){3}\\D instead.

There is a repetition operator you can use: {n} matches the previous token or group n times. To make matches more efficient, you should also be specific in what may be matched between letters (in your case only digits, not "any" character (which the dot . matches)):
^(?:[0-9]*[a-z]){4}[0-9a-z]*$
matches all strings that contain at least 3 lowercase letters.
Explanation:
^ # Start of string
(?: # Start of a (non-capturing) group:
[0-9]* # Match any number of digits
[a-z] # Match one lowercase ASCII letter
){4} # Repeat the group exactly four times
[0-9a-z]* # Then match any following digits/letters
$ # until the end of the string
In R:
grep("^(?:[0-9]*[a-z]){4}[0-9a-z]*$", x, perl=TRUE, value=TRUE);
gives you a character vector with all the elements that are matches by the regex.

The below grep command would find the elements which has four or more letters
> grep("^(?:[^a-z]*[a-z]){4}", x, perl=T, value=T)
[1] "abcd00ab" "abcdefg" "000s00r00g00t00"
OR
> grep("^(?:[^a-z]*[a-z]){3}[^a-z]*[a-z]", x, perl=T, value=T)
[1] "abcd00ab" "abcdefg" "000s00r00g00t00"
To find the elements which has 5 or more letters,
> grep("^(?:[^a-z]*[a-z]){5}", x, perl=T, value=T)
[1] "abcd00ab" "abcdefg"
Explanation:
^ the beginning of the string
(?: group, but do not capture (4 times):
[^a-z]* any character except: 'a' to 'z' (0 or
more times)
[a-z] any character of: 'a' to 'z'
){4} end of grouping

Related

Regex match all letters after a digit

I want to match any letters that occur after a digit(s). There will not be any other digits in the sentence.
# example 1 one letter
> 'word 1 b'
> array(b)
# example 2 multiple letters
>'3c, d, e'
> array (c, d, e)
# example 3 no match
>'word 5'
> array()
# example 4 multiple letters multiple digits
>'words 12a b c'
> array(a, b, c)
I've tried [^\d]+?([A-Za-z]) but this matches letters before the digits also, and not the one that is attached to the digit (e.g. in example 4, 12a, or example 2, 3c)
Since this works for you, here are the possible solutions:
(?:\G(?!^)|\d+)[^a-z]*\K[a-z]
(?<=\d.*)[a-z]
See regex #1 demo and regex #2 demo. Details:
(?:\G(?!^)|\d+) - one or more digits or the end of the previous successful match
[^a-z]* - any zero or more non-lowercase letters
\K - match reset operator discarding all text matched so far
[a-z] - a lowercase letter.
The second regex means:
(?<=\d.*) - a location that is immediately preceded with a digit and then any zero or more chars other than line break chars, as many as possible
[a-z] - a lowercase letter.
To exclude the word and, you can use
(?:\G(?!^)(?:\s+and\b)?|\d+)[^a-z\n]*\K[a-z]
See this regex demo. Or,
(?<=\d.*)[a-z](?<!\band\b)(?!(?<=\ban)d\b)(?!(?<=\ba)nd\b)
See this regex demo.
It sounds like what you might need is a zero-width group, one that is required by the expression but is not part of the capture. The zero-width lookahead will consume any digits it finds, and the group captured will be anything after the digits.
(?=d+)(\w+)

Regex: a character should not come consecutively not more than 3 times

In regex(python3) I have an input of 16 digit number.
I need to check the number so that no 4 consecutive digits are same
1234567890111234 ------> valid
1234555567891234 ------> invalid
You could search for the pattern (.)\1{3} in the string which matches 4 consecutive same letters, if re.search returns None, it's a valid string:
import re
lst = ['12345678901112', '12345555678912']
for x in lst:
print(x)
print('Valid: ', re.search(r'(.)\1{3}', x) is None)
#12345678901112
#Valid: True
#12345555678912
#Valid: False
Here (.) matches a general single character, and capture it as group 1 which we can refer later for the following characters match with back reference \1, and to further make sure there are three same characters, use quantifier {3} on \1. This ensures the matched 4 characters are the same.

regexp - find numbers in a string in any order

I need to find a regexp that allows me to find strings in which i have all the required numbers but only once.
For example:
a <- c("12","13","112","123","113","1123","23","212","223","213","2123","312","323","313","3123","1223","1213","12123","2313","23123","13123")
I want to get:
"123" "213" "312"
The pattern 123 only once and in any order and in any position of the string
I tried a lot of things and this seemed to be the closer while it's still very far from what I want :
grep('[1:3][1:3][1:3]', a, value=TRUE)
[1] "113" "313" "2313" "13123"
What i exactly need is to find all 3 digit numbers containing 1 2 AND 3 digits
Then you can safely use
grep('^[123]{3}$', a, value=TRUE)
##=> [1] "112" "123" "113" "212" "223" "213" "312" "323" "313"
The regex matches:
^ - start of string
[123]{3} - Exactly 3 characters that are either 1, or 2 or 3
$ - assert the position at the end of string.
Also, if you only need unique values, use unique.
If you do not need to allow the same digit more than once, you need a Perl-based regex:
grep('^(?!.*(.).*\\1)[123]{3}$', a, value=TRUE, perl=T)
## => [1] "123" "213" "312"
Note the double escaped back-reference. The (?!.*(.).*\\1) negative look-ahead will check if the string has no repeated symbols with the help of a capturing group (.) and a back-reference that forces the same captured text to appear in the string. If the same characters are found, there will be no match. See IDEONE demo.
The (?!.*(.).*\\1) is a negative look-ahead. It only asserts the absence of some pattern after the current regex engine position, i.e. it checks and returns true if there is no match, otherwise it returns false. Thus, it does not not "consume" characters, it does not "match" the pattern inside the look-ahead, the regex engine stays at the same location in the input string. In this regex, it is the beginning of string (^). So, right at the beginning of the string, the regex engine starts looking for .* (any character but a newline, 0 or more repetitions), then captures 1 character (.) into group 1, again matches 0 or more characters with .*, and then tries to match the same text inside group 1 with \\1. Thus, if there is 121, there will be no match since the look-ahead will return false as it will find two 1s.
you can as well use this
grep('^([123])((?!\\1)\\d)(?!\\2|\\1)\\d', a, value=TRUE, perl=T)
see demo

regex to match entire words containing only certain characters

I want to match entire words (or strings really) that containing only defined characters.
For example if the letters are d, o, g:
dog = match
god = match
ogd = match
dogs = no match (because the string also has an "s" which is not defined)
gods = no match
doog = match
gd = match
In this sentence:
dog god ogd, dogs o
...I would expect to match on dog, god, and o (not ogd, because of the comma or dogs due to the s)
This should work for you
\b[dog]+\b(?![,])
Explanation
r"""
\b # Assert position at a word boundary
[dog] # Match a single character present in the list “dog”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
[,] # Match the character “,”
)
"""
The following regex represents one or more occurrences of the three characters you're looking for:
[dog]+
Explanation:
The square brackets mean: "any of the enclosed characters".
The plus sign means: "one or more occurrences of the previous expression"
This would be the exact same thing:
[ogd]+
Which regex flavor/tool are you using? (e.g. JavaScript, .NET, Notepad++, etc.) If it's one that supports lookahead and lookbehind, you can do this:
(?<!\S)[dog]+(?!\S)
This way, you'll only get matches that are either at the beginning of the string or preceded by whitespace, or at the end of the string or followed by whitespace. If you can't use lookbehind (for example, if you're using JavaScript) you can spell out the leading condition:
(?:^|\s)([dog]+)(?!\S)
In this case you would retrieve the matched word from group #1. But don't take the next step and try to replace the lookahead with (?:$|\s). If you did that, the first hit ("dog") would consume the trailing space, and the regex wouldn't be able to use it to match the next word ("god").
Depending on the language, this should do what you need it to do. It will only match what you said above;
this regex:
[dog]+(?![\w,])
in a string of ..
dog god ogd, dogs o
will only match..
dog, god, and o
Example in javascript
Example in php
Anything between two [](brackets) is a character class.. it will match any character between the brackets. You can also use ranges.. [0-9], [a-z], etc, but it will only match 1 character. The + and * are quantifiers.. the + searches for 1 or more characters, while the * searches for zero or more characters. You can specify an explicit character range with curly brackets({}), putting a digit or multiple digits in-between: {2} will match only 2 characters, while {1,3} will match 1 or 3.
Anything between () parenthesis can be used for callbacks, say you want to return or use the values returned as replacements in the string. The ?! is a negative lookahead, it won't match the character class after it, in order to ensure that strings with the characters are not matched when the characters are present.

A regular expression that matches 25 chars and starts with digits

I have a text field which I need to validate using a regex. My requirement is as follow:
CCCCNNNNNN or CCCCNNNNNNN (Template)
1234ABCDEFG or 123-ABCDEFG (Example string)
Rules:
The whole string is maximum 25 characters
The first four characters (CCCC) must be alphanumeric
CCCC is 4 characters exactly and can be digits or number
CCCC can have a dash sign as 4th character
NNNNNNNNNNNN can be up to 21 characters and only numbers
E.g. AAAA 1234 A58- is a valid string for CCCC.
Here is my research notes:
I will need to match numerics first
I will need the + char to specify to match this pattern X times
I will need to match letters after that for 8-9 spaces
There is a wonderful post on RegEx patterns here:
Matching numbers with regular expressions — only digits and commas
My goal is to apply this REGEX pattern to a text box Mask in a WinForms app.
....
....
...yeah - I think the answer you are looking for (and I stress "think") is this expression:
^[0-9A-Za-z]{3}[0-9A-Za-z-]\d{0,21}$
thats:
^ # assert beginning (not in the middle)
[0-9A-Za-z]{3} # three characters that are: 0-9 or a-z (upper or lower)
[0-9A-Za-z-] # one character that is: 0-9 or a-z (upper or lower) or a dash
\d{0,21} # anywhere from 0 to 21 digits
$ # assert at the end (not somewhere in the middle
If you want to match several cases of this expression, put the above expression (minus the assertions) into parantheses (()) along with whatever is allowed to separate these values - I chose \s or "whitespace") and then use the + quantifier:
^([0-9A-Za-z]{3}[0-9A-Za-z-]\d{0,21}\s+)+$
will match/validate the following input:
1234567890 AAAA123456789012345678901 GGG-123 hhh5 A1B2000000000
If you wanted something else, you'll have to ask a clearer question (there's a lot of contradiction and repetition in your question that makes it EXTREMELY confusing)