R regular expressions: unexpected behavior of "[:digit:]"

R regular expressions: unexpected behavior of "[:digit:]" - regex

I'd like to extract elements beginning with digits from a character vector but there's something about POSIX regular expression syntax that I don't understand.
I would think that
vec <- c("012 foo", "305 bar", "other", "notIt 7")
grep(pattern="[:digit:]", x=vec)
would return 1 2 4 since they are the four elements that have digits somewhere in them. But in fact it returns 3 4.
Likewise grep(pattern="^0", x=vec) returns 1 as I would expect because element 1 starts with a zero. However grep(pattern="^[:digit:]", x=vec) returns integer(0) whereas I would expect it to return 1 2 since those are the elements that start with digits.
How am I misunderstanding the syntax?

Try
grep(pattern="[[:digit:]]", x=vec)
instead as the 'meta-patterns' between colons usually require double brackets.

Another solution
grep(pattern="\\d", x=vec)

man 7 regex
Within a bracket expression, the name of a character class enclosed in "[:" and ":]" stands for the list of all characters belonging to that class. Standard character class names are:
alnum digit punct
alpha graph space
blank lower upper
cntrl print xdigit
Therefore a character class that is the sole member of a bracket expression will look like double-brackets, such as [[:digit:]]. As another example, consider that [[:alnum:]] is equivalent to [[:alpha:][:digit:]].

Related

Regular expression not containing 101

I came across the regular expression not containing 101 as follows:
0∗1∗0∗+(1+00+000)∗+(0+1+0+)∗
I was unable to understand how the author come up with this regex. So I just thought of string which did not contain 101:
01000100
I seems that above string will not be matched by above regex. But I was unsure. So tried translating to equivalent pcre regex on regex101.com, but failed there too (as it can be seen my regex does not even matches string containing single 1.
Whats wrong with my translation? Is above regex indeed correct? If not what will be the correct regex?

Here is a bit shorter expression ^0*(1|00+)*0*$
https://www.regex101.com/r/gG3wP5/1
Explanation:
(1|00+)* we can mix zeroes and ones as long as zeroes occur in groups
^0*...0*$ we can have as many zeroes as we want in prefix/suffix
Direct translation of the original regexp would be like
^(0*1*0*|(1|00|000)*|(0+1+0+)*)$
Update
This seems like artificially complicated version of the above regexp:
(1|00|000)* is the same as (1|00+)*
it is almost the solution, but it does not match strings 0, 01.., and ..10
0*1*0* doesn't match strings with 101 inside, but matches 0 and some of 01.., and ..10
we still need to match those of 01.., and ..10 which have 0 & 1 mixed inside, e.g. 01001.. or ..10010
(0+1+0+)* matches some of the remaining cases but there are still some valid strings unmatched
e.g. 10010 is the shortest string that is not matched by all of the cases.
So, this solution is overly complicated and not complete.

read the explanation in the right side tab in regex101 it tells you what your regex does( I think you misunderstood what list operator does) , inside a list operator ( [ ) , the other characters such as ( won't be metacharacters anymore so the expression [(0*1*0*)[1(00)(000)] will be equivalent to [01()*[] which means it matches 0 or 1 or ( or ) or [
The correct translation of the regular expression 0∗1∗0∗+(1+00+000)∗+(0+1+0+)∗
will be as follows:
^((?:0*1*0*)|(?:1|00|000)*|(?:0+1+0+)*)$
regex101
Debuggex Demo
What your regex [(0*1*0*)[1(00)(000)]*(0+1+0+)*] does:
[(0*1*0*)[1(00)(000)]* -> matches any of characters 0,(,),*,[ zero or more times followed by
(0+1+0+)* --> matches the pattern 0+1+0+ 0 or more times followed by
] --> matches the character ]
so you expression is equivalent to
[([)01](0+1+0+)*] which is not a regular expression to match strings that do not contain 101

0* 1* ( (00+000)* 1*)* (ε+0)
i think this expression covers all cases because --
any number apart from 1 can be broken into constituent 2's and 3's i.e. any number n=2*i+3*j. So there can be any number of 0's between 2 consecutive 1's apart from one 0.Hence, 101 cannot be obtained.
ε+0 for expressions ending in one 0.

The RE for language not containing 101 as sub-string can also be written as (0*1*00)*.0*.1*.0*
This may me a smaller one then what you are using. Try to make use of this.

Regular Expression I got (0+10)1. (looks simple :P)
I just considered all cases to make this.
you consider two 1's we have to end up with continuous 1's
case 1: 11111111111111...
case 2: 0000000011111111111111...(once we take two 1's we cant accept 0's so one and only chance is to continue with 1's)
if you consider only one 1 which was followed by 0 So, no issue and after one 1 we can have any number of 0's.
case 3: 00000000 10100100010000100000100000 1111111111
=>(0*+10*)1
final answer (0+10)1.
Thanks for your patience.

Check ICD10 via regex

I need to check icd10 code this code generate with few condition
min length is 3.
first character is letter and not is 'U'.
second and third is digit.
fourth is dot(.)
fifth to eight charactor is letter or digit.
Ex.:
Right : "A18.32","A28.2","A04.0","A18.R252", "A18", "A18.52", "R18", "R18."
Wrong : "A184.32","U18","111."

is this an icd-10-cm code you are looking to verify.
if so I believe that the 3rd digit is alpha or numeric
taken from page 7
https://www.cms.gov/Medicare/Coding/ICD10/downloads/032310_ICD10_Slides.pdf
if so the following regular expression should validate.
^([a-tA-T]|[v-zV-Z])\d[a-zA-Z0-9](\.[a-zA-Z0-9]{1,4})?$
otherwise you can edit the above regular expression to check characte 2 and 3 as numeric.
^([a-tA-T]|[v-zV-Z])\d{2}(\.[a-zA-Z0-9]{1,4})?$

You could try something like so: ^[A-TV-Z]\d{2}(\.[A-Z\d]{0,4})?$. An example is available here.
This is how the answer satisfies your condition:
Min length is 3: ^[A-TV-Z]\d{2}...$ attempts to match a letter and 2 digits. The ^ and $ ensure that there is nothing else in the string which does not satisfy the regular expression. This segment: (\.[A-Z\d]{0,4})? is surrounded by the ? operator: (...)?. This means that the content within the round brackets may or may not be there.
First character is letter and not is 'U'. This is satisfied by [A-TV-Z], which matches all the upper case letters which are between A and T, V and Z inclusive. This omits the letter U.
Second and third is digit. \d{2} means match two digits.
Fourth is dot(.): This is satisfied by \.. The extra \ is needed because the period character is a special character in regular expressions, which means match any character (exception new lines, unless a special option is passed along).
Fifth to eight charactor is letter or digit. [A-Z\d]{0,4} means any letter or digits, repeated between 0 and 4 times.

Try this:
\b[a-tv-zA-TV-Z]\d{2}(\.[a-zA-Z0-9]{,4})?\b
I assume by your example the dot and everything after it is optional
This regex will match a word boundary \b, a letter other than u or U [a-tv-zA-TV-Z], two digits \d{2} and then an optional dot followed by 0-4 letters or digits (\.[a-zA-Z0-9]{,4})? and a second word boundary \b

This question is old, but I had the same issue of validating ICD-10 codes, so it seemed worth an updated answer.
As it turns out, there are two flavors of ICD-10 codes: ICD-10-CM and ICD-10-PCS. From their usage guidelines:
The ICD-10-CM is a morbidity classification published by the United
States for classifying diagnoses and reason for visits in all health
care settings.
and
The ICD-10-PCS is a procedure classification published by the United
States for classifying procedures performed in hospital inpatient
health care settings.
Both Sets
In both the ICD-10-CM and ICD-10-PCS coding systems, you can validate the structure of a code with a regular expression, but validating the content (in terms of which specific combinations of letters and numbers are valid) may be technically possible, but is practically infeasible. A lookup table would be a better bet.
ICD-10-CM
From the Conventions section of the guidelines:
Format and Structure:
The ICD-10-CM Tabular List contains categories, subcategories and
codes. Characters for categories, subcategories and codes may be
either a letter or a number. All categories are 3 characters. A
three-character category that has no further subdivision is equivalent
to a code. Subcategories are either 4 or 5 characters. Codes may be 3,
4, 5, 6 or 7 characters. That is, each level of subdivision after a
category is a subcategory. The final level of subdivision is a code.
Codes that have applicable 7th characters are still referred to as
codes, not subcategories. A code that has an applicable 7th character
is considered invalid without the 7th character.
According to this specification, you'd expect a valid regular expression would look like this:
^\w{3,7}$
However, a review of the actual values shows that, in all cases, the first character is an upper case letter, the second character is a digit, and any alphabetic characters in the remaining available positions are upper case as well. As such, you can use this information to more precisely specify what you're validating:
^[A-Z]\d[A-Z\d]{1,5}$
If you want to allow for a possible period in the fourth position followed by up to four more characters as specified by the OP:
^[A-Z]\d[A-Z\d](\.[A-Z\d]{0,4})?$
ICD-10-PCS
From the Conventions section of the guidelines:
One of 34 possible values can be assigned to each axis of
classification in the seven character code: they are the numbers 0
through 9 and the alphabet (except I and O because they are easily
confused with the numbers 1 and 0). The number of unique values used
in an axis of classification differs as needed...As with words in their
context, the meaning of any single value is a combination of its axis
of classification and any preceding values on which it may be
dependent...Within a PCS table, valid codes include all combinations
of choices in characters 4 through 7 contained in the same row of the
table. [For example], 0JHT3VZ is a valid code, and 0JHW3VZ is
not a valid code.
So to validate the structure of an ICD-10-PCS code:
^[A-HJ-NP-Z\d]{7}$

Use this exp simple :
'^([A-TV-Za-tv-z]{1}[0-9]{1}[A-Za-z0-9]{1}|[A-TV-Za-tv-z]{1}[0-9]{1}[A-Za-z0-9]{1}.[A-Za-z0-9]{1,4})$'

Regex expression to catch variant of a word

I am looking for simple way to use regex and catch variant of word with simplest format.
For example, the 5 variants of the word below.
hike
hhike
hiike
hikke
hikkee
Using something similar to the format below...
[([a-zA-Z]){4,}]
Thanks

Are you looking for something like /h+i+k+e+/?
Meaning:
The literal h character repeated 1 to infinity times
The literal i character repeated 1 to infinity times
The literal k character repeated 1 to infinity times
The literal e character repeated 1 to infinity times
DEMO
If each character can maximum be there twice, you can use /h{1,2}i{1,2}k{1,2}e{1,2}/ meaning "present 1 or 2 times".

You probably cannot solve this generically (i.e. for any word) under standard regex syntax.
For a given word, as others have pointed out, it is trivial.

This is more of a soundex kind of problem I think:
https://stackoverflow.com/a/392236/514463

Need to capture single character, but ignore digit

I'm parsing out flight info.
Here's the sample data:
E0.777 7 3:09
E0.319 N 1:43
E0.735 8 1:45
E0.735 N 1:48
E0.M80 9 3:21
E0.733 1:48
I need to populate fields like this:
Equipment: 735
On Time: N
Duration: 1:48
Problem I'm having is capturing the Y or N character but ignoring the single digit, then capturing the duration.
This is the expression I have tried:
#"^.{3}(.{3})\s?([N|Y]?)?(?:[0-9]\s+)?(\w{4})"
Edit: I updated the sample data to clarify my question. Equipment is not always three digits, it could be a character and two digits. The data between the equipment and the duration could be a boolean N or Y, a single digit, or white space. Only the boolean should be captured.

Firstly, you mix up the concepts of alternation and character classes [Y|N] would match 3 different characters: Y or | or N. Either use (...) or leave out the pipe.
Secondly your double ? after the character class does not really do anything. Thirdly, at the end you only match consecutive spaces if a digit was found. But if there is no digit, the last ? will ignore the subpattern, thus not allowing spaces either.
Lastly, \w does not match :.
Try this:
#"^.{3}(\d{3})\s?(?:([NY])|\d)\s+(\d:\d\d)"
You should also think about restricting the repeated . at the beginning to a more precise character class (i.e \w{2}\., but I don't know the possibilities there).

#"^..\.(\d{3})\s(?:([YN])|\d)\s*(\S{4})"
Changed .{3} to ..\. which is a bit more specific about there being a literal . for character 3.
(?:([YN])|\d) matches either Y/N or a digit, but only captures a Y or N. Notice that it's [YN] not [Y|N].
Changed \w{4} to \S{4} since \w doesn't match colons :.

This will do it...
^\w\d\.(\d{3})\s(?:([YN])|\d)\s*(\d:\d{2})$
I made some other changes to your regex because it was easier for me to just rewrite it based off your data then to try to modify what you had.
This will capture the Y or N or it won't capture anything in that group. I also tried to be more specific with your duration regex.
Update: This works with your new requirements...
^\w\d\.(\w{3})\s(?:([YN])|\d|\s)\s*(\d:\d{2})$
You can see it working on your data here... http://regexr.com?32j1b
(hover over each line to see the matched groups)

This captures all lines with Y or N and ignores everything else:
^...(\d{3})\s*([YN])\s*(\d+:\d+)

Regular Expression to match pattern once or more with no partial matches

Better explained with examples:
HHH
HHHH
HHHBBHHH
HHHBH
BB
HHBH
I need to come up with a regexp that matches only 3 H's or a multiple of 3 H's (so 6, 9, 12, ... H's are ok as well) and 5 H's are not ok. And if possible I don't want to use Perl regexps.
So for the input above the regexp would match (1), (3) and (6) only.
I'm just starting with regular expressions here so I don't exactly know how I'm supposed to approach this.
edit
Just to clear something up:, an H can only be in one group of 3 H's. The group of 3 H's might be HHH or HHBH.
That's why in example 2 above it is not a match because the last H is not in a group of 3 H's. And you can't take the last 3 H's in a group because the middle 2 H's have already been inside a group before.

You can use the following regular expression:
^([^H]*H[^H]*H[^H]*H[^H]*)+$
It matches any string which contains in total 3 H or any multiple of 3. In between there might be any other character.
Explanation:
^ begin of string
( start of group
[^H]*H any string of characters (or none) not including 'H' plus a single 'H'
[^H]*H any string of characters (or none) not including 'H' plus a single 'H'
[^H]*H any string of characters (or none) not including 'H' plus a single 'H'
[^H]* any string of characters (or none) which is not 'H'
)+ containing the group once or twice or ...
$ end of string
By repeating the subpattern [^H]*H three times we make sure that there are indeed 3 H included, [^H]* allows any separating characters.
Note: use either egrep or run grep with additional argument -E.

Use this to match a multiple of 3 H's:
(H{3})+
Here is a complete regex for your examples:
^(H{3})+B*(H{3})*$
Edit: It looks like you need to count non-consecutive H's. In that case:
^(([^H]*H){3})+[^H]*$
That should match any string with a multiple of 3 H's.

Given the requirement that H's can be arbitrarily interleaved with non-H's, but that the total number of H's must be a non-zero multiple of 3 (so XXX, containing no H's, is not a match), then the total regular expression is anything but trivial. This is not a beginner's regular expression.
I'm going to assume that the dialect of regular expression treats {} and () as metacharacters for counting and grouping, and includes + for one-or-more. If you're using a regular expression system that has a different requirement (\{\}, for example) then adjust accordingly.
You need the regex to match the whole string, so there are no stray H's allowed. So, it must start with ^ and end with $. You need to allow an arbitrary number of non-H's at front and back. The H's may be separated by an arbitrary number of non-H's. That leads to:
^([^H]*H[^H]*H[^H]*H)+[^H]*$
Ouch; that is hard to read! It says the line must consist of 1 or more (+) groups of an arbitrary number of non-H's followed by an H, an arbitrary number of non-H's, another H, an arbitrary number of non-H's and a third H; all of which can be followed by an arbitrary number of non-H's.
Using the {} for counting:
^(([^H]*H){3})+[^H]*$
That's still hard to read. Note that my description said "arbitrary number of non-H's at front and back", but I only use the [^H]* at the back; that's because the repeating pattern allows an arbitrary number of non-H's at the front anyway so there's no need to repeat that fragment.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js