What is the purpose of .*\\? - regex

I have been playing around with list.files() and I wanted to only list 001.csv through 010.csv and I came up with this command:
list_files <- list.files(directory, pattern = ".*\\000|010", full.names = TRUE)
This code gives me what I want, but I do not fully understand what is happening with the pattern argument. How does pattern = .*\\\000 work?

\\0 is a backreference that inserts the whole regex match to that point. Compare the following to see what that can mean:
sub("he", "", "hehello")
## [1] "hello"
sub("he\\0", "", "hehello")
## [1] "llo"
With strings like "001.csv" or "009.csv", what happens is that the .* matches zero characters, the \\0 repeats those zero characters one time, and the 00 matches the first two zeros in the string. Success!
This pattern won't match "100.csv" or "010.csv" because it can't find anything to match that is doubled and then immediately followed by two 0s. It will, though, match "1100.csv", because it matches 1, then doubles it, and then finds two 0s.
So, to recap, ".*\\000" matches any string beginning with xx00 where x stands for any substring of zero or more characters. That is, it matches anything repeated twice and then folllowed by two zeros.

Related

regexp - find numbers in a string in any order

I need to find a regexp that allows me to find strings in which i have all the required numbers but only once.
For example:
a <- c("12","13","112","123","113","1123","23","212","223","213","2123","312","323","313","3123","1223","1213","12123","2313","23123","13123")
I want to get:
"123" "213" "312"
The pattern 123 only once and in any order and in any position of the string
I tried a lot of things and this seemed to be the closer while it's still very far from what I want :
grep('[1:3][1:3][1:3]', a, value=TRUE)
[1] "113" "313" "2313" "13123"
What i exactly need is to find all 3 digit numbers containing 1 2 AND 3 digits
Then you can safely use
grep('^[123]{3}$', a, value=TRUE)
##=> [1] "112" "123" "113" "212" "223" "213" "312" "323" "313"
The regex matches:
^ - start of string
[123]{3} - Exactly 3 characters that are either 1, or 2 or 3
$ - assert the position at the end of string.
Also, if you only need unique values, use unique.
If you do not need to allow the same digit more than once, you need a Perl-based regex:
grep('^(?!.*(.).*\\1)[123]{3}$', a, value=TRUE, perl=T)
## => [1] "123" "213" "312"
Note the double escaped back-reference. The (?!.*(.).*\\1) negative look-ahead will check if the string has no repeated symbols with the help of a capturing group (.) and a back-reference that forces the same captured text to appear in the string. If the same characters are found, there will be no match. See IDEONE demo.
The (?!.*(.).*\\1) is a negative look-ahead. It only asserts the absence of some pattern after the current regex engine position, i.e. it checks and returns true if there is no match, otherwise it returns false. Thus, it does not not "consume" characters, it does not "match" the pattern inside the look-ahead, the regex engine stays at the same location in the input string. In this regex, it is the beginning of string (^). So, right at the beginning of the string, the regex engine starts looking for .* (any character but a newline, 0 or more repetitions), then captures 1 character (.) into group 1, again matches 0 or more characters with .*, and then tries to match the same text inside group 1 with \\1. Thus, if there is 121, there will be no match since the look-ahead will return false as it will find two 1s.
you can as well use this
grep('^([123])((?!\\1)\\d)(?!\\2|\\1)\\d', a, value=TRUE, perl=T)
see demo

Replace a capture group with repeats a single character while retaining the length of the capture group

Suppose you want to replace AXA with AAA, but also AXXXXXA with AAAAAAA.
Basically any number of X characters between two As with the appropriate number of As.
Using gsub() I tried:
gsub(x = "AXA", pattern = "(A)(X+)(\\1)", replacement = "\\1\\1\\1")
which gives AAA. However, it is AAA no matter how long X+ gets. How can I access the length of Subgroup 2 in the output?
Possible duplicate to this:
Replace repeating character with another repeated character
But IMHO sufficiently different for a separate question.
You have a fixed replacement pattern: you captrure A in the first group, so, \\1 refers to A. Thus, you get 3 As. You need a different approach: replace all consecutive X before A and after A. It is possible with Perl-style regex:
input = "AXXXA"
gsub("(?:A|(?<!^)\\G)\\KX(?=X*A)", "A", input, perl=TRUE)
Output of the demo code:
[1] "AAAAA"
\G forces a consecutive match, and \K helps us cut off the initially matched A. The (?=X*A) look-ahead makes sure we have any number of X before A.
EDIT:
This approach works with longer strings, too (here, we are replacing each Xyz between 123 with A):
input = "123XyzXyzXyz123"
gsub("(?:123|(?<!^)\\G)\\KXyz(?=(?:Xyz)*123)", "A", input, perl=TRUE)
Output: [1] "123AAA123"
EDIT 2:
To replace any letters between 2 As we can use \p{L} shorthand character class to match any letter before A:
gsub("(?:A|(?<!^)\\G)\\K\\p{L}(?=\\p{L}*A)", "A", input, perl=TRUE)
=> [1] "XSDFAAAAAA"

Remove last occurrence of character

A question came across talkstats.com today in which the poster wanted to remove the last period of a string using regex (not strsplit). I made an attempt to do this but was unsuccessful.
N <- c("59.22.07", "58.01.32", "57.26.49")
#my attempts:
gsub("(!?\\.)", "", N)
gsub("([\\.]?!)", "", N)
How could we remove the last period in the string to get:
[1] "59.2207" "58.0132" "57.2649"
Maybe this reads a little better:
gsub("(.*)\\.(.*)", "\\1\\2", N)
[1] "59.2207" "58.0132" "57.2649"
Because it is greedy, the first (.*) will match everything up to the last . and store it in \\1. The second (.*) will match everything after the last . and store it in \\2.
It is a general answer in the sense you can replace the \\. with any character of your choice to remove the last occurence of that character. It is only one replacement to do!
You can even do:
gsub("(.*)\\.", "\\1", N)
You need this regex: -
[.](?=[^.]*$)
And replace it with empty string.
So, it should be like: -
gsub("[.](?=[^.]*$)","",N,perl = TRUE)
Explanation: -
[.] // Match a dot
(?= // Followed by
[^.] // Any character that is not a dot.
* // with 0 or more repetition
$ // Till the end. So, there should not be any dot after the dot we match.
)
So, as soon as a dot(.) is matched in the look-ahead, the match is failed, because, there is a dot somewhere after the current dot, the pattern is matching.
I'm sure you know this by now since you use stringi in your packages, but you can simply do
N <- c("59.22.07", "58.01.32", "57.26.49")
stringi::stri_replace_last_fixed(N, ".", "")
# [1] "59.2207" "58.0132" "57.2649"
I'm pretty lazy with my regex, but this works:
gsub("(*)(.)([0-9]+$)","\\1\\3",N)
I tend to take the opposite approach from the standard. Instead of replacing the '.' with a zero-length string, I just parse the two pieces that are on either side.

regular expression to strip leading characters up to first encountered digit

I have a string titled thisLine and I'd like to remove all characters before the first integer. I can use the command
regexpr("[0123456789]",thisLine)[1]
to determine the position of the first integer. How do I use that index to split the string?
The short answer:
sub('^\\D*', '', thisLine)
where
^ matches the beginning of the string
\\D matches any non-digit (it is the opposite of \\d)
\\D* tries to match as many consecutive non-digits as possible
My personal preference, skipping regexp altogether:
sub("^.*?(\\d)","\\1",thisLine)
#breaking down the regex
#^ beginning of line
#. any character
#* repeated any number of times (including 0)
#? minimal qualifier (match the fewest characters possible with *)
#() groups the digit
#\\d digit
#\\1 backreference to first captured group (the digit)
You want the substring function.
Or use gsub to do work in one shot:
> gsub('^[^[:digit:]]*[[:digit:]]', '', 'abc1def')
[1] "def"
You may want to include that first digit, which can be done with a capture:
> gsub('^[^[:digit:]]*([[:digit:]])', '\\1', 'abc1def')
[1] "1def"
Or as flodel and Alan indicate, simply replace "all leading digits" with a blank. See flodel's answer.

RegEx match character before digit but

only after at least 3 characters and only one of those characters should be matched e.g.
for lumia820 the match should be a8 but for aa6 there should not be any match.
My current attempt is /([a-z]{3,})([0-9])/, however this wrongly includes the leading characters. This is probably an easy one for regex specialists but I am completely stuck here.. Can someone pls help?
Assuming you're in an environment that allows lookbehinds, you could do this:
/(?<=[a-z]{2,})([a-z][0-9])/
That will look for two or more letters right before what we want to capture, make sure that they're there without including them in the capture group, and then capture the third (or more) letter followed by the number. The capture itself will make sure that the third letter is there.
#HolyMac per your comment:
Note that I am using c#, and I'm not sure of the differences with Objective-C, but the following matches f9 for me:
string testString = "abasfsdf9314";
Regex regex = new Regex("(?<=[a-z]{2,})([a-z][0-9])");
Match match = regex.Match(testString);
If you need at least 3, you can use {2,} to match 2 or more, and then capture the following character along with the next digit:
/[a-z]{2,}([a-z][\d])[\d]*/
[a-z]{2,} matches at least 2 characters at the start. This ensures there are 2 or more characters before the one you capture.
([a-z][\d]) captures the next character followed by the first digit
[\d]* matches any remaining trailing digits.
If this must be anchored, don't forget ^$.
/^[a-z]{2,}([a-z][\d])[\d]*$/
JavaScript example:
// Matching example aabc9876 yields c9
"a string with aabc9876 and other stuff".match(/[a-z]{2,}([a-z][\d])[\d]*/)
// ["aabc9876", "c9"]
// Non-matching example with zx8
"a string with zx8 should not match".match(/[a-z]{2,}([a-z][\d])[\d]*/)
// null