Explain this R regex - regex

Recently here an R question was answered by mrdwab that used a regex that was pretty cool (LINK). I liked the response but can't generalize it because I don't understand what's happening (I fooled with the different numeric values being supplied but that didn't really yield anything useful). Could someone break the regex down piece by piece and explain what's happening?
x <- c("WorkerId", "pio_1_1", "pio_1_2", "pio_1_3", "pio_1_4", "pio_2_1",
"pio_2_2", "pio_2_3", "pio_2_4")
gsub("([a-z])_([0-9])_([0-9])", "\\1_\\3\\.\\2", x) #Explain me please
Thank you in advance.

Anywhere you have a character and two numbers separated by underscores (e.g., a_1_2) the regex will select the matched character and numbers and make them available as variables. \\1, \\2, and \\3 refer to the matched arguments in the original expression:
\\1 <- a
\\2 <- 1
\\3 <- 2
The result of running gsub as you have it above is to search an expression for matches and flip the order of the numbers wherever they appear. So, for example, a_1_2 would become a_2.1.
"\\1_\\3\\.\\2"
# a_ 2 . 1

Related

Replace improper commas in CSV file

This question may have been asked before, but I couldn't find it. I have a list of CSV files (439 or so) where, in a few of the files, someone also used commas in editorial comments. The result is that I can't put the files into a data frame, since the files now do not have the same number of elements after splitting them. Anyways, the problem I'm facing looks like this:
vec1 <- paste("484,1213,0,62.0006,1,go -- late F1 max, but glide?")
vec2 <- paste("467,1387,0,62.0026,1,goes2")
ls <- list(vec1, vec2)
What I want to do is to have a data frame with six columns. If there wasn't a comma in the editorial comments for vec1, I could use (and have been using, until I found this problematic example) the following:
df <- ldply(ls, function(x)unlist(strsplit(x[1], split = ",")))
However, I'm getting the obvious error message that the results do not have the same number of lengths. Is there any way of getting rid of that comma, or turning it into a semi-colon, or ensuring that, if there are 7 elements in a vector, that 6 and 7 are combined?
If it helps, this is how I'm reading the files in R (I'm using scan because there is other information in the files that I want. There's some odd encoding issues going on here as well, but this seems to work).
data <- scan(file, fileEncoding="latin1", blank.lines.skip = FALSE, what = "list", sep = "\n", quiet = TRUE)
If you need the comments, you still can replace the 6th comma with a semicolon and use your previous solution:
gsub("((?:[^,]*,){5}[^,]*),", "\\1;", vec1, perl=TRUE)
Regex explanation:
((?:[^,]*,){5}[^,]*) - a capturing group that we will reference to as Group 1 with \\1 in the replacement pattern, matching
(?:[^,]*,){5} - 5 sequences of non-comma characters followed by a comma
[^,]* - 0 or more non-commas
, - the comma we'll turn into a ; in the replacement
Or (as #CathG pointed out, a \\K operator can also be used with Perl-like expressions)
sub("^([^,]+,){5}[^,]+\\K,", ";", vec1, perl=T)
From PCRE documentation:
The escape sequence \K causes any previously matched characters not to be included in the final matched sequence.
However, it will not "normalize" any other commas that might follow.

R: how to convert part of a string to variable name and return its value in the same string?

Suppose I have a string marco <- 'polo'. Is there any way I can embed marco in the middle of another string, e.g. x <- 'John plays water marco.' and have x return 'John plays water polo.'?
EDIT
The solution David kindly offered does work for the hypothetical problem I posted above, but what I was trying to get to was this:
data <- c('kek','koki','ukak','ikka')
V <- c('a|e|i|o|u')
Rather than deleting all vowels, which the solution can manage (gsub(V,'',data)), how do I specify, say, all vowels between two k's? Obviously gsub('kVk','',data) doesn't work. Any help would be greatly appreciated.
If you want all vowels between two "k" letters removed, I propose the following:
V <- '[aeiou]'
data <- c('kek', 'koki', 'ukak', 'ikka', 'keeuiokaeioukaeiousk')
gsub(paste0('(?:\\G(?!^)|[^k]*k(?=[^k]+k))\\K', V), '', data, perl=T)
# [1] "kk" "kki" "ukk" "ikka" "kkksk"
The \G feature is an anchor that can match at one of two positions; the start of the string position or the position at the end of the last match. \K resets the starting point of the reported match and any previously consumed characters are no longer included which is similar to a lookbehind.
Regular Expression Explanation
Or, to use the example as given:
V <- 'a|e|i|o|u' ## or equivalently '[aeiou]'
dd <- c('kek','koki','ukak','ikka','kaaaak')
gsub(paste0("k(",V,")+k"),"kk",dd)
## [1] "kk" "kki" "ukk" "ikka" "kk"
I guessed that you might (?) want to delete multiple vowels between ks; I added a + to the regular expression to do this.

Replacing the first vowel-consonent occurence with consonent-vowel using sub in R

I know that it should be something like this but definitely I am missing something in the syntax:
yy=sub(r'\b[aeiou][^aeiou]*',r'\b[^aeiou][aeiou]*',"abmmmm")
I expect to have "bammmm" as output
Error: unexpected string constant in "yy=sub(r'\b[aeiou][^aeiou]*'"
I am not sure how is the exact syntax.
Please run your code in RStudio or any R compiler. I am new to regex and you giving me Python code wouldn't help me to understand the situation. Thanks!
This is what you want
yy=sub("\\b([aeiou])([^aeiuos])","\\2\\1","abmm")
I'll explain how it works:
If you ask me to substitute any vowel-consonent with any consonent-vowel? It doesn't make much sense. Should I change ab to ba, ce, or da? It can be any one of them. You never specified any relationship between the vowel in vowel-consonent and the vowel in consonent-vowel. Therefore, it doesn't make sense to put a regular expression in the 2nd argument. As a result, you are not allowed to.
If you want to achieve what you asked for. You can add brackets to the regular expression in the 1st argument. The first ( marks group 1, second ( marks group 2, etc. (note, group 0 is the whole matched string.) You can use \1, \2, ... in the second argument to put the matched group there.
As an alternative to using a regular expression for this, there's a nice string reversal function in example(strsplit)
> strReverse <- function(x)
sapply(lapply(strsplit(x, NULL), rev), paste, collapse="")
> dd <- "abmmmm"
> paste(strReverse(substr(dd, 1, 2)), substr(dd, 3, nchar(dd)), sep = "")
[1] "bammmm"

Grep for Pattern in File in R

In a document, I'm trying to look for occurences of a 12-digit string which contains alpha and numerals. A sample string is: "PXB111X2206"
I'm trying to get the line numbers that contain this string in R using the below:
FileInput = readLines("File.txt")
prot_pattern="([A-Z0-9]{12})";
prot_string<-grep(prot_pattern,FileInput)
prot_string
This worked fine until it hit a document containing all upper-case titles and returned a line containing the word "CONCENTRATIO"
The string I am trying to look for is: "PXB111X2206". I am expecting the grep to return the line numbers containing the string : "PXB111X2206". It however is returning the line number containing the word: "CONCENTRATIO"
What is wrong with my expression above? Any idea what I am doing wrong here?
Here is some sample input:
Each design objective described herein is significantly important, yet it is just one aspect of what it takes to achieve a successful project.
A successful project is one where project goals are identified early on and where the >interdependencies of all building systems are coordinated concurrently from the planning and programming phase.
CONCENTRATION:
The areas of concentration for design objectives: accessible, aesthetics, cost effective, >functional/operational, historic preservation, productive, secure/safe, and sustainable and >their interrelationships must be understood, evaluated, and appropriately applied.
Each of these design objectives is presented in the design objectives document number. >PXB111X2206.
>
Thanks & Regards,
Simak
You are using a very powerful tool for a very simple task, the expression
[A-Z0-9]{12}
will match any alphanumeric 12 sized uppercased string, for example the word "CONCENTRATIO", however, your "PXB111X2206" is not even 12 symbols long, so it is not possible that is being matched. If you only want to match "PXB111X2206" you only have to use it as a regular expression itself, for example, if you file contents are:
foo
CONCENTRATIO.
bazz
foo bar bazz PXB111X2206 foo bar bazz
foo
bar
bazz
and you use:
grep('PXB111X2206',readLines("File.txt"))
then R will only match line 4 as you would wish.
EDIT
If you are looking for that specific pattern try:
grep('[A-Z]{3}[0-9]{3}[A-Z]{1}[0-9]{4}',readLines("File.txt"))
That expression will match strings like 'AAADDDADDDD' where A is an capital letter, and D a digit, the regular expression contains a group (symbols inside square brackets) and a quantifier (the number inside the brackets) that tells how many of the previous symbol will the expression accept, if no quantifier is present it assumes it is 1.
Let's take a look at what your regular expression means. [A-Z0-9] means any capitalized letter or number and {12} means the previous expression must occur exactly 12 times. The string CONCENTRATIO is 12 capitaized letters, so it's no surprise that grep picks it up. If you want to take out the matches that match to just letters or just numbers you could try something like
allleters <- grep("[A-Z]{12}",strings)
allnumbers <-grep("[0-9]{12}",strings)
both <- grep("[A-Z0-9]{12}",strings)
the matches you wanted would then be something like
both <- both[!both %in% union(allletters,allnumbers)]
Someone with better regexfu might have a more elegant solution, but this will work too.

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:
USDZAR Curncy
R157 Govt
SPX Index
In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:
USDZAR
R157
SPX
What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:
=MID(#REF, 1, FIND(" ", #REF, 1)-1)
which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).
Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.
1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157" "SPX"
2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.
read.table(text = x, as.is = TRUE)
## V1 V2
## 1 USDZAR Curncy
## 2 R157 Govt
## 3 SPX Index
It's pretty easy with stringr:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]
If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))
The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.
Edited to reflect #Wojciech's comment.
The regex would be to search for:
\x20.*
and replace with an empty string.
If you want to know whether it's faster, just time it.