Extract everything inside quotation marks but keep quoted content quoted - regex

I have the following case
"_,\"'() is a marker of \"'_( ,)\"."
and I want to extract this string with Regex such that:
_,\"'() marker \"'_( ,)\"
matches.
Another example for better readability (however the previous example is more important for the use case)
"Test is a marker for 'testing'"
which should result in
Test marker testing
G is a abbreviation for "GDP ('Gross Domestic Product')"
G abbreviation GDP ('Gross Domestic Product')
There are only two options either marker or abbreviation.
My current regex is the following:
/(.*)+ is the (father|mother) of (?:"([^,]*)")./
But it doesn't work with the first example.
Any help is much appreciated.

You could use
(\w+) is a (marker|abbreviation) for ("[^"]*?"|'[^']*?')
R Example
library(stringr)
convert <- function(s) {
res <- str_match(s, "(\\w+) is a (marker|abbreviation) for (\"[^\"]*?\"|'[^']*?')")
return <- paste(res[2], res[3], substr(res[4], 2, nchar(res[4])-1))
}
print(convert("Test is a marker for 'testing'")) # Test marker testing
print(convert("G is a abbreviation for \"GDP ('Gross Domestic Product')\"")) # G abbreviation GDP ('Gross Domestic Product')
Also, see the demo of the regex
P.S. As most of your questions were about R language, I thought to show an example exactly in R. Hope it is helpful.

Related

Extract info inside parenthesis in R

I have some rows, some have parenthesis and some don't. Like ABC(DEF) and ABC. I want to extract info from parenthesis:
ABC(DEF) -> DEF
ABC -> NA
I wrote
gsub(".*\\((.*)\\).*", "\\1",X).
It works good for ABC(DEF), but output "ABC" when there is not parenthesis.
If you do not want to get ABC when using sub with your regex, you need to add an alternative that would match all the non-empty string and remove it.
X <- c("ABC(DEF)", "ABC")
sub(".*(?:\\((.*)\\)).*|.*", "\\1",X)
^^^
See the IDEONE demo.
Note you do not have to use gsub, you only need one replacement to be performed, so a sub will do.
Also, a stringr str_match would also be handy for this task:
str_match(X, "\\((.*)\\)")
or
str_match(X, "\\(([^()]*)\\)")
Using string_extract() will work.
library(stringr)
df$`new column` <- str_extract(df$`existing column`, "(?<=\\().+?(?=\\))")
This creates a new column of any text inside parentheses of an existing column. If there is no parentheses in the column, it will fill in NA.
The inspiration for my answer comes from this answer on the original question about this topic

Regexp match all between parenthesis.

I have the following snippet of text:
#article{carr2006,
title={Techniques for qualitative and quantitative measurement of aspects of laser-induced damage important for laser beam propagation},
author={Carr, CW and Feit, MD and Nostrand, MC and Adams, JJ},
journal={Meas. Sci. Technol.},
volume={17},
number={7},
pages={1958},
year={2006},
publisher={IOP Publishing}
}
#article{NIF1998,
author = {Schwartz, Sheldon and Feit, Michael D. and Kozlowski, Mark R. and Mouser, Ron P.},
title = {Current 3-ω large optic test procedures and data analysis for the quality assurance of National Ignition Facility optics},
journal = {Proc. SPIE},
volume = {3578},
number = {},
pages = {314-321},
year = {1999},
}
And I've been trying to extract the article by it's tag, however I fail to understand how the greedy/non-greedy works, or rather how to capture everything in the brackets when it contains more brackets :/
The following regexp returns a result up until first brackets, which is not what I'm aiming for...
/\{(carr2006[^}]+)\}?/s
Also was trying to capture full text with #article in front, but that doesn't work either...
/#*\{(carr2006[^}]+)\}?/s
Any explanations on what I'm doing wrong would be helpful :)
You may change your regex like below.
#\w+\{1st_standard(?:,\s*\w+\s*=\s*(?:{[^}]*}|"[^"]*"))+,?\s*\}
DEMO
\s* should match any type of whitespace character so this would match also the line breaks.

",(?!.*\\))" returning "Invalid Regex" error in R

I've got a string that I'm working with and I'm trying to select only the commas that are outside of the parentheses so that I can split the string based on that. Here's the string I'm working with:
"LIVINGSTON (Townships of Brighton, Deerfield, Genoa, Hartland,, Oceola & Tyrone), MACOMB, MONROE, OAKLAND, SANILAC, ST. CLAIR, AND WAYNE COUNTIES"
I'm trying to use the regex mentioned in the question title and it's telling me that it's not valid. Presumably this is because the closing parenthesis that is supposed to be escaped is being recognized by R as the parenthesis closing the match group and so the second parenthesis is throwing everything off. I'm just curious about how to work around this. Here is the syntax I'm using:
counties <- "LIVINGSTON (Townships of Brighton, Deerfield, Genoa, Hartland,, Oceola & Tyrone), MACOMB, MONROE, OAKLAND, SANILAC, ST. CLAIR, AND WAYNE COUNTIES"
tmp <- strsplit(counties, ',(?!.*\\))')
I can obviously just do the inverse of what I'm doing now and instead of splitting the text on the commas outside of the parentheses, simply replace the commas inside of the parentheses and then split the string on commas, but I'd like to know why this isn't working.
I believe the reason your regex isn't working is because it's very Perl-ish, which requires the perl=T flag. I think it is also slightly malformed in that you should check for opening and closing parentheses to be complete... I think this is a general solution matching not just your specific case:
counties <- "LIVINGSTON (Townships of Brighton, Deerfield, Genoa, Hartland,, Oceola & Tyrone), MACOMB, MONROE, OAKLAND, SANILAC, ST. CLAIR, AND WAYNE COUNTIES"
tmp <- strsplit(counties, ",(?![^(]*\\))", perl=T)
Because you have an unbalanced ),
https://regex101.com/r/jE0lI9/1
should be:
counties <- "LIVINGSTON (Townships of Brighton, Deerfield, Genoa, Hartland,, Oceola & Tyrone), MACOMB, MONROE, OAKLAND, SANILAC, ST. CLAIR, AND WAYNE COUNTIES"
tmp <- substr(counties, ',(?!.*\\)')
If i have understood the question correctly, try this:
strsplit(gsub("\\(.*\\)", "", counties), ",")[[1]]

R: how to convert part of a string to variable name and return its value in the same string?

Suppose I have a string marco <- 'polo'. Is there any way I can embed marco in the middle of another string, e.g. x <- 'John plays water marco.' and have x return 'John plays water polo.'?
EDIT
The solution David kindly offered does work for the hypothetical problem I posted above, but what I was trying to get to was this:
data <- c('kek','koki','ukak','ikka')
V <- c('a|e|i|o|u')
Rather than deleting all vowels, which the solution can manage (gsub(V,'',data)), how do I specify, say, all vowels between two k's? Obviously gsub('kVk','',data) doesn't work. Any help would be greatly appreciated.
If you want all vowels between two "k" letters removed, I propose the following:
V <- '[aeiou]'
data <- c('kek', 'koki', 'ukak', 'ikka', 'keeuiokaeioukaeiousk')
gsub(paste0('(?:\\G(?!^)|[^k]*k(?=[^k]+k))\\K', V), '', data, perl=T)
# [1] "kk" "kki" "ukk" "ikka" "kkksk"
The \G feature is an anchor that can match at one of two positions; the start of the string position or the position at the end of the last match. \K resets the starting point of the reported match and any previously consumed characters are no longer included which is similar to a lookbehind.
Regular Expression Explanation
Or, to use the example as given:
V <- 'a|e|i|o|u' ## or equivalently '[aeiou]'
dd <- c('kek','koki','ukak','ikka','kaaaak')
gsub(paste0("k(",V,")+k"),"kk",dd)
## [1] "kk" "kki" "ukk" "ikka" "kk"
I guessed that you might (?) want to delete multiple vowels between ks; I added a + to the regular expression to do this.

Excel RegEx Functions in R

I regularly work with Excel Sheets where some fields (observations) contain large amounts of text content in a part structured form (at least visually)
So the content of a single Cell/Obs might be somewhat like this:
My name is John Doe
I live at my address
My Post code is ABC123
My Favorite Pet is: A dog
In Excel I've created a few functions which I can use to look for a string within the cell so lets say that the data is in "A1"
in "A2" I can use "=GETPOSTCODE(A1) where the function is:
Function GetPostCode(PostCode As Range) As String
regex.Pattern = "[A-Z]{3}\d{3,}\b\w*"
regex.IgnoreCase = True
regex.MultiLine = True
Set X = regex.Execute(PostCode.Value)
For Each x1 In X
GetPostCode = UCase(x1)
Exit For
Next
End Function
What kind of structures/functions could I use in r to accomplish this?
the Cells really contain Much more data than that, its purely for example, and I have a number of different "get" functions with different regexs.
I've had a good look at all the Grep type commands but am struggling with limited/developing R skills.
I've been working around this kind of Principle, but pretty much stalled (where textfield is the column with my text in obviously!) I can get a list of all the rows where it contains a post code but not JUST the Post Code:
df$postcode <- df[(df$textfield = grep("[A-Z]{3}\\d{3,}\\b\\w*", df$textfield), ]
Any Help appreciated!
I think you need a combination of regexpr or grepexpr (to find the matches in the string) and regmatches to extract the matching parts of the strings:
x <- "My name is John Doe
I live at my address
My Post code is ABC123
My Favorite Pet is: A dog"
> regmatches(x, regexpr("[A-Z]{3}\\d{3,}\\b\\w*", x, ignore.case = TRUE))
[1] "ABC123"
Other options probably include str_extract from stringr or stri_extract from stringi packages.