Remove all text between two brackets - regex

Suppose I have some text like this,
text<-c("[McCain]: We need tax policies that respect the wage earners and job creators. [Obama]: It's harder to save. It's harder to retire. [McCain]: The biggest problem with American healthcare system is that it costs too much. [Obama]: We will have a healthcare system, not a disease-care system. We have the chance to solve problems that we've been talking about... [Text on screen]: Senators McCain and Obama are talking about your healthcare and financial security. We need more than talk. [Obama]: ...year after year after year after year. [Announcer]: Call and make sure their talk turns into real solutions. AARP is responsible for the content of this advertising.")
and I would like to remove (edit: get rid of) all of the text between the [ and ] (and the brackets themselves). What's the best way to do this? Here is my feeble attempt using regex and the stingr package:
str_extract(text, "\\[[a-z]*\\]")
Thanks for any help!

With this:
gsub("\\[[^\\]]*\\]", "", subject, perl=TRUE);
What the regex means:
\[ # '['
[^\]]* # any character except: '\]' (0 or more
# times (matching the most amount possible))
\] # ']'

The following should do the trick. The ? forces a lazy match, which matches as few . as possible before the subsequent ].
gsub('\\[.*?\\]', '', text)

Here'a another approach:
library(qdap)
bracketX(text, "square")

I think this technically answers what you've asked, but you probably want to add a \\: to the end of the regex for prettier text (removing the colon and space).
library(stringr)
str_replace_all(text, "\\[.+?\\]", "")
#> [1] ": We need tax policies that respect the wage earners..."
vs...
str_replace_all(text, "\\[.+?\\]\\: ", "")
#> [1] "We need tax policies that respect the wage earners..."
Created on 2018-08-16 by the reprex package (v0.2.0).

No need to use a PCRE regex with a negated character class / bracket expression, a "classic" TRE regex will work, too:
subject <- "Some [string] here and [there]"
gsub("\\[[^][]*]", "", subject)
## => [1] "Some here and "
See the online R demo
Details:
\\[ - a literal [ (must be escaped or used inside a bracket expression like [[] to be parsed as a literal [)
[^][]* - a negated bracket expression that matches 0+ chars other than [ and ] (note that the ] at the start of the bracket expression is treated as a literal ])
] - a literal ] (this character is not special in both PCRE and TRE regexps and does not have to be escaped).
If you want to only replace the square brackets with some other delimiters, use a capturing group with a backreference in the replacement pattern:
gsub("\\[([^][]*)\\]", "{\\1}", subject)
## => [1] "Some {string} here and {there}"
See another demo
The (...) parenthetical construct forms a capturing group, and its contents can be accessed with a backreference \1 (as the group is the first one in the pattern, its ID is set to 1).

Related

R - regular expression - capturing a number in file name

I have several files. Their name example is as follows :-
ABC2_5XYZ_7_data.csv
DEF2_10QST_7_data.csv
Everytime when I read the filenames, I would like to capture the number beside the _ and store them into another variable.
In the above example these are the "5" and "10".
Can anyone suggest something ?
I think this would work. I added a couple more strings just to make sure. Since we are looking for the first and only match, we can use sub().
x <- c("ABC2_5XYZ_data.csv", "DEF2_10QST_data.csv", "A123_456ABC_data.csv", "X9F4_7912D_data.csv")
sub(".*_(\\d+).*", "\\1", x)
# [1] "5" "10" "456" "7912"
The regular expression .*_(\\d+).* captures the digits immediately following the underscore. The \\1 returns us the captured digits.
.* matches any character (except newline)
_ matches the character _ literally
( starts the capturing group
\\d+ match a digit one or more times
) ends the capturing group
.* matches any character (except newline)
Further explanation can be found at regex101
Update after OP changed the question: In response to your comments, and the changed question, you can use the following. Note that we are still using sub() (not gsub()!) since we want the first match.
x <- c("ABC2_5XYZ_7_data.csv", "DEF2_10QST_7_data.csv")
sub("[[:alnum:]]+_(\\d+).*", "\\1", x)
# [1] "5" "10"

Cleaning strings in R: add punctuation w/o overwriting last character

I'm new to R and unable to find other threads with a similar issue.
I'm cleaning data that requires punctuation at the end of each line. I am unable to add, say, a period without overwriting the final character of the line preceding the carriage return + line feed.
Sample code:
Data1 <- "%trn: dads sheep\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
Data2 <- gsub("[^[:punct:]]\r\n\\*", ".\r\n\\*", Data1)
The contents of Data2:
[1] "%trn: dads shee.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
Notice the "p" of sheep was overwritten with the period. Any thoughts on how I could avoid this?
Capturing group:
Use a capturing group around your character class and reference the group inside of your replacement.
gsub('([^[:punct:]])\\r\\n\\*', '\\1.\r\n*', Data1)
^ ^ ^^^
# [1] "%trn: dads sheep.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
Lookarounds:
You can switch on PCRE by using perl=T and use lookarounds to achieve this.
gsub('[^\\pP]\\K(?=\\r\\n\\*)', '.', Data1, perl=T)
# [1] "%trn: dads sheep.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
The negated Unicode property \pP class matches any character except any kind of punctuation character.
Instead of using a capturing group, I used \K here. This escape sequence resets the starting point of the reported match. Any previously matched characters are not included in the final matched sequence. As well, I used a Positive Lookahead to assert that a carriage return, newline sequence and a literal asterisk character follows.
There are several ways to do it:
Capture group:
gsub("([^[:punct:]])\\r\\n\\*", "\\1.\r\n*", Data1)
Positive lookbehind (non-capturing group):
gsub("(?<=[^[:punct:]])\\r\\n\\*", ".\r\n*", Data1, perl=T)
EDIT: fixed the backslashes and removed the uncertainty about R support for these.

Use regex to insert space between collapsed words

I'm working on a choropleth in R and need to be able to match state names with match.map(). The dataset I'm using sticks multi-word names together, like NorthDakota and DistrictOfColumbia.
How can I use regular expressions to insert a space between lower-upper letter sequences? I've successfully added a space but haven't been able to preserve the letters that indicate where the space goes.
places = c("NorthDakota", "DistrictOfColumbia")
gsub("[[:lower:]][[:upper:]]", " ", places)
[1] "Nort akota" "Distric olumbia"
Use parentheses to capture the matched expressions, then \n (\\n in R) to retrieve them:
places = c("NorthDakota", "DistrictOfColumbia")
gsub("([[:lower:]])([[:upper:]])", "\\1 \\2", places)
## [1] "North Dakota" "District Of Columbia"
You want to use capturing groups to capture to matched context so you can refer back to each matched group in your replacement call. To access the groups, precede two backslashes \\ followed by the group #.
> places = c('NorthDakota', 'DistrictOfColumbia')
> gsub('([[:lower:]])([[:upper:]])', '\\1 \\2', places)
# [1] "North Dakota" "District Of Columbia"
Another way, switch on PCRE by using perl=T and use lookaround assertions.
> places = c('NorthDakota', 'DistrictOfColumbia')
> gsub('[a-z]\\K(?=[A-Z])', ' ', places, perl=T)
# [1] "North Dakota" "District Of Columbia"
Explanation:
The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included. Basically ( throws away everything that it has matched up to that point. )
[a-z] # any character of: 'a' to 'z'
\K # '\K' (resets the starting point of the reported match)
(?= # look ahead to see if there is:
[A-Z] # any character of: 'A' to 'Z'
) # end of look-ahead

Matching two single quotes or double quote

I have the following strings. It is LatLongs in degrees, minutes and seconds format,
and can be entered as follows:
Option1: 25º 23" 40.6' or
Option2: 25º 23'' 40.6' or
Option3: 25 23 40.6
With one regx i would like to match both strings, the problem for me is matching the "(double quote) AND ' '(two single quotes).
I have the following so far.
^[+|-]?[0-9]{1,2}[\º| ][ ]?[0-9]{1,2}[\"|'{2}| ]
I am building and testing the regx in the terminal on lunix (Ubuntu). From the output i get in the terminal its matches the "(double quote) but only ONE of the ' '(two single quotes).
How can i change the regx to match the "(double quote) and ' '(two single quotes), in one expression?
Thanks in advance.
Check out this pattern:
([+-]?\d{1,2}(?:\.\d{1,2})?.)\s*(\d{1,2}(?:\.\d{1,2})?[\S]*)\s*(\d{1,2}(?:\.\d{1,2})?'?)
It is independent of any special character including support of up-to 2 digits, along with the resolution of your issue.
Your regex has problems. For example, [\"|'{2}| ] matches a single ", |, ', {, 2, } or . Try the following:
^([+-]?\d+)º? ?\b(\d+)\b(?:''|\")? ?([\d.]+)'?$
Explanation:
^ # Start of string
([+-]?\d+) # Match an integer
º?[ ]? # Match a degree and/or a space (both optional)
\b(\d+)\b # Match a positive integer (entire number)
(?:''|\")?[ ]? # Match quotes and/or space (all optional)
([\d.]+) # Match a floating point number
'? # Match an optional single quote
$ # End of string
I think what you really want to have with the Regex above is
^[+|-]?[0-9]{1,2}º? ?[0-9]{1,2}(\"|'{2})? ?[0-9]{1,2}\.[0-9]'?
Although this also matches weird things like
25 23'' 40.6
Your Regex uses custom character classes (the sections in [ and ]) which only can match one single character. You can group together multiple characters by ( and ) and make these groups optional with a ?.

Regex - Find all matching words that don't begin with a specific prefix

How would I construct a regular expression to find all words that end in a string but don't begin with a string?
e.g. Find all words that end in 'friend' that don't start with the word 'girl' in the following sentence:
"A boyfriend and girlfriend gained a friend when they asked to befriend them"
The items in bold should match. The word 'girlfriend' should not.
Off the top of my head, you could try:
\b # word boundary - matches start of word
(?!girl) # negative lookahead for literal 'girl'
\w* # zero or more letters, numbers, or underscores
friend # literal 'friend'
\b # word boundary - matches end of word
Update
Here's another non-obvious approach which should work in any modern implementation of regular expressions:
Assuming you wish to extract a pattern which appears within multiple contexts but you only want to match if it appears in a specific context, you can use an alteration where you first specify what you don't want and then capture what you do.
So, using your example, to extract all of the words that either are or end in friend except girlfriend, you'd use:
\b # word boundary
(?: # start of non-capture group
girlfriend # literal (note 1)
| # alternation
( # start of capture group #1 (note 2)
\w* # zero or more word chars [a-zA-Z_]
friend # literal
) # end of capture group #1
) # end of non-capture group
\b
Notes:
This is what we do not wish to capture.
And this is what we do wish to capture.
Which can be described as:
for all words
first, match 'girlfriend' and do not capture (discard)
then match any word that is or ends in 'friend' and capture it
In Javascript:
const target = 'A boyfriend and girlfriend gained a friend when they asked to befriend them';
const pattern = /\b(?:girlfriend|(\w*friend))\b/g;
let result = [];
let arr;
while((arr=pattern.exec(target)) !== null){
if(arr[1]) {
result.push(arr[1]);
}
}
console.log(result);
which, when run, will print:
[ 'boyfriend', 'friend', 'befriend' ]
This may work:
\w*(?<!girl)friend
you could also try
\w*(?<!girl)friend\w* if you wanted to match words like befriended or boyfriends.
I'm not sure if ?<! is available in all regex versions, but this expression worked in Expersso (which I believe is .NET).
Try this:
/\b(?!girl)\w*friend\b/ig
I changed Rob Raisch's answer to a regexp that finds words Containing a specific substring, but not also containing a different specific substring
\b(?![\w_]*Unwanted[\w_]*)[\w_]*Desired[\w_]*\b
So for example \b(?![\w_]*mon[\w_]*)[\w_]*day[\w_]*\b will find every word with "day" (eg day , tuesday , daywalker ) in it, except if it also contains "mon" (eg monday)
Maybe useful for someone.
In my case I needed to exclude some words that have a given prefix from regex matching result
the text was query-string params
?=&sysNew=false&sysStart=true&sysOffset=4&Question=1
the prefix is sys and I dont the words that have sys in them
the key to solve the issue was with word boundary \b
\b(?!sys)\w+\b
then I added that part in the bigger regex for query-string
(\b(?!sys)\w+\b)=(\w+)