This question already has answers here:
Remove all punctuation except apostrophes in R
(4 answers)
Closed 9 years ago.
I'm cleaning text strings in R. I want to remove all the punctuation except apostrophes and hyphens. This means I can't use the [:punct:] character class (unless there's a way of saying [:punct:] but not '-).
! " # $ % & ( ) * + , . / : ; < = > ? # [ \ ] ^ _ { | } ~. and backtick must come out.
For most of the above, escaping is not an issue. But for square brackets, I'm really having issues. Here's what I've tried:
gsub('[abc]', 'L', 'abcdef') #expected behaviour, shown as sanity check
# [1] "LLLdef"
gsub('[[]]', 'B', 'it[]') #only 1 substitution, ie [] treated as a single character
# [1] "itB"
gsub('[\[\]]', 'B', 'it[]') #single escape, errors as expected
Error: '[' is an unrecognized escape in character string starting "'[["
gsub('[\\[\\]]', 'B', 'it[]') #double escape, single substitution
# [1] "itB"
gsub('[\\]\\[]', 'B', 'it[]') #double escape, reversed order, NO substitution
# [1] "it[]"
I'd prefer not to used fixed=TRUE with gsub since that will prevent me from using a character class. So, how do I include square brackets in a regex character class?
ETA additional trials:
gsub('[[\\]]', 'B', 'it[]') #double escape on closing ] only, single substitution
# [1] "itB"
gsub('[[\]]', 'B', 'it[]') #single escape on closing ] only, expected error
Error: ']' is an unrecognized escape in character string starting "'[[]"
ETA: the single substitution was caused by not setting perl=T in my gsub calls. ie:
gsub('[[\\]]', 'B', 'it[]', perl=T)
You can use [:punct:], when you combine it with a negative lookahead
(?!['-])[[:punct:]]
This way a [:punct:]is only matched, if it is not in ['-]. The negative lookahead assertion (?!['-]) ensures this condition. It failes when the next character is a ' or a - and then the complete expression fails.
Inside a character class you only need to escape the closing square bracket:
Try using '[[\\]]' or '[[\]]' (I am not sure about escaping the backslash as I don't know R.)
See this example.
Related
I need a regular expression that will do the following transformation:
Input: ab\xy
Output: aby
Input: ab\\xy
Output: ab\xy
Consider all of those backslashes as LITERAL backslashes. That is, the first input is the sequence of characters ['a', 'b', '\', 'x', 'y'], and the second is ['a', 'b', '\', '\', 'x', 'y'].
The rule is "in a string of characters, if a backslash is encountered, delete it and the following character ... unless the following character is a backslash, in which case delete only one of the two backslashes."
This is escape sequence hell and I can't seem to find my way out.
You may use
(?s)\\(\\)|\\.
and replace with $1 to restore the \ when a double backslash is found.
Details:
(?s) - a dotall modifier so that . could match any chars inlcuding line break chars
\\(\\) - matches a backslash and then matches and captures another backslash into Group 1
| - or
\\. - matches any escape sequence (a backslash + any char).
See the regex demo and a PHP demo:
$re = '/\\\\(\\\\)|\\\\./s';
$str = 'ab\\xy ab\\\\xy ab\\\\\\xy';
echo $result = preg_replace($re, '$1', $str);
// => aby ab\xy ab\y
I am suffering from regex illness, i am taking medicines but nothing happening, now i am stuck again with this issue
<cfset Change = replacenocase(mytext,'switch(cSelected) {',' var x = 0;while(x < cSelected.length){switch(cSelected[x]) {','one')>
this did not changed anything
i tried Rereplace too
<cfset Change = rereplacenocase(mytext,'[switch(cSelected) {]+',' var x = 0;while(x < cSelected.length){switch(cSelected[x]) {','one')>
this created weird results
Parentheses, square brackets, and curly brackets are special characters in any implementation of RegEx. Wrapping something in [square brackets] means any of the characters within so [fifty] would match any of f,i,t,y. The plus sign after it just means to match any of these characters as many times as possible. So yes [switch(cSelected) {]+ would replace switch(cSelected) {, but it would also replace any occurrence of switch, or s, or w, or the words this or twitch() because each character in these is represented in your character class.
As a regex, you would instead want (switch\(cSelected\) \{) (the + isn't useful here, and we have to escape the parentheses that we want literally represented. It is also a good idea to escape curly braces because they have special meaning in parts of regex and I believe that when you're new to regex, there's no such thing as over-escaping.
(switch # Opens Capture Group
# Literal switch
\(cSelected # Literal (
# Literal cSelected
\) # Literal )
# single space
\{ # Literal {
) # Closes Capture Group
You can also try something like (switch\(cSelected\)\s*\{), using the token \s* to represent any number of whitespace characters.
(switch # Opens CG1
# Literal switch
\(cSelected # Literal (
# Literal cSelected
\) # Literal )
\s* # Token: \s for white space
# * repeats zero or more times
\{ # Literal {
) # Closes CG1
What's needed, and the reason people can't be of much assistance is an excerpt from what you're trying to modify and more lines of code.
Potential reasons that the non-regex ReplaceNoCase() isn't working is either that it can't make the match it needs, which could be a whitespace issue, or it could be that you have two variables setting Change to an action based on the mytext variable..
I'm new to R and unable to find other threads with a similar issue.
I'm cleaning data that requires punctuation at the end of each line. I am unable to add, say, a period without overwriting the final character of the line preceding the carriage return + line feed.
Sample code:
Data1 <- "%trn: dads sheep\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
Data2 <- gsub("[^[:punct:]]\r\n\\*", ".\r\n\\*", Data1)
The contents of Data2:
[1] "%trn: dads shee.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
Notice the "p" of sheep was overwritten with the period. Any thoughts on how I could avoid this?
Capturing group:
Use a capturing group around your character class and reference the group inside of your replacement.
gsub('([^[:punct:]])\\r\\n\\*', '\\1.\r\n*', Data1)
^ ^ ^^^
# [1] "%trn: dads sheep.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
Lookarounds:
You can switch on PCRE by using perl=T and use lookarounds to achieve this.
gsub('[^\\pP]\\K(?=\\r\\n\\*)', '.', Data1, perl=T)
# [1] "%trn: dads sheep.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
The negated Unicode property \pP class matches any character except any kind of punctuation character.
Instead of using a capturing group, I used \K here. This escape sequence resets the starting point of the reported match. Any previously matched characters are not included in the final matched sequence. As well, I used a Positive Lookahead to assert that a carriage return, newline sequence and a literal asterisk character follows.
There are several ways to do it:
Capture group:
gsub("([^[:punct:]])\\r\\n\\*", "\\1.\r\n*", Data1)
Positive lookbehind (non-capturing group):
gsub("(?<=[^[:punct:]])\\r\\n\\*", ".\r\n*", Data1, perl=T)
EDIT: fixed the backslashes and removed the uncertainty about R support for these.
I'm working on a choropleth in R and need to be able to match state names with match.map(). The dataset I'm using sticks multi-word names together, like NorthDakota and DistrictOfColumbia.
How can I use regular expressions to insert a space between lower-upper letter sequences? I've successfully added a space but haven't been able to preserve the letters that indicate where the space goes.
places = c("NorthDakota", "DistrictOfColumbia")
gsub("[[:lower:]][[:upper:]]", " ", places)
[1] "Nort akota" "Distric olumbia"
Use parentheses to capture the matched expressions, then \n (\\n in R) to retrieve them:
places = c("NorthDakota", "DistrictOfColumbia")
gsub("([[:lower:]])([[:upper:]])", "\\1 \\2", places)
## [1] "North Dakota" "District Of Columbia"
You want to use capturing groups to capture to matched context so you can refer back to each matched group in your replacement call. To access the groups, precede two backslashes \\ followed by the group #.
> places = c('NorthDakota', 'DistrictOfColumbia')
> gsub('([[:lower:]])([[:upper:]])', '\\1 \\2', places)
# [1] "North Dakota" "District Of Columbia"
Another way, switch on PCRE by using perl=T and use lookaround assertions.
> places = c('NorthDakota', 'DistrictOfColumbia')
> gsub('[a-z]\\K(?=[A-Z])', ' ', places, perl=T)
# [1] "North Dakota" "District Of Columbia"
Explanation:
The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included. Basically ( throws away everything that it has matched up to that point. )
[a-z] # any character of: 'a' to 'z'
\K # '\K' (resets the starting point of the reported match)
(?= # look ahead to see if there is:
[A-Z] # any character of: 'A' to 'Z'
) # end of look-ahead
Unless I am missing something, this regex seems pretty straightforward:
grepl("Processor\.[0-9]+\..*Processor\.Time", names(web02))
However, it doesn't like the escaped periods, \. for which my intent is to be a literal period:
Error: '\.' is an unrecognized escape in character string starting "Processor\."
What am I misunderstanding about this regex syntax?
My R-Fu is weak to the point of being non-existent but I think I know what's up.
The string handling part of the R processor has to peek inside the strings to convert \n and related escape sequences into their character equivalents. R doesn't know what \. means so it complains. You want to get the escaped dot down into the regex engine so you need to get a single \ past the string mangler. The usual way of doing that sort of thing is to escape the escape:
grepl("Processor\\.[0-9]+\\..*Processor\\.Time", names(web02))
Embedding one language (regular expressions) inside another language (R) is usually a bit messy and more so when both languages use the same escaping syntax.
Instead of
\.
Try
\\.
You need to escape the backspace first.
The R-centric way of doing this is using the [::] notation, for example:
grepl("[:.:]", ".")
# [1] TRUE
grepl("[:.:]", "a")
# [1] FALSE
From the docs (?regex):
The metacharacters in extended regular expressions are . \ | ( ) [ { ^ $ * + ?, but note that whether these have a special meaning depends on the context.
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~.