regex remove punct removes non-punctuation characters in R - regex

While filtering and cleaning text in Hebrew, I found that
gsub("[[:punct:]]", "", txt)
actually removes a relevant character. The character is "ק" and it is located in the "E" spot on the keyboard. Interestingly, the gsub function in R removes the "ק" character and then all words get messed up. Does anyone have an idea why?

According to Regular Expressions as used in R:
Certain named classes of characters are predefined. Their
interpretation depends on the locale (see locales); the interpretation
below is that of the POSIX locale.
Acc. to POSIX locale, [[:punct:]]should capture ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~. So, you might need to adjust your regex to remove only the characters you want:
txt <- "!\"#$%&'()*+,\\-./:;<=>?#[\\\\^\\]_`{|}~"
gsub("[\\\\!\"#$%&'()*+,./:;<=>?#[\\^\\]_`{|}~-]", "", txt, perl = T)
Sample program output:
[1] ""

Related

UTF-8: Create character (string) by char code number

How can I create a UTF-8 string like "\u0531" in R, but taking the code "0531" as a variable?
I have a bad string (consisting of "UTF-8 codes in tags"), which I would like to dynamically turn into a good string (proper UTF-8 string).
badString <- "<U+0531><U+0067>"
goodString <- "Աg" # how can I generate that by a function?
turnBadStringToGoodString<- function (myString){
newString <- gsub("<U\\+([0-9]{4})>","\\u\\1",myString)
newString2 <- parse(text = paste0("'", newString, "'"))[[1]]
return (
newString2
)
}
turnBadStringToGoodString ( badString )
# returns an expression. What to do next?
Plase note that the desired outcome can be achieved by manually typing
"\u0531\u0067"
But how can that be done with a function? Thank you for ideas.
Also related: Converting a \u escaped Unicode string to ASCII
I would suggest to use gsubfn with a regex that would capture the digits and return only the converted Unicode symbols:
library(gsubfn)
badString <- "<U+0531><U+0067>"
turnBadStringToGoodString<- function (myString){
return (
gsubfn("<U\\+(\\d{4})>", ~ parse(text = paste0("'", paste0("\\u",x), "'"))[[1]],myString)
)
}
turnBadStringToGoodString(badString)
[1] "Աg"
A bit of explanation:
<U\\+(\\d{4})> matches <, U, + and then captures into Group 1 4 digits and then just matches >
The value in Group 1 is passed to the callback function (with ~, we refer to it as x inside), and perform the conversion inside the callback.
gsubfn handles all non-overlapping matches in the input string.

Subdivide an expression into alternative subpattern - using gsub()

I'm trying to subdivide my metacharacter expression in my gsub() function. But it does not return anything found.
Task: I want to delete all sections of string that contain either .ST or -XST in my vector of strings.
As you can see below, using one expression works fine. But the | expression simply does not work. I'm following the metacharacter guide on https://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node84.html
What can be the issue? And what caused this issue?
My data
> rownames(table.summary)[1:10]
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$ | [-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK" "ABB" "ALFA" "ALIV-SDB" "AOI" "ATCO-A" "ATCO-B" "AXFO" "AXIS" "AZN"
> gsub(pattern = '[-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV" "AOI.ST" "ATCO" "ATCO" "AXFO.ST" "AXIS.ST" "AZN.ST"
It seems you tested your regex with a flag like IgnorePatternWhitespace (VERBOSE, /x) that allows whitespace inside patterns for readability. You can use it with perl=T option:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub('(?x)[.](.*)$ | [-](.*)$', '', d, perl=T)
## [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
However, you really do not have to use that complex regex here.
If you plan to remove all substrings from ther first hyphen or dot up to the end, you may use the following regex:
[.-].*$
The character class [.-] will match the first . or - symbol and .* wil match all characters up to the end of the string ($).
See IDEONE demo:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub("[.-].*$", "", d)
Result: [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
This will find .ST or -XST at the end of the text and substitute it with empty characters string (effectively removing that part). Don't forget that gsub returns modified string, not modifies it in place. You won't see any change until you reassign return value back to some variable.
strings <- c("AAK.ST", "ABB.ST", "ALFA.ST", "ALIV-SDB.ST", "AOI.ST", "ATCO-A.ST", "ATCO-B.ST", "AXFO.ST", "AXIS.ST", "AZN.ST", "AAC-XST", "AAD-XSTV")
strings <- gsub('(\\.ST|-XST)$', '', strings)
Your regular expression ([.](.*)$ | [-](.*)$'), if not for unnecessary spaces, would remove everything from first dot (.) or dash (-) to end of text. This might be what you want, but not what you said you want.

Remove trailing and leading spaces and extra internal whitespace with one gsub call

I know you can remove trailing and leading spaces with
gsub("^\\s+|\\s+$", "", x)
And you can remove internal spaces with
gsub("\\s+"," ",x)
I can combine these into one function, but I was wondering if there was a way to do it with just one use of the gsub function
trim <- function (x) {
x <- gsub("^\\s+|\\s+$|", "", x)
gsub("\\s+", " ", x)
}
testString<- " This is a test. "
trim(testString)
Here is an option:
gsub("^ +| +$|( ) +", "\\1", testString) # with Frank's input, and Agstudy's style
We use a capturing group to make sure that multiple internal spaces are replaced by a single space. Change " " to \\s if you expect non-space whitespace you want to remove.
Using a positive lookbehind :
gsub("^ *|(?<= ) | *$",'',testString,perl=TRUE)
# "This is a test."
Explanation :
## "^ *" matches any leading space
## "(?<= ) " The general form is (?<=a)b :
## matches a "b"( a space here)
## that is preceded by "a" (another space here)
## " *$" matches trailing spaces
You can just add \\s+(?=\\s) to your original regex:
gsub("^\\s+|\\s+$|\\s+(?=\\s)", "", x, perl=T)
See DEMO
You've asked for a gsub option and gotten good options. There's also rm_white_multiple from "qdapRegex":
> testString<- " This is a test. "
> library(qdapRegex)
> rm_white_multiple(testString)
[1] "This is a test."
If an answer not using gsub is acceptable then the following does it. It does not use any regular expressions:
paste(scan(textConnection(testString), what = "", quiet = TRUE), collapse = " ")
giving:
[1] "This is a test."
You can also use nested gsub. Less elegant than the previous answers tho
> gsub("\\s+"," ",gsub("^\\s+|\\s$","",testString))
[1] "This is a test."

strsplit inconsistent with gregexpr

A comment on my answer to this question which should give the desired result using strsplit does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr and regmatches.
So why does strsplit split on each comma in this example, even though regmatches only returns two matches for the same regex?
# We would like to split on the first comma and
# the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"
# Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34" "56" "78" "90"
# Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )
# Matching positions are at
unlist(m)
[1] 4 13
# And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","
Huh?! What is going on?
The theory of #Aprillion is exact, from R documentation:
The algorithm applied to each input string is
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
In other words, at each iteration ^ will match the begining of a new string (without the precedent items.)
To simply illustrate this behavior:
> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""
Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to #JoshO'Brien for the link.)

Convert punctuation to space

I have a bunch of strings with punctuation in them that I'd like to convert to spaces:
"This is a string. In addition, this is a string (with one more)."
would become:
"This is a string In addition this is a string with one more "
I can go thru and do this manually with the stringr package (str_replace_all()) one punctuation symbol at a time (, / . / ! / ( / ) / etc. ), but I'm curious if there's a faster way I'd assume using regex's.
Any suggestions?
x <- "This is a string. In addition, this is a string (with one more)."
gsub("[[:punct:]]", " ", x)
[1] "This is a string In addition this is a string with one more "
See ?gsub for doing quick substitutions like this, and ?regex for details on the [[:punct:]] class, i.e.
‘[:punct:]’ Punctuation characters:
‘! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { |
} ~’.
have a look at ?regex
library(stringr)
str_replace_all(x, '[[:punct:]]',' ')
"This is a string In addition this is a string with one more "