UTF-8: Create character (string) by char code number

UTF-8: Create character (string) by char code number - regex

How can I create a UTF-8 string like "\u0531" in R, but taking the code "0531" as a variable?
I have a bad string (consisting of "UTF-8 codes in tags"), which I would like to dynamically turn into a good string (proper UTF-8 string).
badString <- "<U+0531><U+0067>"
goodString <- "Աg" # how can I generate that by a function?
turnBadStringToGoodString<- function (myString){
newString <- gsub("<U\\+([0-9]{4})>","\\u\\1",myString)
newString2 <- parse(text = paste0("'", newString, "'"))[[1]]
return (
newString2
)
}
turnBadStringToGoodString ( badString )
# returns an expression. What to do next?
Plase note that the desired outcome can be achieved by manually typing
"\u0531\u0067"
But how can that be done with a function? Thank you for ideas.
Also related: Converting a \u escaped Unicode string to ASCII

I would suggest to use gsubfn with a regex that would capture the digits and return only the converted Unicode symbols:
library(gsubfn)
badString <- "<U+0531><U+0067>"
turnBadStringToGoodString<- function (myString){
return (
gsubfn("<U\\+(\\d{4})>", ~ parse(text = paste0("'", paste0("\\u",x), "'"))[[1]],myString)
)
}
turnBadStringToGoodString(badString)
[1] "Աg"
A bit of explanation:
<U\\+(\\d{4})> matches <, U, + and then captures into Group 1 4 digits and then just matches >
The value in Group 1 is passed to the callback function (with ~, we refer to it as x inside), and perform the conversion inside the callback.
gsubfn handles all non-overlapping matches in the input string.

Related

Regex - change commas only in a portion of a string

I make a lot of changes on a original csv string. there is a lot of comma delimiter. I have to replace by a ";" either only the commas inside the expression || ....|| or only the commas outside this expression. i need to do this change in order to have different delimiter in the expression ||....|| compare to the rest of the string.
Example:
(.*)(?:\|\|)(?:.*)(,)(?:.*)\|\|
After I use
var regex = /myregex/g;
var str = str.replace(regex, ',')
thanks

You can use
const string = "aba,bjlj,alj,ljlj||name1,name2,name3||jflkj,glfgjlf,jflg,fjlfd||name1,name2||fd,sdfsfd,dfs||name1,name2,name3,name4,name5||";
console.log( string.replace(/\|{2}[\w\W]*?\|{2}/g, (x) => x.replace(/,/g, ';')) );
The regex is
/\|{2}.*?\|{2}/gs // matches any text between two double pipes
/\|{2}[\w\W]*?\|{2}/g // matches any text between two double pipes
/\|{2}.*?\|{2}/g // matches any text but line breaks between two double pipes
Note the . does not match line breaks without the s modifier flag.
The regex matches double pipe, then any zero or more chars, as few as possible up to the next double pipe.
Then, x, the whole match value, is passed as an argument to the anonymous callback function used as a replacement argument, and all commas are replaced with ; only inside the matches.
The "contrary" solution is to match and capture the strings between double pipes and only match commas in all other contexts so that you could keep the captures and replace those commas:
const string = "aba,bjlj,alj,ljlj||name1,name2,name3||jflkj,glfgjlf,jflg,fjlfd||name1,name2||fd,sdfsfd,dfs||name1,name2,name3,name4,name5||";
console.log( string.replace(/(\|{2}[\w\W]*?\|{2})|,/g, (x,y) => y || ';') );

Big Thanks.
I also find
var newStr = str.replace(/\|{2}.*?\|{2}/g, function(match) {
return match.replace(/,/g,";");
});
Do you think is it possible to do the contrary and change all the comma outside the occurence ||...|| ?

Subdivide an expression into alternative subpattern - using gsub()

I'm trying to subdivide my metacharacter expression in my gsub() function. But it does not return anything found.
Task: I want to delete all sections of string that contain either .ST or -XST in my vector of strings.
As you can see below, using one expression works fine. But the | expression simply does not work. I'm following the metacharacter guide on https://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node84.html
What can be the issue? And what caused this issue?
My data
> rownames(table.summary)[1:10]
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$ | [-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK" "ABB" "ALFA" "ALIV-SDB" "AOI" "ATCO-A" "ATCO-B" "AXFO" "AXIS" "AZN"
> gsub(pattern = '[-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV" "AOI.ST" "ATCO" "ATCO" "AXFO.ST" "AXIS.ST" "AZN.ST"

It seems you tested your regex with a flag like IgnorePatternWhitespace (VERBOSE, /x) that allows whitespace inside patterns for readability. You can use it with perl=T option:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub('(?x)[.](.*)$ | [-](.*)$', '', d, perl=T)
## [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
However, you really do not have to use that complex regex here.
If you plan to remove all substrings from ther first hyphen or dot up to the end, you may use the following regex:
[.-].*$
The character class [.-] will match the first . or - symbol and .* wil match all characters up to the end of the string ($).
See IDEONE demo:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub("[.-].*$", "", d)
Result: [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"

This will find .ST or -XST at the end of the text and substitute it with empty characters string (effectively removing that part). Don't forget that gsub returns modified string, not modifies it in place. You won't see any change until you reassign return value back to some variable.
strings <- c("AAK.ST", "ABB.ST", "ALFA.ST", "ALIV-SDB.ST", "AOI.ST", "ATCO-A.ST", "ATCO-B.ST", "AXFO.ST", "AXIS.ST", "AZN.ST", "AAC-XST", "AAD-XSTV")
strings <- gsub('(\\.ST|-XST)$', '', strings)
Your regular expression ([.](.*)$ | [-](.*)$'), if not for unnecessary spaces, would remove everything from first dot (.) or dash (-) to end of text. This might be what you want, but not what you said you want.

R: gsub of exact full string with fixed = T

I am trying to gsub exact FULL string - I know I need to use ^ and $. The problem is that I have special characters in strings (could be [, or .) so I need to use fixed=T. This overrides the ^ and $. Any solution is appreciated.
Need to replace 1st, 2nd element in exact_orig with 1st, 2nd element from exact_change but only if full string is matched from beginning to end.
exact_orig = c("oz","32 oz")
exact_change = c("20 oz","32 ct")
gsub_FixedTrue <- function(i) {
for(k in seq_along(exact_orig)) i = gsub(exact_orig[k],exact_change[k],i,fixed=TRUE)
return(i)
}
Test cases:
print(gsub_FixedTrue("32 oz")) #gives me "32 20 oz" - wrong! Must be "32 ct"
print(gsub_FixedTrue("oz oz")) # gives me "20 oz 20 oz" - wrong! Must remain as "oz oz"
I read a somewhat similar thread, but could not make it work for full string (grep at the beginning of the string with fixed =T in R?)

If you want to exactly match full strings, i don't think you really want to use regular expressions in this case. How about just the match() function
fixedTrue<-function(x) {
m <- match(x, exact_orig)
x[!is.na(m)] <- exact_change[m[!is.na(m)]]
x
}
fixedTrue(c("32 oz","oz oz"))
# [1] "32 ct" "oz oz"

Split string on [:punct:] except for underscore in R

I have an equation as a string where the variables in the string equation are variables in the R workspace. I would like to replace each variable with its numeric value in the R workspace. This is easy enough when the variable names don't contain punctuation.
Here is a simple example.
x <- 5
y <- 10
yy <- 15
z <- x*(y + yy)
zAsChar <- "z=x*(y+yy)"
vars <- unlist(strsplit(zAsChar, "[[:punct:]]"))
notVars <- unlist(strsplit(zAsChar, "[^[:punct:]]"))
varsValues <- sapply(vars[vars != ""], FUN=function(aaa) get(aaa))
notVarsValues <- notVars[notVars != ""]
paste(paste0(varsValues, notVarsValues), collapse="")
This yields "125=5*(10+15)", which is great.
However, I would love the option to use underscores in the variable names so that I can use "subscripts" for variable names. I am using these strings in math mode in R markdown.
So I need a [:punct:] that excludes _. I tried using [\\+\\-\\*\\/\\(\\)\\=] rather than [:punct:], but with this approach I couldn't split on the minus sign. Is there a way to preserve the _?

Instead of [:punct:] use the unicode character class \pP (shortcut for \p{P}) and its negation \PP to do that:
[^\\PP_]
(It works with perl=TRUE option)

Are you sure you need to do all this string manipulation? The substitute() function can help you out
substitute(z==x*(y+yy), list(x=x, y=y, yy=yy,z=z))
Or if you really need to start with a character value
do.call("substitute", list(parse(text=zAsChar)[[1]],list(x=x, y=y, yy=yy,z=z)))
# 125 = 5 * (10 + 15)
You can deparse() the result to turn it back into a character.

regex remove punct removes non-punctuation characters in R

While filtering and cleaning text in Hebrew, I found that
gsub("[[:punct:]]", "", txt)
actually removes a relevant character. The character is "ק" and it is located in the "E" spot on the keyboard. Interestingly, the gsub function in R removes the "ק" character and then all words get messed up. Does anyone have an idea why?

According to Regular Expressions as used in R:
Certain named classes of characters are predefined. Their
interpretation depends on the locale (see locales); the interpretation
below is that of the POSIX locale.
Acc. to POSIX locale, [[:punct:]]should capture ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~. So, you might need to adjust your regex to remove only the characters you want:
txt <- "!\"#$%&'()*+,\\-./:;<=>?#[\\\\^\\]_`{|}~"
gsub("[\\\\!\"#$%&'()*+,./:;<=>?#[\\^\\]_`{|}~-]", "", txt, perl = T)
Sample program output:
[1] ""

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

UTF-8: Create character (string) by char code number - regex

Related

Regex - change commas only in a portion of a string

Subdivide an expression into alternative subpattern - using gsub()

R: gsub of exact full string with fixed = T

Split string on [:punct:] except for underscore in R

regex remove punct removes non-punctuation characters in R

Categories

Resources