Regexp_replace PART of unicode (emoji) in Scala-Spark dataframe - regex

I am trying to use spark regexp_replace to replace all emojis whose unicode starts with
\uD83D and replace just that part of the unicode with \uD83D, but I've had no luck.
Here is an example:
I want to take all instances of "😂" (which in Scala is \uD83D\uDE02) and replace those with " 😂". That's easy enough with one emoji and works with this code:
.select(functions.regexp_replace($"text2", "[(\uD83D\uDE02)]", " \uD83D\uDE02").as("split2"))
With the above code, if I have a string like this "😂😂😂😂" in the text2 column, it will turn it into " 😂 😂 😂 😂", which I can then easily split by space.
I want to apply this to ALL emojis that start with \uD83D so I would assume doing this should work, but it doesn't:
.select(functions.regexp_replace($"text2", "[(\uD83D)]", " \uD83D").as("split2"))
This does not affect the data in any way. Even the following does not affect the data in any way, with or without the parenthesis and/or brackets:
.select(functions.regexp_replace($"text2", "[(u)]", " \uD83D").as("split2"))
If I could replace just the first 6 characters of those unicode strings, the following in the text2 column: "😱😂😂😱" would result in " 😱 😂 😂 😱", which is exactly what I need.
Thanks for your help!

You can use this regex \\B\uD83D.{1} and replace with the captured group $1 and space. Then trim to get rid of the last space and split:
val df = Seq(
("😂😂😂😂"),
("😱😂😂😱")
).toDF("text2")
df.select(
split(
trim(regexp_replace($"text2", "(\\B\uD83D.{1})", "$1 ")),
" "
).as("split2")
).show
//+-----------------+
//| split2 |
//+-----------------+
//|[😂, 😂, 😂, 😂]|
//|[😱, 😂, 😂, 😱]|
//+-----------------+

Related

XSLT fn:tokenize ignore leading and trailing spaces

Simple date string needs to be tokenized. I'am using this sample xslt code:
fn:tokenize(date, '[ .\s]+')
All variants of bad date format (i.e. "10.10.2020", "10. 10 .2020", "10 . 10. 2020") are tokenized ok using the function above, except if there's a leading space present (i.e. " 10.10.2020"). If leading space is present, first element is then tokenized as " " blank space.
Is there an option to ignore these leading spaces as well so no matter how bad the format is, only delimiter "." means another token and all spaces are stripped as well?
The right solution seems to be:
fn:tokenize(normalize-space(date, '[ .\s]+')

Split string and also get delimiter in array

I want to split a string like this:
"Street:§§§§__inboundRow['Adress']['Street']__§§§§ und Postal: §§§§__inboundRow['Adress']['Postal']__§§§§ super"
My code in Groovy:
def parts = ret.split(/§§§§__inboundRow.+?__§§§§/)
So the array what I get is
["Street:", " und Postal: ", " super"]
But what I want is:
["Street:", "§§§§__inboundRow['Adress']['Street']__§§§§", " und Postal: ", "§§§§__inboundRow['Adress']['Postal']__§§§§", " super"]
How do I achieve this?
Try splitting on a positive lookahead. Since you want to retain the delimiters you use to split, a lookaround is probably the way to go here.
var str = "String1:§§§§__inboundRow[Test]__§§§§andString2:§§§§__inboundRow[Test1]__§§§§";
console.log(str.split(/(?=§§§§__inboundRow\[Test\d*\]__§§§§|and)/));
I don't know what exact language you are using, but this should work anywhere you may split using regex, with lookahead (JavaScript certainly supports it). The pattern use to split was:
(?=§§§§__inboundRow\[Test\d*\]__§§§§|and)
This says to split when we can assert that what follows is either the text §§§§__inboundRow[Test\d*]__§§§§ or and.

Dealing with Spaces and NA's when Uniting Multiple Columns with Tidyr

So using the simple dataframe below, I want to create a new column that has all the days for each person, separated by a semi-colon.
For example, using Doug, it should look like - Monday; Wednesday; Friday
I would like to use Tidyr's Unite function for this but when I use it, I get - Monday;;Wednesday;;Friday, because of the NA's, which also could be blank spaces as well. Sometimes there are semi-colons at the beginning and end as well. So I'm hoping there's a way to keep using "unite" but enhanced with a regular expression so that I end up with each day of the week separated by one semi-colon, and no semi-colons at the beginning or end.
I would also like to stick with Tidyr, Dplyr, Stringr, etc.
Names<-c("Doug","Ken","Erin","Yuki","John")
Monday<-c("Monday"," "," ","Monday","Monday")
Tuesday<-c(" ","Tuesday","Tuesday"," ","Tuesday")
Wednesday<-c(" ","Wednesday","Wednesday","Wednesday"," ")
Thursday<-c(" "," "," "," ","Thursday")
Friday<-c(" "," "," "," ","Friday")
Days<-data.frame(Monday,Tuesday,Wednesday,Thursday,Friday)
Days<-Days%>%unite(BestDays,Monday,Tuesday,Wednesday,Thursday,Friday,sep="; ",remove=FALSE)
You can try :
Names<-c("Doug","Ken","Erin","Yuki","John")
Monday<-c("Monday",NA,NA,"Monday","Monday")
Tuesday<-c(NA,"Tuesday","Tuesday",NA,"Tuesday")
Wednesday<-c(NA,"Wednesday","Wednesday","Wednesday",NA)
Thursday<-c(NA,NA,NA,NA,"Thursday")
Friday<-c(NA,NA,NA,NA,"Friday")
Days<-data.frame(Monday,Tuesday,Wednesday,Thursday,Friday)
concat_str = function(str) str %>% na.omit %>% paste(collapse = "; ")
Days$BestDaysConcat = apply(Days[,c("Monday","Tuesday","Wednesday","Thursday","Friday")], 1, concat_str)
From getAnywhere("unite_.data.frame"), unite is calling do.call("paste", c(data[from], list(sep = sep))) underhood, and paste as far as I know doesn't provide a functionality to omit NAs unless manually implemented in some way;
Nevertheless, you can use a regular expression method as follows with gsub from base R to clean up the result column:
gsub("^\\s;\\s|;\\s{2}", "", Days$BestDays)
# [1] "Monday" "Tuesday; Wednesday"
# [3] "Tuesday; Wednesday" "Monday; Wednesday"
# [5] "Monday; Tuesday; Thursday; Friday"
This removes either ^\\s;\\s pattern or ;\\s{2} pattern, the former handle the case when the string starts with space string where we can just remove the space and it's following ;\\s, otherwise remove ;\\s{2} which can handle cases where \\s are both in the middle of the string and at the end of the string.

Splitting a string by space except when contained within quotes

I've been trying to split a space delimited string with double-quotes in R for some time but without success. An example of a string is as follows:
rainfall snowfall "Channel storage" "Rivulet storage"
It's important for us because these are column headings that must match the subsequent data. There are other suggestions on this site as to how to go about this but they don't seem to work with R. One example:
Regex for splitting a string using space when not surrounded by single or double quotes
Here is some code I've been trying:
str <- 'rainfall snowfall "Channel storage" "Rivulet storage"'
regex <- "[^\\s\"']+|\"([^\"]*)\""
split <- strsplit(str, regex, perl=T)
what I would like is
[1] "rainfall" "snowfall" "Channel storage" "Rivulet storage"
but what I get is:
[1] "" " " " " " "
The vector is the right length (which is encouraging) but of course the strings are empty or contain a single space. Any suggestions?
Thanks in advance!
scan will do this for you
scan(text=str, what='character', quiet=TRUE)
[1] "rainfall" "snowfall" "Channel storage" "Rivulet storage"
As mplourde said, use scan. that's by far the cleanest solution (unless you want to keep the \", that is...)
If you want to use regexes to do this (or something not solved that easily by scan), you are still looking at it the wrong way. Your regex returns what you want, so if you use that in your strsplit it will cut out everything you want to keep.
In these scenarios you should look at the function gregexp, which returns the starting positions of your matches and adds the lengths of the match as an attribute. The result of this can be passed to the function regmatches(), like this:
str <- 'rainfall snowfall "Channel storage" "Rivulet storage"'
regex <- "[^\\s\"]+|\"([^\"]+)\""
regmatches(str,gregexpr(regex,str,perl=TRUE))
But if you just needs the character vector as the solution of mplourde returns, go for that. And most likely that's what you're after anyway.
You can use strapply from package gsubfn. In strapply you can define matching string rather than splitting string.
str <- "rainfall snowfall 'Channel storage' 'Rivulet storage'"
strapply(str,"\\w+|'\\w+ \\w+'",c)[[1]]
[1] "rainfall" "snowfall" "'Channel storage'" "'Rivulet storage'"

How to extract line numbers from a multi-line string in Vim?

In my opinion, Vimscript does not have a lot of features for manipulating strings.
I often use matchstr(), substitute(), and less often strpart().
Perhaps there is more than that.
For example, what is the best way to remove all text between line numbers in the following string a?
let a = "\%8l............\|\%11l..........\|\%17l.........\|\%20l...." " etc.
I want to keep only the digits and put them in a list:
['8', '11', '17', '20'] " etc.
(Note that the text between line numbers can be different.)
You're looking for split()
echo split(a, '[^0-9]\+')
EDIT:
Given the new constraint: only the numbers from \%d\+l, I'd do:
echo map(split(a, '|'), "matchstr(v:val, '^%\\zs\\d\\+\\zel')")
NB: your vim variable is incorrectly formatted, to use only one backslash, you'd need to write your string with single-quotes. With double-quotes, here you'd need two backslashes.
So, with
let b = '\%8l............\|\%11l..........\|\%17l.........\|\%20l....'
it becomes
echo map(split(b, '\\|'), "matchstr(v:val, '^\\\\%\\zs\\d\\+\\zel')")
One can take advantage of the substitute with an expression feature (see
:help sub-replace-\=) to run over all of the target matches, appending them
to a list.
:let l=[] | call substitute(a, '\\%\(\d\+\)l', '\=add(l,submatch(1))[1:0]', 'g')