Ruby: Split string with occurence - regex

I have a string like :
"Premier League>Arsenal|Bayern Munich|Breaking News|Premier League>Chelsea|Ligue 1>Lille|Premier League>Liverpool|Premier League>Manchester-City|Premier League>Manchester-United|MercaShow|Mercato|Ligue 1>PSG"
i would like to split > charactere.
i would like the result to look like
["Premier League", "Arsenal|Bayern Munich|Breaking News", "Premier League", "Chelsea", "Ligue 1", etc ...]
But i can't figure out how to acheive this.
If i do my_string.split(">") i only get:
["Premier League", "Arsenal|Bayern Munich|Breaking News|Premier League", "Chelsea|Ligue 1", etc...]
I would like to buid categories and sub categories parsing this string.
ex:
Premier League>Arsenal|Bayern Munich|Breaking News|Premier League>Chelsea|Ligue 1>Lille|
should give :
Premier League
Arsenal
Bayern Munich
Breaking News
Premier League
Chelsea
Ligue 1
Lille

Replace | in front of > with another character (eg.<) and then split it.
s="Premier League>Arsenal|Bayern Munich|Breaking News|Premier League>Chelsea|Ligue 1>Lille|Premier League>Liverpool|Premier League>Manchester-City|Premier League>Manchester-United|MercaShow|Mercato|Ligue 1>PSG"
p s.gsub(/\|([^\|]+)>/,"<\\1>").split("<").
map{|x| a=x.split(">"); [a[0], a[1].split("|")]}
Alternatively, parse the string from the back.
a=[]
s="Premier League>Arsenal|Bayern Munich|Breaking News|Premier League>Chelsea|Ligue 1>Lille|Premier League>Liverpool|Premier League>Manchester-City|Premier League>Manchester-United|MercaShow|Mercato|Ligue 1>PSG"
while /\b([^\|]+)>([^>]+)$/ === s do
s=$`
a.unshift([$1, $2.split('|')])
end
p a
I don't think it is easy to split the string directlty.

Related

Using regexp to join two dataframes in spark

Say I have a dataframe df1 with the column "color" that contains a bunch of colors, and another dataframe df2 with column "phrase" that contains various phrases.
I'd like to join the two dataframes where the color in d1 appears in phrases in d2. I cannot use d1.join(d2, d2("phrases").contains(d1("color")), since it would join on anywhere the word appears within the phrase. I don't want to match on words like scaRED for example, where RED is a part of another word. I only want to join when the color appears as a seperate word in the phrases.
Can I use a regular expression to solve this? What function can I use and how is the syntax when I need to reference the column in the expression?
You could create a REGEX pattern that checks for word boundaries (\b) when matching colors and use a regexp_replace check as the join condition:
val df1 = Seq(
(1, "red"), (2, "green"), (3, "blue")
).toDF("id", "color")
val df2 = Seq(
"red apple", "scared cat", "blue sky", "green hornet"
).toDF("phrase")
val patternCol = concat(lit("\\b"), df1("color"), lit("\\b"))
df1.join(df2, regexp_replace(df2("phrase"), patternCol, lit("")) =!= df2("phrase")).
show
// +---+-----+------------+
// | id|color| phrase|
// +---+-----+------------+
// | 1| red| red apple|
// | 3| blue| blue sky|
// | 2|green|green hornet|
// +---+-----+------------+
Note that "scared cat" would have been a match in the absence of the enclosed word boundaries.
Building up on your own solution, you can also try this:
d1.join(d2, array_contains(split(d2("phrases"), " "), d1("color")))
Did not see your data but this is a starter, with a little variation. No need for regex as far as I can see, but who knows:
// You need to do some parsing like stripping of . ? and may be lowercase or uppercase
// You did not provide an example on the JOIN
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
val checkValue = udf { (array: WrappedArray[String], value: String) => array.iterator.map(_.toLowerCase).contains(value.toLowerCase() ) }
//Gen some data
val dfCompare = spark.sparkContext.parallelize(Seq("red", "blue", "gold", "cherry")).toDF("color")
val rdd = sc.parallelize( Array( (("red","hello how are you red",10)), (("blue", "I am fine but blue",20)), (("cherry", "you need to do some parsing and I like cherry",30)), (("thebluephantom", "you need to do some parsing and I like fanta",30)) ))
//rdd.collect
val df = rdd.toDF()
val df2 = df.withColumn("_4", split($"_2", " "))
df2.show(false)
dfCompare.show(false)
val res = df2.join(dfCompare, checkValue(df2("_4"), dfCompare("color")), "inner")
res.show(false)
returns:
+------+---------------------------------------------+---+--------------------------------------------------------+------+
|_1 |_2 |_3 |_4 |color |
+------+---------------------------------------------+---+--------------------------------------------------------+------+
|red |hello how are you red |10 |[hello, how, are, you, red] |red |
|blue |I am fine but blue |20 |[I, am, fine, but, blue] |blue |
|cherry|you need to do some parsing and I like cherry|30 |[you, need, to, do, some, parsing, and, I, like, cherry]|cherry|
+------+---------------------------------------------+---+--------------------------------------------------------+------+

Splitting csv on comma's except when the string is in quotes

I am having problems splitting up a csv file with data like the following
Cat, car, dog, "A string, that has comma's in it", airplane, truck
I originally tried splitting the file with the following code..
it results in
Cat
car
dog
A string
that has comma's in it
airplane
truck
csvFile.splitEachLine( /,\s*/ ){ parts ->
tmpMap = [:]
tmpMap.putAt("column1", parts[0])
tmpMap.putAt("column2", parts[1])
tmpMap.putAt("column3", parts[2])
tmpMap.putAt("column4", parts[3])
mapList.add(tmpMap)
what I would like is
Cat
car
dog
A string, that has comma's in it
airplane
truck
You should change your regex a little:
def mapList = []
def csvFile = "Cat, car, dog, \"A string, that has comma's in it\", airplane, truck"
​csvFile.splitEachLine( /,(?=(?:[^"]*\"[^"]*")*[^"]*\Z)\s/ ){ parts ->
tmpMap = [:]
tmpMap.putAt("column1", parts[0])
tmpMap.putAt("column2", parts[1])
tmpMap.putAt("column3", parts[2])
tmpMap.putAt("column4", parts[3])
tmpMap.putAt("column5", parts[4])
tmpMap.putAt("column6", parts[5])
mapList.add(tmpMap)
}
​print mapList​
But it's better to use already created libraries for that. It will make your life much easier. Take a look at https://github.com/xlson/groovycsv

Remove everything except period and numbers from string regex in R

I know there are many questions on stack overflow regarding regex but I cannot accomplish this one easy task with the available help I've seen. Here's my data:
a<-c("Los Angeles, CA","New York, NY", "San Jose, CA")
b<-c("c(34.0522, 118.2437)","c(40.7128, 74.0059)","c(37.3382, 121.8863)")
df<-data.frame(a,b)
df
a b
1 Los Angeles, CA c(34.0522, 118.2437)
2 New York, NY c(40.7128, 74.0059)
3 San Jose, CA c(37.3382, 121.8863)
I would like to remove the everything but the numbers and the period (i.e. remove "c", ")" and "(". This is what I've tried thus far:
str_replace(df$b,"[^0-9.]","" )
[1] "(34.0522, 118.2437)" "(40.7128, 74.0059)" "(37.3382, 121.8863)"
str_replace(df$b,"[^\\d\\)]+","" )
[1] "34.0522, 118.2437)" "40.7128, 74.0059)" "37.3382, 121.8863)"
Not sure what's left to try. I would like to end up with the following:
[1] "34.0522, 118.2437" "40.7128, 74.0059" "37.3382, 121.8863"
Thanks.
If I understand you correctly, this is what you want:
df$b <- gsub("[^[:digit:]., ]", "", df$b)
or:
df$b <- strsplit(gsub("[^[:digit:]. ]", "", df$b), " +")
> df
a b
1 Los Angeles, CA 34.0522, 118.2437
2 New York, NY 40.7128, 74.0059
3 San Jose, CA 37.3382, 121.8863
or if you want all the "numbers" as a numeric vector:
as.numeric(unlist(strsplit(gsub("[^[:digit:]. ]", "", df$b), " +")))
[1] 34.0522 118.2437 40.7128 74.0059 37.3382 121.8863
Try this
gsub("[\\c|\\(|\\)]", "",df$b)
#[1] "34.0522, 118.2437" "40.7128, 74.0059" "37.3382, 121.8863"
Not a regular expression solution, but a simple one.
The elements of b are R expressions, so loop over each element, parsing it, then creating the string you want.
vapply(
b,
function(bi)
{
toString(eval(parse(text = bi)))
},
character(1)
)
Here is another option with str_extract_all from stringr. Extract the numeric part using str_extract_all into a list, convert to numeric, rbind the list elements and cbind it with the first column of 'df'
library(stringr)
cbind(df[1], do.call(rbind,
lapply(str_extract_all(df$b, "[0-9.]+"), as.numeric)))

editing sentence start <s> and end </s> with R for prediction

I'm building an NLP model to predict the next word in R. So, for a 3 sentences corpus:
a<-"i like cheese"
b<-"the dog like cat"
c<-"the cat eat cheese"
I want it to become:
>a
"<.s> i like cheese <./s>"
>b
"<.s> the dog like cat <./s>"
>c
"<.s> the cat eat cheese <./s>"
Is there a simpler way to do this than:
a<-Unlist(strsplit(a, " "))
a[1]<-"<.s>"
a[length(a)]<-"./s>"
a<-paste(a, collapse = " ")
> a
"<.s> i like cheese <./s>"
You are simply concatenating strings so this should work:
a <- paste("<.s>", a, "<./s>")

Applying Groovy RegEx with Conditional Matching

Using Groovy and regular expression(s) how can I convert this:
String shopping = "SHOPPING LIST(TOMATOES, TEA, LENTIL SOUP: packets=2) for Saturday"
to print out
Shopping for Saturday
TOMATOES
TEA
LENTIL SOUP (2 packets)
I'm not a regex guru, so i couldn't find a regex to do the conversion in just on replaceAll step (i think it should be possible to do it that way). This works though:
def shopping = "SHOPPING LIST(TOMATOES, TEA, LENTIL SOUP: packets=2) for Saturday"
def (list, day) = (shopping =~ /SHOPPING LIST\((.*)\) for (\w+)/)[0][1,2]
println "Shopping for $day\n" +
list.replaceAll(/: packets=(\d+)/, ' ($1 packets)')
.replaceAll(', ', '\n')
First it captures the strings "TOMATOES, TEA: packets=50, LENTIL SOUP: packets=2" and "Saturday" into the variables list and day respectively. Then it processes the list string to convert it in the desired output replacing the "packets=" occurrences and splitting the list by commas (.replaceAll(', ', '\n') is equivalent to .split(', ').join('\n')).
One thing to notice is that if the shopping string does not match the first regex, it will throw an exception for trying to access the first match ([0]). You can avoid that by doing:
(shopping =~ /SHOPPING LIST\((.*)\) for (\w+)/).each { match, list, day ->
println "Shopping for $day\n" +
list.replaceAll(/: packets=(\d+)/, ' ($1 packets)')
.replaceAll(', ', '\n')
}
Which won't print anything if the first regex doesn't match.
I like to use the String find method for these kinds of cases, I think it's clearer than the =~ syntax:
String shopping = "SHOPPING LIST(TOMATOES, TEA, LENTIL SOUP: packets=2) for Saturday"
def expected = """Shopping for Saturday
TOMATOES
TEA
LENTIL SOUP (2 packets)"""
def regex = /SHOPPING LIST\((.*)\) for (.+)/
assert expected == shopping.find(regex) { full, items, day ->
List<String> formattedItems = items.split(", ").collect { it.replaceAll(/: packets=(\d+)/, ' ($1 packets)') }
"Shopping for $day\n" + formattedItems.join("\n")
}