editing sentence start <s> and end </s> with R for prediction

editing sentence start <s> and end </s> with R for prediction - regex

I'm building an NLP model to predict the next word in R. So, for a 3 sentences corpus:
a<-"i like cheese"
b<-"the dog like cat"
c<-"the cat eat cheese"
I want it to become:
>a
"<.s> i like cheese <./s>"
>b
"<.s> the dog like cat <./s>"
>c
"<.s> the cat eat cheese <./s>"
Is there a simpler way to do this than:
a<-Unlist(strsplit(a, " "))
a[1]<-"<.s>"
a[length(a)]<-"./s>"
a<-paste(a, collapse = " ")
> a
"<.s> i like cheese <./s>"

You are simply concatenating strings so this should work:
a <- paste("<.s>", a, "<./s>")

Related

Ruby: Split string with occurence

I have a string like :
"Premier League>Arsenal|Bayern Munich|Breaking News|Premier League>Chelsea|Ligue 1>Lille|Premier League>Liverpool|Premier League>Manchester-City|Premier League>Manchester-United|MercaShow|Mercato|Ligue 1>PSG"
i would like to split > charactere.
i would like the result to look like
["Premier League", "Arsenal|Bayern Munich|Breaking News", "Premier League", "Chelsea", "Ligue 1", etc ...]
But i can't figure out how to acheive this.
If i do my_string.split(">") i only get:
["Premier League", "Arsenal|Bayern Munich|Breaking News|Premier League", "Chelsea|Ligue 1", etc...]
I would like to buid categories and sub categories parsing this string.
ex:
Premier League>Arsenal|Bayern Munich|Breaking News|Premier League>Chelsea|Ligue 1>Lille|
should give :
Premier League
Arsenal
Bayern Munich
Breaking News
Premier League
Chelsea
Ligue 1
Lille

Replace | in front of > with another character (eg.<) and then split it.
s="Premier League>Arsenal|Bayern Munich|Breaking News|Premier League>Chelsea|Ligue 1>Lille|Premier League>Liverpool|Premier League>Manchester-City|Premier League>Manchester-United|MercaShow|Mercato|Ligue 1>PSG"
p s.gsub(/\|([^\|]+)>/,"<\\1>").split("<").
map{|x| a=x.split(">"); [a[0], a[1].split("|")]}
Alternatively, parse the string from the back.
a=[]
s="Premier League>Arsenal|Bayern Munich|Breaking News|Premier League>Chelsea|Ligue 1>Lille|Premier League>Liverpool|Premier League>Manchester-City|Premier League>Manchester-United|MercaShow|Mercato|Ligue 1>PSG"
while /\b([^\|]+)>([^>]+)$/ === s do
s=$`
a.unshift([$1, $2.split('|')])
end
p a
I don't think it is easy to split the string directlty.

Splitting csv on comma's except when the string is in quotes

I am having problems splitting up a csv file with data like the following
Cat, car, dog, "A string, that has comma's in it", airplane, truck
I originally tried splitting the file with the following code..
it results in
Cat
car
dog
A string
that has comma's in it
airplane
truck
csvFile.splitEachLine( /,\s*/ ){ parts ->
tmpMap = [:]
tmpMap.putAt("column1", parts[0])
tmpMap.putAt("column2", parts[1])
tmpMap.putAt("column3", parts[2])
tmpMap.putAt("column4", parts[3])
mapList.add(tmpMap)
what I would like is
Cat
car
dog
A string, that has comma's in it
airplane
truck

You should change your regex a little:
def mapList = []
def csvFile = "Cat, car, dog, \"A string, that has comma's in it\", airplane, truck"
csvFile.splitEachLine( /,(?=(?:[^"]*\"[^"]*")*[^"]*\Z)\s/ ){ parts ->
tmpMap = [:]
tmpMap.putAt("column1", parts[0])
tmpMap.putAt("column2", parts[1])
tmpMap.putAt("column3", parts[2])
tmpMap.putAt("column4", parts[3])
tmpMap.putAt("column5", parts[4])
tmpMap.putAt("column6", parts[5])
mapList.add(tmpMap)
}
print mapList
But it's better to use already created libraries for that. It will make your life much easier. Take a look at https://github.com/xlson/groovycsv

Search for largest word match in a vocabulary from a given string

I have a string the big cat in the zoo, and my vocabulary has ["in the zoo", "the zoo"]
i cant do a direct search , have to search the combinations:
1) zoo
2) the zoo
3) in the zoo
and return only "in the zoo", that is the biggest matching string
how to do this reverse search and match in python

Could try something along the lines of this -
str1 = "the big cat in the zoo"
vocabulary = ["in the zoo", "the zoo"]
str1 = str1.split()
for first, last in itertools.combinations(range(len(str1)), 2):
new_str = ' '.join(str1[first:last+1])
print (new_str)
This gives you the output,
the big
the big cat
the big cat in
the big cat in the
the big cat in the zoo
big cat
big cat in
big cat in the
big cat in the zoo
cat in
cat in the
cat in the zoo
in the
in the zoo
the zoo
Edit it however you want to change it to use it for your problem's conditions.

Sort your list items by descending length.
Loop throug your list items with if (mystring.Contains(vocabularyItem)) ...

How to remove unwanted space between words inside a character vector using R?

I have a character vector like:
"I t is tim e to g o"
I wanted it to be:
"It is time to go"

This regex works in your case: "\\s(?=\\S\\s\\S{2,}|\\S$)"
string <- "I t is tim e to g o"
gsub("\\s(?=\\S\\s\\S{2,}|\\S$)", "", string, perl=TRUE)
## [1] "It is time to go"
Try this.Replace by empty string.
See demo.
https://regex101.com/r/nL5yL3/32

Using rex may make this type of task a little simpler. Although in this case maybe not :)
string <- "I t is tim e to g o"
library(rex)
re_substitutes(string, rex(
space %if_next_is%
list(
list(non_space, space, at_least(non_space, 2)) %or%
list(non_space, end)
)
), "", global = TRUE)
#> [1] "It is time to go"

String pattern matching problem

Imagine we have a long string containing the substrings 'cat' and 'dog' as well as other random characters, eg.
cat x dog cat x cat x dog x dog x cat x dog x cat
Here 'x' represents any random sequence of characters (but not 'cat' or 'dog').
What I want to do is find every 'cat' that is followed by any characters except 'dog' and then by 'cat'. I want to remove that first instance of 'cat' in each case.
In this case, I would want to remove the bracketed [cat] because there is no 'dog' after it before the next 'cat':
cat x dog [cat] x cat x dog x dog x cat x dog x cat
To end up with:
cat x dog x cat x dog x dog x cat x dog x cat
How can this be done?
I thought of somehow using a regular expression like (n)(?=(n)) as VonC recommended
here
(cat)(?=(.*cat))
to match all of the pairs of 'cat' in the string. But I am still not sure how I could use this to remove each cat that is not followed by 'dog' before 'cat'.
The real problem I am tackling is in Java. But I am really just looking for a general pseudocode/regex solution.

Is there any particular reason you want to do this with just one RE call? I'm not sure if that's actually possible in one RE.
If I had to do this, I'd probably go in two passes. First mark each instance of 'cat' and 'dog' in the string, then write some code to identify which cats need to be removed, and do that in another pass.
Pseudocode follows:
// Find all the cats and dogs
int[] catLocations = string.findIndex(/cat/);
int[] dogLocations = string.findIndex(/dog/);
int [] idsToRemove = doLogic(catLocations, dogLocations);
// Remove each identified cat, from the end to the front
for (int id : idsToRemove.reverse())
string.removeSubstring(id, "cat".length());

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

editing sentence start <s> and end </s> with R for prediction - regex

You are simply concatenating strings so this should work: a <- paste("<.s>", a, "<./s>")

Related

Ruby: Split string with occurence

Splitting csv on comma's except when the string is in quotes

Search for largest word match in a vocabulary from a given string

How to remove unwanted space between words inside a character vector using R?

String pattern matching problem

Categories

Resources