String pattern matching problem - regex

Imagine we have a long string containing the substrings 'cat' and 'dog' as well as other random characters, eg.
cat x dog cat x cat x dog x dog x cat x dog x cat
Here 'x' represents any random sequence of characters (but not 'cat' or 'dog').
What I want to do is find every 'cat' that is followed by any characters except 'dog' and then by 'cat'. I want to remove that first instance of 'cat' in each case.
In this case, I would want to remove the bracketed [cat] because there is no 'dog' after it before the next 'cat':
cat x dog [cat] x cat x dog x dog x cat x dog x cat
To end up with:
cat x dog x cat x dog x dog x cat x dog x cat
How can this be done?
I thought of somehow using a regular expression like (n)(?=(n)) as VonC recommended
here
(cat)(?=(.*cat))
to match all of the pairs of 'cat' in the string. But I am still not sure how I could use this to remove each cat that is not followed by 'dog' before 'cat'.
The real problem I am tackling is in Java. But I am really just looking for a general pseudocode/regex solution.

Is there any particular reason you want to do this with just one RE call? I'm not sure if that's actually possible in one RE.
If I had to do this, I'd probably go in two passes. First mark each instance of 'cat' and 'dog' in the string, then write some code to identify which cats need to be removed, and do that in another pass.
Pseudocode follows:
// Find all the cats and dogs
int[] catLocations = string.findIndex(/cat/);
int[] dogLocations = string.findIndex(/dog/);
int [] idsToRemove = doLogic(catLocations, dogLocations);
// Remove each identified cat, from the end to the front
for (int id : idsToRemove.reverse())
string.removeSubstring(id, "cat".length());

Related

Splitting csv on comma's except when the string is in quotes

I am having problems splitting up a csv file with data like the following
Cat, car, dog, "A string, that has comma's in it", airplane, truck
I originally tried splitting the file with the following code..
it results in
Cat
car
dog
A string
that has comma's in it
airplane
truck
csvFile.splitEachLine( /,\s*/ ){ parts ->
tmpMap = [:]
tmpMap.putAt("column1", parts[0])
tmpMap.putAt("column2", parts[1])
tmpMap.putAt("column3", parts[2])
tmpMap.putAt("column4", parts[3])
mapList.add(tmpMap)
what I would like is
Cat
car
dog
A string, that has comma's in it
airplane
truck
You should change your regex a little:
def mapList = []
def csvFile = "Cat, car, dog, \"A string, that has comma's in it\", airplane, truck"
​csvFile.splitEachLine( /,(?=(?:[^"]*\"[^"]*")*[^"]*\Z)\s/ ){ parts ->
tmpMap = [:]
tmpMap.putAt("column1", parts[0])
tmpMap.putAt("column2", parts[1])
tmpMap.putAt("column3", parts[2])
tmpMap.putAt("column4", parts[3])
tmpMap.putAt("column5", parts[4])
tmpMap.putAt("column6", parts[5])
mapList.add(tmpMap)
}
​print mapList​
But it's better to use already created libraries for that. It will make your life much easier. Take a look at https://github.com/xlson/groovycsv

Search for largest word match in a vocabulary from a given string

I have a string the big cat in the zoo, and my vocabulary has ["in the zoo", "the zoo"]
i cant do a direct search , have to search the combinations:
1) zoo
2) the zoo
3) in the zoo
and return only "in the zoo", that is the biggest matching string
how to do this reverse search and match in python
Could try something along the lines of this -
str1 = "the big cat in the zoo"
vocabulary = ["in the zoo", "the zoo"]
str1 = str1.split()
for first, last in itertools.combinations(range(len(str1)), 2):
new_str = ' '.join(str1[first:last+1])
print (new_str)
This gives you the output,
the big
the big cat
the big cat in
the big cat in the
the big cat in the zoo
big cat
big cat in
big cat in the
big cat in the zoo
cat in
cat in the
cat in the zoo
in the
in the zoo
the zoo
Edit it however you want to change it to use it for your problem's conditions.
Sort your list items by descending length.
Loop throug your list items with if (mystring.Contains(vocabularyItem)) ...

reverse replace all using caracter and word

I found this pattern in another post:
Pattern p = Pattern.compile("[^xyz]"); s.replaceAll(p.pattern(), "-");
It allow to replace all of characters except x, y and z
How can we adapt it to add a word in the reverse replace all ? For exemple, I'd like to keep xyz and the word dog
exemple:
"abcxyzabcxyzdfrdogdzx" -- > "xyzxyzdogzx"
thanks
For this, you'll need to capture words that you want globally, and join the matches together.
In JS:
/([xyz]|dog)/g
Breakdown:
[xyz] - any character from the list x, y, z
dog - match dog literally
([xyz]|dog) - capture the list, or dog
/g - the global modifier
let string = "abcxyzabcxyzdfrdogdzx",
regex = /([xyz]|dog)/g,
whatWeAreLookingFor = string.match(regex).join("");
console.log(whatWeAreLookingFor);

Unexpected result of negative lookahead on word (R regex)

I'm trying to create rules for a sentence that contains "dog" but not "cat". I would like the function to return FALSE since the string contains both "dog" and "cat".
Using negation:
grepl("cat.*[^dog]", "asdfasdfasdf cat adsfafds dog", perl=T)
Using negative lookahead:
grepl("cat.*(?!dog)", "asdfasdfasdf cat adsfafds dog", perl=T)
Using str_detect function in the stringr package
require(stringr)
str_detect("asdfasdfasdf cat adsfafds dog", "cat.*(?!dog|$)")
All these three methods return true.
You can use this regex to find strings that contain cat but not dog:
^((cat((?!dog).)*)|(((?!dog).)*?cat((?!dog).)*)+)$
It's based on the answer here. It takes into account that dog can come before or after cat.
The problem with ALL of your solutions is that cat.* will find catand then .* will eat up EVERYTHING, including dogs.
Also, you forgot to handle the cases where dog comes before cat.
As Druzion points out, char classes are not the way to go.
A simple solution will be to create a function to check :-
i) If the string contains both cat and dog, then return FALSE
ii) otherwise, return TRUE
R Code
cat_dog <- function(x) { if (length(grep("(?=.*cat)(?=.*dog)", x, perl = TRUE)) != 0) {return(FALSE)} else {return(TRUE)} }
Updated Code
cat_dog <- function(x) { if (length(grep("(?=.*dog)", x, perl = TRUE) != 0)) {if (length(grep("(?=.*cat)", x, perl = TRUE)) != 0) {return(FALSE)} else {return(TRUE)}} else {return(FALSE)}}
Ideone Demo

editing sentence start <s> and end </s> with R for prediction

I'm building an NLP model to predict the next word in R. So, for a 3 sentences corpus:
a<-"i like cheese"
b<-"the dog like cat"
c<-"the cat eat cheese"
I want it to become:
>a
"<.s> i like cheese <./s>"
>b
"<.s> the dog like cat <./s>"
>c
"<.s> the cat eat cheese <./s>"
Is there a simpler way to do this than:
a<-Unlist(strsplit(a, " "))
a[1]<-"<.s>"
a[length(a)]<-"./s>"
a<-paste(a, collapse = " ")
> a
"<.s> i like cheese <./s>"
You are simply concatenating strings so this should work:
a <- paste("<.s>", a, "<./s>")