Splitting csv on comma's except when the string is in quotes - regex

I am having problems splitting up a csv file with data like the following
Cat, car, dog, "A string, that has comma's in it", airplane, truck
I originally tried splitting the file with the following code..
it results in
Cat
car
dog
A string
that has comma's in it
airplane
truck
csvFile.splitEachLine( /,\s*/ ){ parts ->
tmpMap = [:]
tmpMap.putAt("column1", parts[0])
tmpMap.putAt("column2", parts[1])
tmpMap.putAt("column3", parts[2])
tmpMap.putAt("column4", parts[3])
mapList.add(tmpMap)
what I would like is
Cat
car
dog
A string, that has comma's in it
airplane
truck

You should change your regex a little:
def mapList = []
def csvFile = "Cat, car, dog, \"A string, that has comma's in it\", airplane, truck"
​csvFile.splitEachLine( /,(?=(?:[^"]*\"[^"]*")*[^"]*\Z)\s/ ){ parts ->
tmpMap = [:]
tmpMap.putAt("column1", parts[0])
tmpMap.putAt("column2", parts[1])
tmpMap.putAt("column3", parts[2])
tmpMap.putAt("column4", parts[3])
tmpMap.putAt("column5", parts[4])
tmpMap.putAt("column6", parts[5])
mapList.add(tmpMap)
}
​print mapList​
But it's better to use already created libraries for that. It will make your life much easier. Take a look at https://github.com/xlson/groovycsv

Related

Using Regex to replace a value in a string from a list of items in array

I'm not sure I'm asking what is possible. I have a situation
Dim animalList as string = "Dog|Cat|Bird|Mouse"
Dim animal_story_string as string = "One day I was walking down the street and I saw a dog"
Dim hasAnAnimalonList as Boolean
Dim animals() as String = animalList.Split("|")
Dim regex As New Regex("\b" & String.Join("\b|\b", animals) & "\b", RegexOptions.IgnoreCase)
If regex.IsMatch(animal_story_string) Then
hasAnAnimalonList = True
'Here I would like to replace/format the animal found with HTML bold tags so it would look
like "One day I was walking down the street and I saw a <b>dog</>"
End If
In the past I would loop each value in the animalList and if a match is found replace it at that time. Like
For Each animal As string in animals
' I would replace every animal in the list
' If Cat and Birds and Mouse were not in the string it did not matter
animal_story_string = animal_story_string.Replace(animal,"<b>" + animal + "</b>"
Next
Is there a way to do it using a Regex function?
Is there a way to do it using a Regex function?
Yes, call Regex.Replace method and split the Dog|Cat|Bird|Mouse string to join the result and create the regex pattern as shown below, you can replace the matches in one line using a MatchEvaluator function.
Dim animalList = "Dog|Cat|Bird|Mouse"
Dim regexPattern = String.Join("|", animalList.Split("|"c).Select(Function(x) $"\b{x}\b"))
Dim animal_story_string = "One day I was walking down the street and I saw a dog or maybe a fat cat! I didn't see a bird though."
Dim hasAnAnimalonList = Regex.IsMatch(animal_story_string, regexPattern, RegexOptions.IgnoreCase)
If hasAnAnimalonList Then
Dim replace = Regex.Replace(
animal_story_string,
regexPattern,
Function(m) $"<b>{m.Value}</b>", RegexOptions.IgnoreCase)
Console.WriteLine(replace)
End If
Writes in the console:
One day I was walking down the street and I saw a <b>dog</b> or maybe a fat <b>cat</b>! I didn't see a <b>bird</b> though.
... and in HTML renderer...
One day I was walking down the street and I saw a dog or maybe a fat cat! I didn't see a bird though.
I think
/(?:^|(?<= ))(Dog|Cat|Bird|Mouse)(?:(?= )|$)/i
or even
/\b(Dog|Cat|Bird|Mouse)\b/i
See: https://regex101.com/r/V4Uhg7/1
will do what you want?

Search for largest word match in a vocabulary from a given string

I have a string the big cat in the zoo, and my vocabulary has ["in the zoo", "the zoo"]
i cant do a direct search , have to search the combinations:
1) zoo
2) the zoo
3) in the zoo
and return only "in the zoo", that is the biggest matching string
how to do this reverse search and match in python
Could try something along the lines of this -
str1 = "the big cat in the zoo"
vocabulary = ["in the zoo", "the zoo"]
str1 = str1.split()
for first, last in itertools.combinations(range(len(str1)), 2):
new_str = ' '.join(str1[first:last+1])
print (new_str)
This gives you the output,
the big
the big cat
the big cat in
the big cat in the
the big cat in the zoo
big cat
big cat in
big cat in the
big cat in the zoo
cat in
cat in the
cat in the zoo
in the
in the zoo
the zoo
Edit it however you want to change it to use it for your problem's conditions.
Sort your list items by descending length.
Loop throug your list items with if (mystring.Contains(vocabularyItem)) ...

Regular Expression to match last word when string starts with pattern

I'm trying to create a regex to match the last word of a string, but only if the string starts with a certain pattern.
For example, I want to get the last word of a string only if the string starts with "The cat".
"The cat eats butter" -> would match "butter".
"The cat drinks milk"-> would match "milk"
"The dog eats beef" -> would find no match.
I know the following will give me the last word:
\s+\S*$
I also know that I can use a positive look behind to make sure a string starts with a certain pattern:
(?<=The cat )
But I can't figure out to combine them.
I'll be using this in c# and I know I could combine this with some string comparison operators but I'd like this all to be in one regex expression, as this is one of several regex pattern string that I'll be looping through.
Any ideas?
Use the following regex:
^The cat.*?\s+(\S+)$
Details:
^ - Start of the string.
The cat - The "starting" pattern.
.*? - A sequence of arbitrary chars, reluctant version.
\s+ - A sequence of "white" chars.
(\S+) - A capturing group - sequence of "non-white" chars,
this is what you want to capture.
$ - End of the string.
So the last word will be in the first capturing group.
What about this one?
^The\scat.*\s(\w+)$
My regex knowdlege is quite rusty, but couldn't you simply "add" the word you are looking for at the start of \s+\S*$, if you know that will return the last word?
Something like this then (the "\" is supposed to be the escape sign so it's read as the actual word):
\T\h\e\ \c\a\t\ \s+\S*$
Without Regex
No need for regex. Just use C#'s StartsWith with Linq's Split(' ').Last().
See code in use here
using System;
using System.Linq;
using System.Text.RegularExpressions;
class Example {
static void Main() {
string[] strings = {
"The cat eats butter",
"The cat drinks milk",
"The dog eats beef"
};
foreach(string s in strings) {
if(s.StartsWith("The cat")) {
Console.WriteLine(s.Split(' ').Last());
}
}
}
}
Result:
butter
milk
With Regex
If you prefer, however, a regex solution, you may use the following.
See code in use here
using System;
using System.Text.RegularExpressions;
class Example {
static void Main() {
string[] strings = {
"The cat eats butter",
"The cat drinks milk",
"The dog eats beef"
};
Regex regex = new Regex(#"(?<=^The cat.*)\b\w+$");
foreach(string s in strings) {
Match m = regex.Match(s);
if(m.Success) {
Console.WriteLine(m.Value);
}
}
}
}
Result:
butter
milk

Creating a list of words does not work with a list of sentences

I am trying to take a list of sentences and split each list into new lists containing the words of each sentence.
def create_list_of_words(file_name):
for word in file_name:
word_list = word.split()
return word_list
sentence = ['a frog ate the dog']
x = create_list_of_words(sentence)
print x
This is fine as my output is
['a', 'frog', 'ate', 'the', 'dog']
However, when I try to do a list of sentences it no longer reacts the same.
my_list = ['the dog hates you', 'you love the dog', 'a frog ate the dog']
for i in my_list:
x = create_list_of_words(i)
print x
Now my out
You've had few issues at your second script:
i is 'the dog hates you' while in the first script the parameter was ['a frog ate the dog'] -> one is string and second is list.
word_list = word.split() with this line inside the loop you instantiate the word_list each iteration, instead use the append function as i wrote in my code sample.
When sending string to the function you need to split the string before the word loop.
Try this:
def create_list_of_words(str_sentence):
sentence = str_sentence.split()
word_list = []
for word in sentence:
word_list.append(word)
return word_list
li_sentence = ['the dog hates you', 'you love the dog', 'a frog ate the dog']
for se in li_sentence:
x = create_list_of_words(se)
print x

String pattern matching problem

Imagine we have a long string containing the substrings 'cat' and 'dog' as well as other random characters, eg.
cat x dog cat x cat x dog x dog x cat x dog x cat
Here 'x' represents any random sequence of characters (but not 'cat' or 'dog').
What I want to do is find every 'cat' that is followed by any characters except 'dog' and then by 'cat'. I want to remove that first instance of 'cat' in each case.
In this case, I would want to remove the bracketed [cat] because there is no 'dog' after it before the next 'cat':
cat x dog [cat] x cat x dog x dog x cat x dog x cat
To end up with:
cat x dog x cat x dog x dog x cat x dog x cat
How can this be done?
I thought of somehow using a regular expression like (n)(?=(n)) as VonC recommended
here
(cat)(?=(.*cat))
to match all of the pairs of 'cat' in the string. But I am still not sure how I could use this to remove each cat that is not followed by 'dog' before 'cat'.
The real problem I am tackling is in Java. But I am really just looking for a general pseudocode/regex solution.
Is there any particular reason you want to do this with just one RE call? I'm not sure if that's actually possible in one RE.
If I had to do this, I'd probably go in two passes. First mark each instance of 'cat' and 'dog' in the string, then write some code to identify which cats need to be removed, and do that in another pass.
Pseudocode follows:
// Find all the cats and dogs
int[] catLocations = string.findIndex(/cat/);
int[] dogLocations = string.findIndex(/dog/);
int [] idsToRemove = doLogic(catLocations, dogLocations);
// Remove each identified cat, from the end to the front
for (int id : idsToRemove.reverse())
string.removeSubstring(id, "cat".length());