CSV Regex skipping first comma - regex

I am using regex for CSV processing where data can be in Quotes, or no quotes. But if there is just a comma at the starting column, it skips it.
Here is the regex I am using:
(?:,"|^")(""|[\w\W]*?)(?=",|"$)|(?:,(?!")|^(?!"))([^,]*?|)(?=$|,)
Now the example data I am using is:
,"data",moredata,"Data"
Which should have 4 matches ["","data","moredata","Data"], but it always skips the first comma. It is fine if there is quotes on the first column, or it is not blank, but if it is empty with no quotes, it ignores it.
Here is a sample code I am using for testing purposes, it is written in Dart:
void main() {
String delimiter = ",";
String rawRow = ',,"data",moredata,"Data"';
RegExp exp = new RegExp(r'(?:'+ delimiter + r'"|^")(^,|""|[\w\W]*?)(?="'+ delimiter + r'|"$)|(?:'+ delimiter + '(?!")|^(?!"))([^'+ delimiter + r']*?)(?=$|'+ delimiter + r')');
Iterable<Match> matches = exp.allMatches(rawRow.replaceAll("\n","").replaceAll("\r","").trim());
List<String> row = new List();
matches.forEach((Match m) {
//This checks to see which match group it found the item in.
String cellValue;
if (m.group(2) != null) {
//Data found without speech marks
cellValue = m.group(2);
} else if (m.group(1) != null) {
//Data found with speech marks (so it removes escaped quotes)
cellValue = m.group(1).replaceAll('""', '"');
} else {
//Anything left
cellValue = m.group(0).replaceAll('""', '"');
}
row.add(cellValue);
});
print(row.toString());
}

Investigating your expression
(,"|^")
(""|[\w\W]*?)
(?=",|"$)
|
(,(?!")|^(?!"))
([^,]*?|)
(?=$|,)
(,"|^")(""|[\w\W]*?)(?=",|"$) This part is to match quoted strings, that seem to work for you
Going through this part (,(?!")|^(?!"))([^,]*?|)(?=$|,)
(,(?!")|^(?!")) start with comma not followed by " OR start of line not followed by "
([^,]*?|) Start of line or comma zero or more non greedy and |, why |
(?=$|,) end of line or , .
In CSV this ,,,3,4,5 line should give 6 matches but the above only gets 5
You could add (^(?=,)) at the begining of second part, the part that matches non quoted sections.
Second group with match of start and also added non capture to groups
(?:^(?=,))|(?:,(?!")|^(?!"))(?:[^,]*?)(?=$|,)
Complete: (?:,"|^")(?:""|[\w\W]*?)(?=",|"$)|(?:^(?=,))|(?:,(?!")|^(?!"))(?:[^,]*?)(?=$|,)
Here is another that might work
(?:(?:"(?:[^"]|"")*"|(?<=,)[^,]*(?=,))|^[^,]+|^(?=,)|[^,]+$|(?<=,)$)
How that works i described here: Build CSV parser using regex

Related

Capturing a delimiter that isn't in between single quotes

Like the question says, is it possible to use a single Regex string to get a delimiter that isn't in between some quotes?
For example, I want to split this string with the delimiter &:
"example=3&testing='f&tmp'"
should produce
["example=3", "testing='f&tmp'"]
Essentially, things inside single quotes (' ') should remain untouched.
I found out how to get things within quotes with expression: (?:'.*?')
The closest I could get to a tangible solution was: (.[^']&[^'])
It is not an easy task for a String#split, but is quite a feasible task for Matcher#find if you use
[^&\s=]+=(?:'[^']*'|[^\s&]*)
(see this regex demo) and this Java code:
String text = "example=3&testing='f&tmp'";
Pattern p = Pattern.compile("[^&\\s=]+=(?:'[^']*'|[^\\s&]*)");
Matcher m = p.matcher(text);
List<String> res = new ArrayList<>();
while(m.find()) {
res.add(m.group());
}
System.out.println(res);
// => [example=3, testing='f&tmp']
Details
[^&\s=]+ - one or more chars other than &, = and whitespace
= - a = char
(?:'[^']*'|[^\s&]*) - a non-capturing group matching either ', zero or more chars other than ' and then a ', or zero or more chars other than whitespace and &.

Regex - change commas only in a portion of a string

I make a lot of changes on a original csv string. there is a lot of comma delimiter. I have to replace by a ";" either only the commas inside the expression || ....|| or only the commas outside this expression. i need to do this change in order to have different delimiter in the expression ||....|| compare to the rest of the string.
Example:
(.*)(?:\|\|)(?:.*)(,)(?:.*)\|\|
After I use
var regex = /myregex/g;
var str = str.replace(regex, ',')
thanks
You can use
const string = "aba,bjlj,alj,ljlj||name1,name2,name3||jflkj,glfgjlf,jflg,fjlfd||name1,name2||fd,sdfsfd,dfs||name1,name2,name3,name4,name5||";
console.log( string.replace(/\|{2}[\w\W]*?\|{2}/g, (x) => x.replace(/,/g, ';')) );
The regex is
/\|{2}.*?\|{2}/gs // matches any text between two double pipes
/\|{2}[\w\W]*?\|{2}/g // matches any text between two double pipes
/\|{2}.*?\|{2}/g // matches any text but line breaks between two double pipes
Note the . does not match line breaks without the s modifier flag.
The regex matches double pipe, then any zero or more chars, as few as possible up to the next double pipe.
Then, x, the whole match value, is passed as an argument to the anonymous callback function used as a replacement argument, and all commas are replaced with ; only inside the matches.
The "contrary" solution is to match and capture the strings between double pipes and only match commas in all other contexts so that you could keep the captures and replace those commas:
const string = "aba,bjlj,alj,ljlj||name1,name2,name3||jflkj,glfgjlf,jflg,fjlfd||name1,name2||fd,sdfsfd,dfs||name1,name2,name3,name4,name5||";
console.log( string.replace(/(\|{2}[\w\W]*?\|{2})|,/g, (x,y) => y || ';') );
Big Thanks.
I also find
var newStr = str.replace(/\|{2}.*?\|{2}/g, function(match) {
return match.replace(/,/g,";");
});
Do you think is it possible to do the contrary and change all the comma outside the occurence ||...|| ?

How do I replace the nth occurrence of a special character, say, a pipe delimiter with another in Scala?

I'm new to Spark using Scala and I need to replace every nth occurrence of the delimiter with the newline character.
So far, I have been successful at entering a new line after the pipe delimiter.
I'm unable to replace the delimiter itself.
My input string is
val txt = "January|February|March|April|May|June|July|August|September|October|November|December"
println(txt.replaceAll(".\\|", "$0\n"))
The above statement generates the following output.
January|
February|
March|
April|
May|
June|
July|
August|
September|
October|
November|
December
I referred to the suggestion at https://salesforce.stackexchange.com/questions/189923/adding-comma-separator-for-every-nth-character but when I enter the number in the curly braces, I only end up adding the newline after 2 characters after the delimiter.
I'm expecting my output to be as given below.
January|February
March|April
May|June
July|August
September|October
November|December
How do I change my regular expression to get the desired output?
Update:
My friend suggested I try the following statement
println(txt.replaceAll("(.*?\\|){2}", "$0\n"))
and this produced the following output
January|February|
March|April|
May|June|
July|August|
September|October|
November|December
Now I just need to get rid of the pipe symbol at the end of each line.
You want to move the 2nd bar | outside of the capture group.
txt.replaceAll("([^|]+\\|[^|]+)\\|", "$1\n")
//val res0: String =
// January|February
// March|April
// May|June
// July|August
// September|October
// November|December
Regex Explained (regex is not Scala)
( - start a capture group
[^|] - any character as long as it's not the bar | character
[^|]+ - 1 or more of those (any) non-bar chars
\\| - followed by a single bar char |
[^|]+ - followed by 1 or more of any non-bar chars
) - close the capture group
\\| - followed by a single bar char (not in capture group)
"$1\n" - replace the entire matching string with just the first $1 capture group ($0 is the entire matching string) followed by the newline char
UPDATE
For the general case of N repetitions, regex becomes a bit more cumbersome, at least if you're trying to do it with a single regex formula.
The simplest thing to do (not the most efficient but simple to code) is to traverse the String twice.
val n = 5
txt.replaceAll(s"(\\w+\\|){$n}", "$0\n")
.replaceAll("\\|\n", "\n")
//val res0: String =
// January|February|March|April|May
// June|July|August|September|October
// November|December
You could first split the string using '|' to get the array of string and then loop through it to perform the logic you want and get the output as required.
val txt = "January|February|March|April|May|June|July|August|September|October|November|December"
val out = txt.split("\\|")
var output: String = ""
for(i<-0 until out.length -1 by 2){
val ref = out(i) + "|" + out(i+1) + "\n"
output = output + ref
}
val finalout = output.replaceAll("\"\"","") //just to remove the starting double quote
println(finalout)

DART Conditional find and replace using Regex

I have a string that sometimes contains a certain substring at the end and sometimes does not. When the string is present I want to update its value. When it is absent I want to add it at the end of the existing string.
For example:
int _newCount = 7;
_myString = 'The count is: COUNT=1;'
_myString2 = 'The count is: '
_rRuleString.replaceAllMapped(RegExp('COUNT=(.*?)\;'), (match) {
//if there is a match (like in _myString) update the count to value of _newCount
//if there is no match (like in _myString2) add COUNT=1; to the string
}
I have tried using a return of:
return "${match.group(1).isEmpty ? _myString + ;COUNT=1;' : 'COUNT=$_newCount;'}";
But it is not working.
Note that replaceAllMatched will only perform a replacement if there is a match, else, there will be no replacement (insertion is still a replacement of an empty string with some string).
Your expected matches are always at the end of the string, and you may leverage this in your current code. You need a regex that optionally matches COUNT= and then some text up to the first ; including the char and then checks if the current position is the end of string.
Then, just follow the logic: if Group 1 is matched, set the new count value, else, add the COUNT=1; string:
The regex is
(COUNT=[^;]*;)?$
See the regex demo.
Details
(COUNT=[^;]*;)? - an optional group 1: COUNT=, any 0 or more chars other than ; and then a ;
$ - end of string.
Dart code:
_myString.replaceFirstMapped(RegExp(r'(COUNT=[^;]*;)?$'), (match) {
return match.group(0).isEmpty ? "COUNT=1;" : "COUNT=$_newCount;" ; }
)
Note the use of replaceFirstMatched, you need to replace only the first match.

Regex Replace everything except between the first " and the last "

i need a regex that replaces everything except the content between the first " and the last ".
I need it like this:
Input String:["Key:"Value""]
And after the regex i only need this:
Output String:Key:"Value"
Thanks!
You can try something like this.
patern:
^.*?"(.*)".*$
Substion:
$1
On Regex101
Explination:
the first part ^.*?" matches as few characters as possible that are between the start of the string and a double quote
the second part(.*)" makes the largest match it can that ends in a double quote, and stuffs it all in a capture group
the last part .*$ grabs what ever is left and includes it in the match
Finally you replace the entire match with the contents of the first capture group
Can you say why you need a RegExp?
A function like:
String unquote(String input) {
int start = input.indexOf('"');
if (start < 0) return input; // or throw.
int end = input.lastIndexOf('"');
if (start == end) return input; // or throw
return input.substring(start + 1, end);
}
is going to be faster and easier to understand than a RegExp.
Anyway, for the challenge, let's say we do want a RegExp that replaces the part up to the first " and from the last " with nothing. That's two replaces, so you can do an
input.replaceAll(RegExp(r'^[^"]*"|"[^"]*$'), "")`
or you can use a capturing group and a computed replacement like:
input.replaceFirstMapped(RegExp(r'^[^"]*"([^]*)"[^"]*$'), (m) => m[1])
Alternatively, you can use the capturing group to select the text between the two and extract it in code, instead of doing string replacement:
String unquote(String input) {
var re = RegExp(r'^[^"]*"([^]*)"[^"]$');
var match = re.firstMatch(input);
if (match == null) return input; // or throw.
return match[1];
}