Perl Regex: How to remove quotes inside quotes from CSV line - regex

I've got a line from a CSV file with " as field encloser and , as field seperator as a string. Sometimes there are " in the data that break the field enclosers. I'm looking for a regex to remove these ".
My string looks like this:
my $csv = qq~"123456","024003","Stuff","","28" stuff with more stuff","2"," 1.99 ","",""~;
I've looked at this but I don't understand how to tell it to only remove quotes that are
not at the beginning of the string
not at the end of the string
not preceded by a ,
not followed by a ,
I managed to tell it to remove 3 and 4 at the same time with this line of code:
$csv =~ s/(?<!,)"(?!,)//g;
However, I cannot fit the ^ and $ in there since the lookahead and lookbehind both do not like being written as (?<!(^|,)).
Is there a way to achieve this only with a regex besides splitting the string up and removing the quote from each element?

For manipulating CSV data I'd reccomend using Text::CSV - there's a lot of potential complexity within CSV data, which while possible to contruct code to handle yourself, isn't worth the effort when there's a tried and tested CPAN module to do it for you

Don't use Regex for parsing CSV file, CPAN provides lot of good modules like as nickifat suggest, use Text::CSV or you can use Text::ParseWords like
use Text::ParseWords;
while (<DATA>) {
chomp;
my #f = quotewords ',', 0, $_;
print join "|" => #f;
}
__DATA__
"123456","024003","Stuff","",""28" stuff with more stuff","2"," 1.99 ","",""
Output:
123456|024003|Stuff||28 stuff with more stuff|2| 1.99 ||

This should work:
$csv =~ s/(?<=[^,])"(?=[^,])//g
1 and 2 implies that there must be at least one character before and after the comma, hence the positive lookarounds. 3 and 4 implies that these characters can be anything but a comma.

Thanks for the help here. I was having issues with badly formatted CSV with embedded double-quotes. I would make one slight addition to the lookahead portion of the regex otherwise null values at the end of the line will be corrupted:
(?<=[^,])\"(?=[^,\n])
Adding the \n will eliminate a match against the last double-quote at end-of-line.

the suggested
$csv =~ s/(?<=[^,])"(?=[^,])//g;
is probably the best answer. Without these advanced regex features, you could also do the same with
$csv =~ s/([^,])"([^,])/$1$2/g;
or
$csv = join (',', map {s/"//g;"\"$_\""} split (',', $csv));
I think you should be aware that your string is not well formated csv. In a csv file, double quotes inside values must be doubled (http://en.wikipedia.org/wiki/Comma-separated_values). With your format, values cannot contain quotes near commas.
csv is a not so simple format. If you decides to use "real" csv, you should use a module.
Otherwise, you should probably remove all the double quotes in order to simplify your code and clarify that you are not doing csv.

Related

TCL multi capture group for simplified csv string parsing with regexp

I'm trying to parse a simplified CSV format with TCL regexp. I chose regexp over split to perform rudimentary format compliance test.
My problem is that I want to use a count quantifier but want to exclude the ',' from the match.
My test line:
set line "2017/08/21 16:06:20.0, REALTIME, late by 0.3, EOS450D, 1/640, F/8.0, ISO 100, Partial 450D 0.0%"
So far I have:
regexp -all {(?:([^\,]*)\,){8}} $line dummy date tm off cam exp fnum iso com
My thought process is:
Get a match group for all characters that are not comma up to the next comma.
Now I want to match this 8 time so I put it into a non-capturing group followed by a counting quantifier. But that defeats the purpose as now nothing is matched. What I need is a way to make the match go through the CSV 8 times and capture the text but not the comma.
My CSV is simplified in the following.
No quoted strings in the CSV
No empty entries in CSV
I've checked google for csv matching but most hits were too blown up due to allowing special cases in the CSV content.
Thanks,
Gert
In the regexp command, the interaction between the -all switch and the match variables is that the values captured in the last iteration of matching are used to fill the variables. This means that you can't fill eight variables by having one capture group and iteratively matching it eight times.
Your regular expression doesn't match anyway, since it requires a comma after the last field.
For this particular example, you could use the invocation
% regexp -all -inline {[^,]+} $line
{2017/08/21 16:06:20.0} { REALTIME} { late by 0.3} { EOS450D} { 1/640} { F/8.0} { ISO 100} { Partial 450D 0.0%}
This means to match all groups of characters that aren't commas (note that the comma isn't special: you don't need to escape it) and return them as a list.
As you noted, this is the same as using
% split $line ,
(which is also about five times faster).
You didn't want to use split because you wanted to do some validation: it is unclear what forms of validation you wanted to do, but you can easily validate the number of fields found:
% set fields [split $line ,]
% if {[llength $fields] ne 8} {puts stderr "wrong number of fields"}
You can store the fields in variables and validate them separately, which is a lot easier to get right than trying to validate them all at the same time while extracting them:
lassign $fields date tm off cam exp fnum iso com
if {![regexp {ISO\s+\d+} $iso]} {puts stderr "in search of valid ISO"}
The best method is still to split the data string using the csv package. Even if you just want to use this simplified CSV now, sooner than you think you might want to, say, allow fields with commas in them.
package require csv
set fields [::csv::split $line]
Documentation:
csv (package),
if,
lassign,
llength,
package,
puts,
regexp,
set,
split,
Syntax of Tcl regular expressions
ETA: Getting rid of leading/trailing whitespace. This is a bit unusual, since CSV data is usually arranged to be fields of strictly significant text separated by a separator character. If there is anything to be trimmed, it is usually done when saving the data.
A good way is to put the matched groups through an lmap/string trim filter:
lmap field [regexp -all -inline {[^,]+} $line] {string trim $field}
Another way is to get rid of whitespace around commas first, and then split:
split [regsub -all {\s*,\s*} $line ,] ,
You can use the Tcllib variant of split that splits by regular expression:
package require textutil
::textutil::splitx $line {\s*,\s*}
You can also swap out the earlier regular expression for [^\s,][^,]*[^\s,] (will not match fields of less than two characters). This is a regular expression that is on the verge of becoming too complex to be useful.

Stop regex selecting first character after match

I have a csv in the following format;
"12345"|"ABC"|"ABC"[tab delimeter]
"12345"|"ABC"|"ABC"[tab delimeter]
"12345"|"ABC"|"ABC"[tab delimeter]
However, tabs also appear in the text, I need to remove the tabs which are not preceeded by a " .
I have the following regex which highlights the tabs which are not followed by a "
\t[^\"]
but this highlights the character after the tab as well, I would like to only select and remove the tab.
Note: Not sure if this matters but i am running the command in TextPad before I run it in Perl.
EDIT test data http://pastebin.com/dYfrcSPc
Use this one:
\t(?!")
It means a tab character that is not followed by a " character.
If you cannot download a proper CSV module such as Text::CSV, you can use a lightweight alternative that is part of the core: Text::ParseWords:
use strict;
use warnings;
use Text::ParseWords;
while (<DATA>) {
my #list = quotewords('\t', 1, $_);
tr/\t//d for #list;
print join "\t", #list;
}
__DATA__
"12345"|"ABC "|"ABC" next field
"12345"|"ABC"|" ABC" next field
"123 45"|"ABC"|"ABC" next field
(Note: Tab characters might have been destroyed by stackoverflow formatting)
This will parse the lines and ignore quoted tabs. We can then simply remove them and put the line back together.
Well, the easiest way would be using negative lookbehind...
s/(?<!")\t//g;
... as it'll match only those tab characters not preceded by " character. But if your perl doesn't support it, don't worry - there's another way:
s/([^"])\t/$1/g;
... that is, replacing any non-" symbol followed by \t with that symbol alone.

PHP preg_replace is not matching entire pattern

I'm stuck trying to get the PHP preg_replace to work properly. I want to find all matches of a pattern and replace them with a string. But, for some reason, it's finding only partial matches and replacing all of them. I'm trying to remove the "password" from every line of a text file. The password is always at the end of each line, contains 4 to 8 alpha-numeric characters, and always follows two pipe characters.
Example:
$data = 'A00000001|A00000001|FirstName|LastName|email#address|Role||password'.PHP_EOL;
$data .= 'B00000002|B00000002|FirstName|LastName|email#address|Role||password'.PHP_EOL;
$delim = '|';
$newData = preg_replace("/".$delim.$delim."[a-zA-Z0-9]{4,8}/", $delim.$delim, $data);
echo $newData;
Output:
||||||1|||||||||1|||||||||e||||||||||||e||m||a||i||l||#||a||d||d||r||e||s||s|||||R||o||l||e||||||||||||
||||||2|||||||||2|||||||||e||||||||||||e||m||a||i||l||#||a||d||d||r||e||s||s|||||R||o||l||e||||||||||||
||
I've tried many variations with different groupings using parenthesis, putting back to back [a-zA-Z0-9] patterns instead of {#}. I've tried adding line start ^ and end $ to my pattern. I'm stuck. I know this will end up being something simple to that I'm just overlooking. That's why I need some fresh eyes on this.
You should use this regex
/(?<=\|\|)[a-zA-Z]{4,8}$/
You need to escape | since it represents OR in regex
$ marks the end of string
(?<=\|\|) is a zero width lookbehind
Looks like you can just escape your delimiters.
$newData = preg_replace('/\'.$delim.'\'.$delim.'[a-zA-Z0-9]{4,8}/', $delim.$delim, $data);
I'm trying to remove the "password" from every line of a text file.
In this case, anchor the regex after properly escaping your delimiter. Assuming the delim shouldn't be kept either, you could use:
preg_replace('/\|.*?$/', '', $data);
If it should, use a look-behind or:
preg_replace('/\|.*?$/', '|', $data);
On a separate note: this looks like an SQL dump or a CSV file. If so, consider using whichever variation of COPY ... DELIMITER ... your RDBMS offers instead:
http://www.postgresql.org/docs/current/static/sql-copy.html
You could then create a temporary table, import, drop the column, do whatever else you need to do, and populate the final tables as needed once you're done.

Regex Replace Cleaning a string from unwanted characters

I'm creating a method to modify page titles into a good string for to use URL rewriting.
Example: "Latest news", would be "latest-news"
The problem is the page titles are out of my control and some are similar to the following:
Football & Rugby News!. Ideally this would become football-rugby-news.
I've done some work to get this to football-&-rugby-news!
Is there a possible regex to identify unwanted characters in there and the extra '-' ?
Basically, I need numbers and letters separated by a single '-'.
I only have basic knowledge of regex, and the best I could come up with was:
[^a-z0-9-]
I'm not sure if I'm being clear enough here.
Try a 'replace all' with something like this.
[^a-zA-Z0-9\\-]+
Replace the matches with a dash.
Alternative regex:
[^a-zA-Z0-9]+
This one will avoid multiple dashes if a dash itself is found near other unwanted characters.
This Perl script also does what you're looking for. Of course you'd have to feed it the string by some other means than just hardcoding it; I merely put it in there for the example.
#!/usr/bin/perl
use strict;
use warnings;
my $string = "Football & Rugby News!";
$string = lc($string); # lowercase
my $allowed = qr/a-z0-9-\s/; # all permitted characters
$string =~ s/[^$allowed]//g; # remove all characters that are NOT in $allowed
$string =~ s/\s+/-/g; # replace all kinds of whitespace with '-'
print "$string\n";
prints
football-rugby-news

How do I extract words from a comma-delimited string in Perl?

I have a line:
$myline = 'ca,cb,cc,cd,ce';
I need to match ca into $1, cb into $2, etc..
Unfortunately
$myline =~ /(?:(\w+),?)+/;
doesn't work. With pcretest it only matches 'ce' into $1.
How to do it right?
Do I need to put it into the while loop?
Why not use the split function:
#parts = split(/,/,$myline);
split splits a string into a list of strings using the regular expression you supply as a separator.
Isn't it easier to use my #parts = split(/,/, $myline) ?
Although split is a good way to solve your problem, a capturing regex in list context also works well. It's useful to know about both approaches.
my $line = 'ca,cb,cc,cd,ce';
my #words = $line =~ /(\w+)/g;
Look into the CSV PM's you can download from CPAN, i.e. Text::CSV or Text::CSV_XS.
This will get you what you need and also account for any comma seperated values that happen to be quoted.
Using these modules make it easy to split the data out and parse through it...
For example:
my #field = $csv->fields;
If the number of elements is variable, then you're not going to do it in the way you're aiming for. Loop through the string using the global flag:
while($myline =~ /(\w+)\b/g) {
# do something with $1
}
I am going to guess that your real data is more complex than 'ca,cb,cc,cd,ce', however if it isn't then the use of regular expressions probably isn't warranted. You'd be better off splitting the string on the delimiting character:
my #things = split ',', $myline;