Perl Regex To Remove Commas Between Quotes? - regex
I am trying to remove commas between double quotes in a string, while leaving other commas intact? (This is an email address which sometimes contains spare commas). The following "brute force" code works OK on my particular machine, but is there a more elegant way to do it, perhaps with a single regex?
Duncan
$string = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
print "Initial string = ", $string, "<br>\n";
# Extract stuff between the quotes
$string =~ /\"(.*?)\"/;
$name = $1;
print "name = ", $1, "<br>\n";
# Delete all commas between the quotes
$name =~ s/,//g;
print "name minus commas = ", $name, "<br>\n";
# Put the modified name back between the quotes
$string =~ s/\"(.*?)\"/\"$name\"/;
print "new string = ", $string, "<br>\n";
You can use this kind of pattern:
$string =~ s/(?:\G(?!\A)|[^"]*")[^",]*\K(?:,|"(*SKIP)(*FAIL))//g;
pattern details:
(?: # two possible beginnings:
\G(?!\A) # contiguous to the previous match
| # OR
[^"]*" # all characters until an opening quote
)
[^",]* #"# all that is not a quote or a comma
\K # discard all previous characters from the match result
(?: # two possible cases:
, # a comma is found, so it will be replaced
| # OR
"(*SKIP)(*FAIL) #"# when the closing quote is reached, make the pattern fail
# and force the regex engine to not retry previous positions.
)
If you use an older perl version, \K and the backtracking control verbs may be not supported. In this case you can use this pattern with capture groups:
$string =~ s/((?:\G(?!\A)|[^"]*")[^",]*)(?:,|("[^"]*(?:"|\z)))/$1$2/g;
One way would be to use the nice module Text::ParseWords to isolate the specific field and perform a simple transliteration to get rid of the commas:
use strict;
use warnings;
use Text::ParseWords;
my $str = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
my #row = quotewords(',', 1, $str);
$row[2] =~ tr/,//d;
print join ",", #row;
Output:
06/14/2015,19:13:51,"Mrs Nkolika Nebedom" <ubabankoffice93#gmail.com>,1,2
I assume that no commas can appear legitimately in your email field. Otherwise some other replacement method is required.
Related
Multi-order splitting inside Perl
I have a string which comes from a CSV file: my $str = 'NA19900,4,111629038,0;0,0;0,"GSA-rs16997168,rs16997168,rs2w34r23424",C,T,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0'; which should be translated (somehow) to 'NA19900,4,111629038,0;0,0;0,"GSA-rs16997168;rs16997168;rs2w34r23424",C,T,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0'; so that perl's split does not split the single field GSA-rs16997168,rs16997168 into two separate fields i.e. the comma should be replaced by a semi-colon if it is between the two " I can't find how to do this on Google What I've tried so far: $str =~ s/"([^"]+),([^"]+)"/"$1;$2"/g; but this fails with > 2 expressions It would be great if I could somehow tell perl's split function to count everything within "" as one field even if that text has the , delimiter, but I don't know how to do that :( I've heard of lookaheads, but I don't see how I can use them here :(
Why try to recreate a CSV parser when perfectly good ones exist? use Text::CSV_XS qw( ); my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 2 }); while ( my $row = $csv->get_line($fh) ) { $row->[5] =~ s/,/;/g $csv->say(\*STDOUT, $row); }
My guess is that we wish to capture upto four commas after the last ", for which we would be starting with a simple expression such as: (.*",.+?,.+?,.+?,.+?), Demo Test use strict; my $str = 'NA19900,4,111629038,0;0,0;0,"GSA-rs16997168,rs16997168,rs2w34r23424",C,T,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0'; my $regex = qr/(.*",.+?,.+?,.+?,.+?),/mp; if ( $str =~ /$regex/g ) { print "Whole match is ${^MATCH} and its start/end positions can be obtained via \$-[0] and \$+[0]\n"; # print "Capture Group 1 is $1 and its start/end positions can be obtained via \$-[1] and \$+[1]\n"; # print "Capture Group 2 is $2 ... and so on\n"; } # ${^POSTMATCH} and ${^PREMATCH} are also available with the use of '/p' # Named capture groups can be called via $+{name} RegEx If this expression wasn't desired and you wish to modify it, please visit this link at regex101.com. RegEx Circuit jex.im visualizes regular expressions:
Why use a CSV module and a regex. Just use a regex and cut out the middle man . $str =~ s/(?m:(?:,|^)"|(?!^)\G)[^",]*\K,(?=[^"]*")/;/g ; https://regex101.com/r/tRDCen/1 Read-me version (?m: (?: , | ^ ) " | (?! ^ ) \G ) [^",]* \K , (?= [^"]* " )
Finding and Concatenating strings
I want to find some punctuation characters and concatenate them with spaces. For example: If any punctuation are found then I want to add spaces to front and end of them. $line =~ s/[?%&!,.%*\[◦\]\\;#<>{}#^=\+()\$]/" $1 "/g ; I tried using $ as used in Php where we can use $1, but it didn't work. I searched on the web and couldn't find the Perl syntax? Additionally, how can I preserve ... as a single token? What is the true syntax for my problem.
Use this: #!/usr/bin/perl -w use strict; my $string = "For example; If i found any puncs. above list, i want to add spaces to front and end of token."; $string =~ s/([[:punct:]])/ $1 /g; print "$string\n"; Outputs: For example ; If i found any puncs . above list , i want to add spaces to front and end of token . Obviously, if you want your output different from above, you can just add it in-between / / - I've just replaced all punctuation with " punctuation ".
You need to surround match pattern with () to capture it into $1 $line =~ s/([?%&!,.%*\[◦\]\\;#<>{}#^=\+\(\)\$])/ $1 /g; EDIT (as per OP's comment) how can i preserve '...' a single token ? One way would be to revert back the changes for that token. $line =~ s/ \. \. \. /.../g;
Regex to match perl variable
I'm currently learning about regular expressions and I'm trying to create a regex to match any legal variable name in Perl. This is what I wrote so far: ^\$[A-Za-z_][a-zA-Z0-9_]* The only problem is the regex returns true for special signs, for example the string $a& will return true. What I did wrong? Thanks! Rotem
Parsing Perl is difficult, and the rules for what is and is not a variable are complicated. If you're attempting to parse Perl, consider using PPI instead. It can parse a Perl program and do things like find all the variables. PPI is what perlcritic uses to do its job. If you want to try and do it anyway, here's some edge cases to consider... $^F $/ ${^ENCODING} $1 $élite # with utf8 on ${foo} *{foo} = \42; *{$name} = \42; # with strict off ${$name} = 42; # with strict off And of course the other sigils #%*. And detecting if something is inside a single quoted string. Which is my way of strongly encouraging you to use PPI rather than try to do it yourself. If you want practice, realistic practice is to pull the variable out of a larger string, rather than do exact matches. # Match the various sigils. my $sigils = qr{ [\$\#\%*] }x; # Match $1 and #1 and so on my $digit_var = qr{ $sigils \d+ }x; # Match normal variables my $named_var = qr{ $sigils [\w^0-9] \w* }x; # Combine all the various variable matches my $match_variable = qr{ ( $named_var | $digit_var ) }x; This uses the () capture operator to grab just the variable. It also uses the /x modifier to make the regex easier to read and alternative delimiters to avoid leaning toothpick syndrome. Using \w instead of A-Z ensures that Unicode characters will be picked up when utf8 is on, and that they won't when its off. Finally, qr is used to build up the regex in pieces. Filling in the gaps is left as an exercise.
You need a $ at the end, otherwise it's just matches as far as it can and ignores the rest. So it should be: ^\$[A-Za-z_][A-Za-z0-9]*$
I needed to solve this problem to create a simple source code analyzer. This subroutine extracts Perl user variables from an input section of code sub extractVars { my $line = shift; chomp $line; $line =~ s/#.*//; # Remove comments $line =~ s/\s*;\s*$//; # Remove trailing ; my #vars = (); my $match = 'junk'; while ($match ne '') { push #vars, $match if $match ne 'junk'; $match = ''; if ($line =~ s/( [\#\$\%] # $#% {? # optional brace \$? # optional $ [\w^0-9] # begin var name [\w\-\>\${}\[\]'"]* # var name [\w}\]] # end var name | [\#\$\%] # $#% {? # optional brace \$? # optional $ [\w^0-9] # one letter var name [}\]]? # optional brace or bracket )//x) { $match = $1; next; } } return #vars; } Test it with this code: my #variables = extractVars('$a $a{b} $a[c] $scalar #list %hash $list[0][1] $list[-1] $hash{foo}{bar} $aref->{foo} $href->{foo}->{bar} #$aref %$hash_ref %{$aref->{foo}} $hash{\'foo\'} "$a" "$var{abc}"'); It does NOT work if the variable name contains spaces, for example: $hash{"baz qux"} ${ $var->{foo} }[0]
replace newlines within quoted string with \n
I need to write a quick (by tomorrow) filter script to replace line breaks (LF or CRLF) found within double quoted strings by the escaped newline \n. The content is a (broken) javascript program, so I need to allow for escape sequences like "ab\"cd" and "ab\\"cd"ef" within a string. I understand that sed is not well-suited for the job as it work per line, so I turn to perl, of which I know nothing :) I've written this regex: "(((\\.)|[^"\\\n])*\n?)*" and tested it with the http://regex.powertoy.org. It indeed matches quoted strings with line breaks, however, perl -p -e 's/"(((\\.)|[^"\\\n])*(\n)?)*"/TEST/g' does not. So my questions are: how to make perl to match line breaks? how to write the "replace-by" part so that it keeps the original string and only replaces newlines? There is this similar question with awk solution, but it is not quite what I need. NOTE: I usually don't ask "please do this for me" questions, but I really don't feel like learning perl/awk by tomorrow... :) EDIT: sample data "abc\"def" - matches as one string "abc\\"def"xy" - match "abcd\\" and "xy" "ab cd ef" - is replaced by "ab\ncd\nef"
Here is a simple Perl solution: s§ \G # match from the beginning of the string or the last match ([^"]*+) # till we get to a quote "((?:[^"\\]++|\\.)*+)" # match the whole quote § $a = $1; $b = $2; $b =~ s/\r?\n/\\n/g; # replace what you want inside the quote "$a\"$b\""; §gex; Here is another solution in case you wouldn't want to use /e and just do it with one regex: use strict; $_=<<'_quote_'; hai xtest "aa xx aax" baix "xx" x "axa\"x\\" xa "x\\\\\"x" ax xbai!x _quote_ print "Original:\n", $_, "\n"; s/ ( (?: # at the beginning of the string match till inside the quotes ^(?&outside_quote) " # or continue from last match which always stops inside quotes | (?!^)\G ) (?&inside_quote) # eat things up till we find what we want ) x # the thing we want to replace ( (?&inside_quote) # eat more possibly till end of quote # if going out of quote make sure the match stops inside them # or at the end of string (?: " (?&outside_quote) (?:"|\z) )? ) (?(DEFINE) (?<outside_quote> [^"]*+ ) # just eat everything till quoting starts (?<inside_quote> (?:[^"\\x]++|\\.)*+ ) # handle escapes ) /$1Y$2/xg; print "Replaced:\n", $_, "\n"; Output: Original: hai xtest "aa xx aax" baix "xx" x "axa\"x\\" xa "x\\\\\"x" ax xbai!x Replaced: hai xtest "aa YY aaY" baix "YY" x "aYa\"Y\\" xa "Y\\\\\"Y" ax xbai!x To work with line breaks instead of x, just replace it in the regex like so: s/ ( (?: # at the beginning of the string match till inside the quotes ^(?&outside_quote) " # or continue from last match which always stops inside quotes | (?!^)\G ) (?&inside_quote) # eat things up till we find what we want ) \r?\n # the thing we want to replace ( (?&inside_quote) # eat more possibly till end of quote # if going out of quote make sure the match stops inside them # or at the end of string (?: " (?&outside_quote) (?:"|\z) )? ) (?(DEFINE) (?<outside_quote> [^"]*+ ) # just eat everything till quoting starts (?<inside_quote> (?:[^"\\\r\n]++|\\.)*+ ) # handle escapes ) /$1\\n$2/xg;
Until the OP posts some example content to test by, try adding the "m" (and possibly the "s") flag to the end of your regex; from perldoc perlreref (reference): m Multiline mode - ^ and $ match internal lines s match as a Single line - . matches \n For testing you might also find that adding the command line argument "-i.bak" so that you keep a backup of the original file (now with the extension ".bak"). Note also that if you want to capture but not store something you can use (?:PATTERN) rather than (PATTERN). Once you have your captured content use $1 through $9 to access stored matches from the matching section. For more info see the link about as well as perldoc perlretut (tutorial) and perldoc perlre (full-ish documentation)
#!/usr/bin/perl use warnings; use strict; use Regexp::Common; $_ = '"abc\"def"' . '"abc\\\\"def"xy"' . qq("ab\ncd\nef"); print "befor: {{$_}}\n"; s{($RE{quoted})} { (my $x=$1) =~ s/\n/\\n/g; $x }ge; print "after: {{$_}}\n";
Using Perl 5.14.0 (install with perlbrew) one can do this: #!/usr/bin/env perl use strict; use warnings; use 5.14.0; use Regexp::Common qw/delimited/; my $data = <<'END'; "abc\"def" "abc\\"def"xy" "ab cd ef" END my $output = $data =~ s/$RE{delimited}{-delim=>'"'}{-keep}/$1=~s!\n!\\n!rg/egr; print $output; I need 5.14.0 for the /r flag of the internal replace. If someone knows how to avoid this please let me know.
How do I use Perl to intersperse characters between consecutive matches with a regex substitution?
The following lines of comma-separated values contains several consecutive empty fields: $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n 2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n" I want to replace these empty fields with 'N/A' values, which is why I decided to do it via a regex substitution. I tried this first of all: $rawdata =~ s/,([,\n])/,N\/A/g; # RELABEL UNAVAILABLE DATA AS 'N/A' which returned 2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n 2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,,N/A,\n Not what I wanted. The problem occurs when more than two consecutive commas occur. The regex gobbles up two commas at a time, so it starts at the third comma rather than the second when it rescans the string. I thought this could be something to do with lookahead vs. lookback assertions, so I tried the following regex out: $rawdata =~ s/(?<=,)([,\n])|,([,\n])$/,N\/A$1/g; # RELABEL UNAVAILABLE DATA AS 'N/A' which resulted in: 2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,N/A,Clear\n 2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,,N/A,,N/A\n That didn't work either. It just shifted the comma-pairings by one. I know that washing this string through the same regex twice will do it, but that seems crude. Surely, there must be a way to get a single regex substitution to do the job. Any suggestions? The final string should look like this: 2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,N/A,Clear\n 2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,N/A,N/A,N/A,N/A\n
EDIT: Note that you could open a filehandle to the data string and let readline deal with line endings: #!/usr/bin/perl use strict; use warnings; use autodie; my $str = <<EO_DATA; 2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear 2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,, EO_DATA open my $str_h, '<', \$str; while(my $row = <$str_h>) { chomp $row; print join(',', map { length $_ ? $_ : 'N/A'} split /,/, $row, -1 ), "\n"; } Output: E:\Home> t.pl 2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear 2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A You can also use: pos $str -= 1 while $str =~ s{,(,|\n)}{,N/A$1}g; Explanation: When s/// finds a ,, and replaces it with ,N/A, it has already moved to the character after the last comma. So, it will miss some consecutive commas if you only use $str =~ s{,(,|\n)}{,N/A$1}g; Therefore, I used a loop to move pos $str back by a character after each successful substitution. Now, as #ysth shows: $str =~ s!,(?=[,\n])!,N/A!g; would make the while unnecessary.
I couldn't quite make out what you were trying to do in your lookbehind example, but I suspect you are suffering from a precedence error there, and that everything after the lookbehind should be enclosed in a (?: ... ) so the | doesn't avoid doing the lookbehind. Starting from scratch, what you are trying to do sounds pretty simple: place N/A after a comma if it is followed by another comma or a newline: s!,(?=[,\n])!,N/A!g; Example: my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"; use Data::Dumper; $Data::Dumper::Useqq = $Data::Dumper::Terse = 1; print Dumper($rawData); $rawData =~ s!,(?=[,\n])!,N/A!g; print Dumper($rawData); Output: "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n" "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A\n"
You could search for (?<=,)(?=,|$) and replace that with N/A. This regex matches the (empty) space between two commas or between a comma and end of line.
The quick and dirty hack version: my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear 2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"; while ($rawData =~ s/,,/,N\/A,/g) {}; print $rawData; Not the fastest code, but the shortest. It should loop through at max twice.
Not a regex, but not too complicated either: $string = join ",", map{$_ eq "" ? "N/A" : $_} split (/,/, $string,-1); The ,-1 is needed at the end to force split to include any empty fields at the end of the string.