Parsing a CSV quoted string CSV file in Perl using split

Parsing a CSV quoted string CSV file in Perl using split - regex

I have a CSV file in the format shown below, and I'm using the Perl split command as shown, based on comma as delimiter. The problem is I have a quoted string "HTTP Large, GMS, ZMS: Large Files" with embedded commas and it fails. The array values will have only less elements. How can I modify the split command.
my #values = split('\,', $line);
CSV File
10852,800 Mob to Int'l,235341739,573047,84475.40,0.0003,Inbound,Ber unit
10880,"HTTP Large, GMS, ZMS: Large Files",52852810,128,13712.68,0.0002,,Rer unit
13506,Presence National,2716766818,2447643,309116.40,0.0001,Presence,per Cnit

Issues like embedded commas are precisely why modules such as Text::CSV were created. If, but only if, the data does not have embedded commas, then you can make regular expressions work. When the data has embedded commas, it is time to move to a tool designed to handle CSV with embedded commas, and that would be Text::CSV in Perl (and its relatives Text::CSV_PP and Text::CSV_XS).

I have also used the same approach as yours and it works fine with me. Try this code.
my #values = split(/(?<="),(?=")/, $line);
hope it helps

Related

Is there a way to match strings:numbers with variable positioning within the string?

We are using a simple curl to get metrics via an API. The problem is, that the output is fixed in the amount of arguments but not their position within the output.
We need to do this with a "simple" regex since the tool only accepts this.
/"name":"(.*)".*?"memory":(\d+).*?"consumer_utilisation":(\w+|\d+).*?"messages_unacknowledged":(\d+).*?"messages_ready":(\d+).*?"messages":(\d+)/s
It works fine for:
{"name":"queue1","memory":89048,"consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"messages":0}
However if the output order is changed, then it doesn't match any more:
{"name":"queue2","consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"messages":0,"memory":21944}
{"name":"queue3","consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"memory":21944,"messages":0}
I need a relative definition of the strings to match, since I never know at which position they will appear. Its in total 9 different queue-metric-groups.

The simple option is to use a regex for each key-value pair instead of one large regex.
/"name":"((?:[^\\"]|\\.)*)"/
/"memory":(\d+)/
This other option is not a regex, but might be sufficient. Instead of using regex, you could simply transform the resulting response before reading it. Since you say "We are using a simple curl" I'm guessing you're talking about the Curl command line tool. You could pipe the result into a simple Perl command.
perl -ne 'use JSON; use Text::CSV qw(csv); $hash = decode_json $_; csv (sep_char=> ";", out => *STDOUT, in => [[$hash->{name}, $hash->{memory}, $hash->{consumer_utilisation}, $hash->{messages_unacknowledged}, $hash->{messages_ready}, $hash->{messages}]]);'
This will keep the order the same, making it easier to use a regex to read out the data.
input
{"name":"queue1","memory":89048,"consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"messages":0}
{"name":"queue2","consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"messages":0,"memory":21944}
{"name":"queue3","consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"memory":21944,"messages":0}
output
queue1;89048;;0;0;0
queue2;21944;;0;0;0
queue3;21944;;0;0;0
For this to work you need Perl and the packages JSON and Text::CSV installed. On my system they are present in perl, libjson-perl and libtext-csv-perl.
note: I'm currently using ; as separator. If this is included into one of the output will be surrounded by double quotes. "name":"que;ue1" => "que;ue1";89048;;0;0;0 If the value includes both a ; and a " the " will be escaped by placing another one before it. "name":"q\"ue;ue1" => "q""ue;ue1";89048;;0;0;0

How can I use Regex to parse irregular CSV and not select certain characters

I have to handle a weird CSV format, and I have been running into problems. The string I have been able to work out thus far is
(?:\s*(?:\"([^\"]*)\"|([^,]+))\s*?)+?
My files are often broken and irregular, since we have to deal with OCR'd text which is usually not checked by our users. Therefore, we tend to end up with lots of weird things, like a single " within a field, or even a newline character(which is why I am using Regex instead of my previous readLine()-based solution). I've gotten it to parse most everything correctly, except it captures [,] [,]. How can I get it to NOT select fields with only a single comma? When I try and have it not select commas, it turns "156,000" into [156] and [000]
The test string I've been using is
"156,000","",""i","parts","dog"","","Monthly "running" totals"
The ideal desire capture output is
[156,000],[],[i],[parts],[dog],[],[Monthly "running" totals]
I can do with or without the internal quotes, since I can always just strip them during processing.
Thank you all very much for your time.

Your CSV is indeed irregular and difficult to parse. I suggest you do 2 replacements first to your data.
// remove all invalid double ""
input = Regex.Replace(input, #"(?<!,|^)""(?=,|$)|(?<=,)""(?!,|$)", "\"");
// now escape all inner "
input = Regex.Replace(input, #"(?<!,|^)"(?!,|$)", #"\\\"");
// at this stage your have proper CSV data and I suggest using a good .NET csv parser
// to parse your data and get individual values
Replacement 1 demo
Replacement 2 demo

If Pattern matched in 1st line i need to remove the 4th line

Friends,
I need some help in regex pattern match and replace
I usually use %s/findstring/replacestring/g for the pattern match and replace in same line
But if my file is some thing like this
<tracker xid="tracker4795">
<title>MIC-DMI Change Requests</title>
<description>New tracker created </description>
<dateCreated>2010-05-03 15:18:10 EST</dateCreated>
<displayLines>1</displayLines>
<isRequired>false</isRequired>
I need to pattern match the <tracker xid.*> and escape all the lines until it match <displayLine.*> again if these match both the pattern i need to remove the
<isRequired>.*
Something like if pattern matched in both 4th and 6th line remove the 7th line
Kindly throw some light on how to achieve this

You have to match the entire set of lines. For that, note that . does not match a newline character; this must be explicitly specified via \n. With that, you have multiple options:
Match the entire block, use capture groups to excise the line
The pattern is more complex, but this is the general approach:
:%s/\(<tracker xid=.*\n\%(.*\n\)\{3}<displayLines>.*\n\)<isRequired.*\n/\1/g
Match the minimal block, delete separately
This just establishes a match via :global, then uses relative addressing to remove the line.
:g/<tracker xid=.*\n\%(.*\n\)\{3}<displayLines>.*/+5delete
Caveats
Only do this if you are absolutely sure that the XML source is in a consistent, well-known format. Text editors / regular expressions are a quick and ready tool for this, but fundamentally are the wrong tool. Be aware of this, and don't blame the tool when something goes wrong. Read more here. For production-grade reliability and automation, please use an XML tool (like XSL transformations).

When you say 'something like this' it looks like what you've got there is XML. I can't say for sure, because 'something like this' covers a lot of defects.
However if it is XML, it's a really bad idea to try and parse it with a regular expression. The reason being that XML is a defined data format with a quite strict specification. If everyone sticks to that spec, then all is fine and dandy.
However, if someone is assuming you will handle their XML as XML, and you're not (because you're using a regular expression), what you will be creating is a brittle piece of code that at some point in the future will just randomly break for no apparent reason - because they stuck to the XML spec, but changed something in an entirely valid way.
So assuming that it is XML, and looks 'something like' the example below - I would suggest using Perl and XML::Twig to parse your data.
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $xml;
{ local $/; $xml = <DATA> };
my $data = XML::Twig->new( pretty_print => 'indented' )->parse($xml);
foreach my $element ( $data->root->children('tracker') ) {
my $xid = $element->att('xid');
print $xid, "\n";
foreach my $subelement ( $element->children ) {
if ( $subelement->name eq 'isRequired' ) {
#delete the 'isRequired' line
$subelement->delete;
}
}
}
$data->print;
__DATA__
<xml>
<tracker xid="tracker4795">
<title>MIC-DMI Change Requests</title>
<description>New tracker created </description>
<dateCreated>2010-05-03 15:18:10 EST</dateCreated>
<displayLines>1</displayLines>
<isRequired>false</isRequired>
</tracker>
</xml>

If you know the input is in the example format (with only one open-tag per line, and all tracker tags contain a displaylines and isrequired tag), or you can force it to that format, then I think a search-and-replace is too unwieldy, and full XML parsing is "correct" but way more complicated than you need, and you should try a simpler method with the :g command:
:g#<tracker xid#/<displayLine/d
This just searches for lines matching "<tracker xid", then deletes the next line after that matching "<displayLine"
Thus you don't need a specific number of lines in between "<tracker" and "<displayLine" so it is more robust to variances in line offsets, but it is still quite fragile to format changes.
However, I repeat the warnings from others: if the format is not easily and consistently predictable then I'd suggest parsing the file line by line in a loop, or using a real XML parser (possibly using Vim's Perl or Python integration), rather than using an :s or :g command.

Reg ex searching of csv file,

I have huge task to do, seperating Voltage data from recorded .csv files of the format.
13/03/2014 18:48,71.556671,71.651062,71.639755,72.130692,71.961441,72.646423,72.262756,72.334511,7.812012
I am new to RegExpressions, how do i get data from column 10, repeatedly?
I have over 10,000,000 files to reduce and average to 32,000 for exel to graph. Any advice greatly welcome, trying to use PowerGrep to get up to speed.

Not that I would say that regex is the tool for it, but here goes:
(?:[^,]*,){9}([^,]*)
I.e. nine "columns" of non-commas, separated by commas, then capture the tenth in group 1.
E.g. use it with a Perl one-liner:
perl -ne 'chomp; /(?:[^,]*,){9}([^,]*)/ and print "$1\n"'

Perl splitting text string (from HTML page, text document, etc.) by line into array?

This is kind of a weird question, at least for me, as I don't exactly understand what is fully involved in this. Basically, I have been doing this process where I save a scraped document (such as a web page) to a .txt file. Then I can easily use Perl to read this file and put each line into an array. However, it is not doing this based on any visible thing in the document (i.e., it is not going by HTML linebreaks); it just knows where a new line is, based on the .txt format.
However, I would like to cut this process out and just do the same thing from within a variable, so instead I would have what would have been the contents of the .txt file in a string and then I want to parse it, in the same way, line by line. The problem for me is that I don't know much about how this would work as I don't really understand how Perl would be able to tell where a new line is (assuming I'm not going by HTML linebreaks, as often it is just a web based .txt file (which presents to my scraper, www:mechanize, as a web page) I'm scraping so there is no HTML to go by). I figure I can do this using other parameters, such as blank spaces, but am interested to know if there is a way to do this by line. Any info is appreciated.
I'd like to cut the actual saving of a file to reduce issues related to permissions on servers I use and also am just curious if I can make the process more efficient.

Here's an idea that might help you: you can open from strings as well as files.
So if you used to do this:
open( my $io, '<', 'blah.txt' ) or die "Could not open blah.txt! - $!";
my #list = <$io>;
You can just do this:
open( my $io, '<', \$text_I_captured );
my #list = <$io>;

It's hard to tell what your code's doing since we don't have it in front of us; it would be easier to help if you posted what you had. However, I'll give it a shot. If you scrape the text into a variable, you will have a string which may have embedded line breaks. These will either be \n (the traditional Unix newline) or \r\n (the traditional Windows newline sequence). Just as you can split on a space to get (a first approximation of) the words in a sentence, you can instead split on the newline sequence to get the lines in. Thus, the single line you'll need should be
my #lines = split(/\r?\n/, $scraped_text);

Use the $/ variable, this determines what to break lines on. So:
local $/ = " ";
while(<FILE>)...
would give you chunks separated by spaces. Just set it back to "\n" to get back to the way it was - or better yet, go out of the local $/ scope and let the global one come back, just in case it was something other than "\n" to begin with.
You can eliminate it altogether:
local $/ = undef;
To read whole files in one slurp. And then iterate through them however you like. Just be aware that if you do a split or a splice, you may end up copying the string over and over, using lots of CPU and lots of memory. One way to do it with less is:
# perl -de 0
> $_="foo\nbar\nbaz\n";
> while( /\G([^\n]*)\n/go ) { print "line='$1'\n"; }
line='foo'
line='bar'
line='baz'
If you're breaking apart things by newlines, for example. \G matches either the beginning of the string or the end of the last match, within a /g-tagged regex.
Another weird tidbit is $/=\10... if you give it a scalar reference to an integer (here 10), you can get record-length chunks:
# cat fff
eurgpuwergpiuewrngpieuwngipuenrgpiunergpiunerpigun
# perl -de 0
$/ = \10;
open FILE, "<fff";
while(<FILE>){ print "chunk='$_'\n"; }
chunk='eurgpuwerg'
chunk='piuewrngpi'
chunk='euwngipuen'
chunk='rgpiunergp'
chunk='iunerpigun'
chunk='
'
More info: http://www.perl.com/pub/a/2004/06/18/variables.html
If you combine this with FM's answer of using:
$data = "eurgpuwergpiuewrngpieuwngipuenrgpiunergpiunerpigun";
open STRING, "<", \$data;
while(<STRING>){ print "chunk='$_'\n"; }
I think you can get every combination of what you need...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parsing a CSV quoted string CSV file in Perl using split - regex

I have also used the same approach as yours and it works fine with me. Try this code. my #values = split(/(?<="),(?=")/, $line); hope it helps

Related

Is there a way to match strings:numbers with variable positioning within the string?

How can I use Regex to parse irregular CSV and not select certain characters

If Pattern matched in 1st line i need to remove the 4th line

Reg ex searching of csv file,

Perl splitting text string (from HTML page, text document, etc.) by line into array?

Categories

Resources