Regular expression to match CSV delimiters - regex

I'm trying to create a PCRE that will match only the commas used as delimiters in a line from a CSV file. Assuming the format of a line is this:
1,"abcd",2,"de,fg",3,"hijk"
I want to match all of the commas except for the one between the 'e' and 'f'. Alternatively, matching just that one is acceptable, if that is the easier or more sensible solution. I have the sense that I need to use a negative lookahead assertion to handle this, but I'm finding it a bit too difficult to figure out.

See my post that solves this problem for more detail.
^(?:(?:"((?:""|[^"])+)"|([^,]*))(?:$|,))+$ Will match the whole line, then you can use match.Groups[1 ].Captures to get your data out (without the quotes). Also, I let "My name is ""in quotes""" be a valid string.

CSV parsing is a difficult problem, and has been well-solved. Whatever language you are using doubtless has a complete solution that takes care of it, without you having to go down the road of writing your own regex.
What language are you using?

As you've already been told, a regular expression is really not appropriate; it is tricky to deal with the general case (doubly so if newlines are allowed in fields, and triply so if you might have to deal with malformed CSV data.
I suggest the tool CSVFIX as likely to do what you need.
To see how bad CSV can be, consider this data (with 5 clean fields, two of them empty):
"""",,"",a,"a,b"
Note that the first field contains just one double quote. Getting the two double quotes squished to one is really rather tough; you probably have to do it with a second pass after you've captured both with the regex. And consider this ill-formed data too:
"",,"",a",b c",
The problem there is that the field that starts with a contains a double quote; how to interpret it? Stop at the comma? Then the field that starts with b is similarly ill-formed. Stop at the next quote? So the field is a",b c" (or should the quotes be removed)? Etc...yuck!
This Perl gets pretty close to handling correctly both the above lines of data with a ghastly regex:
use strict;
use warnings;
my #list = ( q{"""",,"",a,"a,b"}, q{"",,"",a",b c",} );
foreach my $string (#list)
{
print "Pattern: <<$string>>\n";
while ($string =~ m/ (?: " ( (?:""|[^"])* ) " | ( [^,"] [^,]* ) | ( .? ) )
(?: $ | , ) /gx)
{
print "Found QF: <<$1>>\n" if defined $1;
print "Found PF: <<$2>>\n" if defined $2;
print "Found EF: <<$3>>\n" if defined $3;
}
}
Note that as written, you have to identify which of the three captures was actually used. With two stage processing, you could just deal with one capture and then strip out enclosing double quotes and nested doubled up double quotes. This regex assumes that if the field does not start with a double quote, then there double quote has no special meaning within the field. Have fun ringing the changes!
Output:
Pattern: <<"""",,"",a,"a,b">>
Found QF: <<"">>
Found EF: <<>>
Found QF: <<>>
Found PF: <<a>>
Found QF: <<a,b>>
Found EF: <<>>
Pattern: <<"",,"",a",b c",>>
Found QF: <<>>
Found EF: <<>>
Found QF: <<>>
Found PF: <<a">>
Found PF: <<b c">>
Found EF: <<>>
We can debate whether the empty field (EF) at the end of the first pattern is correct; it probably isn't, which is why I said 'pretty close'. OTOH, the EF at the end of the second pattern is correct.
Also, the extraction of two double quotes from the field """" is not the final result you want; you'd have to post-process the field to eliminate one of each adjacent pair of double quotes.

Without thinking to hard, I would do something like [0-9]+|"[^"]*" to match everything except the comma delimiters. Would that do the trick?
Without context it's impossible to give a more specific solution.

Andy's right: correctly parsing CSV is a lot harder than you probably realise, and has all kinds of ugly edge cases. I suspect that it's mathematically impossible to correctly parse CSV with regexes, particularly those understood by sed.
Instead of sed, use a Perl script that uses the Text::CSV module from CPAN (or the equivalent in your preferred scripting language). Something like this should do it:
use Text::CSV;
use feature 'say';
my $csv = Text::CSV->new ( { binary => 1, eol => $/ } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
my $rows = $csv->getline_all(STDIN);
for my $row (#$rows) {
say join("\t", #$row);
}
That assumes that you don't have any tab characters embedded in your data, of course - perhaps it would be better to do the subsequent stages in a Real Scripting Language as well, so you could take advantage of proper lists?

I know this is old, but this RegEx works for me:
/(\"[^\"]+\")|[^,]+/g
It could be use potentially with any language. I tested it in JavaScript, so the g is just a global modifier. It works even with messed up lines (extra quotes), but empty is not dealt with.
Just sharing, maybe this will help someone.

Related

Is there a Perl regex metacharacter or a way to have specify a default value, if a subpattern capture does not match?

Here's the idea. I am parsing command line options but doing it across the entire command line, not by each #ARGV element separately.
program --format="%H:%M:%S" --timeout 12 --nofail
I want the parsing to work with these cases.
--name=value, easy to parse
--name value, pretty easy
--name no value, default the value to 1
Here is the regex which works, except it cannot do the missing value case
%options = "#ARGV" ~= /--([A-Za-z]+)[= ]([^-]\S*)/g;
i.e. match --name=value or --name value but not --name --name, --name --name is two names, not a --name=value pair.
If a --name has no value following it that matches the second capture in the regex, is there a way, within the regex, to specify a default, in my case a 1, to indicate "true". i.e. if an --name has no argument, like --nofail then set that argument to 1 indicating true.
Actually, in asking this I figured out a workaround using separate match statements which is fine. However, just out of curiosity, the question still stands, is there a Perl regex way to have a default if a submatch fails?
I don't see how to return a list reflecting a changed input from a regex alone. To change the input we need s{}{}er operator, as we need code in its replacement part to analyze captures and decide what to change; and, we get a string, not a list, which need be further processed (split).
Here is then one such take, with a minimal intrusion of code.
Match name and value, with = or space between them, and if value ($2) is undefined give it a value; so we need /e to implement that.† Once we are at it, put a space between all name-value pairs. This goes under /r so that the changed string is returned, and passed through split
my %arg = split ' ',
$args =~ s{ --(\w+) (?: =|\s+|\z) ([^-]\S*)? }{ $1.' '.($2//'7 ') }ergx;
The split can be done by another regex instead but that's still extra processing.
A complete program (with more flags added to the input)
use warnings;
use strict;
use feature 'say';
my $args = shift // q(--fmt="%H:%M" --f1 --time 12 --f2 --f3);
say $args;
my %arg = split ' ',
$args =~ s{ --(\w+) (?: =|\s+|\z) ([^-]\S*)? }{ $1 . ' ' . ($2//'1 ') }ergx;
say "$_ => $arg{$_}" for keys %arg;
This prints as expected. But note that there may be edge cases, and in particular having a space inside (a quoted) argument value, like "%H %M", would require a far more complex pattern.
I presume that the regex ask is for play/study. Normally this goes by libraries, like Getopt::Long. If that is somehow not possible then processing #ARGV term by term is nice and easy -- and fast.
† In order to actually do "if value ($2) is undefined give it a value" we need to run code in the replacement part, what is done under the /e modifier

Need to add prefix to captured names, based on given lists

I have two arrays #Mister and #Mrs and need to add prefix based on the values.
#Mister = qw(Parasuram Raghavan Srivatsan);
#Mrs = qw(Kalai Padmini Maha);
my $str = "I was invited the doctor Parasuram and Kalai and civil Engineer Raghavan and Padmini and finally Advocate Srivatsan and Maha";
#Mr. Parasuram Mr. Raghavan Mr. Srivatsan
if(grep ($_ eq $str), #Mister)
{ $str=~s/($_)/Mr. $1/g; }
#Mrs. Kalai Mrs. Padmini Mrs. Maha`
if(grep ($_ eq $str), #Mrs)
{ $str=~s/($_)/Mrs. $1/g; }
Output Should be:
I was invited the doctor Mr. Parasuram and Mrs. Kalai and civil Engineer Mr. Raghavan and Mrs. Padmini and finally Advocate Mr. Srivatsan and Mrs. Maha
Could someone simplify the way I am doing and whats wrong in this code.
A simple take
my $mr_re = join '|', #Mister;
my $mrs_re = join '|', #Mrs;
$str =~ s/\b($mr_re)\b/Mr. $1/g;
$str =~ s/\b($mrs_re)\b/Ms. $1/g;
(note that I used the neutral Ms above instead of Mrs)
However, when we consider the bewildering complexity of names, the \b doesn't take care of all ways for a name to contain another. An easy example: the - is readily found in names and \b is an anchor between \w and \W, where \w does not include -.
Thus Name-Another would be matched by Name alone as well.
If there are characters other than alphanumeric (plus _) that can be inside names consider
my $w_re = /[a-z-]/i; # list all characters that can be in a name
$str =~ s/(?<!$w_re)($mr_re)(?!$w_re)/Mr. $1/g; # same for Ms.
where negative lookarounds ?<! and ?! are assertions that match your non-name characters (those not listed in $w_re) but do not consume them. Thus they delimit acceptable names.
The same holds for accents, and yet many other characters used in names in various cultures. The task of forming a satisfactory $w_re may be a tricky one even for one particular use case.
If names can come in multiple words (with spaces), in order to handle names within others you would have to parse them in general. That is a complex task; seek modules as little regex won't cut it.
A simple fix would be to preprocess lists to check for names with multiple words that contain other names from your lists, and to handle that case by case.
For your example with hard coded and verifiable names the above works. However, in general, when assembling a regex from strings make sure that all (ascii) non-word chars are escaped so that you actually have the intended literal characters without a special meaning
my $mr_re = join '|', map { quotemeta } #Mister;
my $mrs_re = join '|', map { quotemata } #Mrs;
See quotemeta; inside a regex use \Q, see it in perlbackslash and in perlre.
Note that this problem critically relies on sensible input.
If names are duplicated in lists the problem is ill-posed: If they repeat in the sentence it is unknown which is which, if they don't it is unknown whether it is Mr. or Ms. Thus the name lists should be first checked for duplicates.
"Could someone simplify the way I am doing and whats wrong in this code."
The first part is addressed by zdim in a way I would do it too, but the "what's wrong" part could get some more addressing, in my opinion (just nitpicking, but maybe useful for someone):
if(grep ($_ eq $str), #Mister)
{ $str =~ s/($_)/Mr. $1/g; }
Your list entries will never equal the $str, I think you meant $str =~/$_/
Either use an additional pair of parenthesis around both condition and #list or use the block form of grep (grep { $str =~ /$_/ } #Mister) - otherwise grep will miss the list as argument, since it takes the one existing pair as limiter for it's argument list right now.
the $_ used in the grep command is not available outside of the command, so the $str-substitution would use whatever the value of $_ is currently. In the example it would most likely be undef, so that between each character in the former $str 'Mr. ' is inserted.
Like I said: A perfectly good solution to your problem is given in zdim's answer, but you also asked "what's wrong in this code".
#ssr1012 and other readers: Be careful! It's tempting to think there is a universal solution for this problem. But, unfortunately, even #zdim's approach will give undesirable results if the same name appears in both arrays, and it is still tricky if a name in one array is the same as a name in the other array except for a few additional characters at the start or end.
Here's your example using slightly different names:
my #Mister = qw(Parasuram Mahan Srivatsan);
my #Mrs = qw(Kalai Padmini Maha);
...
# I was invited the doctor Mr. Parasuram and Ms. Kalai and civil Engineer Mr. Ms. Mahan and Ms. Padmini and finally Advocate Mr. Srivatsan and Ms. Maha
See the "Mr. Ms. Mahan"? You don't have enough information for a universal solution. This is only reliable if your names are hard-coded and checked first to avoid collisions.
Even if you added first names, you might not have enough information - guessing gender from first names is unreliable in many language cultures.

Perl - Regexp to manipulate .csv

I've got a function in Perl that reads the last modified .csv in a folder, and parses it's values into variables.
I'm finding some problems with the regular expressions.
My .csv look like:
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP OUT TOTAL","PDP OUT OK","PDP OUT NOK","PDP OUT OK Rate"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","ARG - NAME 1","536","536","0","100%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","USA - NAME 2","1850","1438","412","77.72%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","AUS - NAME 3","8","6","2","75%"
.(ignore this dot, you will understand later)
So far, I've had some help to parse the values into some variables, by:
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
while ( my $line = <$file> ) {
my ($date_time, $duration, $sample, $corner, $country_name, $pdp_in_total, $pdp_in_ok, $pdp_in_not_ok, $pdp_in_ok_rate)
= parse_line ',', 0, $line;
my ($date, $time) = split /\s+/, $date_time;
my ($country, $name) = $country_name =~ m/(.+) - (.*)/;
print "$date, $time, $country, $name, $pdp_in_total, $pdp_in_ok_rate";
}
The problems are:
I don't know how to make the first AND second line (that are the column names from the .csv) to be ignored;
The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
How can I do this?
When you have a csv file with column headers and want to parse the data into variables, the simplest choice would be to use Text::CSV. This code shows how you get your data into the hash reference $row. (I.e. my %data = %$row)
use strict;
use warnings;
use Text::CSV;
use feature 'say';
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
});
# open the file, I use the DATA internal file handle here
my $title = <DATA>;
# Set the headers using the header line
$csv->column_names( $csv->getline(*DATA) );
while (my $row = $csv->getline_hr(*DATA)) {
# you can now access the variables via their header names, e.g.:
if (defined $row->{Duration}) { # this will skip the blank lines
say $row->{Duration};
}
}
__DATA__
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP IN TOTAL","PDP IN OK","PDP IN NOT OK","PDP IN OK Rate"
"04/12/2014 10:00:00","3600","1","GRPS_INB","CHN - Name 1","1198","1195","3","99.74%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","ARG - Name 2","1198","1069","129","89.23%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","NLD - Name 3","813","798","15","98.15%"
If we print one of the $row variables with Data::Dumper, it shows the structure we are getting back from Text::CSV:
$VAR1 = {
'PDP IN TOTAL' => '1198',
'PDP IN NOT OK' => '3',
'PDP IN OK' => '1195',
'Period end' => '04/12/2014 10:00:00',
'Line' => 'CHN - Name 1',
'Duration' => '3600',
'Sample' => '1',
'PDP IN OK Rate' => '99.74%',
'Corner' => 'GRPS_INB'
};
open ...
my $names_from_first_line = <$file>; # you can use them or just ignore them
while($my line = <$file>) {
unless ($line =~ /\S/) {
# skip empty lines
next;
}
..
}
Also, consider using Text::CSV to handle CSV format
1) I don't know how to make the first line (that are the column names from the .csv) to be ignored;
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
2) The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
next if $line =~ /^\s*$/;
You know that the valid lines will start with dates. I suggest you simply skip lines that don't start with dates in the format you expect:
while ( my $line = <$file> ) {
warn qq(next if not $line =~ /^"\d{2}-\d{2}-d{4}/;); # Temp debugging line
next if not $line =~ /^"\d{2}-\d{2}-d{4}/;
warn qq($line matched regular expression); # Temp debugging line
...
}
The /^"\d{2}-\d{2}-d{4}",/ is a regular expression pattern. The pattern is between the /.../:
^ - Beginning of the line.
" - Quotation Mark.
\d{2} - Followed by two digits.
- - Followed by a dash.
\d{2] - Followed by two more digits.
- - Followed by a dash.
\d{4} - Followed by four more digits
This should be describing the first part of your line which is the date in MM-DD-YYYY format surrounded by quotes and followed by a comma. The =~ tells Perl that you want the thing on the left to match the regular expression on the right.
Regular expressions can be difficult to understand, and is one of the reasons why Perl has such a reputation of being a write-only language. Regular expressions have been likened to sailor cussing. However, regular expressions is an extremely powerful tool, and worth the effort to learn. And with some experience, you'll be able to easily decode them.
The next if... syntax is similar to:
if (...) {
next;
}
Normally, you shouldn't use post-fix if and never use unless (which is if's opposite). They can make your program more difficult to understand. However, when placed right after the opening line of a loop like this, they make a clear statement that you're filtering out lines you don't want. I could have written this (and many people would argue this is preferable):
next unless $line =~ /^"\d{2}-\d{2}-d{4}",/;
This is saying you want to skip lines unless they match your regular expression. It's all a matter of personal preference and what do you think is easier for the poor schlub who comes along next year and has to figure out what your program is doing.
I actually thought about this and decided that if not ... was saying that I expect almost all lines in the file to match my format, and I want to toss away the few exceptions. To me, next unless ... is saying that there are some lines that match my regular expression, and many lines that don't, and I want to only work on lines that match.
Which gets us to the next part of programming: Watching for things that will break your program. My previous answer didn't do a lot of error checking, but it should. What happens if a line doesn't match your format? What if the split didn't work? What if the fields are not what I expect? You should really check each statement to make sure it actually worked. Almost all functions in Perl will return a zero, a null string, or an undef if they don't work. For example, the open statement.
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
If open doesn't work, it returns a file handle value of zero. The or states that if open doesn't return a non-zero file handle, execute the line that follows which kills your program.
So, look through your program, and see any place where you make an assumption that something works as expected and think what happens if it didn't. Then, add checks in your program to something if you get that exception. It could be that you want to report the error or log the error and skip to the next line. It could be that you want your program to come to a screeching halt. It could be that you can recover from the error and continue. What ever you do, check for possible errors (especially from user input) and handle possible errors.
Debugging
I told you regular expressions are tricky. Yes, I made a mistake assuming that your date was a separate field. Instead, it's followed by a space then the time which means that the final ", in the regular expression should not be there. I've fixed the above code. However, you may still need to test and tweak. Which brings us into debugging in Perl.
You can use warn statements to help debug your program. If you copy a statement, then surround it with warn qq(...);, Perl will print out the line (filling out variables) and the line number. I even create macros in my various editors to do this for me.
The qq(...) is a quote like operator. It's another way to do double quotes around a string. The nice thing is that the string can contain actual quotation marks, and the qq(...); will still work.
Once you've finished debugging, you can search for your warn statements and delete them. Perl comes with a powerful built in debugger, and many IDEs integrate with it. However, sometimes it's just easier to toss in a few warn statements to see what's going on in your code -- especially if you're having issues with regular expressions acting up.

How to extract line numbers from a multi-line string in Vim?

In my opinion, Vimscript does not have a lot of features for manipulating strings.
I often use matchstr(), substitute(), and less often strpart().
Perhaps there is more than that.
For example, what is the best way to remove all text between line numbers in the following string a?
let a = "\%8l............\|\%11l..........\|\%17l.........\|\%20l...." " etc.
I want to keep only the digits and put them in a list:
['8', '11', '17', '20'] " etc.
(Note that the text between line numbers can be different.)
You're looking for split()
echo split(a, '[^0-9]\+')
EDIT:
Given the new constraint: only the numbers from \%d\+l, I'd do:
echo map(split(a, '|'), "matchstr(v:val, '^%\\zs\\d\\+\\zel')")
NB: your vim variable is incorrectly formatted, to use only one backslash, you'd need to write your string with single-quotes. With double-quotes, here you'd need two backslashes.
So, with
let b = '\%8l............\|\%11l..........\|\%17l.........\|\%20l....'
it becomes
echo map(split(b, '\\|'), "matchstr(v:val, '^\\\\%\\zs\\d\\+\\zel')")
One can take advantage of the substitute with an expression feature (see
:help sub-replace-\=) to run over all of the target matches, appending them
to a list.
:let l=[] | call substitute(a, '\\%\(\d\+\)l', '\=add(l,submatch(1))[1:0]', 'g')

Trying to simplify a Regex

I'm spending my weekend analyzing Campaign Finance Contribution records. Fun!
One of the annoying things I've noticed is that entity names are entered differently:
For example, i see stuff like this: 'llc', 'llc.', 'l l c', 'l.l.c', 'l. l. c.', 'llc,', etc.
I'm trying to catch all these variants.
So it would be something like:
"l([,\.\ ]*)l([,\.\ ]*)c([,\.\ ]*)"
Which isn't so bad... except there are about 40 entity suffixes that I can think of.
The best thing I can think of is programmatically building up this pattern , based on my list of suffixes.
I'm wondering if there's a better way to handle this within a single regex that is human readable/writable.
You could just strip out excess crap. Using Perl:
my $suffix = "l. lc.."; # the worst case imaginable!
$suffix =~ s/[.\s]//g;
# no matter what variation $suffix was, it's now just "llc"
Obviously this may maul your input if you use it on the full company name, but getting too in-depth with how to do that would require knowing what language we're working with. A possible regex solution is to copy the company name and strip out a few common words and any words with more than (about) 4 characters:
my $suffix = $full_name;
$suffix =~ s/\w{4,}//g; # strip words of more than 4 characters
$suffix =~ s/(a|the|an|of)//ig; # strip a few common cases
# now we can mangle $suffix all we want
# and be relatively sure of what we're doing
It's not perfect, but it should be fairly effective, and more readable than using a single "monster regex" to try to match all of them. As a rule, don't use a monster regex to match all cases, use a series of specialized regexes to narrow many cases down to a few. It will be easier to understand.
Regexes (other than relatively simple ones) and readability rarely go hand-in-hand. Don't misunderstand me, I love them for the simplicity they usually bring, but they're not fit for all purposes.
If you want readability, just create an array of possible values and iterate through them, checking your field against them to see if there's a match.
Unless you're doing gene sequencing, the speed difference shouldn't matter. And it will be a lot easier to add a new one when you discover it. Adding an element to an array is substantially easier than reverse-engineering a regex.
The first two "l" parts can be simplified by [the first "l" part here]{2}.
You can squish periods and whitespace first, before matching: for instance, in perl:
while (<>) {
$Sq = $_;
$Sq =~ s/[.\s]//g; # squish away . and " " in the temporary save version
$Sq = lc($Sq);
/^llc$/ and $_ = 'L.L.C.'; # try to match, if so save the canonical version
/^ibm/ and $_ = 'IBM'; # a different match
print $_;
}
Don't use regexes, instead build up a map of all discovered (so far) entries and their 'canonical' (favourite) versions.
Also build a tool to discover possible new variants of postfixes by identifying common prefixes to a certain number of characters and printing them on the screen so you can add new rules.
In Perl you can build up regular expressions inside your program using strings. Here's some example code:
#!/usr/bin/perl
use strict;
use warnings;
my #strings = (
"l.l.c",
"llc",
"LLC",
"lLc",
"l,l,c",
"L . L C ",
"l W c"
);
my #seps = ('.',',','\s');
my $sep_regex = '[' . join('', #seps) . ']*';
my $regex_def = join '', (
'[lL]',
$sep_regex,
'[lL]',
$sep_regex,
'[cC]'
);
print "definition: $regex_def\n";
foreach my $str (#strings) {
if ( $str =~ /$regex_def/ ) {
print "$str matches\n";
} else {
print "$str doesn't match\n";
}
}
This regular expression could also be simplified by using case-insensitive matching (which means $match =~ /$regex/i ). If you run this a few times on the strings that you define, you can easily see cases that don't validate according to your regular expression. Building up your regular expression this way can be useful in only defining your separator symbols once, and I think that people are likely to use the same separators for a wide variety of abbreviations (like IRS, I.R.S, irs, etc).
You also might think about looking into approximate string matching algorithms, which are popular in a large number of areas. The idea behind these is that you define a scoring system for comparing strings, and then you can measure how similar input strings are to your canonical string, so that you can recognize that "LLC" and "lLc" are very similar strings.
Alternatively, as other people have suggested you could write an input sanitizer that removes unwanted characters like whitespace, commas, and periods. In the context of the program above, you could do this:
my $sep_regex = '[' . join('', #seps) . ']*';
foreach my $str (#strings) {
my $copy = $str;
$copy =~ s/$sep_regex//g;
$copy = lc $copy;
print "$str -> $copy\n";
}
If you have control of how the data is entered originally, you could use such a sanitizer to validate input from the users and other programs, which will make your analysis much easier.