Use Regex to modify specific column in a CSV - regex

I'm looking to convert some strings in a CSV which are in 0000-2400 hour format to 00-24 hour format. e.g.
2011-01-01,"AA",12478,31703,12892,32575,"0906",-4.00,"1209",-26.00,2475.00
2011-01-02,"AA",12478,31703,12892,32575,"0908",-2.00,"1236",1.00,2475.00
2011-01-03,"AA",12478,31703,12892,32575,"0907",-3.00,"1239",4.00,2475.00
The 7th and 9th columns are departure and arrival times, respectively. Preferably the lines should look like this when I'm done:
2011-01-01,"AA",12478,31703,12892,32575,"09",-4.00,"12",-26.00,2475.00
The whole csv will eventually be imported into R and I want to try and handle some of the processing beforehand because it will be kinda large. I initially attempted to do this with Perl but I'm having trouble picking out multiple digits w/ a regex. I can get a single digit before a given comma with a lookbehind expression, but not more than one.
I'm also open to being told that doing this in Perl is needlessly silly and I should stick to R. :)

I may as well offer my own solution to this, which is
s/"(\d\d)\d\d"/"$1"/g

Like I mentioned in the comments, using a CSV module like Text::CSV is a safe option. This is a quick sample script of how its used. You'll notice that it does not preserve quotes, though it should, since I put in keep_meta_info. If it's important to you, I'm sure there's a way to fix it.
use strict;
use warnings;
use Data::Dumper;
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
keep_meta_info => 1,
});
while (my $row = $csv->getline(*DATA)) {
for ($row->[6], $row->[8]) {
s/\d\d\K\d\d//;
}
$csv->print(*STDOUT, $row);
}
__DATA__
2011-01-01,"AA",12478,31703,12892,32575,"0906",-4.00,"1209",-26.00,2475.00
2011-01-02,"AA",12478,31703,12892,32575,"0908",-2.00,"1236",1.00,2475.00
2011-01-03,"AA",12478,31703,12892,32575,"0907",-3.00,"1239",4.00,2475.00
Output:
2011-01-01,AA,12478,31703,12892,32575,09,-4.00,12,-26.00,2475.00
2011-01-02,AA,12478,31703,12892,32575,09,-2.00,12,1.00,2475.00
2011-01-03,AA,12478,31703,12892,32575,09,-3.00,12,4.00,2475.00

Related

Including regex on variable before matching string

I'm trying to find and extract the occurrence of words read from a text file in a text file. So far I can only find when the word is written correctly and not munged (a changed to # or i changed to 1). Is it possible to add a regex to my strings for matching or something similar? This is my code so far:
sub getOccurrenceOfStringInFileCaseInsensitive
{
my $fileName = $_[0];
my $stringToCount = $_[1];
my $numberOfOccurrences = 0;
my #wordArray = wordsInFileToArray ($fileName);
foreach (#wordArray)
{
my $numberOfNewOccurrences = () = (m/$stringToCount/gi);
$numberOfOccurrences += $numberOfNewOccurrences;
}
return $numberOfOccurrences;
}
The routine receives the name of a file and the string to search. The routine wordsInFileToArray () just gets every word from the file and returns an array with them.
Ideally I would like to perform this search directly reading from the file in one go instead of moving everything to an array and iterating through it. But the main question is how to hard code something into the function that allows me to capture munged words.
Example: I would like to extract both lines from the file.
example.txt:
russ1#anh#ck3r
russianhacker
# this variable also will be read from a blacklist file
$searchString = "russianhacker";
getOccurrenceOfStringInFileCaseInsensitive ("example.txt", $searchString);
Thanks in advance for any responses.
Edit:
The possible substitutions will be defined by an user and the regex must be set to fit. A user could say that a common substitution is to change the letter "a" to "#" or even "1". The possible change is completely arbitrary.
When searching for a specific word ("russian" for example) this could be done with something like:
(m/russian/i); # would just match the word as it is
(m/russi[a#1]n/i); # would match the munged word
But I'm not sure how to do that if I have the string to match stored in a variable, such as:
$stringToSearch = "russian";
This is sort of a full-text search problem, so one method is to normalize the document strings before matching against them.
use strict;
use warnings;
use Data::Munge 'list2re';
...
my %norms = (
'#' => 'a',
'1' => 'i',
...
);
my $re = list2re keys %norms;
s/($re)/$norms{$1}/ge for #wordArray;
This approach only works if there's only a single possible "normalized form" for any given word, and may be less efficient anyway than just trying every possible variation of the search string if your document is large enough and you recompute this every time you search it.
As a note your regex m/$randomString/gi should be m/\Q$randomString/gi, as you don't want any regex metacharacters in $randomString to be interpreted that way. See docs for quotemeta.
There are parts of the problem which aren't specified precisely enough (yet).
Some of the roll-your-own approaches, that depend on the details, are
If user defined substitutions are global (replace every occurrence of a character in every string) the user can submit a mapping, as a hash say, and you can fix them all. The process will identify all candidates for the words (along with the actual, unmangled, words, if found). There may be false positives so also plan on some post-processing
If the user can supply a list of substitutions along with words that they apply to (the mangled or the corresponding unmangled ones) then we can have a more targeted run
Before this is clarified, here is another way: use a module for approximate ("fuzzy") matching.
The String::Approx seems to fit quite a few of your requirements.
The match of the target with a given string relies on the notion of the Levenshtein edit distance: how many insertions, deletions, and replacements ("edits") it takes to make the given string into the sought target. The maximum accepted number of edits can be set.
A simple-minded example:
use warnings;
use strict;
use feature 'say';
use String::Approx qw(amatch);
my $target = qq(russianhacker);
my #text = qw(that h#cker was a russ1#anh#ck3r);
my #matches = amatch($target, ["25%"], #text);
say for #matches; #==> russ1#anh#ck3r
See documentation for what the module avails us, but at least two comments are in place.
First, note that the second argument in amatch specifies the percentile-deviation from the target string that is acceptable. For this particular example we need to allow every fourth character to be "edited." So much room for tweaking can result in accidental matches which then need be filtered out, so there will be some post-processing to do.
Second -- we didn't catch the easier one, h#cker. The module takes a fixed "pattern" (target), not a regex, and can search for only one at a time. So, in principle, you need a pass for each target string. This can be improved a lot, but there'll be more work to do.
Please study the documentation; the module offers a whole lot more than this simple example.
I've ended solving the problem by including the regex directly on the variable that I'll use to match against the lines of my file. It looks something like this:
sub getOccurrenceOfMungedStringInFile
{
my $fileName = $_[0];
my $mungedWordToCount = $_[1];
my $numberOfOccurrences = 0;
open (my $inputFile, "<", $fileName) or die "Can't open file: $!";
$mungedWordToCount =~ s/a/\[a\#4\]/gi;
while (my $currentLine = <$inputFile>)
{
chomp ($currentLine);
$numberOfOccurrences += () = ($currentLine =~ m/$mungedWordToCount/gi);
}
close ($inputFile) or die "Can't open file: $!";
return $numberOfOccurrences;
}
Where the line:
$mungedWordToCount =~ s/a/\[a\#4\]/gi;
Is just one of the substitutions that are needed and others can be added similarly.
I didn't know that Perl would just interpret the regex inside of the variable since I've tried that before and could only get the wanted results defining the variables inside the function using single quotes. I must've done something wrong the first time.
Thanks for the suggestions, people.

Perl MongoDB API - regular expression in filters

I am working in Perl with a MongoDB. I have a collection with documents that have a big text field that I need to be able to find all rows that contain multiple strings in the field.
So for instance, if this is a database of movie quotes one row would have value:
We must totally destroy all spice production on Arrakis. The Guild and
the entire Universe depends on spice. He who can destroy a thing,
controls a thing.
I want to be able to match that row with terms "spice", "Arrakis", and "Guild" where ALL of those terms have to be in the text.
My current approach can only achieve matches if the terms provided happen to be in the correct order, i.e.:
$db->get_collection( 'quotes' )->find( { quote => qr/spice.*Arrakis.*Guild/i } );
That's a match, but
$db->get_collection( 'quotes' )->find( { quote => qr/Guild.*spice.*Arrakis/i } );
is not a match.
If I were working with a SQL database I could do:
... WHERE quote LIKE '%spice%' and quote LIKE '%Arrakis%' and quote LIKE '%Guild%'
but with the MongoDB interface you only get one shot per field.
Is there a way to match multiple words where all are required in one regex, or is there another way to get more than one crack at a field in the MongoDB interface?
One way: A bunch of positive lookahead assertations:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
my #tests = ("The Guild demands that the spice must flow from Arrakis",
"House Atreides will be transported to Arrakis by the Guild.");
for my $test (#tests) {
if ($test =~ m/^(?=.*spice)
(?=.*Guild)
(?=.*Arrakis)/x) {
say "$test: matches";
} else {
say "$test: fails";
}
}
produces:
The Guild demands that the spice must flow from Arrakis: matches
Duke Leto will be transported to Arrakis by the Guild.: fails

Perl decimal to ASCII conversion

I am pulling SNMP information from an F5 LTM, and storing this information in a psql database. I need help converting the returned data in decimal format into ASCII characters.
Here is an example of the information returned from the SNMP request:
iso.3.6.1.4.1.3375.2.2.10.2.3.1.9.10.102.111.114.119.97.114.100.95.118.115 = Counter64: 0
In my script, I need to identify the different sections of this information:
my ($prefix, $num, $char-len, $vs) = ($oid =~ /($vsTable)\.(\d+)\.(\d+)\.(.+)/);
This gives me the following:
(prefix= .1.3.6.1.4.1.3375.2.2.10.2.3.1)
(num= 9 )
(char-len= 10 )
(vs= 102.111.114.119.97.114.100.95.118.115)
The variable $vs is the Object name in decimal format. I would like to convert this to ASCII characters (which should be "forward_vs").
Does anyone have a suggestion on how to do this?
Given that this is related to interpreting SNMP data, it seems logical to me to use one or more of the SNMP modules available from CPAN. You have to know quite a lot about SNMP to determine when the string you quote stops being the identifier (prefix) and starts to be the value. You have a better chance of getting a general solution with SNMP code than with hand-hacked code.
Jonathan Leffler has the right answer, but here are a couple of things to expand your Perl horizons:
use v5.10;
$_ = "102.111.114.119.97.114.100.95.118.115";
say "Version 1: " => eval;
say "Version 2: " => pack "W".(1+y/.//) => /\d+/g;
Executed, that prints:
Version 1: forward_vs
Version 2: forward_vs
Once both are clear to you, you may hit space to continue or q to quit. :)
EDIT: The last one can also be written
pack "WW".y/.//,/\d+/g
But please don't. :)
my $new_vs = join("", map { chr($_) } split(/\./,$vs));
Simple solution:
$ascii .= chr for split /\./, $vs;
pack 'C*', split /\./
For example,
>perl -E"say pack 'C*', split /\./, $ARGV[0]" 102.111.114.119.97.114.100.95.118.115
forward_vs

Regular expression to match CSV delimiters

I'm trying to create a PCRE that will match only the commas used as delimiters in a line from a CSV file. Assuming the format of a line is this:
1,"abcd",2,"de,fg",3,"hijk"
I want to match all of the commas except for the one between the 'e' and 'f'. Alternatively, matching just that one is acceptable, if that is the easier or more sensible solution. I have the sense that I need to use a negative lookahead assertion to handle this, but I'm finding it a bit too difficult to figure out.
See my post that solves this problem for more detail.
^(?:(?:"((?:""|[^"])+)"|([^,]*))(?:$|,))+$ Will match the whole line, then you can use match.Groups[1 ].Captures to get your data out (without the quotes). Also, I let "My name is ""in quotes""" be a valid string.
CSV parsing is a difficult problem, and has been well-solved. Whatever language you are using doubtless has a complete solution that takes care of it, without you having to go down the road of writing your own regex.
What language are you using?
As you've already been told, a regular expression is really not appropriate; it is tricky to deal with the general case (doubly so if newlines are allowed in fields, and triply so if you might have to deal with malformed CSV data.
I suggest the tool CSVFIX as likely to do what you need.
To see how bad CSV can be, consider this data (with 5 clean fields, two of them empty):
"""",,"",a,"a,b"
Note that the first field contains just one double quote. Getting the two double quotes squished to one is really rather tough; you probably have to do it with a second pass after you've captured both with the regex. And consider this ill-formed data too:
"",,"",a",b c",
The problem there is that the field that starts with a contains a double quote; how to interpret it? Stop at the comma? Then the field that starts with b is similarly ill-formed. Stop at the next quote? So the field is a",b c" (or should the quotes be removed)? Etc...yuck!
This Perl gets pretty close to handling correctly both the above lines of data with a ghastly regex:
use strict;
use warnings;
my #list = ( q{"""",,"",a,"a,b"}, q{"",,"",a",b c",} );
foreach my $string (#list)
{
print "Pattern: <<$string>>\n";
while ($string =~ m/ (?: " ( (?:""|[^"])* ) " | ( [^,"] [^,]* ) | ( .? ) )
(?: $ | , ) /gx)
{
print "Found QF: <<$1>>\n" if defined $1;
print "Found PF: <<$2>>\n" if defined $2;
print "Found EF: <<$3>>\n" if defined $3;
}
}
Note that as written, you have to identify which of the three captures was actually used. With two stage processing, you could just deal with one capture and then strip out enclosing double quotes and nested doubled up double quotes. This regex assumes that if the field does not start with a double quote, then there double quote has no special meaning within the field. Have fun ringing the changes!
Output:
Pattern: <<"""",,"",a,"a,b">>
Found QF: <<"">>
Found EF: <<>>
Found QF: <<>>
Found PF: <<a>>
Found QF: <<a,b>>
Found EF: <<>>
Pattern: <<"",,"",a",b c",>>
Found QF: <<>>
Found EF: <<>>
Found QF: <<>>
Found PF: <<a">>
Found PF: <<b c">>
Found EF: <<>>
We can debate whether the empty field (EF) at the end of the first pattern is correct; it probably isn't, which is why I said 'pretty close'. OTOH, the EF at the end of the second pattern is correct.
Also, the extraction of two double quotes from the field """" is not the final result you want; you'd have to post-process the field to eliminate one of each adjacent pair of double quotes.
Without thinking to hard, I would do something like [0-9]+|"[^"]*" to match everything except the comma delimiters. Would that do the trick?
Without context it's impossible to give a more specific solution.
Andy's right: correctly parsing CSV is a lot harder than you probably realise, and has all kinds of ugly edge cases. I suspect that it's mathematically impossible to correctly parse CSV with regexes, particularly those understood by sed.
Instead of sed, use a Perl script that uses the Text::CSV module from CPAN (or the equivalent in your preferred scripting language). Something like this should do it:
use Text::CSV;
use feature 'say';
my $csv = Text::CSV->new ( { binary => 1, eol => $/ } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
my $rows = $csv->getline_all(STDIN);
for my $row (#$rows) {
say join("\t", #$row);
}
That assumes that you don't have any tab characters embedded in your data, of course - perhaps it would be better to do the subsequent stages in a Real Scripting Language as well, so you could take advantage of proper lists?
I know this is old, but this RegEx works for me:
/(\"[^\"]+\")|[^,]+/g
It could be use potentially with any language. I tested it in JavaScript, so the g is just a global modifier. It works even with messed up lines (extra quotes), but empty is not dealt with.
Just sharing, maybe this will help someone.

what do I use to match MS Word chars in regEx

I need to find and delete all the non standard ascii chars that are in a string (usually delivered there by MS Word). I'm not entirely sure what these characters are... like the fancy apostrophe and the dual directional quotation marks and all that. Is that unicode? I know how to do it ham-handed [a-z etc. etc.] but I was hoping there was a more elegant way to just exclude anything that isn't on the keyboard.
Probably the best way to handle this is to work with character sets, yes, but for what it's worth, I've had some success with this quick-and-dirty approach, the character class
[\x80-\x9F]
this works because the problem with "Word chars" for me is the ones which are illegal in Unicode, and I've got no way of sanitising user input.
Microsoft apps are notorious for using fancy characters like curly quotes, em-dashes, etc., that require special handling without adding any real value. In some cases, all you have to do is make sure you're using one of their extended character sets to read the text (e.g., windows-1252 instead of ISO-8859-1). But there are several tools out there that replace those fancy characters with their plain-but-universally-supported ewquivalents. Google for "demoronizer" or "AsciiDammit".
I usually use a JEdit macro that replaces the most common of them with a more ascii-friendly version, i.e.:
hyphens and dashes to minus sign;
suspsension dots (single char) to multiple dots;
list item dot to asterisk;
etc.
It is easily adaptable to Word/Openoffice/whatever, and of course modified to suit your needs. I wrote an article on this topic:
http://www.megadix.it/node/138
Cheers
What you are probably looking at are Unicode characters in UTF-8 format. If so, just escape them in your regular expression language.
My solution to this problem is to write a Perl script that gives me all of the characters that are outside of the ASCII range (0 - 127):
#!/usr/bin/perl
use strict;
use warnings;
my %seen;
while (<>) {
for my $character (grep { ord($_) > 127 } split //) {
$seen{$character}++;
}
}
print "saw $_ $seen{$_} times, its ord is ", ord($_), "\n" for keys %seen;
I then create a mapping of those characters to what I want them to be and replace them in the file:
#!/usr/bin/perl
use strict;
use warnings;
my %map = (
chr(128) => "foo",
#etc.
);
while (<>) {
s/([\x{80}-\x{FF}])/$map{$1}/;
print;
}
What I would do is, use AutoHotKey, or python SendKeys or some sort of visual basic that would send me all possible keys (also with shift applied and unapplied) to a Word document.
In SendKeys it would be a script of the form
chars = ''.join([chr(i) for i in range(ord('a'),ord('z'))])
nums = ''.join([chr(i) for i in range(ord('0'),ord('9'))])
specials = ['-','=','\','/',','.',',','`']
all = chars+nums+specials
SendKeys.SendKeys("""
{LWIN}
{PAUSE .25}
r
winword.exe{ENTER}
{PAUSE 1}
%(all)s
+(%(all)s)
"testQuotationAndDashAutoreplace"{SPACE}-{SPACE}a{SPACE}{BS 3}{LEFT}{BS}
{Alt}{PAUSE .25}{SHIFT}
changeLanguage
%(all)s
+%(all)s
"""%{'all':all})
Then I would save the document as text, and use it as a database for all displable keys in your keyboard layout (you might want to replace the default input language more than once to receive absolutely all displayable characters).
If the char is in the result text document - it is displayable, otherwise not. No need for regexp. You can of course afterward embed the characters range within a script or a program.