use perl to replace unit timestamp within text string - regex

Ok, I a have a data file with two columns of data. They are RecordNumber and Notes. They are separated by pipes and look like this.
Record1|1234567890 username notes notes notes notes 1254184921 username notes notes notes notes|
... This goes on for thousands of records.
Using a perl script (and possible some regex) I need to take the notes column and parse it out to make 3 new columns separated with pipes to load into a table. The columns need to be Note_Date|Note_Username|Note_Text.
The 10-digit string of numbers throughout the notes column is a unix timestamp. My second task is to take this and convert it to a regular timestamp. Please, any help would be appreciated.
Thanks.

You may need to modify this for your needs:
use strict;
use warnings;
while (<>) {
my #a = split(/\|/);
while ($a[1]=~/\s*(\d+)\s+(\w+)\s+([^0-9]*)/g) {
my ($t, $u, $n) = ($1, $2, $3);
$t = localtime($t);
print $a[0], "|$t $u $n|\n";
}
}

Related

Perl MongoDB API - regular expression in filters

I am working in Perl with a MongoDB. I have a collection with documents that have a big text field that I need to be able to find all rows that contain multiple strings in the field.
So for instance, if this is a database of movie quotes one row would have value:
We must totally destroy all spice production on Arrakis. The Guild and
the entire Universe depends on spice. He who can destroy a thing,
controls a thing.
I want to be able to match that row with terms "spice", "Arrakis", and "Guild" where ALL of those terms have to be in the text.
My current approach can only achieve matches if the terms provided happen to be in the correct order, i.e.:
$db->get_collection( 'quotes' )->find( { quote => qr/spice.*Arrakis.*Guild/i } );
That's a match, but
$db->get_collection( 'quotes' )->find( { quote => qr/Guild.*spice.*Arrakis/i } );
is not a match.
If I were working with a SQL database I could do:
... WHERE quote LIKE '%spice%' and quote LIKE '%Arrakis%' and quote LIKE '%Guild%'
but with the MongoDB interface you only get one shot per field.
Is there a way to match multiple words where all are required in one regex, or is there another way to get more than one crack at a field in the MongoDB interface?
One way: A bunch of positive lookahead assertations:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
my #tests = ("The Guild demands that the spice must flow from Arrakis",
"House Atreides will be transported to Arrakis by the Guild.");
for my $test (#tests) {
if ($test =~ m/^(?=.*spice)
(?=.*Guild)
(?=.*Arrakis)/x) {
say "$test: matches";
} else {
say "$test: fails";
}
}
produces:
The Guild demands that the spice must flow from Arrakis: matches
Duke Leto will be transported to Arrakis by the Guild.: fails

Need help in designing a regular expression

"LIM-1-2::PROVPEC=NTK552DA,CTYPE=\"LIM C-Band\":OOS-AU,UEQ"
"2XOSC-1-4::PROVPEC=NTK554BA,CTYPE=\"OSC w/WSC 2 Port SFP 2 Port 10/100 BT\":OOS-AU,UEQ"
"P155M-1-4-1::PROVPEC=NTK592NP,CTYPE=\"OC-3 0-15dB CWDM 1511 nm\":OOS-AU,UEQ"
I have this data in a file. I need to extract -1-2 for first equipment likewise -1-4-1 for last one. I will using this data later. I am able to figure out how to get -1-1 but it's not versatile enough to get -1-1-4 also.
Equipment can also have a subslot.This list is tentative.
EQP-shelf-slot-subslot. I need some expression which can check if subslot exists or not provides me out in form -shelf-slot-subslot or -shelf-slot
How about:
my ($wanted) = $str =~ /^\w+([^:]+)/;
or, if quotes are part of the string:
my ($wanted) = $str =~ /^"\w+([^:]+)/;

Regex to select semicolons that are not enclosed in double quotes

I have string like
a;b;"aaa;;;bccc";deef
I want to split string based on delimiter ; only if ; is not inside double quotes. So after the split, it will be
a
b
"aaa;;;bccc"
deef
I tried using look-behind, but I'm not able to find a correct regular expression for splitting.
Regular expressions are probably not the right tool for this. If possible you should use a CSV library, specify ; as the delimiter and " as the quote character, this should give you the exact fields you are looking for.
That being said here is one approach that works by ensuring that there are an even number of quotation marks between the ; we are considering the split at and the end of the string.
;(?=(([^"]*"){2})*[^"]*$)
Example: http://www.rubular.com/r/RyLQyR8F19
This will break down if you can have escaped quotation marks within a string, for example a;"foo\"bar";c.
Here is a much cleaner example using Python's csv module:
import csv, StringIO
reader = csv.reader(StringIO.StringIO('a;b;"aaa;;;bccc";deef'),
delimiter=';', quotechar='"')
for row in reader:
print '\n'.join(row)
Regular expression will only get messier and break on even minor changes. You are better off using a csv parser with any scripting language. Perl built in module (so you don't need to download from CPAN if there are any restrictions) called Text::ParseWords allows you to specify the delimiter so that you are not limited to ,. Here is a sample snippet:
#!/usr/local/bin/perl
use strict;
use warnings;
use Text::ParseWords;
my $string = 'a;b;"aaa;;;bccc";deef';
my #ary = parse_line(q{;}, 0, $string);
print "$_\n" for #ary;
Output
a
b
aaa;;;bccc
deef
This is kind of ugly, but if you don't have \" inside your quoted strings (meaning you don't have strings that look like this ("foo bar \"badoo\" goo") you can split on the " first and then assume that all your even numbered array elements are, in fact, strings (and split the odd numbered elements into their component parts on the ; token).
If you *do have \" in your strings, then you'll want to first convert those into some other temporary token that you'll convert back later after you've performed your operation.
Here's a fiddle...
http://jsfiddle.net/VW9an/
var str = 'abc;def;ghi"some other dogs say \\"bow; wow; wow\\". yes they do!"and another; and a fifth'
var strCp = str.replace(/\\"/g,"--##--");
var parts = strCp.split(/"/);
var allPieces = new Array();
for(var i in parts){
if(i % 2 == 0){
var innerParts = parts[i].split(/\;/)
for(var j in innerParts)
allPieces.push(innerParts[j])
}
else{
allPieces.push('"' + parts[i] +'"')
}
}
for(var a in allPieces){
allPieces[a] = allPieces[a].replace(/--##--/g,'\\"');
}
console.log(allPieces)
Match All instead of Splitting
Answering long after the battle because no one used the way that seems the simplest to me.
Once you understand that Match All and Split are Two Sides of the Same Coin, you can use this simple regex:
"[^"]*"|[^";]+
See the matches in the Regex Demo.
The left side of the alternation | matches full quoted strings
The right side matches any chars that are neither ; nor "

Use Regex to modify specific column in a CSV

I'm looking to convert some strings in a CSV which are in 0000-2400 hour format to 00-24 hour format. e.g.
2011-01-01,"AA",12478,31703,12892,32575,"0906",-4.00,"1209",-26.00,2475.00
2011-01-02,"AA",12478,31703,12892,32575,"0908",-2.00,"1236",1.00,2475.00
2011-01-03,"AA",12478,31703,12892,32575,"0907",-3.00,"1239",4.00,2475.00
The 7th and 9th columns are departure and arrival times, respectively. Preferably the lines should look like this when I'm done:
2011-01-01,"AA",12478,31703,12892,32575,"09",-4.00,"12",-26.00,2475.00
The whole csv will eventually be imported into R and I want to try and handle some of the processing beforehand because it will be kinda large. I initially attempted to do this with Perl but I'm having trouble picking out multiple digits w/ a regex. I can get a single digit before a given comma with a lookbehind expression, but not more than one.
I'm also open to being told that doing this in Perl is needlessly silly and I should stick to R. :)
I may as well offer my own solution to this, which is
s/"(\d\d)\d\d"/"$1"/g
Like I mentioned in the comments, using a CSV module like Text::CSV is a safe option. This is a quick sample script of how its used. You'll notice that it does not preserve quotes, though it should, since I put in keep_meta_info. If it's important to you, I'm sure there's a way to fix it.
use strict;
use warnings;
use Data::Dumper;
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
keep_meta_info => 1,
});
while (my $row = $csv->getline(*DATA)) {
for ($row->[6], $row->[8]) {
s/\d\d\K\d\d//;
}
$csv->print(*STDOUT, $row);
}
__DATA__
2011-01-01,"AA",12478,31703,12892,32575,"0906",-4.00,"1209",-26.00,2475.00
2011-01-02,"AA",12478,31703,12892,32575,"0908",-2.00,"1236",1.00,2475.00
2011-01-03,"AA",12478,31703,12892,32575,"0907",-3.00,"1239",4.00,2475.00
Output:
2011-01-01,AA,12478,31703,12892,32575,09,-4.00,12,-26.00,2475.00
2011-01-02,AA,12478,31703,12892,32575,09,-2.00,12,1.00,2475.00
2011-01-03,AA,12478,31703,12892,32575,09,-3.00,12,4.00,2475.00

regex to strip out image urls?

I need to separate out a bunch of image urls from a document in which the images are associated with names like this:
bellpepper = "http://images.com/bellpepper.jpg"
cabbage = "http://images.com/cabbage.jpg"
lettuce = "http://images.com/lettuce.jpg"
pumpkin = "http://images.com/pumpkin.jpg"
I assume I can detect the start of a link with:
/http:[^ ,]+/i
But how can I get all of the links separated from the document?
EDIT: To clarify the question: I just want to strip out the URLs from the file minus the variable name, equals sign and double quotes so I have a new file that is just a list of URLs, one per line.
Try this...
(http://)([a-zA-Z0-9\/\\.])*
If the format is constant, then this should work (python):
import re
s = """bellpepper = "http://images.com/bellpepper.jpg" (...) """
re.findall("\"(http://.+?)\"", s)
Note: this is not "find an image in a file" regexp, just an answer to the question :)
do you mean to say you have that kind of format in your document and you just want to get the http part? you can just split on the "=" delimiter without regex
$f = fopen("file","r");
if ($f){
while( !feof($f) ){
$line = fgets($f,4096);
$s = explode(" = ",$line);
$s = preg_replace("/\"/","",$s);
print $s[1];
}
fclose($f);
}
on the command line :
#php5 myscript.php > newfile.ext
if you are using other languages other than PHP, there are similar string splitting method you can use. eg Python/Perl's split(). please read your doc to find out
You may try this, if your tool supports positive lookbehind:
/(?<=")[^"\n]+/