Related
I have a file containing several rows of code, like this:
160101, 0100, 58.8,
160101, 0200, 59.3,
160101, 0300, 59.5,
160101, 0400, 59.1,
I'm trying to print out the third column with a regex, like this:
# Read the text file.
open( IN, "file.txt" ) or die "Can't read words file: $!";
# Print out.
while (<IN>) {
print "Number: $1\n"
while s/[^\,]+\,[^\,]+\,([^\,]+)\,/$1/g;
}
And it works fairly well, however, I'm trying to only fetch the numbers that are greater than or equal to 59 (that includes numbers like 59.1 and 59.0). I've tried several numeric regex combinations (the one below will not give me the right number, obviously, but just making a point), including:
while s/[^\,]+\,[^\,]+\,([^\,]+)\,^[0-9]{3}$/$1/g;
but none seem to work. Any ideas?
My first idea would be to split that line and then pick and choose
while (my $line = <IN>) {
my #nums = split ',\s*', $line;
print "$nums[2]\n" if $nums[2] >= $cutoff;
}
If you insist on doing it all in the regex then you may want to use /e modifier, so in the substitution part you can run code. Then you can test the particular match and print it there.
Assuming that the numbers can't reach 100 (three digits) you could use
[^\,]+\,[^\,]+\,\s*(59\.\d+|[6-9]\d\.\d+)\,
which uses your regex except for the capture group which captures the number 59 and it's decimals, or two digit numbers from 60-99 and it's decimals.
Regards
Edit:
To go above 100 you can add another alternative in the capture group:
[^\,]+\,[^\,]+\,\s*(59\.\d+|[6-9]\d\.\d+|[1-9]\d{2,}\.\d+)\,
which allows larger numbers (>=100.0).
Why do you use while? Is it possible to have more than one third column on a line? A simple if will work the same, comunicating the intent more clearly.
Also, if you want to extract, you don't need to substitute. Use m// instead of s///.
Regexes aren't the right tool to do numberic comparisons. Use >= instead:
print "Number: $1\n" if /[^\,]+\,[^\,]+\,([^\,]+)\,/
&& $1 >= 59
Assuming the line ends with a comma :
print foreach map{s/.+?(\d+.\d+),$/$1/;$_} ;
In case there might be someting after the rightmost comma :
print foreach map{s/.+?(\d+.\d+),[^,]*$/$1/;$_} ;
But i wouldn't use regexp in that case :
print foreach map{(split, ',')[-2]} ;
I would suggest not using a regex when split is a better tool for the job. Likewise - regex is very bad at detecting numeric values - it works on text based patterns.
But how about:
while ( <> ) {
print ((split /,\s*/)[2],"\n");
}
If you want to test a conditional:
while ( <> ) {
my #fields = split /,\s*/;
print $fields[2],"\n" if $fields[2] >= 59;
}
Or perhaps:
print join "\n", grep { $_ >= 59 } map { (split /,\s*/)[2] } <>;
map takes your input, and extracts the third field (returning a list). grep then applies a filter condition to every element. And then we print it.
Note - in the above, I use <> which is the magic file handle (reads files specified on command line, or STDIN) but you can use your filehandle.
However it's probably worth noting - 3 argument open with lexical file handles are recommended now.
open ( my $input, '<', 'file.txt' ) or die $!;
It has a number of advantages and is generally good style.
I'm currently stuck in vim trying to find a search/replace oneliner to replace a number with another + increment for each new iteration = when it finds a new match.
I'm working in xml svg code to batch process files Inkscape cannot process the text (plain svg multiline text bug).
<tspan
x="938.91315"
y="783.20563"
id="tspan13017"
style="font-weight:bold">Text1:</tspan><tspan
x="938.91315"
y="833.20563"
id="tspan13019">Text2</tspan><tspan
x="938.91315"
y="883.20563"
id="tspan13021">✗Text3</tspan>
etc.
So what I want to do is to change that to this result:
<tspan
x="938.91315"
y="200"
id="tspan13017"
style="font-weight:bold">Text1:</tspan><tspan
x="938.91315"
y="240"
id="tspan13019">Text2</tspan><tspan
x="938.91315"
y="280"
id="tspan13021">✗Text3</tspan>
etc.
So I duckducked and found the best vim tips resource from zzapper, but I cannot understand it:
convert yy to 10,11,12 :
:let i=10 | ’a,’bg/Abc/s/yy/\=i/ |let i=i+1
I then adapted it to something I can understand and should work in my home vim:
:let i=300 | 327,$ smagic ! y=\"[0-9]\+.[0-9]\+\" ! \=i ! g | let i=i+50
But somehow it doesn't loop, all I get is that:
<tspan
x="938.91315"
300
id="tspan13017"
style="font-weight:bold">Text1:</tspan><tspan
x="938.91315"
300
id="tspan13019">Text2</tspan><tspan
x="938.91315"
300
id="tspan13021">✗Text3</tspan>
So here I'm seriously stuck. I cannot figure out what doesn't work :
My adaptation of the original formula ?
My data layout ?
My .vimrc ?
I'll try to find other resources by myself, but on that kind of trick they are pretty rare I find, and like in zzapper tips, not always delivered with a manual.
One way to fix it:
:let i = 300 | g/\m\<y=/ s/\my="\zs\d\+.\d\+\ze"/\=i/ | let i += 50
Translation:
let i = 300 - hopefully obvious
g/\m\<y=/ ... - for all lines matching \m\<y=, apply the following command; the "following command" is s/.../.../ | let ...; the regexp:
\m - "magic" regexp
\< - match only at word boundary
s/\my="\zs\d\+.\d\+\ze"/\=i/ - substitute; the regexp:
\m - "magic" regexp
\d\+ - one or more digits
\zs...\ze - replace only what is matched between these points
\=i - replace with the value of expression i
let i += 50 - hopefully obvious again.
For more information: :help :g, :help \zs, :help \ze, help s/\\=.
Just to add my take as a memo (wrote this as an answer as an EDIT didn't seem right). Sorry it is not the best vim scripting here but it enables me to understand (I'm not a vim specialist).
:let i=300 | 323,$g/y="/smagic![0-9]\+.[0-9]\+!\=i!g | let i+=50
Assign the initial value to i :
:let i=300
Start :global (:g) function from line 323 to the end of file:
323,$g
Pattern to match for executing the commands (litteral text here)
y="
Substitution with magic on (magic meaning special characters "enabled")
smagic
Pattern to find
[0-9]\+.[0-9]\+
(numbers between 0-9 one or more times, a litteral dot, the numbers again)
Replaced with
\=i
\= tells vim to evaluate i not to write it litterally
Increment i with 50 for the next iteration
let i+=50
This part is still in the g function.
The separators, in bold:
| are the separators between the different functions
/ are the separators in the :g function
! are the separators in the smagic function
This is my data (in a file):
5807035;Fab;2015/01/05;04;668100;18:06:01,488;18:06:02,892
5807028;Opt;2015/01/05;04;836100;17:12:45,223;17:12:47,407
5807028;Fab;2015/01/05;04;836100;17:12:47,470;17:12:48,172
5807027;Opt;2015/01/05;04;926100;17:12:31,807;17:12:34,365
5807027;Fab;2015/01/05;04;926100;17:12:34,443;17:12:37,095
5807026;Opt;2015/01/05;04;682100;17:12:11,698;17:12:19,062
5807026;Fab;2015/01/05;04;682100;17:12:19,124;17:12:21,667
5807025;Opt;2015/01/05;04;217100;17:12:00,669;17:12:02,635
This is my Perl code :
while ( $data =~ m/(\d+);(Opt|Fab);(.+);(\d{2});(.+);(.+);(.+)\n(\d+);(Opt|Fab);.+;\d{2};.+;(.+);(.+)\n/g ) {
if ( "$1" eq "$8" && "$2" ne "$9" ) {
print OUTFILE "$1;$3;$4;$5;$6;$7;$10;$11\n";
}
}
The lines 1 and 2 match the regex, but do not satisfy the condition of the if statement. That's fine.
On the other hand, the lines 2 and 3 satisfy the regex, AND the condition of the if statement. However, it these lines are not retrieved.
I suppose it's because the regex read two lines, then the next two lines, etc. I think I should include the condition of the if statement in the regex (if I'm not mistaken).
What do you guys think ?
The variable $data holds the content of my CSV file.
Since you want to check line 1 & 2, then 2 & 3, you need to prevent the regex engine from consuming the 2nd line by placing the regex to match the second line in a look-ahead:
while ( $data =~ m/(\d+);(Opt|Fab);(.+);(\d{2});(.+);(.+);(.+)\n(?=(\d+);(Opt|Fab);.+;\d{2};.+;(.+);(.+)\n)/g ) {
I didn't think too much when I first answer, but as #ThisSuitIsBlackNot suggested in the comment, using regular expression to parse CSV results in low maintainability code. Using CSV library to parse the data and process them is a better idea here.
I've got a function in Perl that reads the last modified .csv in a folder, and parses it's values into variables.
I'm finding some problems with the regular expressions.
My .csv look like:
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP OUT TOTAL","PDP OUT OK","PDP OUT NOK","PDP OUT OK Rate"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","ARG - NAME 1","536","536","0","100%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","USA - NAME 2","1850","1438","412","77.72%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","AUS - NAME 3","8","6","2","75%"
.(ignore this dot, you will understand later)
So far, I've had some help to parse the values into some variables, by:
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
while ( my $line = <$file> ) {
my ($date_time, $duration, $sample, $corner, $country_name, $pdp_in_total, $pdp_in_ok, $pdp_in_not_ok, $pdp_in_ok_rate)
= parse_line ',', 0, $line;
my ($date, $time) = split /\s+/, $date_time;
my ($country, $name) = $country_name =~ m/(.+) - (.*)/;
print "$date, $time, $country, $name, $pdp_in_total, $pdp_in_ok_rate";
}
The problems are:
I don't know how to make the first AND second line (that are the column names from the .csv) to be ignored;
The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
How can I do this?
When you have a csv file with column headers and want to parse the data into variables, the simplest choice would be to use Text::CSV. This code shows how you get your data into the hash reference $row. (I.e. my %data = %$row)
use strict;
use warnings;
use Text::CSV;
use feature 'say';
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
});
# open the file, I use the DATA internal file handle here
my $title = <DATA>;
# Set the headers using the header line
$csv->column_names( $csv->getline(*DATA) );
while (my $row = $csv->getline_hr(*DATA)) {
# you can now access the variables via their header names, e.g.:
if (defined $row->{Duration}) { # this will skip the blank lines
say $row->{Duration};
}
}
__DATA__
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP IN TOTAL","PDP IN OK","PDP IN NOT OK","PDP IN OK Rate"
"04/12/2014 10:00:00","3600","1","GRPS_INB","CHN - Name 1","1198","1195","3","99.74%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","ARG - Name 2","1198","1069","129","89.23%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","NLD - Name 3","813","798","15","98.15%"
If we print one of the $row variables with Data::Dumper, it shows the structure we are getting back from Text::CSV:
$VAR1 = {
'PDP IN TOTAL' => '1198',
'PDP IN NOT OK' => '3',
'PDP IN OK' => '1195',
'Period end' => '04/12/2014 10:00:00',
'Line' => 'CHN - Name 1',
'Duration' => '3600',
'Sample' => '1',
'PDP IN OK Rate' => '99.74%',
'Corner' => 'GRPS_INB'
};
open ...
my $names_from_first_line = <$file>; # you can use them or just ignore them
while($my line = <$file>) {
unless ($line =~ /\S/) {
# skip empty lines
next;
}
..
}
Also, consider using Text::CSV to handle CSV format
1) I don't know how to make the first line (that are the column names from the .csv) to be ignored;
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
2) The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
next if $line =~ /^\s*$/;
You know that the valid lines will start with dates. I suggest you simply skip lines that don't start with dates in the format you expect:
while ( my $line = <$file> ) {
warn qq(next if not $line =~ /^"\d{2}-\d{2}-d{4}/;); # Temp debugging line
next if not $line =~ /^"\d{2}-\d{2}-d{4}/;
warn qq($line matched regular expression); # Temp debugging line
...
}
The /^"\d{2}-\d{2}-d{4}",/ is a regular expression pattern. The pattern is between the /.../:
^ - Beginning of the line.
" - Quotation Mark.
\d{2} - Followed by two digits.
- - Followed by a dash.
\d{2] - Followed by two more digits.
- - Followed by a dash.
\d{4} - Followed by four more digits
This should be describing the first part of your line which is the date in MM-DD-YYYY format surrounded by quotes and followed by a comma. The =~ tells Perl that you want the thing on the left to match the regular expression on the right.
Regular expressions can be difficult to understand, and is one of the reasons why Perl has such a reputation of being a write-only language. Regular expressions have been likened to sailor cussing. However, regular expressions is an extremely powerful tool, and worth the effort to learn. And with some experience, you'll be able to easily decode them.
The next if... syntax is similar to:
if (...) {
next;
}
Normally, you shouldn't use post-fix if and never use unless (which is if's opposite). They can make your program more difficult to understand. However, when placed right after the opening line of a loop like this, they make a clear statement that you're filtering out lines you don't want. I could have written this (and many people would argue this is preferable):
next unless $line =~ /^"\d{2}-\d{2}-d{4}",/;
This is saying you want to skip lines unless they match your regular expression. It's all a matter of personal preference and what do you think is easier for the poor schlub who comes along next year and has to figure out what your program is doing.
I actually thought about this and decided that if not ... was saying that I expect almost all lines in the file to match my format, and I want to toss away the few exceptions. To me, next unless ... is saying that there are some lines that match my regular expression, and many lines that don't, and I want to only work on lines that match.
Which gets us to the next part of programming: Watching for things that will break your program. My previous answer didn't do a lot of error checking, but it should. What happens if a line doesn't match your format? What if the split didn't work? What if the fields are not what I expect? You should really check each statement to make sure it actually worked. Almost all functions in Perl will return a zero, a null string, or an undef if they don't work. For example, the open statement.
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
If open doesn't work, it returns a file handle value of zero. The or states that if open doesn't return a non-zero file handle, execute the line that follows which kills your program.
So, look through your program, and see any place where you make an assumption that something works as expected and think what happens if it didn't. Then, add checks in your program to something if you get that exception. It could be that you want to report the error or log the error and skip to the next line. It could be that you want your program to come to a screeching halt. It could be that you can recover from the error and continue. What ever you do, check for possible errors (especially from user input) and handle possible errors.
Debugging
I told you regular expressions are tricky. Yes, I made a mistake assuming that your date was a separate field. Instead, it's followed by a space then the time which means that the final ", in the regular expression should not be there. I've fixed the above code. However, you may still need to test and tweak. Which brings us into debugging in Perl.
You can use warn statements to help debug your program. If you copy a statement, then surround it with warn qq(...);, Perl will print out the line (filling out variables) and the line number. I even create macros in my various editors to do this for me.
The qq(...) is a quote like operator. It's another way to do double quotes around a string. The nice thing is that the string can contain actual quotation marks, and the qq(...); will still work.
Once you've finished debugging, you can search for your warn statements and delete them. Perl comes with a powerful built in debugger, and many IDEs integrate with it. However, sometimes it's just easier to toss in a few warn statements to see what's going on in your code -- especially if you're having issues with regular expressions acting up.
I'm using the following regular expression to find the exact occurrences in infinitives. Flag is global.
(?!to )(?<!\w) (' + word_to_search + ') (?!\w)
To give example of what I'm trying to achieve
looking for out should not bring : to outlaw
looking for out could bring : to be out of line
looking for to should not bring : to etc. just because it matches the first to
I've already done these steps, however, to cross out/off should be in the result list too. Is there any way to create an exception without compromising what I have achieved?
Thank you.
I'm still not sure I understand the question. You want to match something that looks like an infinitive verb phrase and contains the whole word word_to_search? Try this:
"\\bto\\s(?:\\w+[\\s/])*" + word_to_search + "\\b"
Remember, when you create a regex in the form of a string literal, you have to escape the backslashes. If you tried to use "\b" to specify a word boundary, it would have been interpreted as a backspace.
I know OR operator but the question was rather how to organize the structure so it can look ahead and behind. I'm going to explain what I have done so far
var strPattern:String = '(?!to )(?<!\w) (' + word_to_search + ') (?!\w)|';
strPattern+='(?!to )(?<!\w) (' + word_to_search + '\/)|';
strPattern+='(?!to )(\/' + word_to_search + ')';
var pattern:RegExp = new RegExp(strPattern, "g");
First line is the same line in my question, it searches structures like to bail out for cases where you type out. Second line is for matching structures like to cross out/off. But we need something else to match to cross out/off if the word is off. So, the third line add that extra condition.