I have the following format for my CSV file that I downloaded online and need to parse it. I want to be able to get rid of the 'unit' column. How can I go about doing this? I want to be able to do it as I parse through the file and not copy it to a different file because the file is very large. Thanks!
<radio>,<mcc>,<net>,<area>,<cell>,<unit>,<lon>,<lat>,<range>,<samples>,<changeable>,<created>,<updated>,<averageSignal>
UMTS,262,2,801,86355,,13.28527,52.521711,37,7,1,1282569574,1300175362,-91GSM,262,2,801,1795,,13.276605,52.525348,5714,9,1,1282569574,1300175362,-87
#!/usr/bin/env perl
use strict;
use warnings;
my $col_to_delete = '<unit>';
chomp ( my #header = split /,/, <> );
my #newheader = grep { not $_ eq $col_to_delete } #header;
print join (",", #newheader),"\n";
while ( <> ) {
chomp;
my %row;
#row{#header} = split /,/;
print join (",", #row{#newheader}),"\n";
}
Reads STDIN line by line, and prints as you go, so you don't need lots of memory.
You can probably do it via in-place editing, but you'll want to test it to make sure that it works satisfactorily first. You'd do this by setting $^I, but really you're better off creating two files, because it's shockingly easy to clobber one accidentally.
Related
First a quick intro. I'm new here, so if I screw up a post, please let me know and I'll fix it.
I've been trying to accomplish my goal using perl, but I'm stuck. I don't need to use perl to accomplish it, but I figure it's that, or Excel and I like perl better. If you have a better method please share.
I start with a file (output from a log file). It is 1 line, fields delimitted by colon. Here is an example of the file:
RmDenySumm:SGID=46244:Req=15000:tsid=46244:AllocBw=38332:BwList=12456/12500/3750/5876/3750:tsid=63042:AllocBw=38750:BwList=15000/12500/3750/3750/3750:tsid=63043:AllocBw=36717:BwList=14706/12500/3750/5761:tsid=63044:AllocBw=37011:BwList=15000/12500/5761/3750:tsid=61741:AllocBw=38450:BwList=12339/3750/6501/12502/3357:tsid=61721:AllocBw=37460:BwList=12500/15000/4200/5760:tsid=2072:AllocBw=31975:BwList=12136/12339/3750/3750:tsid=2073:AllocBw=24260:BwList=14634/5876/3750:tsid=30842:AllocBw=38453:BwList=14634/12500/5761/5557:tsid=30843:AllocBw=37105:BwList=15000/15000/3750/3355:tsid=30844:AllocBw=38295:BwList=14706/12339/3750/3750/3750:tsid=30845:AllocBw=25601:BwList=5762/12339/3750/3750:tsid=30846:AllocBw=38455:BwList=15000/12136/5761/5557:tsid=30847:AllocBw=26974:BwList=14634/12339:tsid=30848:AllocBw=29634:BwList=14634/15000:tsid=30849:AllocBw=37338:BwList=14838/15000/3750/3750:tsid=60958:AllocBw=36898:BwList=12339/12500/6501/5557:tsid=60959:AllocBw=37178:BwList=12339/12500/12339:tsid=60960:AllocBw=27339:BwList=12339/15000:tsid=60962:AllocBw=34839:BwList=12339/3750/15000/3750:tsid=60963:AllocBw=37500:BwList=15000/15000/3750/3750:tsid=60964:AllocBw=38346:BwList=15000/3754/15000/4592:tsid=60965:AllocBw=24626:BwList=15000/5876/3750:tsid=60966:AllocBw=34513:BwList=12502/12500/5761/3750
I need to grab all of "AllocBW=######" fields, separate the number part from the "AllocBW", add them all together then subtract them from a set value.
In perl, I have this:
#!/usr/bin/perl -w
use Data::Dumper;
#
#
my $file = "/home/nick/perl/svcgroup.txt";
my #asplit;
my $c = 0;
open (FILE, "<", $file) or die "Can't open file".$!."\n";
while (<FILE>) {
$_ =~ s/\n//g;
push(#asplit, split (":", $_));
#print Dumper #asplit;
}
foreach $splits (#asplit) {
if ($splits =~ m/AllocBw/) {
print $splits."\n";
}
}
#print Dumper #asplit;
print "\n\n";
close FILE;
exit;
Which leaves me with:
AllocBw=38332
AllocBw=38750
AllocBw=36717
AllocBw=37011
AllocBw=38450
AllocBw=37460
AllocBw=31975
AllocBw=24260
AllocBw=38453
AllocBw=37105
AllocBw=38295
AllocBw=25601
AllocBw=38455
AllocBw=26974
AllocBw=29634
AllocBw=37338
AllocBw=36898
AllocBw=37178
AllocBw=27339
AllocBw=34839
AllocBw=37500
AllocBw=38346
AllocBw=24626
AllocBw=34513
This is where I get stuck. I'm not sure how to strip these values down to the number and add them up.
If someone can assist, I'd be grateful. If this is more easily accomplished using something other than Perl, that's fine too. My programming scope is limited, as I only make small scripts to accomplish small repetitive tasks at work.
EDIT FOR BORODIN
ie (not formatted like this, this is just for illustration):
AllocBw 12575+
AllocBw 12568+
AllocBw 12358 = TotAllocBw 37501
MaxBw 38800*3=116400
116400(MaxBw) - 37501(TotAllocBw) = TotAvaiBw 78899
This would just be a big bonus. The script you wrote works perfectly well for my purposes and I can adapt it as I need. Thanks again! Much appreciated. I was able to follow everything you did differently in the script and learned some new stuff.. Thanks for that as well.
It is simplest to use a global regular expression match to find all occurrences of AllocBw=... in each line of your input file.
This program's outer while loop iterates over all the lines in the input file, and so should be executed only once.
The inner while iterates over all instances of the regex pattern AllocBw=(\d+) (AllocBw= followed by any number of decimal digits) and captures the numeric value into $1.
The captured number is added to $total each time, and can simply be printed at the end.
use strict;
use warnings;
my $file = '/home/nick/perl/svcgroup.txt';
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};
my $total = 0;
while ( <$fh> ) {
$total += $1 while /AllocBw=(\d+)/g;
}
printf "Total: %d\n", $total;
output
Total: 826049
I am a first year grad student who's relatively new in computational biology. I recently started using Perl and it's not the easiest language to learn, at least not for me.
I need help applying my idea/logic the right way to figure out the solution to my problem.
I have a dna string and I want to split it at specific sites to get multiple fragments using information from an enzyme file that contains lines of recognition sites. Once the fragments are obtained, I want to output the list of dna fragments in an output file. I want to create an output file for every line in the enzyme file I am going to extract the information from, to apply it to the dna string.
Here's what I mean exactly:
Hypothetical scenario:
Enzyme.File contains:
abc/at'gtct// (abc is the name of the enzyme. (atgtct) is the recognition site.)
def/cgg'ataaa// ........
Suppose the dna string is: $dna = "accggttatgtctaaacggataaagtctcggataaattt" (recognition sites are bolded)
For line 1
When I extract the info from the first line/enzyme(abc) from the enzyme file and apply it to this string, the output should be:
accggttat
gtctaaacggataaagtctcggataaattt
(split between cgg'ataaa) the apostrophe represents the cut point
(note: Even though there is another gtct in the string, it does not split it because at ought to precede it.)
For line 2
$dna = accggttatgtctaaacggataaagtctcggataaattt (Info is applied to same dna string)
Info from line/enzyme 2 (def) would split the dna as follow:
accggttatgtctaaacgg (split between cgg'ataaa)
ataaagtctcgg
ataaattt
I want to put each output from the different lines in separate file with distinct names. (I can take care of assigning the names)
So in conclusion, this example would create two new files, one name "abc_whatever" and "def_whatever". Important: If the enzyme file had 8 lines with different enzymes, I would get 8 new output files with their distinct dna fragments."
Here's what I've tried so far:
#!/usr/bin/perl;
use warnings;
use strict;
open(ENZ,$ARGV[0]) || die; # ENZ(file handle for enzyme file)
my $dna = "accggttatgtctaaacggataaagtctcggataaattt";
while (<ENZ>) {
if ( match pattern etc..) { # I took care of that and created captured groups of
$1 = holds "abc" # the info I needed from the line e.g. I captured
$2 = ..."at" # (abc)/(at)'(gtct)//, so they are stored in $1,$2,$3
$3 = ..."gtct" # respectively
}
while (<$dna>){
my #fragments_array = split(/$3/, $dna);
open (OutFile, ">$dna"."_"."$1")
print OutFile shift #fragments_array,"\n";
foreach (#fragments_array) {
print OutFile "$3$_\n";
close OutFile;
}
}
}
close ENZ;
FIRST
I can only create an output only for the 1st line in the Enzyme file. I want to create and output file for all the lines.
SECOND
I am not properly cutting the dna. From other examples I have seen online, it looks like I am gonna have to use the following functions to properly apply the enzyme information on the dna. The functions include:
the for loop, length and substr(),
If you can, please demonstrate your work in the simplest form (no extravagant, impressing codes lol :-) since I am just learning this language)
Thanks in advance!
FIRST I can only create an output only for the 1st line in the Enzyme file. I want to create and output file for all the lines.
That's simply because you put close OutFile; into the foreach (#fragments_array) loop, instead of placing the close after the loop body.
SECOND I am not properly cutting the dna.
That's because you forgot to include $2, the head of the recognition site (e. g. the at of atgtct) in the split pattern as well as in the output.
The problem is solved easier if we just insert the splitting new-line character everywhere between the head and the tail:
#!/usr/bin/perl
use warnings;
use strict;
open(ENZ, $ARGV[0]) || die; # ENZ (file handle for enzyme file)
my $dna = "accggttatgtctaaacggataaagtctcggataaattt";
while (<ENZ>)
{
if (m-(.*)/(.*)'(.*)//-)
{
my ($head, $tail) = ($2, $3); # $2$3 is the recognition site; save it
open(OutFile, ">${dna}_$1");
(my $fragments = $dna) =~ s/$head$tail/$head\n$tail/g; # insert NLs
print OutFile $fragments, "\n";
close OutFile;
}
}
close ENZ;
I changed your code a bit, hope it works now
#!/usr/bin/perl
use warnings;
use strict;
open(ENZ, $ARGV[0]);
my $dna = "accggttatgtctaaacggataaagtctcggataaattt";
my ($enzyme, $first, $second) = ("", "", "");
for my $line (<ENZ>) {
chomp($line); # remove \n at the end of string
my #elements = split(/\/|'/, $line); # split string into tokens (e.g. abc/at'gtct => array(abc, at, gtct))
$elements[2] = substr($elements[2], 0, -2); # remove the last "//"
my ($firstPart, $secondPart) = ($elements[1], $elements[2]);
if ($dna =~ /(.*)$firstPart$secondPart(.*)/) {
$first = $1 . $firstPart;
$second = $2 . $secondPart;
$enzyme = $elements[0];
open(OUTPUT, ">$enzyme" . "_something");
print OUTPUT "$first\n$second\n";
close(OUTPUT);
}
}
close ENZ;
EDIT: this is the working version. I suggest you learn how to use Regular Expression if you want to use Perl for your study. It is the strongest tool in Perl.
I need to create a perl script that reads the last modified file in a given folder (the file is always a .csv) and parses the values from their columns, so I can control them to a mysql database.
The main problem is: I need to separate the Date from the Hours, and the Country from the Names(CHN, DEU and JPN represent China, Deutschland and Japan).
They come together like in the example below:
"02/12/2014 09:00:00","3600","1","CHN - NAME1","0%","0%"
"02/12/2014 09:00:00","3600","1","DEU - NAME2","10%","75.04%"
"02/12/2014 09:00:00","3600","1","JPN - NAME3","0%","100%"
So far I can split the lines, but how can I make it understand that each value into "" and separated by , should be inserted into my arrays?
my %date;
my %hour;
my %country;
my %name;
my %percentage_one;
my %percentage_two;
# Selects lastest file in the given directory
my $files = File::DirList::list('/home/cvna/IN/SCRIPTS/zabbix/roaming/tratamento_IAS/GPRS_IN', 'M');
my $file = $files->[0]->[13];
open(CONFIG_FILE,$file);
while (<CONFIG_FILE>){
# Splits the file into various lines
#lines = split(/\n/,$_);
# For each line that i get...
foreach my $line (#lines){
# I need to split the values between , without the ""
# And separating Hour from Date, and Name from Country
#aux = split(/......./,$line)
}
}
close(CONFIG_FILE);
readline or <> only reads one line. There's no need to split it on newlines. But, instead of fixing your code, use Text::CSV:
#!/usr/bin/perl
use 5.010;
use warnings;
use strict;
use Text::CSV;
my $csv = 'Text::CSV'->new({ binary => 1 }) or die 'Text::CSV'->error_diag;
while (my $row = $csv->getline(*DATA)) {
my ($date, $time) = split / /, $row->[0];
my ($country, $name) = split / - /, $row->[3];
print "Date: $date\tTime: $time\tCountry: $country\tName: $name\n";
}
__DATA__
"02/12/2014 09:00:00","3600","1","CHN - NAME1","0%","0%"
"02/12/2014 09:00:00","3600","1","DEU - NAME2","10%","75.04%"
"02/12/2014 09:00:00","3600","1","JPN - NAME3","0%","100%"
Looking at your code, it appears you're pretty new to Perl. The Text::CSV module is a nice solution, but unfortunately, isn't a standard module. You'll need to use CPAN to install it. It isn't difficult, but may require you to be the administrator of your computer.
The module Text::ParseWords is a standard module and can handle quoted words much like Text::CSV can.
You'll need to basically split the line (which I do with the parse_linefunction). The first parameter is , which is what I want to split my line upon. Unlike split itself, parse_line doesn't split on the parameters that are quoted, and handles backticked quotes. This is very similar to Text::CSV.
Once you've split your line, you'll need to split date from time and country from name. In my example, I show two ways of doing this: One uses split and the other uses a matching regular expression. Either one will work.
use strict; # Lets you know when you misspell variable names
use warnings; # Warns of issues (using undefined variables
use feature qw(say); # Let's you use 'say' instead of 'print' (No \n needed)
use Text::ParseWords;
while ( my $line = <DATA> ) {
my ($date_time, $foo, $bar, $country_name, $percent1, $percent2)
= parse_line ',', 0, $line;
my ($date, $time) = split /\s+/, $date_time;
my ($country, $name) = $country_name =~ m/(.+) - (.*)/;
say "$date, $time, $country, $name";
}
__DATA__
"02/12/2014 09:00:00","3600","1","CHN - NAME1","0%","0%"
"02/12/2014 09:00:00","3600","1","DEU - NAME2","10%","75.04%"
"02/12/2014 09:00:00","3600","1","JPN - NAME3","0%","100%"
In your actual program, you'll open your file, and make sure you've opened that file. You can test for that, or use autodie:
use strict; # Lets you know when you misspell variable names
use warnings; # Warns of issues (using undefined variables
use feature qw(say); # Let's you use 'say' instead of 'print' (No \n needed)
use Text::ParseWords;
use autodie;
open my $config_file, "<", $file; # No need for testing thanks to use autodie!
# What you need to do if you don't use autodie
# open my $config_file, "<", $file or die qq(Can't open "$file" for reading);
while ( my $line = <$config_file> ) {
my ($date_time, $foo, $bar, $country_name, $percent1, $percent2)
= parse_line ',', 0, $line;
my ($date, $time) = split /\s+/, $date_time;
my ($country, $name) = $country_name =~ m/(.+) - (.*)/;
say "$date, $time, $country, $name"; # Show fields were correctly parsed.
}
It looks like you want to store the data, I see you have multiple hashes that I bet you're trying to keep in parallel. Take a look at how you can use references that allows you to build more complex structures:
my %data; #Where I'll be storing the data...
$data{$key}->{DATE} = $date;
$data{$key}->{HOUR} = $hour;
$data{$key}->{COUNTRY} = $country;
...
Now, all of your data is in %data. You can pass it around from place to place in your program, and not worry whether you've updated each and every single hash.
Once you get the hang of references, you are on your way to writing Object Oriented Perl code.
Get a good book on Modern Perl too. Perl coding techniques have changed quite a bit since Perl 5 was released. Unfortunately, most people never learn the way Perl should be written because they learn from old books that are lying around, or from looking at older code written in the Perl 3 and Perl 4 error (pun intended). Perl is a flexible and powerful language that allows you to quickly generate yourself enough rope to hang yourself. Learning good programming techniques will allow you to write more complex and comprehensive programs that are actually easier to read and maintain.
Almost complete program...
Here's the complete program that finds the most recent file in a particular directory, then reads in that file and parses the lines.
I'm using -M file test. This file test returns the last modification time of the file as expressed as the age of the file in days since the program ran. For example, a file that was last modified 2 1/2 days ago will return 2.5 while a file last modified one day and four hours ago will return 1.16666667. You can use this to compare the age of the various files.
This program does works for Perl 5.8.8 without installing any new modules, and I've tested it with data I've made up.
You can see I use "open ... or die ...; without any issues. Are you getting some other error? Do you have use strict; and use warnings; set in your program?
#! /usr/bin/env perl
#
use strict; # Lets you know when you misspell variable names
use warnings; # Warns of issues (using undefined variables
use Text::ParseWords;
use Benchmark;
use constant {
DATA_FILE_DIR => "temp",
};
#
# Find newest file in the directory
#
opendir my $data_dir, DATA_FILE_DIR
or die qq(Cannot open directory for reading.);
my $newest_file;
while ( my $file = readdir $data_dir ) {
next if $file eq "." or $file eq "..";
my $full_name = DATA_FILE_DIR . "/" . $file;
if ( not defined $newest_file
or -M $full_name < -M $newest_file ) {
$newest_file = $full_name;
}
}
print qq(Using file is "$newest_file"\n);
closedir $data_dir;
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
while ( my $line = <$file> ) {
# Read in the entire line
my ($date_time, $foo, $bar, $country_name, $percent1, $percent2)
= parse_line ',', 0, $line;
# Split the DATE/TIME field
my ($date, $time) = split /\s+/, $date_time;
# Split the Country/Name field
my ($country, $name) = $country_name =~ m/(.+) - (.*)/;
# Print statement merely shows that these four fields are truly split.
print "$date, $time, $country, $name\n";
}
I have a file with strings in each row as follows
"229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1
the next line could look like
84545,X,2
I'm trying to parse this text in Perl. Note: quotes are present in the strings when there are several of them in a row, but not present if there is only item
I would like to parse each item into an array. I tried the following regex
#fields = ($_ =~ /(\d+\_\d+),*/g);
but it is missing the last 2714. How do I capture that edge case? Any help appreciated. Thanks in advance
It looks like you have a CSV File, so use an actual CSV parser for it like Text::CSV.
After you parse the columns, you can separate your first field into the array:
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1 } ) # should set binary attribute.
or die "Cannot use CSV: ".Text::CSV->error_diag ();
my $line = qq{"229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1 the next line could look like 84545,X,2};
if ($csv->parse($line)) {
my #columns = $csv->fields();
my #nums = split ',', $columns[0];
print "#nums\n";
}
Outputs:
229269_2 190594_2 94552_2 266076_2 269628_2 165328_2 99319_2 263339_2 263300_2 99315_2 271509_2 2714
Why not a regex ?
Yes, of course it's possible to use a regex for practically anything. But what you need to understand is that this will make your code extremely fragile and difficult to maintain.
Even if you want to use a regular expression, you should STILL do this in two steps. First separate the initial column(s) of your CSV, and then process the specific column that you're worried about.
Because you're just working with the first column, you could use code like the following:
use strict;
use warnings;
my $line = qq{"229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1 the next line could look like 84545,X,2};
if ($line =~ /^"(.*?)"|^([^,]*)/) {
my $column0 = $1 // $2;
my #nums = split ',', $column0;
print "#nums\n";
}
The above happens to accomplish the same thing as the previous code. However, it has one big flaw, it's not nearly as obvious to the maintaining programmer what's going on.
Whenever a new coder, or even yourself in 6 months, views the first set of code, it is extremely obvious what format your data is in. You're working with a CSV file, and the first column is a list separated by commas. The second code also works, but the new maintainer must actually read the regex and figure out what's going on to understand both what format the data is in, and whether the code is actually doing it correctly.
Anyway, do whatever you will, but I strongly advise you to use an actual CSV Parser for parsing csv files.
If all you want is all but the last two fields...
my $string = qq("229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1);
$string =~ s/"//g; # delete the quotes
my #f = split (/,/, $string); # split on the comma
pop #f; pop #f; # jettison the last two columns
# #f contains what you're looking for
scenario: I am a Jr. C# developer, but recently (3 days) began learning Perl for batch files. I have a requirement to parse through a text file, extract some key data, then output the key data to a new text file. As seems to always be the case, there are butt loads of fragmented examples on the net regarding how to 'read' from a file, 'write' to a file, 'store' line by line into an array, 'filter' this and that, yadda yadda, but nothing discussing the entire process of read, filter, write. Trying to splice examples from the net together is no good, because none seem to work together as coherent code. Coming from C#, Perl's syntax structure is hella confusing. I just need some advice on this process.
My objective is to parse a text file, single out all lines similar to the one below, by date, and output only the first 8 digits of the 2nd number group and 5 digits from the 3rd number group to a new text file.
11122 20100223454345 ....random text..... [keyword that identifies all the
entries I need]... random text 0.0034543345
I know regex is likely the best option, and have most of the expression written, but it does not work in Perl!
Question: Could someone please show a simple (dummy) example of how to read from, filter (using dummy regex) the file, then output the (dummy) results to a new file? I'm not concerned with functional details, I can learn those, I just need the syntax structure Perl uses. For example:
open(FH, '<', 'dummy1.txt')
open(NFH, '>', 'dummy2.txt')
#array; or $dumb;
while(<FH>)
{
filter each line [REGEX] and shove it into [#array or $dumb scalar]
}
print(join(',', #array)) to dummy2.txt
close FH;
close NFH;
Note: For various reasons, I cannot paste my source code in here, sorry. Any help is appreciated.
UPDATE: ANSWER:
Much thanks to all those who provided insight into my issue. After reading through you replies, as well as conducting further research, I learned that there are dozens of ways to accomplish the same task in Perl(which I am not a fan of). In the end, this is how I solved the problem, and IMO it's the cleanest, and most succinct, solution for those having similar struggles. Thanks again for all the help.
#======================================================================
# 1. READ FILE: inputFile.txt
# 2. CREATE FILE: outputFile.txt
# 3. WRITE TO: outputFile.txt IF line matches REGEX constraints
# 4. CLOSE FILES: outputFile.txt & inputFile.txt
#==========================================================================
#1
$readFile = 'C:/.../.../inputFile.txt';
open(FH, '<', $readFile) or Error("Could not read file ($!)");
#2
$writeFile = 'C:/.../.../outputFile.txt';
open(NFH, '>', $writeFile) or Error("Cannot write to file ($!)");
#3
#lines = <FH>;
LINE: foreach $line (#lines)
{
if ($line =~ m/(201403\d\d).*KEYWORD.*time was (\d+\.\d+)/)
{
$date = $1;
$elapsedtime = $2;
print NFH "$date,$elapsedtime\n";
}
}
#4
close NFH;
close FH;
perlfaq5 - How do I change, delete, or insert a line in a file, or append to the beginning of a file? covers most of the different scenarios for how to use files.
However, I will add to that by saying that always start your scripts with use strict; and use warnings;, and because you're doing file processing, use autodie; will serve you as well.
With that in mind, a quick stub would be the following:
use strict;
use warnings;
use autodie;
open my $infh, '<', 'dummy1.txt';
open my $outfh, '>', 'dummy2.txt';
while (my $line = <$infh>) {
chomp $line; # Remove \n
if (Whatever magically processing here) {
print $outfh, "your new data";
}
}
while(<FH>)
{
# variable $_ contains the current line
if(m/regex_goes_here/) #by default, the regex match operator m// attempts to match the default $_ variable
{
#do actions
}
}
Also note, m/regex/ is the same as /regex/
Refer to:
http://perldoc.perl.org/perlvar.html#General-Variables
http://perldoc.perl.org/perlre.html
For capturing variables from regex match, THIS might help
EDIT
If you want a different variable than the default $_, as #Miller suggested, use while($line = <FH>) followed by if($line =~ m/regex_goes_here/)
=~ is the Binding Operator
One tip. Don't explicitly open filehandles to your input and output files. Instead read from STDIN and write to STDOUT. Your program will be far more flexible and easier to use as you'll be able to treat it like a Unix filter.
$ your_filter_program < your_input.txt > your_output.txt
And doing this actually makes your program simpler to write too.
while (<>) { # <> reads from STDIN
# transform your data (which is in $_) in some way
...
print; # prints $_ to STDOUT
}
You might find the first few chapters of Data Munging with Perl are useful.
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
INPUT_FILE => "NAME_OF_INPUT_FILE",
OUTPUT_FILE => "NAME_OF_OUTPUT_FILE",
FILTER => qr/regex_for_line_to_filter/,
};
open my $in_fh, "<", INPUT_FILE;
open my $out_fh, ">", OUTPUT_FILE;
while ( my $line = <$in_fh> ) {
chomp $line;
next unless $line =~ FILTER;
$line =~ s/regular_expression/replacement/;
say {$out_fh} $line;
}
close $in_file;
close $out_file;
The $in_file is your input file, and $out_fh is your output file. I basically open both, and loop through the input. The chomp removes the \n from the end. I always recommend doing that.
The next goes to the next iteration of the loop unless I match FILTER which is a regular expression matching lines you want to keep. This is identical to:
if ( $line !~ FILTER ) {
next;
}
I then use the substitution command to get the parts of the line I want, and munge them into the output I want. I maybe better off expanding this a bit. Maybe using split to split up my line into various pieces, the only using the pieces I want. I could then use substr to pull out the substring from the select pieces.
The say command is like print except it automatically adds in a NL on the end. This is how you write a line to a file.
Now, get Learning Perl and read it. If you know any programming. it shouldn't take you more than a week to go through the first half of the book. That should be more than enough to be able to write a program like this. The more complex stuff like references and object orientation might take a bit longer.
On line documentation can be found at http://perldoc.perl.org. You can look up the use statements which are called pragmas over there. Documentation on the individual functions are also available.
If I understood well, this one liner will do the job:
perl -ane 'print substr($F[1],0,8),"\t",substr($F[-1],0,5),"\n" if /keyword/' in.txt
Assuming in.txt is:
11122 20100223454345 ....random text..... [keyword that identifies all the entries I need]... random text 0.0034543345
11122 30100223454345 ....random text..... [ that identifies all the entries I need]... random text 0.124543345
11122 40100223454345 ....random text..... [keyword that identifies all the entries I need]... random text 0.65487
11122 50100223454345 ....random text..... [ that identifies all the entries I need]... random text 0.6215
output:
20100223 0.003
40100223 0.654