space delimited file handling - regex

I have insider transactions of a company in a space delimited file. Sample data looks like the following:
1 Gilliland Michael S January 2,2013 20,000 19
2 Still George J Jr January 2,2013 20,000 19
3 Bishkin S. James February 1,2013 150,000 21
4 Mellin Mark P May 28,2013 238,000 25.26
Col1 is Serial# that I dont need to print
Col2 is the name of the person who did trades. This column is not consistent. It has first name and second name and middle initial and for some insiders salutations as well (Mr, Dr. Jr etc)
col3 is the date format Month Day,Year
col4 is the number of shares traded
col5 is the price at which shares were purchased or sold.
I need you guys help to print each column value separately. Thanks for your help.

Count the total number of fields read; the difference between that and the number of non-name fields gives you the width of the name.
#!/bin/bash
# uses bash features, so needs a /bin/bash shebang, not /bin/sh
# read all fields into an array
while read -r -a fields; do
# calculate name width assuming 5 non-name fields
name_width=$(( ${#fields[#]} - 5 ))
cur_field=0
# read initial serial number
ser_id=${fields[cur_field]}; (( ++cur_field ))
# read name
name=''
for ((i=0; i<name_width; i++)); do
name+=" ${fields[cur_field]}"; (( ++cur_field ))
done
name=${name# } # trim leading space
# date spans two fields due to containing a space
date=${fields[cur_field]}; (( ++cur_field ))
date+=" ${fields[cur_field]}"; (( ++cur_field ))
# final fields are one span each
num_shares=${fields[cur_field]}; (( ++cur_field ))
price=${fields[cur_field]}; (( ++cur_field ))
# print in newline-delimited form
printf '%s\n' "$ser_id" "$name" "$date" "$num_shares" "$price" ""
done
Run as follows (if you saved the script as process):
./process <input.txt >output.txt

It might be a little easier in perl.
perl -lane '
#date = splice #F, -4, 2;
#left = splice #F, -2, 2;
splice #F, 0, 1;
print join "|", "#F", "#date", #left
' file
Gilliland Michael S|January 2,2013|20,000|19
Still George J Jr|January 2,2013|20,000|19
Bishkin S. James|February 1,2013|150,000|21
Mellin Mark P|May 28,2013|238,000|25.26
You can change the delimiter in the join as per your requirement.

Here is the data separated using awk
awk '{c1=$1;c5=$NF;c4=$(NF-1);c3=$(NF-3)FS$(NF-2);$1=$NF=$(NF-1)=$(NF-2)=$(NF-3)="";gsub(/^ | *$/,"");c2=$0;print c1"|"c2"|"c3"|"c4"|"c5}' file
1|Gilliland Michael S|January 2,2013|20,000|19
2|Still George J Jr|January 2,2013|20,000|19
3|Bishkin S. James|February 1,2013|150,000|21
4|Mellin Mark P|May 28,2013|238,000|25.26
You know have your data in variable c1 to c5
Or better displayed here:
awk '{c1=$1;c5=$NF;c4=$(NF-1);c3=$(NF-3)FS$(NF-2);$1=$NF=$(NF-1)=$(NF-2)=$(NF-3)="";gsub(/^ | *$/,"");c2=$0;print c1"|"c2"|"c3"|"c4"|"c5}' file | column -t -s "|"
1 Gilliland Michael S January 2,2013 20,000 19
2 Still George J Jr January 2,2013 20,000 19
3 Bishkin S. James February 1,2013 150,000 21
4 Mellin Mark P May 28,2013 238,000 25.26

Related

AWK comparing string date value from task list to today's date

I have a todo.txt task list that I'd like to filter on to show any tasks scheduled for dates in the future or today, i.e. - show no past dates scheduled and only show tasks which have a date scheduled.
The file lines and orders change sometimes to include a 'threshold (think snooze/postpone task until...) date in format: t:date +%Y-%m-%d which says, 'don't start this task until this date'.
Data file:
50 (A) Testing due date due:2018-09-22 t:2018-09-25
04 (B) Buy Socks, Underwear t:2018-09-22
05 (B) Buy Vaporizer t:2018-09-23 due:2018-09-22
16 (C) Watch Thor Ragnarock
12 (B) Pay Electric Bill due:2018-09-20 t:2018-09-25
x 2018-09-21 pri:B Buy Prebiotics +health #web due:2018-09-21
So far I've come up with this:
cat t | awk -F: -v date="$(date +%Y-%m-%d)" '/due:|t:/ $2 >= date || $3 >= date { print $0}'|
nl
Problem is, The date comparison is working on the “due:” field as it usually comes before “t:” field. Also, entries older then today are output.
Output:
1 50 (A) Testing due date due:2018-09-22 t:2018-09-25
2 05 (B) Buy Vaporizer t:2018-09-23 due:2018-09-22
3 12 (B) Pay Electric Bill due:2018-09-20 t:2018-09-25
Questions:
How do I correctly make the date comparison against "t:" value after the “:” separator if “t:” is present - and on the “due:” value if “t:” is not present?
Date greater than (“>”) seems to work but equal to does not (“>=“)
$ cat tst.awk
{
orig = $0
sched = ""
for (i=NF; i>0; i--) {
if ( sub(/^t:/,"",$i) ) {
sched = $i
break
}
else if ( sub(/^due:/,"",$i) ) {
sched = $i
}
}
$0 = orig
}
sched >= date
$ awk -v date="$(date +%Y-%m-%d)" -f tst.awk file
50 (A) Testing due date due:2018-09-22 t:2018-09-25
05 (B) Buy Vaporizer t:2018-09-23 due:2018-09-22
12 (B) Pay Electric Bill due:2018-09-20 t:2018-09-25

Search for multiple strings in many text files, count hits on combinations

I'm struggling to automate a reporting exercise, and would appreciate some pointers or advice please.
I have several hundred thousand small (<5kb) text files. Each contains a few variables, and I need to count the number of files that match each combination of variables.
Each file contains a device number, such as /001/ /002/.../006/.
Each file also contains a date string, such as 01.10.14 (dd.mm.yy)
Some files contain a 'status' string which is always "Not Settled"
I need a way to trawl through each file on a Linux system (spread across several subdirectories), and produce a report file that counts 'per device' how many files include each date stamp (6 month range) and for each of those dates, how many contain the status string.
The report might look like this:
device, date, total count of files
device, date, total "not settled" count
e.g.
/001/, 01.12.14, 356
/001/, 01.12.14, 12
/001/, 02.12.14, 209
/001/, 02.12.14, 8
/002/, 01.12.14, 209
/002/, 01.12.14, 7
etc etc
In other words:
Foreach /device/
Foreach <date>
count total matching files - write number to file
count toal matching 'not settled' files - write number to file
Each string to match could appear anywhere in the file.
I tried using grep piped to a second (and third) grep commands, but I'd like to automate this and loop through the variables (6 devices, about 180 dates, 2 status strings) . I suspect Perl and Bash is the answer, but I'm out of my depth.
Please can anyone recommend an approach to this?
Edit: Some sample data as mentioned in the comments. The information is basically receipt data from tills - as would be sent to a printer. Here's a sample (identifying bits stripped out).
c0! SUBTOTAL 11.37
c0! ! T O T A L 11.37!
c0! 19 ITEMS
c0! C a s h ? 11.37
vu p022c0!
c0! NET TOTAL VAT A 10.87
c0! VAT 00.0% 0.00
c0! NET TOTAL VAT B 0.42
c0! VAT 20.0% 0.08
c0! *4300 772/080/003/132 08.01.15 11:18 A-00
Contents = Not Settled
In the case above, I'd be looking for /003/ , 08.01.15, and "Not Settled"
Many thanks.
First, read everything into an SQLite database, then run queries against it to your heart's content. Putting the data in an SQL database is going to save you time if you need to tweak anything. Besides, even simple SQL can tackle this kind of thing if you have the right tables set up.
First of all I agree with #Sinan :-)
The following might work as hack to make a hash out of your file data.
# report.pl
use strict;
use warnings;
use Data::Dumper;
my %report;
my ($date, $device) ;
while (<>) {
next unless m/^ .*
(?<device>\/00[1-3]\/) .*
(?<date>\d{2}\.\d{2}\.\d{2})
.*$/x ;
($date, $device,) = ($+{date}, $+{device});
$_ = <> unless eof;
if (/Contents/) {
$report{$date}{$device}{"u_count"}++ ;
}
else {
$report{$date}{$device}{"count"}++ ;
}
}
print Dumper(\%report)
This seems to work with a collection of data files in the format shown below (since you don't say or show where the Contents = Not Settled appears, I assume it is either part of the last line along with the device ID or in a separate and final line for each file).
Explanation:
The script reads the STDIN of all the files passed as a glob in while(<>){} loop. First, next unless m/ ... skips forward lines of input until it matches the line with device and date information.
Next, the match then uses named capture groups (?<device> ?<date> to hold the values of the patterns it finds and places those values in corresponding variables (($date, $device,) = ($+{date}, $+{device});). These could simply be $1 and $2 but naming keeps me organized here.
Then, in case there is another line to read $_ = <> unless eof; reads it and tries the final set of conditional matches in order to add to $counts and $u_counts.
Data file format:
file1.data
c0! SUBTOTAL 11.37
c0! ! T O T A L 11.37! c0! 19 ITEMS
c0! C a s h ? 11.37
vu p022c0!
c0! NET TOTAL VAT A 10.87
c0! VAT 00.0% 0.00
c0! NET TOTAL VAT B 0.42
c0! VAT 20.0% 0.08
c0! *4300 772/080/003/132 08.01.15 11:18 A-00
file2.data
c0! SUBTOTAL 11.37
c0! ! T O T A L 11.37! c0! 19 ITEMS
c0! C a s h ? 11.37
vu p022c0!
c0! NET TOTAL VAT A 10.87
c0! VAT 00.0% 0.00
c0! NET TOTAL VAT B 0.42
c0! VAT 20.0% 0.08
c0! *4300 772/080/002/132 08.01.15 11:18 A-00
Contents = Not Settled
(a set of files for testing are listed here: http://pastebin.com/raw.php?i=7ALU80fE).
perl report.pl file*.data
Data::Dumper Output:
$VAR1 = {
'08.01.15' => {
'/002/' => {
'u_count' => 4
},
'/003/' => {
'count' => 1
}
},
'08.12.15' => {
'/003/' => {
'count' => 1
}
}
};
From that you can make a report by iterating through the hash with keys() (the date) and retrieving the inner hash and count values per machine. Really it would be a good idea to have some tests to make sure everything works as expected - that or just do as #sinan_Ünür suggests: use SQLite!
NB: this code was not extensively tested :-)

How to select a column from a file with command line

I have a file that appears as follow :
some random text : azoidfalkrnalrkazlkja
zlazekamzlekazmlekalzkemlkmlkmlkmlkmlkml
o&kjoik&oék"&po"éképo"k&éo"kéo"koé"kk"k"
Column1 Column2 Column3 Column4 Column5
=======================================
0 1 1000 No_Light X Disabled (Persistent)
1 1 1010 Online X E-Port 10:20:30:40:50:60:70:80 "some comment"
2 1 1020 Online X F-Port 10:00:00:00:00:00:00:00
3 1 1030 No_Light X Disabled (Persistent)
I can extract all "Online" status with grep "^ *[0-9].*Online" ./myfile. How can I then extract further information for each line (for instance, add each value to a $COLUMN variable) ?
I would like to extract all data from the 3rd column, and then treat the result as an array to extract the data from each line.
EDIT: Quotting Jotne's Answer, I did somthing like that :
COLUMN=3
MYVARIABLE=($(awk '/Online/ {print $c}' c="$COLUMN" file))
echo ${MYVARIABLE[0]}
To get information from eks column #3 and that is online:
COLUMN=3
awk '/Online/ {print $c}' c="$COLUMN" file
1010
1020

Separating a string in Excel VBA

I have a series (thousands and thousands) of call record that I'm trying to create a spreadsheet of. They're all in a text file. The format looks like this:
12/ 13/ 05 Syracuse, NY 10: 22 AM 111- 111- 1111 2 $ - $ - $ -
12/ 13/ 05 New York, NY 10: 28 AM 111- 111- 1111 (F) 2 $ - $ - $ -
12/ 13/ 05 Orlando, FL 10: 48 AM 111- 111- 1111 (F) 4 $ - $ - $ -
3/ 9/ 09 Internal 4: 51 PM 111- 111- 1111 (E) 23 $ - $ - $ -
10/ 14/ 11 Colorado Site 8: 12 AM 111- 111- 1111 14 $ - $ - $ -
1/ 3/ 12 Dept 27 3: 16 PM 111- 111- 1111 (F) 93 $ - $ - $ -
11/ 12/ 12 Internal 3: 13 PM 18765 (E) 16 $ - $ - $ -
11/ 14/ 12 Internal 11: 43 AM John Doe 3 $ - $ - $ -
Month/ day/ year/ city called, STATE HH: MM APM 123- 456 7890 OptionalCode $Charge $Tax $Total
This is, minus details, directly from the file. No quotes around strings, no tabs. I tried to use text to columns, but some cities do have space and others don't.
Anyone want to point me in the right direction? RegEx maybe (Which I've heard of but never used)? Something else?
Update:
Thanks for the early feedback. The line are actual data from my file, though I stripped city and phone numbers. I've updated with the city information to show variance there. To the best I can see, none of the city names have a comma, but I'm dealing with close to 120,000 lines total and, obviously, haven't checked them all.
The city won't always, obviously, have a space - Syracuse above doesnt, New York, however, does. The month and date, too, aren't always 2 digits - which also throws off checks for length. I can read to first, then second forward slash, though - those are fixed after date and month values.
And the bracketed code doesn't always appear... sometimes it's there, sometimes not, though they do appear to only ever be one letter when they arrive.
I hope this clears a few things up. This would have been far easier if it was stored correctly in the first place. Sigh.
Updates 2,3 & 4 Added a few lines from call log changes per Robin's request.
I know you asked for a VBA solution, but I do my call record parsing purely in a spreadsheet with formulae.
I have uploaded a workbook solution here (version 3).
Once you have the workbook open, copy and paste the contents of your text file into cell A2. Then fill down the range B2:X2 as far as necessary.
The formulae will work with any variation in length of month, day , year, city, state, time, code, charge, tax and total.
Let me know if any lines break. You can easily check for these by using the AutoFilter dropdown in the headers to select for errors/extraneous values. Append any offending lines to your question.
Updates:
Version 2 takes care of the situation where the City field contains a location name, and the State field is blank.
Version 3 takes care of the situation where the Phone Number field contains an extension number or name.
Something like this might work if there are no comma's in the city name.
Sub foo()
thisLine = "12/ 13/ 05 City Name, ST 10: 28 AM 111- 111- 1111 (F) 2 $ - $ - $ -"
thisDate = Mid(thisLine, 1, 10)
thisLine = Mid(thisLine, 12)
firstComma = InStr(1, thisLine, ",")
City = Mid(thisLine, 1, firstComma - 1)
thisLine = Mid(thisLine, firstComma + 2)
State = Left(thisLine, 2)
thisLine = Mid(thisLine, 4)
thisTime = Left(thisLine, 9)
thisLine = Mid(thisLine, 11)
thisPhone = Left(thisLine, 14)
thisLine = Mid(thisLine, 16)
tempArray = Split(thisLine, "$")
If UBound(tempArray) = 3 Then
optionalCode = tempArray(0)
charge = "$" & tempArray(1)
tax = "$" & tempArray(2)
Total = "$" & tempArray(3)
Else
' throw an error something went wrong
End If
End Sub

How do you add columns from a spreadsheet and then generate percentages in Perl?

I have a lot of data in a format like this
Amistad Academy District Amistad Academy 596 812 73.4
Andover School District Andover 39 334 11.7
Ansonia School District Ansonia High School 427 732 58.3
Ansonia School District Ansonia Middle School 219 458 47.8
Ansonia School District Mead School 431 642 67.1
Ansonia School District Prendergast School 504 787 64
What I need to do is grep a bunch of school districts and then take the last column, sum up all the matching districts (all of Ansonia for example) then divide that number by the sum of the next-to-last column. I have no trouble getting the school districts into separate files. That was just a grep. Now, however, I'm stuck and thinking it might be easier to just do it in excel. I've been playing with solutions in perl like
1 #!/opt/local/bin/perl
2 use strict;
3 use warnings;
4 use ARGV::readonly;
5
6 my #data;
7 my #headers - split ',', <>;
8
9 while (<>) {
10 my #row = split;
11 $data[$_] += $row[$_] for (0 .. $#row);
12 }
13
14 $" = "\t";
15 print "#headers", "\n";
16 print "#data";
but can't figure out the syntax to do the sum and division.
Thanks.
You are summing every column. You just want to sum two of them. Otherwise, you're practically there.
my $sum_last = 0; # Use better name.
my $sum_penu = 0; # Use better name.
while (<>) {
chomp;
my #row = split /\t/;
next if $row[0] ne 'Ansonia School District';
$sum_last += $row[-1];
$sum_penu += $row[-2];
}
say $sum_last / $sum_penu;
The program below will pick out the values from the file and keep the running totals for each school district in a hash. The contents of the hash are printed when all the data has been read. It works from the unfiltered file - you don't have to grep it into separate sources.
I notice that your data seems to be tab-separated, and it is important to use split /\t/ so that fields containing space characters don't get split up as well.
You don't say what the data means so I can't make the code more readable.
Please ask again if you have any further questions.
use strict;
use warnings;
open my $fh, '<', 'myfile' or die $!;
scalar <$fh>; # lose header record
my %data;
while (<$fh>) {
my #fields = split /\t/;
my $district = shift #fields;
$data{$district}[0] += $fields[-2];
$data{$district}[1] += $fields[-1];
}
for my $district (sort keys %data) {
printf "%s - %f\n", $district, $data{$district}[1] / $data{$district}[0];
}
output
Andover School District - 0.035030
Ansonia School District - 0.090569