split on certain column when it is a url and has spaces - regex

I have thousands of lines of data similar to
abc:1.35 (Johndoe 10-Oct-14): /usr/data/2013a/resources/fspecs/abstractbpweight/abstractbpweight.xml - Wed Aug 27 17:57:37 2014 33 13590770 33056 1 422 6367 234
efg:1.1 (Jane 12-Oct-14): /usr/data/2013a/resources/source data/abstractbpweight/file.xml - Tue Aug 26 17:57:37 2014 33 13590770 33056 1 422 6367 234
To get just the first column and the fourth column (url) into another file, I was using
awk '{print $1 $4}' file > smallerfile
Now the fourth column url sometimes has spaces and the entire path did not get captured for some cases. Also I suspect it might have other characters too (e.g. -,_ etc) and hence I wasnt sure if I can split using "-". How can I get just the first column and the fourth column in its entirety.
Thanks

Assuming your normal lines (i.e. those without extra spaces in url) have always 17 fields:
awk '{printf "%s",$1;for(i=4;i<NF-12;i++)printf "%s%s",OFS,$i;if(NF)print ""}' input.txt
Output:
abc:1.35 /usr/data/2013a/resources/fspecs/abstractbpweight/abstractbpweight.xml
efg:1.1 /usr/data/2013a/resources/source data/abstractbpweight/file.xml
It prints first field, then field 4 and extra fields belonging to url which made total number of fields be greater than 17. This removes empty lines if you need them delete if(NF).

You can try this way:
awk -F[-:] '{ split($2,a," "); print $1 ":" a[1] $5 }' file
The idea is to use - and : as field separators to allow any number of spaces inside the parenthesis.
But indeed the path can contain hyphen too. So to prevent this you can use sed instead that will check the space and hyphen after the path:
sed -r 's/^(\S+)[^:]+:\s+(.+?)\s+-.*/\1 \t\2/' file

Use the pattern /\.xml/ to decide what to print
awk '$4~/\.xml/{print $1,$4} $5~/\.xml/{print $1,$4,$5}' file
will produce output
abc:1.35 /usr/data/2013a/resources/fspecs/abstractbpweight/abstractbpweight.xml
efg:1.1 /usr/data/2013a/resources/source data/abstractbpweight/file.xml
what it does?
$4~/\.xml/ checks if the pattern .xml is contained in 4th field, if yes print $1 and $4
$5~/\.xml/ checks if the pattern .xml is contained in 5th field, then print all the fields.

Related

awk: Use gensub to substitute multiple lines from a paragraph record

I have an input file with multiple paragraphs separated by at least two newlines (\n\n), and I'm wanting to extract fields from lines within certain paragraphs. I think the processing will be simplest if I can get gensub to work as I'm hoping. Considering the following input file:
[Record R1]
Var1=0
Var2=20
Var3=5
[Record R2]
Var1=10
Var3=9
Var4=/var/tmp/
Var2=12
[Record R3]
Var1=2
Var3=5
Var5=19
I want to print only the value of Var2 from records R1 and R3 (where Var2 doesn't actually exist). I can easily group all of the variables into their corresponding record by setting RS="\n\n", then they are all contained within $0. But since I don't know where it will appear it the list ahead of time, I want to use something like gensub to extract it. This is what I have going:
awk '
BEGIN {
RS="\n\n"
}
/Record R1/ || /Record R3/ {
print gensub(/[\n.]*Var2=(.*)[\n.]*/, "\\1", "g", $0)
}
' /tmp/input.txt
But instead of only printing 20 (the value of Var2 from R1), it prints the following:
[Record R1]
Var1=0
20
Var3=5
[Record R3]
Var1=2
Var3=5
Var5=19
The intent is that the regex in the gensub command would capture all characters (newlines: \n; and non-newlines: .) before and after Var2=XX and replace everything with XX. But instead, it's only capturing the characters on the same line as Var2=XX. Can awk's gensub do this kind of multi-line substitution?
I know an alternative would be to loop over all the fields in the record, the split the field that matches Var2= on the = sign, but that feels less efficient as I scale this out to multiple variables.
I don't understand what it is you're trying to do with gensub() but to do what you seem to be trying to do in any awk is:
awk -F'[][[:space:]=]+' '{f[$2]=$3} !NF{if (f["Record"]~/^R[12]$/) print f["Var2"]; delete f}' file
20
12
awk -F'[][[:space:]=]+' '{f[$2]=$3} !NF{if (f["Record"]~/^R[13]$/) print f["Var2"]; delete f}' file
20
gensub() doesn't care if the string it's operating on is one line or many lines btw - \n is just one more character, no different from any other character.
Oh, hang on, now I see what you're thinking with that gensub() - your problems are:
[\n.]* means zero or more newlines or periods but you don't have
any periods in your input so it's the same as \n* but you don't have any newlines immediately before a Var2
Var2 doesn't exist in your 2nd records so the regexp can't match it.
The (.*) will match everything to the end of the record (leftmost longest matches).
The "g" is misleading since you only expect 1 match.
So using gensub() on multi-line text isn't an issue, your regexps just wrong.
another awk
$ awk -v RS= '/\[Record R[13]\]/{for(i=2;i<=NF;i++)
{v=sub(/ *Var2=/,"",$i);
if(v) print $i}}' file
20

How to extract multiple mailing addresses from PDF file using pdftotext

I am using pdftotext in a bash script, trying to extract the names and addresses from PDF postage labels.
An example PDF file:
Delivered By:
1st Class
Postage on Account GB
First Last
HouseName
Street
Town
County
Postcode
Customer Reference: 12400 / 203 1
32224983765
RETURN TO: MyName,
DoorNumber, Street, Town,
City, Postcode, Country
121-0434 905 20200-000 6190 C228
Delivered By:
1st Class
Postage on Account GB
First Last
HouseNumber
Street
Town
Postcode
Customer Reference: 12401 / 200 1
32224286536
RETURN TO: MyName,
DoorNumber, Street, Town,
City, Postcode, Country
121-0434 905 20200-000 6190 C414
Please note:
The addresses do not have a fixed length ie. some consist of only 4 lines, and I have seen some with up to 6 lines.
The number of addresses in the PDF file will not be known in advance.
So far, I have just got:
pdftotext label.pdf - | grep -A10 "Postage on Account GB" | tail -n+3 | head -n -3
The - avoids creating a file. The grep -A10 outputs the first 10 lines from the match "Postage on Account GB". The tail -n+3 removes the match and next line. The head -n -3 removes the last 3 lines. That works fine when there's only one address in the file consisting of 6 lines, but I'm stuck when it comes to multiple addresses and with different lengths.
Put simply, I would like to extract the data from after the blank line after Postage on Account GB, until the line before the next blank line. Then format the output so that addresses are comma delimited and each on a new line, such as:
First Last, HouseName, Street, Town, County, Postcode
First Last, HouseNumber, Street, Town, Postcode
Updated Answer
In the light of your comments, I have updated my answer as follows:
pdftotext file.pdf - | perl -00 -wnl -e 'BEGIN{$a=$r=0} if($a){($add=$_)=~tr/\n/,/; $r=1; $a=0; next} if($r){printf "%s,%s\n",$_,$add;$r=0} $a=1 if m/Postage on Account/;'
One record is read each time through the loop - a record is separated by blank lines above and below because of -00. At the start, I set $a and $r flags to zero, meaning we are not looking at an address nor a reference. If we are looking at an address, I translate all newlines into commas and note that we are now looking for a reference. If we find a reference, we print it and the saved address and note that we are no longer looking at an address or a reference. If we find the string "Postage on Account", we note that we are now expecting an address to follow.
Sample Output
Customer Reference: 12400 / 203 1,First Last,HouseName,Street,Town,County,Postcode
Customer Reference: 12401 / 200 1,First Last,HouseNumber,Street,Town,Postcode
Original Answer
I think I'd go with Perl in paragraph mode:
pdftotext file - | perl -00 -wnl -e 'BEGIN{$p=1} if($p==1){tr/\n/,/;print;$p=0}; $p=1 if /Postage/'
The -00 sets Perl in paragraph mode treating each blank line delimited block as a paragraph. The BEGIN{...} sets the print flag ($p) so the first line gets printed. On subsequent paragraphs, when the print flag is set, the newlines get changed into spaces with tr and the paragraph gets printed and the flag reset. Finally, whenever we see the word Postage we set the print flag.
pdftotext filename.pdf - |sed -n '/Postage on Account GB/,/Customer Reference:/{/Postage on Account GB/!{/Customer Reference:/!p}}' |grep . |tr '\n' ',' |sed 's/,$//g' |sed "s/Postcode/&\n/g" |sed 's/^,//g'
First Last,HouseName,Street,Town,County,Postcode
First Last,HouseNumber,Street,Town,Postcode

Extract string between fields from begin to end and end to begin of the line - space delimited - shell commands

I have one logfile that is space delimited file. The structure is this
Mon Oct 05 23:17:52 2015 0 10.0.0.1 3989728 /dir/file name.txt X X X X acct proto 0 *
I want to be able to extract the filenames, which sometimes for my lucky contains a space in its name. e.g "file name.txt"
I cannot simply cut this using the field position, because of that space that sometimes appears in the name of the files.
The way I was thinking of doing this was getting what is between the field 8 from left to right and field 8 from right to left.
But I cannot think of anything to help me with that.
Does anyone had to do it before and can shine a light.
Thanks
This is difficult to attempt without a larger data, but here is a rough solution that will discard the tenth field if it does not match a specified pattern. (This only works if there is a single whitespace ' ' in the file name):
#!/bin/sh
STORE1=$( echo "Mon Oct 05 23:17:52 2015 0 10.0.0.1 3989728 /dir/file name.txt X X X X acct proto 0 *" | awk '{print $9}' )
STORE2=$( echo "Mon Oct 05 23:17:52 2015 0 10.0.0.1 3989728 /dir/file name.txt X X X X acct proto 0 *" | awk '{print $10}' )
# if the tenth field matches the string "X" discard it
if [ "$STORE2" != "X" ]
then STORE1="$STORE1 $STORE2"
fi
printf "%s" "$STORE1"
Here's a quick test with python:
import re
txt = "Mon Oct 05 23:17:52 2015 0 10.0.0.1 3989728 /dir/file name.txt X X X X acct proto 0 *"
print re.search("\d+(\.\d+){3}\s+\d+\s+(.*)(\s+\S+){8}",txt).group(2)
Yes, I realize this is not shell, but the regular expression will pick up anything between the (ip address, integer) and before the last 8 fields as you were attempting. Just use the regex and apply it to your script.
echo "Mon Oct 05 23:17:52 2015 0 10.0.0.1 3989728 /dir/file name.txt X X X X acct proto 0 *"
sed -r 's#.*/([^.]+\.[A-Za-z]*).*#\1#' logfile.txt
The regex could be explained as follows:
.*/ Matches every character until the last slash.
([^.]+\.[A-Za-z]*) Matches everything from there and up to the first dot, followed by alphabetic characters. This is the filename. The text matched is captured by the group.
.* Matches the rest of the line.
The whole line is therefore substituted with \1, the text captured by the group 1 (the filename), and output to logfile.txt.
Some assumptions were made: the file must always have a slash from its path, the filename must have only one dot for the extension, and the extension consists of only alphabetic characters.
Thanks everyone for the inputs. I thought a bit more about it and used AWK to get that done.
Looping file content from the field I want to last field minus 8.
cat file | awk '{out=""; for(i=9;i<=NF-8;i++){out=out" "$i}; print out}'

Awk 3 Spaces + 1 space or hyphen

I have a rather large chart to parse. Each column is separated by either 4 spaces or by 3 spaces and a hyphen (since the numbers in the chart can be negative).
cat DATA.txt | awk "{ print match($0,/\s\s/) }"
does nothing but print a slew of 0's. I'm trying to understand AWK and when to escape, etc, but I'm not getting the hang of it. Help is appreciated.
One line:
1979 1 -0.176 -0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
1979 1 -0.176 0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
I would like to get just, say, the second column. I copied the line, but I'd like to see -0.185 and 0.185.
You need to start by thinking about bash quoting, since it is bash which interprets the argument to awk which will be the awk program. Inside double-quoted strings, bash expands $0 to the name of the bash executable (or current script); that's almost certainly not what you want, since it will not be a quoted string. In fact, you almost never want to use double quotes around the awk program argument, so you should get into the habit of writing awk '...'.
Also, awk regular expressions don't understand \s (although Gnu awk will handle that as an extension). And match returns the position of the match, which I don't think you care about either.
Since by default, awk considers any sequence of whitespace a field separator, you don't really need to play any games to get the fourth column. Just use awk '{print $4}'
Why not just use this simple awk
awk '$0=$4' Data.txt
-0.185
0.185
It sets $0 to value in $4 and does the default action, print.
PS do not use cat with program that can read data itself, like awk
In case of filed 4 containing 0, you can make it more robust like:
awk '{$0=$4}1' Data.txt
If you're trying to split the input according to 3 or 4 spaces then you will get the expected output only from column 3.
$ awk -v FS=" {3,4}" '{print $3}' file
-0.185
0.185
FS=" {3,4}" here we pass a regex as FS value. This regex get parsed and set the Field Separator value to three or four spaces. In regex {min,max} called range quantifier which repeats the previous token from min to max times.

Print line after multiline match with sed

I am trying to create a script to pull out an account code from a file. The file itself is long and contains a lot of other data, but I have included below an excerpt of the part I am looking at (there is other contents before and after this excerpt)
The section of the file I am interested in sometimes look like this
Account Customer Order No. Whse Payment Terms Stock No. Original Invoice No.
VIN No.
AAAAAA01 9999 1000 30 days
and sometimes it looks like this
Account Customer Order No. Whse Payment Terms Stock No. Original Invoice No.
AAAAAA01 9999 1000 30 days
(one field cut off the end, where that field had been wrapping down onto it's own line)
I know I can use | tr -s ' ' | cut -d ' ' -F 1 to pull the code once I have the line it is on, but that is not a set line number (the content before this section is dynamic).
I am starting by trying to handled the case with the extra field, I figure it will be easy enough to make that an optional match with ?
The number of spaces used to separate the fields can change as this is essentially OCRed.
A few of my attempts so far - (assume the file is coming in from STDIN)
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s\+VIN No\.\s*/{n;p;}'
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*\n\s*VIN No\.\s*/{n;p;}'
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*\r\s*VIN No\.\s*/{n;p;}'
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*\r\n\s*VIN No\.\s*/{n;p;}'
These all failed to match whatsoever
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*/,/\s\*VIN No\.\s*/{n;p;}'
This at least matched something, but frustratingly printed the VIN No. line, followed by every second line after it. It also seems like it would be more difficult to mark as an optional part of the expression.
So, given an input of the full file (including either of the above excerpts), I am looking for an output of either
AAAAAA01 9999 1000 30 days
(which I can then trim to the required data) or AAAAAA01 if there is an easier way of getting straight to that.
This might work for you (GNU sed):
sed -n '/Account/{n;/VIN No\./n;p}' file
Use sed with the -n switch, this makes sed act like grep i.e. only print lines explicitly using the commands P or (this case) p.
/Account/ match a line with the pattern Account
For the above match only:
n normally this would print the current line and then read the next line into the pattern space, but as the -n is in action no printing takes place. So now the pattern space contains the next line.
/VIN No\./n If the current line contains Vin No effectively empty the pattern space and read in the next line.
p print whatever is currently in the pattern space.
So this a condition within a condition. When we encounter Action print either the following line or the line following that.
awk '/^\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.$/ {
getline;
if (/^\s*VIN No\.$/) getline;
print;
exit;
}'
Going strictly off your input, in both cases the desired field is on the last line. So to print the first field of the last line,
awk 'END {print $1}'
Result
AAAAAA01