How to extract multiple mailing addresses from PDF file using pdftotext - regex

I am using pdftotext in a bash script, trying to extract the names and addresses from PDF postage labels.
An example PDF file:
Delivered By:
1st Class
Postage on Account GB
First Last
HouseName
Street
Town
County
Postcode
Customer Reference: 12400 / 203 1
32224983765
RETURN TO: MyName,
DoorNumber, Street, Town,
City, Postcode, Country
121-0434 905 20200-000 6190 C228
Delivered By:
1st Class
Postage on Account GB
First Last
HouseNumber
Street
Town
Postcode
Customer Reference: 12401 / 200 1
32224286536
RETURN TO: MyName,
DoorNumber, Street, Town,
City, Postcode, Country
121-0434 905 20200-000 6190 C414
Please note:
The addresses do not have a fixed length ie. some consist of only 4 lines, and I have seen some with up to 6 lines.
The number of addresses in the PDF file will not be known in advance.
So far, I have just got:
pdftotext label.pdf - | grep -A10 "Postage on Account GB" | tail -n+3 | head -n -3
The - avoids creating a file. The grep -A10 outputs the first 10 lines from the match "Postage on Account GB". The tail -n+3 removes the match and next line. The head -n -3 removes the last 3 lines. That works fine when there's only one address in the file consisting of 6 lines, but I'm stuck when it comes to multiple addresses and with different lengths.
Put simply, I would like to extract the data from after the blank line after Postage on Account GB, until the line before the next blank line. Then format the output so that addresses are comma delimited and each on a new line, such as:
First Last, HouseName, Street, Town, County, Postcode
First Last, HouseNumber, Street, Town, Postcode

Updated Answer
In the light of your comments, I have updated my answer as follows:
pdftotext file.pdf - | perl -00 -wnl -e 'BEGIN{$a=$r=0} if($a){($add=$_)=~tr/\n/,/; $r=1; $a=0; next} if($r){printf "%s,%s\n",$_,$add;$r=0} $a=1 if m/Postage on Account/;'
One record is read each time through the loop - a record is separated by blank lines above and below because of -00. At the start, I set $a and $r flags to zero, meaning we are not looking at an address nor a reference. If we are looking at an address, I translate all newlines into commas and note that we are now looking for a reference. If we find a reference, we print it and the saved address and note that we are no longer looking at an address or a reference. If we find the string "Postage on Account", we note that we are now expecting an address to follow.
Sample Output
Customer Reference: 12400 / 203 1,First Last,HouseName,Street,Town,County,Postcode
Customer Reference: 12401 / 200 1,First Last,HouseNumber,Street,Town,Postcode
Original Answer
I think I'd go with Perl in paragraph mode:
pdftotext file - | perl -00 -wnl -e 'BEGIN{$p=1} if($p==1){tr/\n/,/;print;$p=0}; $p=1 if /Postage/'
The -00 sets Perl in paragraph mode treating each blank line delimited block as a paragraph. The BEGIN{...} sets the print flag ($p) so the first line gets printed. On subsequent paragraphs, when the print flag is set, the newlines get changed into spaces with tr and the paragraph gets printed and the flag reset. Finally, whenever we see the word Postage we set the print flag.

pdftotext filename.pdf - |sed -n '/Postage on Account GB/,/Customer Reference:/{/Postage on Account GB/!{/Customer Reference:/!p}}' |grep . |tr '\n' ',' |sed 's/,$//g' |sed "s/Postcode/&\n/g" |sed 's/^,//g'
First Last,HouseName,Street,Town,County,Postcode
First Last,HouseNumber,Street,Town,Postcode

Related

Sed remove only first occurence of a string

I have several string in my text file witch have this case:
Brisbane, Queensland, Australia|BNE
I know how to use the SED command, to replace any character by another one. This time I want to replace the characters coma-space by a pipe, only for the first match to not affect the country name at the same time.
I need to convert it to something like that:
Brisbane|Queensland, Australia|BNE
As you can see, only the first coma-space was replaced, not the second one and I keep the country name "Queensland, Australia" complete. Can someone help me to achieve this, thanks.
Here is a sample of my file:
Brisbane, Queensland, Australia|BNE
Bristol, United Kingdom|BRS
Bristol, VA|TRI
Brive-La-Gaillarde, France - Laroche|BVE
Brno, Czech Republic - Bus service|ZDN
Brno, Czech Republic - Turany|BRQ
If you do: sed 's/, /|/' file.txt doesn't work.
The output should be like that:
Brisbane|Queensland, Australia|BNE
Simply don't use the g option. Your sed command should look like this:
sed 's/, /|/'
The s command will by default only the replace the first occurrence of a string in the pattern buffer - unless you pass the g option.
Since you have not posted the output of your test file, we can only guess what you need. And here is may guess:
awk -F", *" 'NF>2{$0=$1"|"$2 OFS $3}1' OFS=", " file
Brisbane|Queensland, Australia|BNE
Bristol, United Kingdom|BRS
Bristol, VA|TRI
Brive-La-Gaillarde, France - Laroche|BVE
Brno, Czech Republic - Bus service|ZDN
Brno, Czech Republic - Turany|BRQ
As you see it counts fields to see if it needs | or not. If it neds | then reconstruct the line.

split on certain column when it is a url and has spaces

I have thousands of lines of data similar to
abc:1.35 (Johndoe 10-Oct-14): /usr/data/2013a/resources/fspecs/abstractbpweight/abstractbpweight.xml - Wed Aug 27 17:57:37 2014 33 13590770 33056 1 422 6367 234
efg:1.1 (Jane 12-Oct-14): /usr/data/2013a/resources/source data/abstractbpweight/file.xml - Tue Aug 26 17:57:37 2014 33 13590770 33056 1 422 6367 234
To get just the first column and the fourth column (url) into another file, I was using
awk '{print $1 $4}' file > smallerfile
Now the fourth column url sometimes has spaces and the entire path did not get captured for some cases. Also I suspect it might have other characters too (e.g. -,_ etc) and hence I wasnt sure if I can split using "-". How can I get just the first column and the fourth column in its entirety.
Thanks
Assuming your normal lines (i.e. those without extra spaces in url) have always 17 fields:
awk '{printf "%s",$1;for(i=4;i<NF-12;i++)printf "%s%s",OFS,$i;if(NF)print ""}' input.txt
Output:
abc:1.35 /usr/data/2013a/resources/fspecs/abstractbpweight/abstractbpweight.xml
efg:1.1 /usr/data/2013a/resources/source data/abstractbpweight/file.xml
It prints first field, then field 4 and extra fields belonging to url which made total number of fields be greater than 17. This removes empty lines if you need them delete if(NF).
You can try this way:
awk -F[-:] '{ split($2,a," "); print $1 ":" a[1] $5 }' file
The idea is to use - and : as field separators to allow any number of spaces inside the parenthesis.
But indeed the path can contain hyphen too. So to prevent this you can use sed instead that will check the space and hyphen after the path:
sed -r 's/^(\S+)[^:]+:\s+(.+?)\s+-.*/\1 \t\2/' file
Use the pattern /\.xml/ to decide what to print
awk '$4~/\.xml/{print $1,$4} $5~/\.xml/{print $1,$4,$5}' file
will produce output
abc:1.35 /usr/data/2013a/resources/fspecs/abstractbpweight/abstractbpweight.xml
efg:1.1 /usr/data/2013a/resources/source data/abstractbpweight/file.xml
what it does?
$4~/\.xml/ checks if the pattern .xml is contained in 4th field, if yes print $1 and $4
$5~/\.xml/ checks if the pattern .xml is contained in 5th field, then print all the fields.

Using sed, find mailing address in each text file and store portions of it to separate variables?

I have a series of text files that each contain the string "Address" twice in different parts of the file, and later the string "Subscriber Address", making for three total appearances of "Address". Using sed, I'd like to harvest data immediately following the first instance of "Address" in each file while ignoring the rest. Sometimes the full address will appear in two lines as follows...
Address
100 MAIN ST
STRATFORD CT 06614
And sometimes the address line will wrap, moving the City, State and ZIP to a third line as follows...
Address
NO 10 GREEN ACRES
LANE
SHELTON CT 06484
I'd like to store the output in variables: Address1, Address2, City, State and Zip. Using each of the examples above, the desired outcome would be...
Address1=100 MAIN ST
City=STRATFORD
State=CT
Zip=06614
Address1=NO 10 GREEN ACRES
Address2=LANE
City=SHELTON
State=CT
Zip=06484
A suitable alternative in the second example would be to concatenate address lines 1 and 2, resulting in the following...
Address1=NO 10 GREEN ACRES LANE
City=SHELTON
State=CT
Zip=06484
I know that this is a lot to ask. Any help is very much appreciated.
Sed is not intended for this purpose. Sed operates on single lines only, not keeping history and such.
You could switch to e.g. an AWK clone (awk, gawk, nawk).
There is my attempt to do this:
$ cat file
test
First Address
100 MAIN ST STRATFORD CT 06614
test
Second Address
100 MAIN ST
STRATFORD CT 06614
test
Third Address
NO 10 GREEN ACRES
LANE
SHELTON CT 06484
test
$ sed -n '/Address/{:start;N;/[^0-9]$/b start;s/\n/|/g;p}' file |
sed 1d |
sed 's/^Address|//;s| \([0-9]\+\)$|\nZip: \1|' |
sed 's| \([A-Z]\+\)$|\nState: \1|'|
sed 's/|\([^|]\+\)$/\nCity: \1/' |
sed '/^[^:]\+$/s|\(.*\)|Address: \1|;s/|/ /g'
Address: Second Address 100 MAIN ST
City: STRATFORD
State: CT
Zip: 06614
Address: Third Address NO 10 GREEN ACRES LANE
City: SHELTON
State: CT
Zip: 06484
(Let me not to explain how it exactly works :-))
P.S. The idea behind this loong command is to transforming file to lines with adressed only, after that we delete 1st line and continue with others. By using regular expressions we transform each addresses to required format.
sed -ne '/./{H;$!d;}' -e 'x;/Address/,/^$/!d' -e 's/\n/#/g;s/#Address#//' -e 's/\(.*\)#\(.*\)#\(.*\)/Address1=\1\nAddress2=\2\n\3\n/;s/\(.*\)#\(.*\)/Address1=\1\n\2\n/;s/\([a-Z]*\)\s\([a-Z][a-Z]\)\s\([0-9]\{5\}\)/City=\1\nState=\2\nZip=\3/p' addr.txt
this flattens the addresses out and formats them then you just need to uniq them

AWK - Split file by value in specific column

I have the following AWK script (provided by Armali on this site) which basically strips a tab delimited file by date(Month/year) and saves it as yyyymmm. I now have another additional condition by which the file should be split. It should be split by Month/year and also by the unique value in Column 3. Save the file as yyyymmm_Col3Uniquevalue.
The current script is
awk "NR>1{split($2,date,\"/\");print>date[3]strftime(\"%%b.txt\",(date[2]-1)*31*24*60*60)}" input.txt
Data Format:
Country Date Type
HongKong 31/01/2012 Television
Japan 14/01/2012 Press
Japan 05/01/2012 Television
Japan 16/02/2013 Press
Japan 15/02/2013 Television
Output will be 4 txt files:
2012Jan_Press - Containing record 2
2012Jan_Television - Containing record 1,3
2013Feb_Press - Containing record 4
2013Feb_Television - Containing record 5
Play with this for a bit to make sure you understand it:
$ cat file
Country Date Type
HongKong 31/01/2012 Television
Japan 14/01/2012 Press
Japan 05/01/2012 Television
Japan 16/02/2013 Press
Japan 15/02/2013 Television
$ cat tst.awk
NR>1 {
split($2,a,"/")
secs = mktime(a[3]" "a[2]" "a[1]" 0 0 0")
mth = strftime("%b", secs)
file = a[3] mth "_" $3
print file
}
$ awk -f tst.awk file
2012Jan_Television
2012Jan_Press
2012Jan_Television
2013Feb_Press
2013Feb_Television
Look up mktime() and strftime() in the GNU awk manual.
Just change print file to print > file when you're done testing.
With TAB separated fields...:
awk -F\t "NR>1{split($2,date,\"/\");print>date[3]strftime(\"%%b_\"$3\".txt\",(date[2]-1)*31*24*60*60)}" input.txt
$3 had to be excluded from the quoted format string.
If the date field $2 contains after a space also the time, split by space as well as by "/" to keep getting the year in date[3]:
awk -F\t "NR>1{split($2,date,\"[/ ]\");print>date[3]strftime(\"%%b_\"$3\".txt\",(date[2]-1)*31*24*60*60)}" input.txt

Print line after multiline match with sed

I am trying to create a script to pull out an account code from a file. The file itself is long and contains a lot of other data, but I have included below an excerpt of the part I am looking at (there is other contents before and after this excerpt)
The section of the file I am interested in sometimes look like this
Account Customer Order No. Whse Payment Terms Stock No. Original Invoice No.
VIN No.
AAAAAA01 9999 1000 30 days
and sometimes it looks like this
Account Customer Order No. Whse Payment Terms Stock No. Original Invoice No.
AAAAAA01 9999 1000 30 days
(one field cut off the end, where that field had been wrapping down onto it's own line)
I know I can use | tr -s ' ' | cut -d ' ' -F 1 to pull the code once I have the line it is on, but that is not a set line number (the content before this section is dynamic).
I am starting by trying to handled the case with the extra field, I figure it will be easy enough to make that an optional match with ?
The number of spaces used to separate the fields can change as this is essentially OCRed.
A few of my attempts so far - (assume the file is coming in from STDIN)
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s\+VIN No\.\s*/{n;p;}'
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*\n\s*VIN No\.\s*/{n;p;}'
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*\r\s*VIN No\.\s*/{n;p;}'
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*\r\n\s*VIN No\.\s*/{n;p;}'
These all failed to match whatsoever
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*/,/\s\*VIN No\.\s*/{n;p;}'
This at least matched something, but frustratingly printed the VIN No. line, followed by every second line after it. It also seems like it would be more difficult to mark as an optional part of the expression.
So, given an input of the full file (including either of the above excerpts), I am looking for an output of either
AAAAAA01 9999 1000 30 days
(which I can then trim to the required data) or AAAAAA01 if there is an easier way of getting straight to that.
This might work for you (GNU sed):
sed -n '/Account/{n;/VIN No\./n;p}' file
Use sed with the -n switch, this makes sed act like grep i.e. only print lines explicitly using the commands P or (this case) p.
/Account/ match a line with the pattern Account
For the above match only:
n normally this would print the current line and then read the next line into the pattern space, but as the -n is in action no printing takes place. So now the pattern space contains the next line.
/VIN No\./n If the current line contains Vin No effectively empty the pattern space and read in the next line.
p print whatever is currently in the pattern space.
So this a condition within a condition. When we encounter Action print either the following line or the line following that.
awk '/^\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.$/ {
getline;
if (/^\s*VIN No\.$/) getline;
print;
exit;
}'
Going strictly off your input, in both cases the desired field is on the last line. So to print the first field of the last line,
awk 'END {print $1}'
Result
AAAAAA01