AWK - Split file by value in specific column - regex

I have the following AWK script (provided by Armali on this site) which basically strips a tab delimited file by date(Month/year) and saves it as yyyymmm. I now have another additional condition by which the file should be split. It should be split by Month/year and also by the unique value in Column 3. Save the file as yyyymmm_Col3Uniquevalue.
The current script is
awk "NR>1{split($2,date,\"/\");print>date[3]strftime(\"%%b.txt\",(date[2]-1)*31*24*60*60)}" input.txt
Data Format:
Country Date Type
HongKong 31/01/2012 Television
Japan 14/01/2012 Press
Japan 05/01/2012 Television
Japan 16/02/2013 Press
Japan 15/02/2013 Television
Output will be 4 txt files:
2012Jan_Press - Containing record 2
2012Jan_Television - Containing record 1,3
2013Feb_Press - Containing record 4
2013Feb_Television - Containing record 5

Play with this for a bit to make sure you understand it:
$ cat file
Country Date Type
HongKong 31/01/2012 Television
Japan 14/01/2012 Press
Japan 05/01/2012 Television
Japan 16/02/2013 Press
Japan 15/02/2013 Television
$ cat tst.awk
NR>1 {
split($2,a,"/")
secs = mktime(a[3]" "a[2]" "a[1]" 0 0 0")
mth = strftime("%b", secs)
file = a[3] mth "_" $3
print file
}
$ awk -f tst.awk file
2012Jan_Television
2012Jan_Press
2012Jan_Television
2013Feb_Press
2013Feb_Television
Look up mktime() and strftime() in the GNU awk manual.
Just change print file to print > file when you're done testing.

With TAB separated fields...:
awk -F\t "NR>1{split($2,date,\"/\");print>date[3]strftime(\"%%b_\"$3\".txt\",(date[2]-1)*31*24*60*60)}" input.txt
$3 had to be excluded from the quoted format string.
If the date field $2 contains after a space also the time, split by space as well as by "/" to keep getting the year in date[3]:
awk -F\t "NR>1{split($2,date,\"[/ ]\");print>date[3]strftime(\"%%b_\"$3\".txt\",(date[2]-1)*31*24*60*60)}" input.txt

Related

Preview csv data when file can have data with newline or data without newline

User can upload two types of files data . Please find the sample data below
Sample 1:
Name mobile url message text
test11 1234567890 www.example.com "Data Test New
Date:27/02/2020
Items: 1
Total: 3
Regards
ABC DATa
Ph:091 : 123456789"
test12 1234567891 www.example.com hello
Sample2
test12 1234567891 www.example.com hello
test13 1234567892 www.example.com hi
test14 1234567893 www.example.com hi
User file can have 2-3 millions of records . So I want to give preview option to the user where user can preview first 10 lines of their uploaded file. To get first 10 lines I am using below command
awk -v RS='"[^"]*"' 'NR>10{exit} {gsub(/\r?\n/, "\\n", RT); ORS=RT} 1' test.csv
it works perfectly when files rows have double value but for sample 2 it is printing all the records from file.
Below command is working for sample 2 but not for sample 1
head -n10 test.csv | tr '^' ','
Excpected Output:
Sample1:
Name mobile url message text
test11 1234567890 www.example.com "Data Test New\nDate:27/02/2020\nItems: 1\nTotal: 3\nRegards\nABC DATa\nPh:091 : 123456789"
test12 1234567891 www.example.com hello
Sample 2:
test12 1234567891 www.example.com hello
test13 1234567892 www.example.com hi
test14 1234567893 www.example.com hi
I need a command which will work in both cases
You may try this gnu awk:
awk -v RS='("[^"]*")?\r?\n' 'NF {
ORS = gensub(/\r?\n(.)/, "\\\\n\\1", "g", RT)
++n
print
}
n == 10 {exit}' file
Or a single line:
awk -v RS='("[^"]*")?\r?\n' 'NF{ORS = gensub(/\r?\n(.)/, "\\\\n\\1", "g", RT); ++n; print} n==2{exit}' file
With GNU awk:
awk 'BEGIN{
FS="[\t,|]" # field separators: tab, comma and pipe
RS="\r{0,1}\n" # input record separator
}
$NF~/^"/ && $NF~/[^"]$/{ # if last field starts with " but does not end with "
m=$0 # build new row in variable m
while ($0~/[^"]$/){ # loop until current row does not end with "
getline # read next row
m=m "\\n" $0 # append current row to variable m
NR-- # decrease the row counter
}
$0=m # copy new build row to current row
}
NR<=10{print}' file
As one line:
awk 'BEGIN{FS="[\t,|]"; RS="\r{0,1}\n"} $NF~/^"/ && $NF~/[^"]$/ {m=$0; while ($0~/[^"]$/){getline; m=m "\\n" $0; NR--}; $0=m} NR<=10{print}' file
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
This might work for you (GNU sed):
sed -E '/^([^"]*"[^"]*")*[^"]*"[^"]*$/{
:a;N;//ba;s/\n/\\n/g};x;s/^/x/;/x{10}/{x;q};x' file
The solution is in two halves:
The first part joins consecutive lines which have unbalanced quotes
The second part counts the lines printed.
The first part, appends lines until double quotes are balanced (or not if there are no quotes). The newlines are then replaced by \\n.
The second part uses the hold space to maintain a counter (in the solution 10) which when reached terminates the processing.
N.B. This will work one invocation per file. To process more than one file at a time use:
sed -nsE '/^([^"]*"[^"]*")*[^"]*"[^"]*$/{:a;N;//ba;s/\n/\\n/g};p
x;s/^/x/;/x{10}/{:b;n;bb};x' file1 file2 ... filen

I wrote some code which is mentioned below , but how can i delete column heading row from second file onwards after merging two files

File Name: first.txt
id name contact
1 abc 7679876789
2 bcd 9867363675
File Name: second.txt
id name contact
3 cde 7979436789
4 bgb 9845363675
After merging both files(first.txt and second.txt), I got following output :
id name contact
1 abc 7679876789
2 bcd 9867363675
id name contact
3 cde 7979436789
4 bgb 9845363675
But I want output like follow :
id name contact
1 abc 7679876789
2 bcd 9867363675
3 cde 7979436789
4 bgb 9845363675
So I need to remove second file first column id row.
This might help you but create new file
cat file1.txt file2.txt | awk '!seen[$0]++'>file3.txt
You can skip the header line when you are merging:
awk 'FNR==1 && NR > FNR {next}; 1' first.txt second.txt
This is easier than merging, look for a line like the first line and remove that line.
Wehen both files have the same header and the header is not a data row, you can use
awk 'remembered==$0 {next} FNR==1 {remembered=$0} 1' combined.txt
Without awk you need to do more processing like using head -1 for finding the line you want to skip and process like this
sed '1p; /id name contact/d' combined.txt
# Or with headrow=$(sed -1q combined.txt)
sed "1p; /$headrow/d" combined.txt

How to extract multiple mailing addresses from PDF file using pdftotext

I am using pdftotext in a bash script, trying to extract the names and addresses from PDF postage labels.
An example PDF file:
Delivered By:
1st Class
Postage on Account GB
First Last
HouseName
Street
Town
County
Postcode
Customer Reference: 12400 / 203 1
32224983765
RETURN TO: MyName,
DoorNumber, Street, Town,
City, Postcode, Country
121-0434 905 20200-000 6190 C228
Delivered By:
1st Class
Postage on Account GB
First Last
HouseNumber
Street
Town
Postcode
Customer Reference: 12401 / 200 1
32224286536
RETURN TO: MyName,
DoorNumber, Street, Town,
City, Postcode, Country
121-0434 905 20200-000 6190 C414
Please note:
The addresses do not have a fixed length ie. some consist of only 4 lines, and I have seen some with up to 6 lines.
The number of addresses in the PDF file will not be known in advance.
So far, I have just got:
pdftotext label.pdf - | grep -A10 "Postage on Account GB" | tail -n+3 | head -n -3
The - avoids creating a file. The grep -A10 outputs the first 10 lines from the match "Postage on Account GB". The tail -n+3 removes the match and next line. The head -n -3 removes the last 3 lines. That works fine when there's only one address in the file consisting of 6 lines, but I'm stuck when it comes to multiple addresses and with different lengths.
Put simply, I would like to extract the data from after the blank line after Postage on Account GB, until the line before the next blank line. Then format the output so that addresses are comma delimited and each on a new line, such as:
First Last, HouseName, Street, Town, County, Postcode
First Last, HouseNumber, Street, Town, Postcode
Updated Answer
In the light of your comments, I have updated my answer as follows:
pdftotext file.pdf - | perl -00 -wnl -e 'BEGIN{$a=$r=0} if($a){($add=$_)=~tr/\n/,/; $r=1; $a=0; next} if($r){printf "%s,%s\n",$_,$add;$r=0} $a=1 if m/Postage on Account/;'
One record is read each time through the loop - a record is separated by blank lines above and below because of -00. At the start, I set $a and $r flags to zero, meaning we are not looking at an address nor a reference. If we are looking at an address, I translate all newlines into commas and note that we are now looking for a reference. If we find a reference, we print it and the saved address and note that we are no longer looking at an address or a reference. If we find the string "Postage on Account", we note that we are now expecting an address to follow.
Sample Output
Customer Reference: 12400 / 203 1,First Last,HouseName,Street,Town,County,Postcode
Customer Reference: 12401 / 200 1,First Last,HouseNumber,Street,Town,Postcode
Original Answer
I think I'd go with Perl in paragraph mode:
pdftotext file - | perl -00 -wnl -e 'BEGIN{$p=1} if($p==1){tr/\n/,/;print;$p=0}; $p=1 if /Postage/'
The -00 sets Perl in paragraph mode treating each blank line delimited block as a paragraph. The BEGIN{...} sets the print flag ($p) so the first line gets printed. On subsequent paragraphs, when the print flag is set, the newlines get changed into spaces with tr and the paragraph gets printed and the flag reset. Finally, whenever we see the word Postage we set the print flag.
pdftotext filename.pdf - |sed -n '/Postage on Account GB/,/Customer Reference:/{/Postage on Account GB/!{/Customer Reference:/!p}}' |grep . |tr '\n' ',' |sed 's/,$//g' |sed "s/Postcode/&\n/g" |sed 's/^,//g'
First Last,HouseName,Street,Town,County,Postcode
First Last,HouseNumber,Street,Town,Postcode

split on certain column when it is a url and has spaces

I have thousands of lines of data similar to
abc:1.35 (Johndoe 10-Oct-14): /usr/data/2013a/resources/fspecs/abstractbpweight/abstractbpweight.xml - Wed Aug 27 17:57:37 2014 33 13590770 33056 1 422 6367 234
efg:1.1 (Jane 12-Oct-14): /usr/data/2013a/resources/source data/abstractbpweight/file.xml - Tue Aug 26 17:57:37 2014 33 13590770 33056 1 422 6367 234
To get just the first column and the fourth column (url) into another file, I was using
awk '{print $1 $4}' file > smallerfile
Now the fourth column url sometimes has spaces and the entire path did not get captured for some cases. Also I suspect it might have other characters too (e.g. -,_ etc) and hence I wasnt sure if I can split using "-". How can I get just the first column and the fourth column in its entirety.
Thanks
Assuming your normal lines (i.e. those without extra spaces in url) have always 17 fields:
awk '{printf "%s",$1;for(i=4;i<NF-12;i++)printf "%s%s",OFS,$i;if(NF)print ""}' input.txt
Output:
abc:1.35 /usr/data/2013a/resources/fspecs/abstractbpweight/abstractbpweight.xml
efg:1.1 /usr/data/2013a/resources/source data/abstractbpweight/file.xml
It prints first field, then field 4 and extra fields belonging to url which made total number of fields be greater than 17. This removes empty lines if you need them delete if(NF).
You can try this way:
awk -F[-:] '{ split($2,a," "); print $1 ":" a[1] $5 }' file
The idea is to use - and : as field separators to allow any number of spaces inside the parenthesis.
But indeed the path can contain hyphen too. So to prevent this you can use sed instead that will check the space and hyphen after the path:
sed -r 's/^(\S+)[^:]+:\s+(.+?)\s+-.*/\1 \t\2/' file
Use the pattern /\.xml/ to decide what to print
awk '$4~/\.xml/{print $1,$4} $5~/\.xml/{print $1,$4,$5}' file
will produce output
abc:1.35 /usr/data/2013a/resources/fspecs/abstractbpweight/abstractbpweight.xml
efg:1.1 /usr/data/2013a/resources/source data/abstractbpweight/file.xml
what it does?
$4~/\.xml/ checks if the pattern .xml is contained in 4th field, if yes print $1 and $4
$5~/\.xml/ checks if the pattern .xml is contained in 5th field, then print all the fields.

Select rows from a text file by a matching pattern in one column

I have a file in the following format.
Table_name Value
Employee 0
student 50
Payroll 0
animals 20
I need to fetch the entire row in which the value is non-zero.
The expected output would be
student 50
animals 20
If the value is zero for all the rows, I should get a mail alert like "all values are zero".
This can work:
$ awk '$2' file
Table_name Value
student 50
animals 20
If the 2th field is not 0 then the condition is evaluated as true - note that {print $0} is the default block in awk so it can be omitted if this is the only action you want to perform.
Regarding the mail alert, I think you'd better show some of your code to have a refer point.
Following awk should meet both of your requirements:
awk 'NR>1 && $2 {nz=1;print}; END{if (!nz) print "all zeroes, send email"}' file
You just need to replace print "all zeroes, send email" with your mail sending command.
Code for GNU sed:
$sed '/\S+\s+[0]+\b/d' file
Table_name Value
student 50
animals 20
$sed -r '/\S+\s+([1-9]|[0]+[1-9])/!d' file
student 50
animals 20
why not perl:
perl -ane 'print if($.!=1 && $F[1]!=0)' your_file
Very simply with awk,
awk -F " " '{if($2 > 0) print $0}' your_file