Print line after multiline match with sed - regex

I am trying to create a script to pull out an account code from a file. The file itself is long and contains a lot of other data, but I have included below an excerpt of the part I am looking at (there is other contents before and after this excerpt)
The section of the file I am interested in sometimes look like this
Account Customer Order No. Whse Payment Terms Stock No. Original Invoice No.
VIN No.
AAAAAA01 9999 1000 30 days
and sometimes it looks like this
Account Customer Order No. Whse Payment Terms Stock No. Original Invoice No.
AAAAAA01 9999 1000 30 days
(one field cut off the end, where that field had been wrapping down onto it's own line)
I know I can use | tr -s ' ' | cut -d ' ' -F 1 to pull the code once I have the line it is on, but that is not a set line number (the content before this section is dynamic).
I am starting by trying to handled the case with the extra field, I figure it will be easy enough to make that an optional match with ?
The number of spaces used to separate the fields can change as this is essentially OCRed.
A few of my attempts so far - (assume the file is coming in from STDIN)
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s\+VIN No\.\s*/{n;p;}'
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*\n\s*VIN No\.\s*/{n;p;}'
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*\r\s*VIN No\.\s*/{n;p;}'
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*\r\n\s*VIN No\.\s*/{n;p;}'
These all failed to match whatsoever
| sed -n '/\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.\s*/,/\s\*VIN No\.\s*/{n;p;}'
This at least matched something, but frustratingly printed the VIN No. line, followed by every second line after it. It also seems like it would be more difficult to mark as an optional part of the expression.
So, given an input of the full file (including either of the above excerpts), I am looking for an output of either
AAAAAA01 9999 1000 30 days
(which I can then trim to the required data) or AAAAAA01 if there is an easier way of getting straight to that.

This might work for you (GNU sed):
sed -n '/Account/{n;/VIN No\./n;p}' file
Use sed with the -n switch, this makes sed act like grep i.e. only print lines explicitly using the commands P or (this case) p.
/Account/ match a line with the pattern Account
For the above match only:
n normally this would print the current line and then read the next line into the pattern space, but as the -n is in action no printing takes place. So now the pattern space contains the next line.
/VIN No\./n If the current line contains Vin No effectively empty the pattern space and read in the next line.
p print whatever is currently in the pattern space.
So this a condition within a condition. When we encounter Action print either the following line or the line following that.

awk '/^\s*Account\s\+Customer Order No\.\s\+Whse\s\+Payment Terms\s\+Stock No\.\s\+Original Invoice No\.$/ {
getline;
if (/^\s*VIN No\.$/) getline;
print;
exit;
}'

Going strictly off your input, in both cases the desired field is on the last line. So to print the first field of the last line,
awk 'END {print $1}'
Result
AAAAAA01

Related

remove duplicate lines (only first part) from a file

I have a list like this
ABC|Hello1
ABC|Hello2
ABC|Hello3
DEF|Test
GHJ|Blabla1
GHJ|Blabla2
And i want it to be this:
ABC|Hello1
DEF|Test
GHJ|Blabla1
so i want to remove the duplicates in each line before the: |
and only let the first one there.
A simple way using awk
$ awk -F"|" '!seen[$1]++ {print $0}' file
ABC|Hello1
DEF|Test
GHJ|Blabla1
The trick here is to set the appropriate field separator "|" in this case after which the individual columns can be accessed column-wise starting with $1. In this answer, am maintaining a unique-value array seen and printing the line only if the value from $1 is not seen previously.

Delete Specific Lines with AWK [or sed, grep, whatever]

Is it possible to remove lines from a file using awk? I'd like to find any lines that have Y in the last column and then remove any lines that match the value in column 2 of said line.
Before:
KEY1,TRACKINGKEY1,TRACKINGNUMBER1-1,PACKAGENUM1-1,N
,TRACKINGKEY1,TRACKINGNUMBER1-2,PACKAGENUM1-2,N
KEY1,TRACKINGKEY1,TRACKINGNUMBER1-1,PACKAGENUM1-1,Y
,TRACKINGKEY1,TRACKINGNUMBER1-2,PACKAGENUM1-2,Y
KEY1,TRACKINGKEY5,TRACKINGNUMBER1-3,PACKAGENUM1-3,N
KEY2,TRACKINGKEY2,TRACKINGNUMBER2-1,PACKAGENUM2-1,N
KEY3,TRACKINGKEY3,TRACKINGNUMBER3-1,PACKAGENUM3-1,N
,TRACKINGKEY3,TRACKINGNUMBER3-2,PACKAGENUM3-2,N
So awk would find that row 3 has Y in the last column, then look at column 2 [TRACKINGKEY1] and remove all lines that have TRACKINGKEY1 in column 2.
Expected result:
KEY1,TRACKINGKEY5,TRACKINGNUMBER1-3,PACKAGENUM1-3,N
KEY2,TRACKINGKEY2,TRACKINGNUMBER2-1,PACKAGENUM2-1,N
KEY3,TRACKINGKEY3,TRACKINGNUMBER3-1,PACKAGENUM3-1,N
,TRACKINGKEY3,TRACKINGNUMBER3-2,PACKAGENUM3-2,N
The reason for this is that our shipping program puts out a file whenever a shipment is processed, as well as when that shipment gets voided [in case of an error]. So what I end up with is the initial package info, then the same info indicating that it was voided, then yet another set of lines with the new shipment info. Unfortunately our ERP software has a fairly simple scripting language in which I can't even make an array so I'm limited to shell tools.
Thanks in advance!
One way is to take 2 pass to same file using awk:
awk -F, 'NR == FNR && $NF=="Y" && !($2 in seen){seen[$2]}
NR != FNR && !($2 in seen)' file file
KEY1,TRACKINGKEY5,TRACKINGNUMBER1-3,PACKAGENUM1-3,N
KEY2,TRACKINGKEY2,TRACKINGNUMBER2-1,PACKAGENUM2-1,N
KEY3,TRACKINGKEY3,TRACKINGNUMBER3-1,PACKAGENUM3-1,N
,TRACKINGKEY3,TRACKINGNUMBER3-2,PACKAGENUM3-2,N
Explanation:
NR == FNR # if processing the file 1st time
&& $NF=="Y" # and last field is Y
&& !($2 in seen) { # we haven't seen field 2 before
seen[$2]} # store field 2 in array seen
}
NR != FNR # when processing the file 2nd time
&& !($2 in seen) # array seen doesn't have field 2
# take default action and print the line
This solution is kind of gross, but kind of fun.
grep ',Y$' file | cut -d, -f2 | sort -u | grep -vwFf - file
grep ',Y$' file -- find the lines with Y in the last column
cut -d, -f2 -- print just the tracking key from those lines
sort -u -- give just the unique keys
grep -vwFf - file --
read the unique tracking keys from stdin (-f -)
only consider them a match if they are whole words (-w)
they are fixed strings, not regular expressions (-F)
then exclude lines matching these patterns (-v) from file

how to grep exact string match across 2 files

I've UTF-8 plain text lists of usernames, 1 per line, in list1.txt and list2.txt. Note, in case pertinent, that usernames may contain regex characters e.g. ! ^ . ( and such as well as spaces.
I want to get and save to matches.txt a list of all unique values occurring in both lists. I've little command line expertise but this almost gets me there:
grep -Ff list1.txt list2.txt > matches.txt
...but that is treating "jdoe" and "jdoe III" as a match, returning "jdoe III" as the matched value. This is incorrect for the task. I need the per-line pattern match to be the whole line, i.e. from ^ to $. I've tried adding the -x flag but that gets no matches at all (edit: see comment to accepted answer - I got the flag order wrong).
I'm on OS X 10.9.5 and I don't have to use grep - another command line (tool) solving the problem will do.
All you need to do is add the -x flag to your grep query:
grep -Fxf list1.txt list2.txt > matches.txt
The -x flag will restrict matches to full line matches (each PATTERN becomes ^PATTERN$). I'm not sure why your attempt at -x failed. Maybe you put it after the -f, which must be immediately followed by the first file?
This awk will be handy than grep here:
awk 'FNR==NR{a[$0]; next} $0 in a' list1.txt list2.txt > matches.txt
$0 is the line, FNR is the current line number of the current file, NR is the overall line number (they are only the same when you are on the first file). a[$0] is a associative array (hash) whose key is the line. next will ensure that further clauses (the $0 in a) will not run if the current clause (the fact that this is the first file) did. $0 in a will be true when the current line has a value in the array a, thus only lines present in both will be displayed. The order will be their order of occurence in the second file.
A very simple and straightforward way to do it that doesn't require one to do all sorts of crazy things with grep is as follows
cat list1.txt list2.txt|grep match > matches.txt
Not only that, but it's also easier to remember, (especially if you regularly use cat).
grep -Fwf file1 file2 would match word to word !!

How to use awk and grep on 300GB .txt file?

I have a huge .txt file, 300GB to be more precise, and I would like to put all the distinct strings from the first column, that match my pattern into a different .txt file.
awk '{print $1}' file_name | grep -o '/ns/.*' | awk '!seen[$0]++' > test1.txt
This is what I've tried, and as far as I can see it works fine but the problem is that after some time I get the following error:
awk: program limit exceeded: maximum number of fields size=32767
FILENAME="file_name" FNR=117897124 NR=117897124
Any suggestions?
The error message tells you:
line(117897124) has to many fields (>32767).
You'd better check it out:
sed -n '117897124{p;q}' file_name
Use cut to extract 1st column:
cut -d ' ' -f 1 < file_name | ...
Note: You may change ' ' to whatever the field separator is. The default is $'\t'.
The 'number of fields' is the number of 'columns' in the input file, so if one of the lines is really long, then that could potentially cause this error.
I suspect that the awk and grep steps could be combined into one:
sed -n 's/\(^pattern...\).*/\1/p' some_file | awk '!seen[$0]++' > test1.txt
That might evade the awk problem entirely (that sed command substitutes any leading text which matches the pattern, in place of the entire line, and if it matches, prints out the line).
Seems to me that your awk implementation has an upper limit for the number of records it can read in one go of 117,897,124. The limits can vary according to your implementation, and your OS.
Maybe a sane way to approach this problem is to program a custom script that uses split to split the large file into smaller ones, with no more than 100,000,000 records each.
Just in case that you don't want to split the file, then maybe you could look for the limits file correspondent to your awk implementation. Maybe you can define unlimited as the Number of Records value, although I believe that is not a good idea, as you might end up using a lot of resources...
If you have enough free space on disk (because creates a temp .swp file) I suggest to use Vim, vim regex has small difference but you can convert from standard regex to vim regex with this tool http://thewebminer.com/regex-to-vim
The error message says your input file contains too many fields for your awk implementation. Just change the field separator to be the same as the record separator and you'll only have 1 field per line and so avoid that problem, then merge the rest of the commands into one:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\// && !seen[$0]++' file_name
If that's a problem then try:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\//' file_name | sort -u
There may be an even simpler solution but since you haven't posted any sample input and expected output, we're just guessing.

Find duplicates of a file by name in a directory recursively - Linux

I have a folder which contains sub folders and some more files in them.
The files are named in the following way
abc.DEF.xxxxxx.dat
I'm trying to find the duplicate files only matching 'xxxxxx' in the above pattern ignoring the rest. The extension .dat doesn't change. But the length of abc and DEF might change. The order of separation by periods also doesn't change.
I'm guessing I need to use Find in the following way
find -regextype posix-extended -regex '\w+\.\w+\.\w+\.dat'
I need help coming up with the regular expression. Thanks.
Example:
For a file named 'epg.ktt.crwqdd.dat', I need to find duplicate files containing 'crwqdd'.
You can use awk for that:
find /path -type f -name '*.dat' | awk -F. 'a[$4]++'
Explanation:
Let find give the following output:
./abd.DdF.TTDFDF.dat
./cdd.DxdsdF.xxxxxx.dat
./abc.DEF.xxxxxx.dat
./abd.DdF.xxxxxx.dat
./abd.DEF.xxxxxx.dat
Basically, spoken with the words of a computer, you want to count the occurrences of a pattern between .dat and the next dot and print those lines where pattern appeared at least the second time.
To achieve this we split the file names by the . what gives us 5(!) fields:
echo ./abd.DEF.xxxxxx.dat | awk -F. '{print $1 " " $2 " " $3 " " $4 " " $5}'
/abd DEF xxxxxx dat
Note the first, empty field. The pattern of interest is $4.
To count the occurrences of a pattern in $4 we use an associative array a and increment it's value on each occurrence. Unoptimized, the awk command will look like:
... | awk -F. '{{if(a[$4]++ > 1){print}}'
However, you can write an awk program in the form:
CONDITION { ACTION }
What will give us:
... | awk -F. 'a[$4]++ > 1 {print}'
print is the default action in awk. It prints the whole current line. As it is the default action it can be omitted. Also the >1 check can be omitted because awk treats integer values greater than zero as true. This gives us the final command:
... | awk -F. 'a[$4]++'
To generalize the command we can say the pattern of interest isn't the 4th column, it is the next to last column. This can be expressed using number of fields in awk its NF:
... | awk -F. 'a[$(NF-1)]++'
Output:
./abc.DEF.xxxxxx.dat
./abd.DdF.xxxxxx.dat
./abd.DEF.xxxxxx.dat