awk - if values match then print file1 and file 2

awk - if values match then print file1 and file 2 - regex

I googled a lot my problem and tested different solutions, but none seem to work. I even used the same command in advance with success but now I do not manage to get my desired output.
I have file1
AAA;123456789A
BBB;123456789B
CCC;123456789C
And file2
1;2;3;CCC;pippo
1;2;3;AAA;pippo
1;2;3;BBB;pippo
1;2;3;*;pippo
My desired output is this:
1;2;3;CCC;pippo;CCC;123456789C
1;2;3;AAA;pippo;AAA;123456789A
1;2;3;BBB;pippo;BBB;123456789B
I tried with this command:
awk -F";" -v OFS=";" 'FNR == NR {a[$10]=$1; b[$20]=$2; next}($10 in a){ if(match(a[$10],$4)) print $0,a[$10],b[$20]}' file1 file2
But I get this output (only one entry, even with bigger files):
1;2;3;CCC;pippo;CCC;123456789C
What am I doing wrong? If it manages for one it should for all the other. Why is this not happening?
Also why if I set a[$1]=$1 it doesn't work?
Thank you for helping!
If possible could you explain the answer? So next time I won't have to ask for help!
EDIT: Sorry, I did not mention (since I wanted to keep the example minimal) that in file2 some fields are just "*". And I'd like to add an "else doesn't match do something".

awk to the rescue!
$ awk 'BEGIN{FS=OFS=";"}
NR==FNR{a[$1]=$0;next}
{print $0,a[$4]}' file1 file2
1;2;3;CCC;pippo;CCC;123456789C
1;2;3;AAA;pippo;AAA;123456789A
1;2;3;BBB;pippo;BBB;123456789B
UPDATE:
Based on the original input file it was only looking for exact match. If you want to skip the entries where there is no match, you need to qualify the print block with $4 in a
$ awk 'BEGIN{FS=OFS=";"}
NR==FNR{a[$1]=$0;next}
$4 in a{print $0,a[$4]}' file1 file2

join is made for this sort of thing:
$ join -t';' -1 4 -o1.{1..5} -o2.{1..2} <(sort -t';' -k4 file2) <(sort -t';' file1)
1;2;3;AAA;pippo;AAA;123456789A
1;2;3;BBB;pippo;BBB;123456789B
1;2;3;CCC;pippo;CCC;123456789C
The output is what you asked for except for the ordering of lines, which I assume isn't important. The -o options to join are needed because you want the full set of fields; you can try omitting it and you'll get the join field on the left a single time instead, which might also be fine.

Related

Is there a way to use regex with awk to execute a command only when the pattern is matched?

I'm trying to write a bash script that gets user input, checks a .txt for the line that contains that input then plugs that into a wget statement to commence a download.
In testing the functionality awk seems to print out every line, not just pattern matched lines.
chosen=DSC01985
awk -v c="$chosen" 'BEGIN {FS="/"; /c/}
{print $8, "found", c}
END{print " done"}' ./imgLink.txt
The above should take from imgLink.txt, search for the pattern and return that the pattern is found. Instead it prints the the 8th field of every line in the file.
I have tried moving /c/ out of the begin statement but to no avail.
what's going on here?
Example input:
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01533.jpg
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01536.jpg
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01543.jpg
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01558.jpg
https://xxxx/xxxx/xxxx/xxxx/xxx/DSC01565.jpg
etc.
Example output:
...
DSC02028.jpg found DSC01985
DSC02030.jpg found DSC01985
DSC02032.jpg found DSC01985
DSC02038.jpg found DSC01985
DSC02042.jpg found DSC01985
etc.

You were close in your attempt, you can't search an awk variable like /var/ you need different method for this. Could you please try following.Considering that your string which you want to look will come in URL value(s) which you have currently xxxed in your post.
awk -v c="$chosen" -F'/' '$0 ~ c{print $NF " found " c}' Input_file
Not sure why you have written done in your END block, you could add it here if you need it. Also $NF means last field of current line you could print it as per your need too.

remove duplicate links from a file

I have a big file with links something like this:
http://aaaa1.com/weblink/link1.html#XXX
http://aaaa1.com/weblink/link1.html#XYZ
http://aaaa1.com/weblink/link2.html#XXX
http://bbbb.com/web1/index.php?index=1
http://bbbb.com/web1/index1.php?index=1
http://bbbb.com/web1/index1.php?index=2
http://bbbb.com/web1/index1.php?index=3
http://bbbb.com/web1/index1.php?index=4
http://bbbb.com/web1/index1.php?index=5
http://bbbb.com/web1/index2.php?index=666
http://bbbb.com/web1/index3.php?index=666
http://bbbb.com/web1/index4.php?index=5
I want to remove all duplicate links and remain with:
http://aaaa1.com/weblink/link1.html#XXX
http://aaaa1.com/weblink/link2.html#XXX
http://bbbb.com/web1/index.php?index=1
http://bbbb.com/web1/index1.php?index=1
http://bbbb.com/web1/index2.php?index=666
http://bbbb.com/web1/index3.php?index=666
http://bbbb.com/web1/index4.php?index=5
How can i do this ?

Could you please try following.
awk -F'[#?]' '!a[$1]++' Input_file
Explanation of above code:
awk -F'[#?]' ' ##Starting awk script from here and making field separator as #(hash) and ?(literal character) as per OP sample Input_file provided.
!a[$1]++ ##Creating an array whose index is $1(first field of current line). Checking condition if $1 is NOT present in a then increase its value with 1.
##And ! condition will make sure each lines $1 should come only 1 time in array a so by doing this all duplicates will NOT be printed.
' Input_file ##Mentioning Input_file name here.

I hope this will clear all the duplicate links from your file but there should be exactly similar values.
sort -u your-link-file.txt
And if you want to store it to another file then use this
cat your-link-file.txt | sort -u > result.txt

Adding commas when necessary to a csv file using regex

I have a csv file like the following:
entity_name,data_field_name,type
Unit,id
Track,id,LONG
The second row is missing a comma. I wonder if there might be some regex or awk like tool in order to append commas to the end of line in case there are missing commas in these rows?
Update
I know the requirements are a little vague. There might be several alternative ways to narrow down the requirements such as:
The header row should define the number of columns (and commas) that is valid for the whole file. The script should read the header row first and find out the correct number of columns.
The number of columns might be passed as an argument to the script.
The number of columns can be hardcoded into the script.
I didn't narrow down the requirements at first because I was ok with any of them. Of course, the first alternative is the best but I wasn't sure if this was easy to implement or not.
Thanks for all the great answers and comments. Next time, I will state acceptable alternative requirements explicitly.

You can use this awk command to fill up all rows starting from 2nd row with the empty cell values based on # of columns in the header row, in order to avoid hard-coding # of columns:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} NF{$nc=$nc} 1' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
Earlier solution:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} {printf "%s", $0;
for (i=NF+1; i<=nc; i++) printf "%s", OFS; print ""}' file

I would use sed,
sed 's/^[^,]*,[^,]*$/&,/' file
Example:
$ echo 'Unit,id' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,
$ echo 'Unit,id,bar' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,bar

Try this:
$ awk -F , 'NF==2{$2=$2","}1' file
Output:
entity_name,data_field_name,type
Unit,id,
Track,id,LONG

With another awk:
awk -F, 'NF==2{$3=""}1' OFS=, yourfile.csv

to present balance to all the awk solutions, following could be a vim only solution
:v/,.*,/norm A,
rationale
/,.*,/ searches for 2 comma's in a line
:v apply a global command on each line NOT matching the search
norm A, enters normal mode and appends a , to the end of the line

This MIGHT be all you need, depending on the info you haven't shared with us in your question:
$ awk -F, '{print $0 (NF<3?FS:"")}' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG

How to print a greedy range of lines using awk

I've encountered the following problem and haven't found a solution nor why awk behaves in this strange way.
So let's say I have the following text in a file:
startcue
This shouldn't be found.
startcue
This is the text I want to find.
endcue
startcue
This shouldn't be found either.
And I want to find the lines "startcue", "This is the text I want to find.", and "endcue".
I naively assumed that a simple range search by awk '/startcue/,/endcue/' would do it, but this prints out the whole file. I guess awk somehow finds the first range, but as the third startcue triggers on the printing of lines, it prints all the lines until the end of the file (still, this all seems a bit strange to me).
Now to the question: How can I get awk to print out just the lines I wan't? And maybe as an extra question: Can anybody explain awk's behaviour?
Thanks

$ awk '/startcue/{f=1; buf=""} f{buf = buf $0 RS} /endcue/{printf "%s",buf; f=0}' file
startcue
This is the text I want to find.
endcue

Here is a simple way to do it.
Since data is separated by blank lines, I set RS to nothing.
This makes awk to work with data in blocks.
Then find all blocks starting with startcue and ending with endcue
awk -v RS="" '/^startcue/ && /endcue$/' file
startcue
This is the text I want to find.
endcue
If startcue and endcue are always start line and end line and does only appears once int the block, this should do: (PS testing does show that it does not matter if there are more or less hits in the block. This always prints the block if both startclue and endcue are found)
awk -v RS="" '/startcue/ && /endcue/' file
startcue
This is the text I want to find.
endcue
And this should work too:
awk -v RS="" '/startcue.*endcue/' file
startcue
This is the text I want to find.
endcue

To summarize the problem, you want print lines from startcue to endcue but not if the endcue is missing. Ed Morton's approach is good. Here is yet another approach:
$ tac file | awk '/endcue/,/startcue/' | tac
startcue
This is the text I want to find.
endcue
How it works
tac file
This prints the lines in reverse order. tac is just like cat except that the lines come out in reverse order.
awk '/endcue/,/startcue/'
This prints all lines starting from endcue and finishing with startcue. When done this way, passages with missing endcues are not printed.
tac
This reverses the lines once again so that are back in the correct order.
How awk ranges work
Consider:
awk '/startcue/,/endcue/' file
This tells awk to start printing when if finds startcue and continue printing until if finds endcue. This is exactly what it does on your file.
There is no implied rule that the range /startcue/,/endcue/ cannot itself contain multiple instances of startcue. awk simply starts printing when it sees the first occurrence of startcue and continues until if finds endcue.

no buffering needed :
{m,n,g}awk 'BEGIN { _ +=_ ^= ORS = FS = RS = "\nendcue\n"
sub("end", "?start", RS)
__= substr(RS, _+--_) } (NF=_<NF) && $!_=__$_'
startcue
This is the text I want to find.
endcue

How to use awk and grep on 300GB .txt file?

I have a huge .txt file, 300GB to be more precise, and I would like to put all the distinct strings from the first column, that match my pattern into a different .txt file.
awk '{print $1}' file_name | grep -o '/ns/.*' | awk '!seen[$0]++' > test1.txt
This is what I've tried, and as far as I can see it works fine but the problem is that after some time I get the following error:
awk: program limit exceeded: maximum number of fields size=32767
FILENAME="file_name" FNR=117897124 NR=117897124
Any suggestions?

The error message tells you:
line(117897124) has to many fields (>32767).
You'd better check it out:
sed -n '117897124{p;q}' file_name
Use cut to extract 1st column:
cut -d ' ' -f 1 < file_name | ...
Note: You may change ' ' to whatever the field separator is. The default is $'\t'.

The 'number of fields' is the number of 'columns' in the input file, so if one of the lines is really long, then that could potentially cause this error.
I suspect that the awk and grep steps could be combined into one:
sed -n 's/\(^pattern...\).*/\1/p' some_file | awk '!seen[$0]++' > test1.txt
That might evade the awk problem entirely (that sed command substitutes any leading text which matches the pattern, in place of the entire line, and if it matches, prints out the line).

Seems to me that your awk implementation has an upper limit for the number of records it can read in one go of 117,897,124. The limits can vary according to your implementation, and your OS.
Maybe a sane way to approach this problem is to program a custom script that uses split to split the large file into smaller ones, with no more than 100,000,000 records each.
Just in case that you don't want to split the file, then maybe you could look for the limits file correspondent to your awk implementation. Maybe you can define unlimited as the Number of Records value, although I believe that is not a good idea, as you might end up using a lot of resources...

If you have enough free space on disk (because creates a temp .swp file) I suggest to use Vim, vim regex has small difference but you can convert from standard regex to vim regex with this tool http://thewebminer.com/regex-to-vim

The error message says your input file contains too many fields for your awk implementation. Just change the field separator to be the same as the record separator and you'll only have 1 field per line and so avoid that problem, then merge the rest of the commands into one:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\// && !seen[$0]++' file_name
If that's a problem then try:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\//' file_name | sort -u
There may be an even simpler solution but since you haven't posted any sample input and expected output, we're just guessing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

awk - if values match then print file1 and file 2 - regex

Related

Is there a way to use regex with awk to execute a command only when the pattern is matched?

remove duplicate links from a file

Adding commas when necessary to a csv file using regex

How to print a greedy range of lines using awk

How to use awk and grep on 300GB .txt file?

Categories

Resources