How I merge two files when two first columns match in both files and replace first file values with second file columns... I mean...
Same number of columns:
FILE 1:
121212,0100,1.1,1.2,
121212,0200,2.1,2.2,
FILE 2:
121212,0100,3.1,3.2,3.3,
121212,0130,4.1,4.2,4.3,
121212,0200,5.1,5.2,5.3,
121212,0230,6.1,6.2,6.3,
OUTPUT:
121212,0100,3.1,3.2,3.3,
121212,0200,5.1,5.2,5.3,
In other words, I need to print $0 of the second file when match $1 and $2 in both files. I understand the logic but I can't implement it using arrays. That apparently should be used.
Please take a moment to explain any code.
Use awk to print the first 2 fields in the pattern file and pipe to grep to do the match:
$ awk 'BEGIN{OFS=FS=","}{print $1,$2}' file1 | grep -f - file2
121212,0100,3.1,3.2,3.3,
121212,0200,5.1,5.2,5.3,
The -f option tells grep to take the pattern from a file but using - instead of a filename makes grep take the patterns from stdin.
So the first awk script produces the patterns from file1 which we pipe to match against in file2 using grep:
$ awk 'BEGIN{OFS=FS=","}{print $1,$2}' file1
121212,0100
121212,0200
You probably want to anchor the match to the beginning of the line using ^:
$ awk 'BEGIN{OFS=FS=","}{print "^"$1,$2}' file1
^121212,0100
^121212,0200
$ awk 'BEGIN{OFS=FS=","}{print "^"$1,$2}' file1 | grep -f - file2
121212,0100,3.1,3.2,3.3,
121212,0200,5.1,5.2,5.3,
Here's one way using awk:
awk -F, 'FNR==NR { a[$1,$2]; next } ($1,$2) in a' file1 file2
Results:
121212,0100,3.1,3.2,3.3,
121212,0200,5.1,5.2,5.3,
Related
I would like to use sed (I think?) to replace a pattern with file contents.
Example
File 1 (primary file)
Hello <<CODE>>
Goodbye.
File 2 (contents)
Anonymous person,
I am very nice.
File 3 (target)
Hello Anonymous person,
I am very nice.
Goodbye.
Right now, I am using this command:
sed "/<<CODE>>/{r file2
:a;n;ba}" file1 | \
sed "s/<<CODE>>//g" > \
file3
But this outputs:
Hello
Anonymous person,
I am very nice.
Goodbye.
(note the newline after Hello)
How can I do this without getting that newline?
(note that file2 may contain all sorts of things: brackets, newlines, quotes, ...)
Much simpler to use awk:
awk 'FNR==NR{s=(!s)?$0:s RS $0;next} /<<CODE>>/{sub(/<<CODE>>/, s)} 1' file2 file1
Hello Anonymous person,
I am very nice.
Goodbye.
Explanation:
FNR==NR - Execute this block for first file in input i.e. file2
s=(!s)?$0:s RS $0 - Concatenate whole file content in string s
next - Read next line until EOF on first file
/<<CODE>>/ - If a line with <<CODE>> is found execute that block
sub(/<<CODE>>/, s) - Replace <<CODE>> with string s (data of file2)
1 - print the output
EDIT: Non-regex way:
awk 'FNR==NR{s=(!s)?$0:s RS $0; next}
i=index($0, "<<CODE>>"){$0=substr($0, 1, i-1) s substr($0, i+8)} 1' file2 file1
The awk though it might be harder to read, is probably the right way to go.
Just for comparison sake, here's a ruby version
ruby -ne 'BEGIN{#body=File.open("file2").read}; puts gsub(/<<CODE>>/,#body);' < file1
not too bad in Perl:
cat file1 | perl -e "open FH, qq(file2); \$f2=join '', <FH>; chomp \$f2; map {s/<<CODE>>/\$f2/g; print \$_} <STDIN>" > file3
(maybe i am not the best perl coder)
it's straight-forward. read the whole file2 in, substitute it, then print.
You could do simply with:
awk '{gsub("<<CODE>>", filetwo)}1' filetwo="$(<file2)" file1
I am thinking about a solution where we use Bash to evaluate a cat process outside of the single quotes delimitating the sed script, but unfortunately the following doesn't work as soon as your file2 contains a newline:
sed 's/<<CODE>>/'"$(cat file2)"'/' file1
It accepts spaces in file2, but not newlines, as you can see in the following piece of code that works well, whatever the content of file2:
sed 's/<<CODE>>/'"$(cat file2 | tr -d '\n')"'/' file1
But, obviously, this modifies the content of the file before inclusion, which is simply bad. :-(
So if you want to play, you can first tr the newlines to some weird unexpectable character, and translate those back after sed has worked on its script:
sed 's/<<CODE>>/'"$(cat file2 | tr '\n' '\3')"'/' file1 | tr '\3' '\n'
The basic idea is this. Suppose that you want to search a file for multiple patterns from a pipe with awk :
... | awk -f - '{...}' someFile.txt
* '...' is just short for some code
* '-f -' indicates the pattern is taken from pipe
Is there a way to know which pattern is searched at each instant within the awk script
(like you know $1 is the first field, is there something like $PATTERN that contains the current pattern
searched or a way to get something like it?
More Elaboration:
if I have 2 files:
someFile.txt containing:
1
2
4
patterns.txt containing:
1
2
3
4
running this command:
cat patterns.txt |awk -f - '{...}' someFile.txt
What should I type between the braces such that only the pattern in patterns.txt that
has not been matched in someFile.txt is printed?(in this case the number 3 in patterns.txt is not matched)
Under the requirements that patterns.txt be supplied as stdin and that the processing be done with awk:
$ cat patterns.txt | awk 'FNR==NR{p=p "\n" $0;next;} p !~ $0' someFile.txt -
3
This was tested using GNU awk.
Explanation
We want to remove from patterns.txt anything that matches a line in someFile.txt. To do this, we first read in someFile.txt and create patterns from it. Next, we print only the lines from patterns.txt that do not match any of the patterns from someFile.txt.
FNR==NR{p=p "\n" $0;next;}
NR is the number of lines that awk has read so far and FNR is the number of lines that awk has read so far from the current file. Thus, if FNR==NR, we are still reading the first named file: someFile.txt. We save all such lines in the newline-separated variable p. We then tell awk to skip the remaining commands and jump to the next line.
p !~ $0
If we got here, then we are now reading the second named file on the command line which is - for stdin. This boolean condition evaluates to either true or false. If it is true, the line is printed. If not, it is skipped. In other words, the above is awk's crytic shorthand for:
p !~ $0 {print $0}
cmd | awk 'NR==FNR{pats[$0]; next} {for (p in pats) if ($0 ~ p) delete pats[p]} END{ for (p in pats) print p }' - someFile.txt
Another way in awk
cat patterns.txt | awk 'NR>FNR&&!($0 in a);{a[$0]}' someFile.txt -
I have two files. File1 is as follows
Apple
Cat
Bat
File2 is as follows
I have an Apple
Batman returns
This is a test file.
Now I want to check which strings in first file are not present in the second file. I can do a grep -f file1 file2 but thats giving me the matched lines in the second file.
To get the strings that are in the first file and also in the second file:
grep -of file1 file2
The result (using the given example) will be:
Apple
Bat
To get the strings that are in the first file but not in the second file, you could:
grep -of file1 file2 | cat - file1 | sort | uniq -u
Or even simpler (thanks to #triplee's comment):
grep -of file1 file2 | grep -vxFf - file1
The result (using the given example) will be:
Cat
From the grep man page:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
From the uniq man page:
-u, --unique
Only print unique lines
If you want to show words from file1 that are not in file2, a dirty way is to loop through the words and grep silently. In case of not match, print the word:
while read word
do
grep -q "$word" f2 || echo "$word"
done < f1
To match exact words, add -w: grep -wq...
Test
$ while read word; do grep -q "$word" f2 || echo "$word"; done < f1
Cat
$ while read word; do grep -wq "$word" f2 || echo "$word"; done < f1
Cat
Bat
A better approach is to use awk:
$ awk 'FNR==NR {a[$1]; next} {for (i=1;i<=NF;i++) {if ($i in a) delete a[$i]}} END {for (i in a) print i}' f1 f2
Cat
Bat
This stores the values in file1 into the array a[]. Then, it loops through all lines of file2 checking each single element. If one of them matches a value in the array a[], then this element is removed from the array. Finally, in the END{} block prints the values that were not found.
I am using an awk command (someawkcommand) that prints these lines (awkoutput):
>Genome1
ATGCAAAAG
CAATAA
and then, I want to use this output (awkoutput) as the input of a sed command. Something like that:
someawkcommand | sed 's/awkoutput//g' file1.txt > results.txt
file1.txt:
>Genome1
ATGCAAAAG
CAATAA
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
The final objective is to delete all lines in a file (file1.txt) containing the exact pattern found previously by awk.
The file results.txt contains (output of sed):
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
How should I write the sed command? Is there any simple way that sed will recognize the output of awk as its input?
Using GNU awk for multi-char RS:
$ cat file1
>Genome1
ATGCAAAAG
CAATAA
$ cat file2
>Genome1
ATGCAAAAG
CAATAA
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
$ gawk -v RS='^$' -v ORS= 'NR==FNR{rmv=$0;next} {sub(rmv,"")} 1' file1 file2
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
The stuff that might be non-obvious to newcomers but are very common awk idioms:
-v RS='^$' tells awk to read the whole file as one string (instead of it's default one line at a time).
-v ORS= sets the Output Record Separator to the null string (instead of it's default newline) so that when the file is printed as a string awk doesn't add a newline after it.
NR==FNR is a condition that is only true for the first input file.
1 is a true condition invoking the default action of printing the current record.
Here is a possible sed solution:
someawkcommand | sed -n 's_.*_/&/d;_;H;${x;s_\n__g p}' | sed -f - file1.txt
First sed command turns output from someawkcommand into a sed expression.
Concretely, it turns
>Genome1
ATGCAAAAG
CAATAA
into:
/>Genome1/d;/ATGCAAAAG/d;/CAATAA/d;
(in sed language: delete lines containing those patterns; mind that you will have to escape /,[,],*,^,$ in your awk output if there are some, with another substitution for instance).
Second sed command reads it as input expression (-f - reads sed commands from file -, i.e. gets it from pipe) and applies to file file1.txt.
Remark for other readers:
OP wants to use sed, but as notified in comments, it may not be the easiest way to solve this question. Deleting lines with awk could be simpler. Another (easy) solution could be to use grep with -v (invert match) and -f (read patterns from files) options, in this way:
someawkcommand | grep -v -f - file1.txt
Edit: Following #rici's comments, here is a new command that takes output from awk as a single multiline pattern.
Disclaimer: It gets dirty. Kids, don't do it home. Grown-ups are strongly encouraged to consider avoiding sed for that.
someawkcommand | \
sed -n 'H;${x;s_\n__;s_\n_\\n_g;s_.*_H;${x;s/\\n//;s/&//g p}_ p}' | \
sed -n -f - file1.txt
Output from inner sed is:
H;${x;s/\n//;s/>Genome1\nATGCAAAAG\nCAATAA//g p}
Additional drawback: it will add an empty line instead of removed pattern. Can't fix it easily (problems if pattern is at beginning/end of file). Add a substitution to remove it if you really feel like it.
This is can more easily be done in awk, but the usual "eliminate duplicates" code is not correct. As I understand the question, the goal is to remove entire stanzas from the file.
Here's a possible solution which assumes that the first awk script outputs a single stanza:
awk 'NR == FNR {stanza[nstanza++] = $0; next}
$0 == stanza[i] {++i; next}
/^>/ && i == nstanza {i=0; next}
i {for (j=0; j<i; ++j) print stanza[j]; i=0}
{print $0;}
' <(someawkcommand) file1.txt
This might work for you (GNU sed):
sed '1{h;s/.*/:a;$!{N;ba}/p;d};/^>/!{H;$!d};x;s/\n/\\n/g;s|.*|s/&\\n*//g|p;$s|.*|s/\\n*$//|p;x;h;d' file1
sed -f - file2
This builds a script from file1 and then runs it against file2.
The script slurps in file2 and then does a gobal substitution(s) using the contents of file1. Finally it removes any blank lines at the end file caused by the contents deletion.
To see the script produced from file1, remove the pipe and the second sed command.
An alternative way would be to use diff and sed:
diff -e file2 file1 | sed 's/d/p/g' | sed -nf - file2
please refer the file contents below.
#HD VN:1.0 SO:unsorted
#SQ SN:Chr1 LN:30427680
#PG ID:bowtie2 PN:bowtie2 VN:2.1.0
how can i extract just the number 30427680 using awk or any other unix command.
Using sed
sed -n 's/.*LN://p' < input.txt
This will erase everything up until LN:, and print what's left, and only if a substitution did take place.
Using awk
awk -v FS=: '/LN:/ { print $3; }' < input.txt
This will match lines that contain LN:, use : as field separator, and print the 3rd column.
Using grep
grep -o '[0-9]\{3,\}' < input.txt
This will match sequences of 3 or more digits, and print only the matched pattern thanks to the -o.
Depending on other cases not included in your question, you might have to make the patterns more strict.
Using grep:
grep -oP 'LN:\K.*' filename
Just use grep:
grep -o 30427680 file
-o, --only-matching
Prints only the matching part of the lines.
Using perl :
perl -ne 'print $& if /LN:\K.*/' filename
or
perl -ne 'print $1 if /LN:(.*)/' filename
Another awk
awk -F"LN:" 'NF>1 {print $2}' file