I am trying to compare two files to then return one of the files columns upon a match. The code that I am using right now is excluding non-matching patterns and just printed out matching patterns. I need to print all results, both matching and non-matching, using grep.
File 1:
A,42.4,-72.2
B,47.2,-75.9
Z,38.3,-70.7
C,41.7,-95.2
File 2:
F
A
B
Z
C
P
E
Current Result:
A,42.4,-72.2
B,47.2,-75.9
Z,38.3,-70.7
C,41.7,-95.2
Expected Result:
F
A,42.4,-72.2
B,47.2,-75.9
Z,38.3,-70.7
C,41.7,-95.2
P
E
Bash Code:
while IFS=',' read point lat lon; do
check=`grep "${point} /home/aaron/file2 | awk '{print $1}'`
echo "${check},${lat},${lon}"
done < /home/aaron/file1
In awk:
$ awk -F, 'NR==FNR{a[$1]=$0;next}{print ($1 in a?a[$1]:$1)}' file1 file2
F
A,42.4,-72.2
B,47.2,-75.9
Z,38.3,-70.7
C,41.7,-95.2
P
E
Explained:
$ awk -F, ' # field separator to ,
NR==FNR { # file1
a[$1]=$0 # hash record to a, use field 1 as key
next
}
{
print ($1 in a?a[$1]:$1) # print match if found, else nonmatch
}
' file1 file2
If you don't care about order, there's a join binary in GNU coreutils that does just what you need :
$sort file1 > sortedFile1
$sort file2 > sortedFile2
$join -t, -a 2 sortedFile1 sortedFile2
A,42.4,-72.2
B,47.2,-75.9
C,41.7,-95.2
E
F
P
Z,38.3,-70.7
It relies on files being sorted and will not work otherwise.
Now will you please get out of my /home/ ?
another join based solution preserving the order
f() { nl -nln -s, -w1 "$1" | sort -t, -k2; }; join -t, -j2 -a2 <(f file1) <(f file2) |
sort -t, -k2 |
cut -d, -f2 --complement
F
A,42.4,-72.2,2
B,47.2,-75.9,3
Z,38.3,-70.7,4
C,41.7,-95.2,5
P
E
Cannot beat the awk solution but another alternative utilizing unix toolchain based on decorate-undecorate pattern.
Problems with your current solution:
1. You are missing a double-quote in grep "${point} /home/aaron/file2.
2. You should start with the other file for printing all lines in that file
while IFS=',' read point; do
echo "${point}$(grep "${point}" /home/aaron/file1 | sed 's/[^,]*,/,/')"
done < /home/aaron/file2
3. The grep can give more than one result. Which one do you want (head -1) ?
An improvement would be
while IFS=',' read point; do
echo "${point}$(grep "^${point}," /home/aaron/file1 | sed -n '1s/[^,]*,/,/p')"
done < /home/aaron/file2
4. Using while is the wrong approach.
For small files it wil get the work done, but you will get stuck with larger files. The reason is that you will call grep for each line in file2, reading file1 a lot of times.
Better is using awk or some other solution.
Another solution is using sed with the output of another sed command:
sed -r 's#([^,]*),(.*)#s/^\1$/\1,\2/#' /home/aaron/file1
This will give commands for the second sed.
sed -f <(sed -r 's#([^,]*),(.*)#s/^\1$/\1,\2/#' /home/aaron/file1) /home/aaron/file2
Related
I have a collection of plain text files which are named as yymmdd_nnnnnnnnnn.txt, which I want to append another number sequence to the filenames, so that they each become named as yymmdd_nnnnnnnnnn_iiiiiiiii.txt instead, where the iiiiiiiii is taken from the one line in each file which contains the text "GST: 123456789⏎" (or similar) at the end of the line. While I am sure that there will only be one such matching line within each file, I don't know exactly which line it will be on.
I need an elegant one-liner solution that I can run over the collection of files in a folder, from a bash script file, to rename each file in the collection by appending the specific GST number for each filename, as found within the files themselves.
Before even getting to the renaming stage, I have encountered a problem with this. Here is what I tried, which didn't work...
# awk '/\d+$/' | grep -E 'GST: ' 150101_2224567890.txt
The grep command alone works perfectly to find the relevant line within the file, but the awk doesn't return just the final digits group. It fails with the error "warning: regexp escape sequence \d is not a known regexp operator". I had assumed that this regex should return any number of digits which are at the end of the line. The text file in question contains a line which ends with "GST: 112060340⏎". Can someone please show me how to make this work, and maybe also to help with the appropriate coding to move the collection of files to the new filenames? Thanks.
Thanks to a comment from #Renaud, I now have the following code working to obtain just the GST registration number from within a text file, which puts me a step closer towards a workable solution.
awk '/GST: / {printf $NF}' 150101_2224567890.txt
I still need to loop this over the collection instead of just specifying one filename. I also need to be able to use the output from #Renaud's contribution, to rename the files. I'm getting closer to a working solution, thanks!
This awk should work for you:
awk '$1=="GST:" {fn=FILENAME; sub(/\.txt$/, "", fn); print "mv", FILENAME, fn "_" $2 ".txt"; nextfile}' *_*.txt | sh
To make it more readable:
awk '$1 == "GST:" {
fn = FILENAME
sub(/\.txt$/, "", fn)
print "mv", FILENAME, fn "_" $2 ".txt"
nextfile
}' *_*.txt | sh
Remove | sh from above to see all mv commands together.
You may try
for f in *_*.txt; do echo mv "$f" "${f%.txt}_$(sed '/.*GST: /!d; s///; q' "$f").txt"; done
Drop the echo if you're satisfied with the output.
As you are sure there is only one matching line, you can try:
$ n=$(awk '/GST:/ {print $NF}' 150101_2224567890.txt)
$ mv 150101_2224567890.txt "150101_2224567890_$n.txt"
Or, for all .txt files:
for f in *.txt; do
n=$(awk '/GST:/ {print $NF}' "$f")
if [[ -z "$n" ]]; then
printf '%s: GST not found\n' "$f"
continue
fi
mv "$f" "$f{%.txt}_$n.txt"
done
Another one-line solution to consider, although perhaps not so elegant.
for original_filename in *_*.txt; do \
new_filename=${original_filename%'.txt'}_$(
grep -E 'GST: ' "$original_filename" | \
sed -E 's/.*GST//g; s/[^0-9]//g'
)'.txt' && \
mv "$original_filename" "$new_filename"; \
done
Output:
150101_2224567890_123456789.txt
If you are open to a multi line script:-
#!/bin/sh
for f in *.txt; do
prefix=$(echo "${f}" | sed s'#\.txt##')
cp "${f}" f1
sed -i s'#GST#%GST#' "./f1"
cat "./f1" | tr '%' '\n' > f2
number=$(cat "./f2" | sed -n '/GST/'p | cut -d':' -f2 | tr -d ' ')
newname="${prefix}_${number}.txt"
mv -v "${f}" "${newname}"
rm -v "./f1"
rm -v "./f2"
done
In general, if you want to make your files easy to work with, then leave as many potential places for them to be split with newlines as possible. It is much easier to alter files by simply being able to put what you want to delete or print on its' own line, than it is to search for things horizontally with regular expressions.
I'm not very fluent in bash but actively trying to improve, so I'd like to ask some experts here for a little suggestion :)
Let's say I've got a following text file:
Some
spam
about which I don't care.
I want following letters:
X1
X2
X3
I do not want these:
X4
X5
Nor this:
X6
But I'd like these, too:
I want following letters:
X7
And so on...
And I'd like to get numbers of lines with these letters, so my desired output should look like:
5 6 7 15
To clarify: I want all lines matching some regex /\s*X./, that occur right after one match with another regex /\sI want following letters:/
Right now I've got a working solution, which I don't really like:
cat data.txt | grep -oPz "\sI want following letters:((\s*X.)*)" | grep -oPz "\s*X." > tmp.txt
for entry in $(cat tmp.txt); do
grep -n $entry data.txt | cut -d ":" -f1
done
My question is: Is there any smart way, any tool I don't know with a functionality to do this in one line? (I esspecially don't like having to use temp file and a loop here)
You can use awk:
awk '/I want following/{p=1;next}!/^X/{p=0;next}p{print NR}' file
Explanation in multiline version:
#!/usr/bin/awk
/I want following/{
# Just set a flag and move on with the next line
p=1
next
}
!/^X/ {
# On all other lines that doesn't start with a X
# reset the flag and continue to process the next line
p=0
next
}
p {
# If the flag p is set it must be a line with X+number.
# print the line number NR
print NR
}
Following may help you here.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1} flag' Input_file
Above will print the lines which have I want following letters: too in case you don't want these then use following.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1;next} flag' Input_file
To add line number to output use following.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1;next} flag{print FNR}' Input_file
First, let's optimize a little bit your current script:
#!/bin/bash
FILE="data.txt"
while read -r entry; do
[[ $entry ]] && grep -n $entry "$FILE" | cut -d ":" -f1
done < <(grep -oPz "\sI want following letters:((\s*X.)*)" "$FILE"| grep -oPz "\s*X.")
And here's some comments:
No need to use cat file|grep ... => grep ... file
Do not use the syntaxe for i in $(command), it's often the cause of multiple bugs and there's always a smarter solution.
No need to use a tmp file either
And then, there's a lot of shorter possible solutions. Here's one using awk:
$ awk '{ if($0 ~ "I want following letters:") {s=1} else if(!($0 ~ "^X[0-9]*$")) {s=0}; if (s && $0 ~ "^X[0-9]*$") {gsub("X", ""); print}}' data.txt
1
2
3
7
I'm trying to add a 'chr' string in the lines where is not there. This operation is necessary only in the lines that have not '##'.
At first I use grep + sed commands, as following, but I want to run the command overwriting the original file.
grep -v "^#" 5b110660bf55f80059c0ef52.vcf | grep -v 'chr' | sed 's/^/chr/g'
So, to run the command in file I write this:
sed -i -E '/^#.*$|^chr.*$/ s/^/chr/' 5b110660bf55f80059c0ef52.vcf
This is the content of the vcf file.
##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="#ref plus strand,#ref minus strand, #alt plus strand, #alt minus strand">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 24430-0009S21_GM17-12140
1 955597 95692 G T 1382 PASS VARTYPE=1;BGN=0.00134309;ARL=150;DER=53;DEA=55;QR=40;QA=39;PBP=1091;PBM=300;TYPE=SNP;DBXREF=dbSNP:rs115173026,g1000:0.2825,esp5400:0.2755,ExAC:0.2290,clinvar:rs115173026,CLNSIG:2,CLNREVSTAT:mult,CLNSIGLAB:Benign;SGVEP=AGRN|+|NM_198576|1|c.45G>T|p.:(p.Pro15Pro)|synonymous GT:DP:AD:DP4 0/1:125:64,61:50,14,48,13
chr1 957898 82729935 G T 1214 off_target VARTYPE=1;BGN=0.00113362;ARL=149;DER=50;DEA=55;QR=38;QA=40;PBP=245;PBM=978;NVF=0.53;TYPE=SNP;DBXREF=dbSNP:rs2799064,g1000:0.3285;SGVEP=AGRN|+|NM_198576|2|c.463+56G>T|.|intronic GT:DP:AD:DP4 0/1:98:47,51:9,38,10,41
If I understand what is your expected result, try:
sed -ri '/^(#|chr)/! s/^/chr/' file
Your question isn't clear and you didn't provide the expected output so we can't test a potential solution but if all you want is to add chr to the start of lines where it's not already present and which don't start with # then that's just:
awk '!/^(#|chr)/{$0="chr" $0} 1' file
To overwrite the original file using GNU awk would be:
awk -i inplace '!/^(#|chr)/{$0="chr" $0} 1' file
and with any awk:
awk '!/^(#|chr)/{$0="chr" $0} 1' file > tmp && mv tmp file
This can be done with a single sed invocation. The script itself is something like the following.
If you have an input of format
$ echo -e '#\n#\n123chr456\n789chr123\nabc'
#
#
123chr456
789chr123
abc
then to prepend chr to non-commented chrless lines is done as
$ echo -e '#\n#\n123chr456\n789chr123\nabc' | sed '/^#/ {p
d
}
/chr/ {p
d
}
s/^/chr/'
which prints
#
#
123chr456
789chr123
chrabc
(Note the multiline sed script.)
Now you only need to run this script on a file in-place (-i in modern sed versions.)
I have two files. File1 is as follows
Apple
Cat
Bat
File2 is as follows
I have an Apple
Batman returns
This is a test file.
Now I want to check which strings in first file are not present in the second file. I can do a grep -f file1 file2 but thats giving me the matched lines in the second file.
To get the strings that are in the first file and also in the second file:
grep -of file1 file2
The result (using the given example) will be:
Apple
Bat
To get the strings that are in the first file but not in the second file, you could:
grep -of file1 file2 | cat - file1 | sort | uniq -u
Or even simpler (thanks to #triplee's comment):
grep -of file1 file2 | grep -vxFf - file1
The result (using the given example) will be:
Cat
From the grep man page:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
From the uniq man page:
-u, --unique
Only print unique lines
If you want to show words from file1 that are not in file2, a dirty way is to loop through the words and grep silently. In case of not match, print the word:
while read word
do
grep -q "$word" f2 || echo "$word"
done < f1
To match exact words, add -w: grep -wq...
Test
$ while read word; do grep -q "$word" f2 || echo "$word"; done < f1
Cat
$ while read word; do grep -wq "$word" f2 || echo "$word"; done < f1
Cat
Bat
A better approach is to use awk:
$ awk 'FNR==NR {a[$1]; next} {for (i=1;i<=NF;i++) {if ($i in a) delete a[$i]}} END {for (i in a) print i}' f1 f2
Cat
Bat
This stores the values in file1 into the array a[]. Then, it loops through all lines of file2 checking each single element. If one of them matches a value in the array a[], then this element is removed from the array. Finally, in the END{} block prints the values that were not found.
Given a file like this:
a
b
a
b
I'd like to be able to use sed to replace just the last line that contains an instance of "a" in the file. So if I wanted to replace it with "c", then the output should look like:
a
b
c
b
Note that I need this to work irrespective of how many matches it might encounter, or the details of exactly what the desired pattern or file contents might be. Thanks in advance.
Not quite sed only:
tac file | sed '/a/ {s//c/; :loop; n; b loop}' | tac
testing
% printf "%s\n" a b a b a b | tac | sed '/a/ {s//c/; :loop; n; b loop}' | tac
a
b
a
b
c
b
Reverse the file, then for the first match, make the substitution and then unconditionally slurp up the rest of the file. Then re-reverse the file.
Note, an empty regex (here as s//c/) means re-use the previous regex (/a/)
I'm not a huge sed fan, beyond very simple programs. I would use awk:
tac file | awk '/a/ && !seen {sub(/a/, "c"); seen=1} 1' | tac
Many good answers here; here's a conceptually simple two-pass sed solution assisted by tail that is POSIX-compliant and doesn't read the whole file into memory, similar to Eran Ben-Natan's approach:
sed "$(sed -n '/a/ =' file | tail -n 1)"' s/a/c/' file
sed -n '/a/=' file outputs the numbers of the lines (function =) matching regex a, and tail -n 1 extracts the output's last line, i.e. the number of the line in file file containing the last occurrence of the regex.
Placing command substitution $(sed -n '/a/=' file | tail -n 1) directly before ' s/a/c' results in an outer sed script such as 3 s/a/c/ (with the sample input), which performs the desired substitution only on the last on which the regex occurred.
If the pattern is not found in the input file, the whole command is an effective no-op.
Another approach:
sed "`grep -n '^a$' a | cut -d \: -f 1 | tail -1`s/a/c/" a
The advantage of this approach is that you run sequentially on the file twice, and not read it to memory. This can be meaningful in large files.
This might work for you (GNU sed):
sed -r '/^PATTERN/!b;:a;$!{N;/^(.*)\n(PATTERN.*)/{h;s//\1/p;g;s//\2/};ba};s/^PATTERN/REPLACEMENT/' file
or another way:
sed '/^PATTERN/{x;/./p;x;h;$ba;d};x;/./{x;H;$ba;d};x;b;:a;x;/./{s/^PATTERN/REPLACEMENT/p;d};x' file
or if you like:
sed -r ':a;$!{N;ba};s/^(.*\n?)PATTERN/\1REPLACEMENT/' file
On reflection, this solution may replace the first two:
sed '/a/,$!b;/a/{x;/./p;x;h};/a/!H;$!d;x;s/^a$/c/M' file
If the regexp is no where to found in the file, the file will pass through unchanged. Once the regex matches, all lines will be stored in the hold space and will be printed when one or both conditions are met. If a subsequent regex is encountered, the contents of the hold space is printed and the latest regex replaces it. At the end of file the first line of the hold space will hold the last matching regex and this can be replaced.
Another one:
tr '\n' ' ' | sed 's/\(.*\)a/\1c/' | tr ' ' '\n'
in action:
$ printf "%s\n" a b a b a b | tr '\n' ' ' | sed 's/\(.*\)a/\1c/' | tr ' ' '\n'
a
b
a
b
c
b
A two-pass solution for when buffering the entire input is intolerable:
sed "$(sed -n /a/= file | sed -n '$s/$/ s,a,c,/p' )" file
(the earlier version of this hit a bug with history expansion encountered on a redhat bash-4.1 install, this way avoids a $!d that was being mistakenly expanded.)
A one-pass solution that buffers as little as possible:
sed '/a/!{1h;1!H};/a/{x;1!p};$!d;g;s/a/c/'
Simplest:
tac | sed '0,/a/ s/a/c/' | tac
Here is all done in one single awk
awk 'FNR==NR {if ($0~/a/) f=NR;next} FNR==f {$0="c"} 1' file file
a
b
c
b
This reads the file twice. First run to find last a, second run to change it.
tac infile.txt | sed "s/a/c/; ta ; b ; :a ; N ; ba" | tac
The first tac reverses the lines of infile.txt, the sed expression (see https://stackoverflow.com/a/9149155/2467140) replaces the first match of 'a' with 'c' and prints the remaining lines, and the last tac reverses the lines back to their original order.
Here is a way with only using awk:
awk '{a[NR]=$1}END{x=NR;cnt=1;while(x>0){a[x]=((a[x]=="a"&&--cnt==0)?"c <===":a[x]);x--};for(i=1;i<=NR;i++)print a[i]}' file
$ cat f
a
b
a
b
f
s
f
e
a
v
$ awk '{a[NR]=$1}END{x=NR;cnt=1;while(x>0){a[x]=((a[x]=="a"&&--cnt==0)?"c <===":a[x]);x--};for(i=1;i<=NR;i++)print a[i]}' f
a
b
a
b
f
s
f
e
c <===
v
It can also be done in perl:
perl -e '#a=reverse<>;END{for(#a){if(/a/){s/a/c/;last}}print reverse #a}' temp > your_new_file
Tested:
> cat temp
a
b
c
a
b
> perl -e '#a=reverse<>;END{for(#a){if(/a/){s/a/c/;last}}print reverse #a}' temp
a
b
c
c
b
>
Here's another option:
sed -e '$ a a' -e '$ d' file
The first command appends an a and the second deletes the last line. From the sed(1) man page:
$ Match the last line.
d Delete pattern space. Start next cycle.
a text Append text, which has each embedded newline preceded by a backslash.
Here's the command:
sed '$s/.*/a/' filename.txt
And here it is in action:
> echo "a
> b
> a
> b" > /tmp/file.txt
> sed '$s/.*/a/' /tmp/file.txt
a
b
a
a
awk-only solution:
awk '/a/{printf "%s", all; all=$0"\n"; next}{all=all $0"\n"} END {sub(/^[^\n]*/,"c",all); printf "%s", all}' file
Explanation:
When a line matches a, all lines between the previous a up to (not including) current a (i.e. the content stored in the variable all) is printed
When a line doesn't match a, it gets appended to the variable all.
The last line matching a would not be able to get its all content printed, so you manually print it out in the END block. Before that though, you can substitute the line matching a with whatever you desire.
Given:
$ cat file
a
b
a
b
You can use POSIX grep to count the matches:
$ grep -c '^a' file
2
Then feed that number into awk to print a replacement:
$ awk -v last=$(grep -c '^a' file) '/^a/ && ++cnt==last{ print "c"; next } 1' file
a
b
c
b