show only matched strings - grep - regex

I have two files. File1 is as follows
Apple
Cat
Bat
File2 is as follows
I have an Apple
Batman returns
This is a test file.
Now I want to check which strings in first file are not present in the second file. I can do a grep -f file1 file2 but thats giving me the matched lines in the second file.

To get the strings that are in the first file and also in the second file:
grep -of file1 file2
The result (using the given example) will be:
Apple
Bat
To get the strings that are in the first file but not in the second file, you could:
grep -of file1 file2 | cat - file1 | sort | uniq -u
Or even simpler (thanks to #triplee's comment):
grep -of file1 file2 | grep -vxFf - file1
The result (using the given example) will be:
Cat
From the grep man page:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
From the uniq man page:
-u, --unique
Only print unique lines

If you want to show words from file1 that are not in file2, a dirty way is to loop through the words and grep silently. In case of not match, print the word:
while read word
do
grep -q "$word" f2 || echo "$word"
done < f1
To match exact words, add -w: grep -wq...
Test
$ while read word; do grep -q "$word" f2 || echo "$word"; done < f1
Cat
$ while read word; do grep -wq "$word" f2 || echo "$word"; done < f1
Cat
Bat
A better approach is to use awk:
$ awk 'FNR==NR {a[$1]; next} {for (i=1;i<=NF;i++) {if ($i in a) delete a[$i]}} END {for (i in a) print i}' f1 f2
Cat
Bat
This stores the values in file1 into the array a[]. Then, it loops through all lines of file2 checking each single element. If one of them matches a value in the array a[], then this element is removed from the array. Finally, in the END{} block prints the values that were not found.

Related

Printing Both Matching and Non-Matching Patterns

I am trying to compare two files to then return one of the files columns upon a match. The code that I am using right now is excluding non-matching patterns and just printed out matching patterns. I need to print all results, both matching and non-matching, using grep.
File 1:
A,42.4,-72.2
B,47.2,-75.9
Z,38.3,-70.7
C,41.7,-95.2
File 2:
F
A
B
Z
C
P
E
Current Result:
A,42.4,-72.2
B,47.2,-75.9
Z,38.3,-70.7
C,41.7,-95.2
Expected Result:
F
A,42.4,-72.2
B,47.2,-75.9
Z,38.3,-70.7
C,41.7,-95.2
P
E
Bash Code:
while IFS=',' read point lat lon; do
check=`grep "${point} /home/aaron/file2 | awk '{print $1}'`
echo "${check},${lat},${lon}"
done < /home/aaron/file1
In awk:
$ awk -F, 'NR==FNR{a[$1]=$0;next}{print ($1 in a?a[$1]:$1)}' file1 file2
F
A,42.4,-72.2
B,47.2,-75.9
Z,38.3,-70.7
C,41.7,-95.2
P
E
Explained:
$ awk -F, ' # field separator to ,
NR==FNR { # file1
a[$1]=$0 # hash record to a, use field 1 as key
next
}
{
print ($1 in a?a[$1]:$1) # print match if found, else nonmatch
}
' file1 file2
If you don't care about order, there's a join binary in GNU coreutils that does just what you need :
$sort file1 > sortedFile1
$sort file2 > sortedFile2
$join -t, -a 2 sortedFile1 sortedFile2
A,42.4,-72.2
B,47.2,-75.9
C,41.7,-95.2
E
F
P
Z,38.3,-70.7
It relies on files being sorted and will not work otherwise.
Now will you please get out of my /home/ ?
another join based solution preserving the order
f() { nl -nln -s, -w1 "$1" | sort -t, -k2; }; join -t, -j2 -a2 <(f file1) <(f file2) |
sort -t, -k2 |
cut -d, -f2 --complement
F
A,42.4,-72.2,2
B,47.2,-75.9,3
Z,38.3,-70.7,4
C,41.7,-95.2,5
P
E
Cannot beat the awk solution but another alternative utilizing unix toolchain based on decorate-undecorate pattern.
Problems with your current solution:
1. You are missing a double-quote in grep "${point} /home/aaron/file2.
2. You should start with the other file for printing all lines in that file
while IFS=',' read point; do
echo "${point}$(grep "${point}" /home/aaron/file1 | sed 's/[^,]*,/,/')"
done < /home/aaron/file2
3. The grep can give more than one result. Which one do you want (head -1) ?
An improvement would be
while IFS=',' read point; do
echo "${point}$(grep "^${point}," /home/aaron/file1 | sed -n '1s/[^,]*,/,/p')"
done < /home/aaron/file2
4. Using while is the wrong approach.
For small files it wil get the work done, but you will get stuck with larger files. The reason is that you will call grep for each line in file2, reading file1 a lot of times.
Better is using awk or some other solution.
Another solution is using sed with the output of another sed command:
sed -r 's#([^,]*),(.*)#s/^\1$/\1,\2/#' /home/aaron/file1
This will give commands for the second sed.
sed -f <(sed -r 's#([^,]*),(.*)#s/^\1$/\1,\2/#' /home/aaron/file1) /home/aaron/file2

removing last character of every word in files

I have multiple files with just one line of simple text. I want to remove last character of every word in each file. Every file has different length of text.
The closest I got is to edit one file:
awk '{ print substr($1, 1, length($1)-1); print substr($2, 1, length($2)-1); }' file.txt
But I can not figure out, how to make this general, for files with different words count.
awk '{for(x=1;x<=NF;x++)sub(/.$/,"",$x)}7' file
this should do the removal.
If it was tested ok, and you want to overwrite your file, you can do:
awk '{for(x=1;x<=NF;x++)sub(/.$/,"",$x)}7' file > tmp && mv tmp file
Example:
kent$ awk '{for(x=1;x<=NF;x++)sub(/.$/,"",$x)}7' <<<"foo bar foobar"
fo ba fooba
Use awk to loop till max fields in each row upto NF, and apply the substr function.
awk '{for (i=1; i<=NF; i++) {printf "%s ", substr($i, 1, length($i)-1)}}END{printf "\n"}' file
For a sample input file
ABCD ABC BC
The awk logic produces an output
ABC AB B
Another way by changing the record-separator to NULL and just using print:-
awk 'BEGIN{ORS="";}{for (i=1; i<=NF; i++) {print substr($i, 1, length($i)-1); print " "}}END{print "\n"}' file
I would go for a Bash approach:
Since ${var%?} removes the last character of a variable:
$ var="hello"
$ echo "${var%?}"
hell
And you can use the same approach on arrays:
$ arr=("hello" "how" "are" "you")
$ printf "%s\n" "${arr[#]%?}"
hell
ho
ar
yo
What about going through the files, read their only line (you said the files just consist in one line) into an array and use the abovementioned tool to remove the last character of each word:
for file in dir/*; do
read -r -a myline < "$file"
printf "%s " "${myline[#]%?}"
done
Sed version, assuming word are only composed of letter (if not, just adapt the class [[:alpha:]] to reflect your need) and separate by space and puctuation
sed 's/$/ /;s/[[:alpha:]]\([[:blank:][:punct:]]\)/\1/g;s/ $//' YourFile
awk (gawk for regex boundaries in fact)
gawk '{gsub(/.\>/, "");print}' YourFile
#or optimized by #kent ;-) thks for the tips
gawk '4+gsub(/.\>/, "")' YourFile
$ cat foo
word1
word2 word3
$ sed 's/\([^ ]*\)[^ ]\( \|$\)/\1\2/g' foo
word
word word
A word is any string of characters excluding space (=[^ ]).
EDIT: If you want to enforce POSIX (--posix), you can use:
$ sed --posix 's/\([^ ]*\)[^ ]\([ ]\{,1\}\)/\1\2/g' foo
word
word word
This \( \|$\) changes to \([ ]\{,1\}\), ie there is an optional space in the end.

Sed replace pattern with file contents

I would like to use sed (I think?) to replace a pattern with file contents.
Example
File 1 (primary file)
Hello <<CODE>>
Goodbye.
File 2 (contents)
Anonymous person,
I am very nice.
File 3 (target)
Hello Anonymous person,
I am very nice.
Goodbye.
Right now, I am using this command:
sed "/<<CODE>>/{r file2
:a;n;ba}" file1 | \
sed "s/<<CODE>>//g" > \
file3
But this outputs:
Hello
Anonymous person,
I am very nice.
Goodbye.
(note the newline after Hello)
How can I do this without getting that newline?
(note that file2 may contain all sorts of things: brackets, newlines, quotes, ...)
Much simpler to use awk:
awk 'FNR==NR{s=(!s)?$0:s RS $0;next} /<<CODE>>/{sub(/<<CODE>>/, s)} 1' file2 file1
Hello Anonymous person,
I am very nice.
Goodbye.
Explanation:
FNR==NR - Execute this block for first file in input i.e. file2
s=(!s)?$0:s RS $0 - Concatenate whole file content in string s
next - Read next line until EOF on first file
/<<CODE>>/ - If a line with <<CODE>> is found execute that block
sub(/<<CODE>>/, s) - Replace <<CODE>> with string s (data of file2)
1 - print the output
EDIT: Non-regex way:
awk 'FNR==NR{s=(!s)?$0:s RS $0; next}
i=index($0, "<<CODE>>"){$0=substr($0, 1, i-1) s substr($0, i+8)} 1' file2 file1
The awk though it might be harder to read, is probably the right way to go.
Just for comparison sake, here's a ruby version
ruby -ne 'BEGIN{#body=File.open("file2").read}; puts gsub(/<<CODE>>/,#body);' < file1
not too bad in Perl:
cat file1 | perl -e "open FH, qq(file2); \$f2=join '', <FH>; chomp \$f2; map {s/<<CODE>>/\$f2/g; print \$_} <STDIN>" > file3
(maybe i am not the best perl coder)
it's straight-forward. read the whole file2 in, substitute it, then print.
You could do simply with:
awk '{gsub("<<CODE>>", filetwo)}1' filetwo="$(<file2)" file1
I am thinking about a solution where we use Bash to evaluate a cat process outside of the single quotes delimitating the sed script, but unfortunately the following doesn't work as soon as your file2 contains a newline:
sed 's/<<CODE>>/'"$(cat file2)"'/' file1
It accepts spaces in file2, but not newlines, as you can see in the following piece of code that works well, whatever the content of file2:
sed 's/<<CODE>>/'"$(cat file2 | tr -d '\n')"'/' file1
But, obviously, this modifies the content of the file before inclusion, which is simply bad. :-(
So if you want to play, you can first tr the newlines to some weird unexpectable character, and translate those back after sed has worked on its script:
sed 's/<<CODE>>/'"$(cat file2 | tr '\n' '\3')"'/' file1 | tr '\3' '\n'

Is there a way to obtain the current pattern searched in an AWK script?

The basic idea is this. Suppose that you want to search a file for multiple patterns from a pipe with awk :
... | awk -f - '{...}' someFile.txt
* '...' is just short for some code
* '-f -' indicates the pattern is taken from pipe
Is there a way to know which pattern is searched at each instant within the awk script
(like you know $1 is the first field, is there something like $PATTERN that contains the current pattern
searched or a way to get something like it?
More Elaboration:
if I have 2 files:
someFile.txt containing:
1
2
4
patterns.txt containing:
1
2
3
4
running this command:
cat patterns.txt |awk -f - '{...}' someFile.txt
What should I type between the braces such that only the pattern in patterns.txt that
has not been matched in someFile.txt is printed?(in this case the number 3 in patterns.txt is not matched)
Under the requirements that patterns.txt be supplied as stdin and that the processing be done with awk:
$ cat patterns.txt | awk 'FNR==NR{p=p "\n" $0;next;} p !~ $0' someFile.txt -
3
This was tested using GNU awk.
Explanation
We want to remove from patterns.txt anything that matches a line in someFile.txt. To do this, we first read in someFile.txt and create patterns from it. Next, we print only the lines from patterns.txt that do not match any of the patterns from someFile.txt.
FNR==NR{p=p "\n" $0;next;}
NR is the number of lines that awk has read so far and FNR is the number of lines that awk has read so far from the current file. Thus, if FNR==NR, we are still reading the first named file: someFile.txt. We save all such lines in the newline-separated variable p. We then tell awk to skip the remaining commands and jump to the next line.
p !~ $0
If we got here, then we are now reading the second named file on the command line which is - for stdin. This boolean condition evaluates to either true or false. If it is true, the line is printed. If not, it is skipped. In other words, the above is awk's crytic shorthand for:
p !~ $0 {print $0}
cmd | awk 'NR==FNR{pats[$0]; next} {for (p in pats) if ($0 ~ p) delete pats[p]} END{ for (p in pats) print p }' - someFile.txt
Another way in awk
cat patterns.txt | awk 'NR>FNR&&!($0 in a);{a[$0]}' someFile.txt -

AWK replace $0 of second file when match few columns

How I merge two files when two first columns match in both files and replace first file values with second file columns... I mean...
Same number of columns:
FILE 1:
121212,0100,1.1,1.2,
121212,0200,2.1,2.2,
FILE 2:
121212,0100,3.1,3.2,3.3,
121212,0130,4.1,4.2,4.3,
121212,0200,5.1,5.2,5.3,
121212,0230,6.1,6.2,6.3,
OUTPUT:
121212,0100,3.1,3.2,3.3,
121212,0200,5.1,5.2,5.3,
In other words, I need to print $0 of the second file when match $1 and $2 in both files. I understand the logic but I can't implement it using arrays. That apparently should be used.
Please take a moment to explain any code.
Use awk to print the first 2 fields in the pattern file and pipe to grep to do the match:
$ awk 'BEGIN{OFS=FS=","}{print $1,$2}' file1 | grep -f - file2
121212,0100,3.1,3.2,3.3,
121212,0200,5.1,5.2,5.3,
The -f option tells grep to take the pattern from a file but using - instead of a filename makes grep take the patterns from stdin.
So the first awk script produces the patterns from file1 which we pipe to match against in file2 using grep:
$ awk 'BEGIN{OFS=FS=","}{print $1,$2}' file1
121212,0100
121212,0200
You probably want to anchor the match to the beginning of the line using ^:
$ awk 'BEGIN{OFS=FS=","}{print "^"$1,$2}' file1
^121212,0100
^121212,0200
$ awk 'BEGIN{OFS=FS=","}{print "^"$1,$2}' file1 | grep -f - file2
121212,0100,3.1,3.2,3.3,
121212,0200,5.1,5.2,5.3,
Here's one way using awk:
awk -F, 'FNR==NR { a[$1,$2]; next } ($1,$2) in a' file1 file2
Results:
121212,0100,3.1,3.2,3.3,
121212,0200,5.1,5.2,5.3,