How to get complementary lines from two text files? - regex

How to get complementary lines from two text files?
File file1.txt has
123 foo
234 bar
...
File file2.txt has
123 foo
333 foobar
234 bar
...
I want to get all lines in file1.txt and not in file2.txt. The two files are hundreds of MB large and contain non-ASCII characters. What's a fast way to do this?

For good performance with large files, don't read much of the file into memory; work with what's on disk as much as possible.
String-matching can be done efficiently with hashing.
One strategy:
Scan the first file line-by-line. For each line:
Hash the string for the line. The hashing algorithm you use does matter; djb2 is one example but there are many.
Put the key into a hash-set structure. Do not keep the string data.
Scan the second file line-by-line. For each line:
Hash the string for the line.
If the hash key is not found in the set from the first file:
Write the string data for this line to the output where you're tracking the different lines (e.g. standard output or another file). The hash didn't match so this line appears in the 2nd file but not the 1st.

Lines, specifically?
fgrep -vxf file2.txt file1.txt

"Hundreds of MB" is not so much.
I would solve this task this way (in Perl):
$ cat complementary.pl
my %f;
open(F, "$ARGV[1]") or die "Can't open file2: $ARGV[1]\n";
$f[$_] = 1 while(<F>);
close(F);
open(F, "$ARGV[0]") or die "Can't open file1: $ARGV[0]\n";
while(<F>) {
print if not defined $f[$_];
}
Example of usage:
$ cat file1.txt
100 a
200 b
300 c
$ cat file2.txt
200 b
100 a
400 d
$ perl complementary.pl file1.txt file2.txt
300 c

Related

How to extract values between 2 known strings

I have some huge files containing mixed binary and xml data. I want to extract all values between 2 XML tags that have multiple occurrences in the file. Pattern would be as following: <C99><F1>050</F1><F2>random value</F2></C99> . Portions of XML data are not formatted, everything is in a single line.
I need all values between <F1> and </F1> from <C99> where value is between range 050 and 999(<F1> exists under other fields as well but I need only values of F1 from C99). I need to count them, to see how many C99 have F1 with values between 050 and 999.
I want a hint how I could easily reach and extract that values (using cat and grep? or sed?). Sorting and counting is easy to do it once values are exported in a file.
My temporary solution:
After removing all binary data from the file, I can run the following command:
cat filename | grep -o "<C99><F1>......." > file.txt
This will export first 12 characters from all strings starting with <C99><F1>.
<C99><F1>001
<C99><F1>056
<C99><F1>123
<C99><F1>445
.....
Once exported in a text file, I replace <C99><F1> with nothing and then I sort and count remaining values.
Thank you!
Using XMLStarlet:
$ xml sel -t -v '//C99/F1[. >= 50 and . <= 999]' -nl data.xml | wc -l
Not much of a hint there, sorry.

simply pass a variable into a regex OR string search in awk

This is driving me nuts. Here's what I want to do, and I've made it simple as possible:
This is written into an awk script:
#!/bin/bash/awk
# pass /^CHEM/, /^BIO/, /^ENG/ into someVariable and search file.txt
/someVariable/ {print NR, $0}
OR I would be fine with (but like less)
#!/bin/bash/awk
# pass "CHEM", "BIO", "ENG" into someVariable and search file.txt
$1=="someVariable" {print NR, $0}
I find all kinds of stuff on BASH/SHELL variables being passed but I don't want to learn BASH programming to simply pass a value to a variable.
Bonus: I actually have to search 125 values in each document, with 40 documents needing to be evaluated. It can't hurt to ask a bit more, but how would I take a separate file of these 125 values, pass them individually to someVariable?
I have all sorts of ways to do this in BASH but I don't understand them and there has got to be a way to simply cycle through a set of search terms dynamically in awk (perhaps by an array since I do not believe a list exists yet)
Thank you as I am tired of beating my head into a wall.
I actually have to search 125 values in each document, with 40 documents needing to be evaluated.
Let's put the strings that we want to search for in file1:
$ cat file1
apple
banana
pear
Let's call the file that we want to search file2:
$ cat file2
ear of corn
apple blossom
peas in a pod
banana republic
pear tree
To search file2 for any of the words in file1, use:
$ awk 'FNR==NR{a[$1]=1;next;} ($1 in a){print FNR,$0;}' file1 file2
2 apple blossom
4 banana republic
5 pear tree
How it works
FNR==NR{a[$1]=1;next;}
This stores every word that we are looking for as a key in array a.
In more detail, NR is the number of lines that awk has read so far and FNR is the number of lines that awk has read so far from the current file. Thus, if FNR==NR, we are still reading the first named file: file1. For every line in file1, we set a[$1] to 1.
next tells awk to skip the rest of the commands and start over with the next line.
($1 in a){print FNR,$0;}
If we get to this command, we are on file2.
If the first field is a key in array a, then we print the line number and the line.
"...For example I wanted the text between two regexp from file2. Let's say /apple/, /pear/. How would I substitute and extract the text between those two regexp?..."
while read b e; do awk "/^$b$/,/^$e$/" <(seq 1 100); done << !
> 1 5
> 2 8
> 90 95
> !
1
2
3
4
5
2
3
4
5
6
7
8
90
91
92
93
94
95
Here between the two exclamation points is the input for ranges and as the data file I used 1..100. Notice the double quotes instead of single quotes in the awk script.
If you have entered start end values in the file ranges, and your data in file data
while read b e; do awk "/^$b$/,/^$e$/" data; done < ranges
If you want to print the various ranges to different files, you can do something like this
while read b e; do awk "/^$b$/,/^$e$/ {print > $b$e}" data; done < ranges
A slight variation that you may or may not like... I sometimes use the BEGIN section to read the contents of a file into an array...
BEGIN {
count = 1
while ("cat file1" | getline)
{
a[count] = $3
count++
}
}
The rest continues in much the same way. Anyway, maybe that works for you as well.

Parse target file based on source file contents

I am trying to search for lines in FileB (which is comma separated) that contain content from lines in FileA. I originally tried using grep but it does not seem to care for some of the characters in FileA. I do not assume that the CSV formatting would matter much, well at least to grep.
$ grep -f FileA FileB
grep: Unmatched [ or [^
I am open to using any generally available Linux command, Perl or Python. There is not a specific expression that can be matched which is the reason for using the content from FileA to match on. Below are some example lines that are in FileA that we want to match in FileB.
page=--&id='`([{^~
page=&rows_select=%' and '%'='
l=admin&x=&id=&pagex=http://.../search/cache?ei=utf-&p=change&fr=mailc&u=http://sub.domain.com/cache.aspx?q=change&d=&mkt=en-us&setlang=en-us&w=afe,dbfcd&icp=&.intl=us&sit=dbajdy.alt
The lines in fileB that contain the above strings will contain additional characters in the line, i.e. the strings the the two files will not be a one for one match:
fileA contains abc and fileB contains 012abc*(), 012abc*() would print
A simple python solution would be:
with open('filea', 'r') as fa:
with open('fileb', 'r') as fb:
patterns = fa.readlines()
for line in fb:
if line in patterns:
print line
which would store the whole pattern file in memory, and compare each line of the other file against the list.
but why wouldn't you just use diff? I'd have to look at the manpage, but I'm pretty sure there's a way to make it tell what are the similarities between two files. After googling:
Using diff to find the portions of many files that are the same? (bizzaro-diff, or inverse-diff)
https://unix.stackexchange.com/questions/1079/output-the-common-lines-similarities-of-two-text-files-the-opposite-of-diff
they give that solution:
diff --unchanged-group-format='## %dn,%df
%<' --old-group-format='' --new-group-format='' \
--changed-group-format='' a.txt b.txt
Untested Solution:
Logic:
Store line from FileB in lines array
For each line in lines array;
Check if line in array appears as a part of your line in FileB
If index(..) returns > 0 then;
Print that line from FileB
awk 'NR==FNR{lines[$0]++;next}{for (line in lines) {if (index($0,line)>0) {print $0}}}' FILEA FILEB`
Use fgrep (or equivalently grep -F). That interprets the pattern (the contents of FileA) as a literal string to search for instead of a regular expression.

diff while ignoring patterns within a line, but not the entire line

I often have a need to compare two files, while ignoring certain changes within those files. I don't want to ignore entire lines, just a portion of them. The most common case of this is timestamps on the lines, but there are a couple dozen other patterns that I need ignore too.
File1:
[2012-01-02] Some random text foo
[2012-01-02] More output here
File2:
[1999-01-01] Some random text bar
[1999-01-01] More output here
In this example, I want to see the difference on line number 1, but not on line number 2.
Using diff's -I option will not work because it ignores the entire line. Ideal output:
--- file1 2013-04-05 13:39:46.000000000 -0500
+++ file2 2013-04-05 13:39:56.000000000 -0500
## -1,2 +1,2 ##
-[2012-01-02] Some random text foo
+[1999-01-01] Some random text bar
[2012-01-02] More output here
I can pre-process these files with sed:
sed -e's/^\[....-..-..\]//' < file1 > file1.tmp
sed -e's/^\[....-..-..\]//' < file2 > file2.tmp
diff -u file1.tmp file2.tmp
but then I need to put those temporary files somewhere, and remember to clean them up afterwards. Also, my diff output no longer refers to the original filenames, and no longer emits the original lines.
Is there a widely available variant of diff, or a similar tool, that can do this as a single command?
You can use temporary streams to avoid file creation and cleanup, syntax is following:
$ diff <(command with output) <(other command with output)
In your case:
diff <(cat f1 | sed -e's/^\[....-..-..\]//') <(cat f2 | sed -e's/^\[....-..-..\]//')
Hope this helps.
It isn't exactly what you are looking for since I'm not sure how to retain the dates, but this does solve a couple of your issues:
diff -u --label=file1 <(sed 's/^\[....-..-..\]//' file1) --label=file2 <(sed 's/^\[....-..-..\]//' file2)
Output:
--- file1
+++ file2
## -1,2 +1,2 ##
- Some random text foo
+ Some random text bar
More output here

AWK: Write lines into multiple files

I'm trying to extract sequences from a FASTA file using awk.
e.g. the file looks like this and it contains 703 sequences. I want to extract each of them to separate files.
>sequence_1
AACTTGGCCTT
>sequence_2
AACTTGGCCTT
.
.
.
I'm using this awk script:
awk '/>/ {OUT=substr($0,2) ".fasta"}; OUT {print >OUT}'file.fasta
...which works but only for the 16 first and then I get an error saying;
.fasta makes too many open files
input record number 35, file file.fasta
source line number 1
You would need to close files when you're done. Try:
awk '/>/ {close(OUT); OUT=substr($0,2) ".fasta"}; OUT {print > OUT}' file.fasta