Compare two files and extract line based on matching substring?

Compare two files and extract line based on matching substring? - regex

I have two files
crackedHashes.txt formatted as Hash:Password
C3B9FE4E0751FC204C29183910DB9EB4:fretful
CA022093C4BAFA397FAC5FB2E407FCA9:remarkable
36E13152AA93A7631608CD9DD753BD2A:please
hashList.txt formatted as Username:Hash
Frank:C3B9FE4E0751FC204C29183910DB9EB4
Jane:A67BC194586C11FD2F6672DE631A28E0
Lisa:CA022093C4BAFA397FAC5FB2E407FCA9
John:36E13152AA93A7631608CD9DD753BD2A
Dave:6606866DB8B0232B371C2C4C35B37D01
I want a new file that combines the two lists based on the same matching hash.
output.txt
Frank:C3B9FE4E0751FC204C29183910DB9EB4:fretful
Lisa:CA022093C4BAFA397FAC5FB2E407FCA9:remarkable
John:36E13152AA93A7631608CD9DD753BD2A:please
I've been scouring the forums here and can only find things returning one string or not using regex (matching whole line). I've tried to do it in parts so I first broke up crackedHashes by doing sed 's/:.*//' crackedHashes.txt and then was going to do the same for the other file and compare by basically writting a bunch of outfiles and comparing the outfiles. I also tried comparing based on variation of grep -f crackedHashes.txt hashList.txt > outfile.txt but that was yielding many more "results" than it was supposed to.
I could manually do grep <hash> hashList.txt> but when it comes to files and lines I'm a bit lost

With GNU join, bash and GNU sort:
join -1 1 -2 2 -t : <(sort crackedHashes.txt) <(sort -t : -k 2 hashList.txt) -o 2.1,1.1,1.2
Output:
John:36E13152AA93A7631608CD9DD753BD2A:please
Frank:C3B9FE4E0751FC204C29183910DB9EB4:fretful
Lisa:CA022093C4BAFA397FAC5FB2E407FCA9:remarkable
See: man join

Related

How do I grep a string using the previous output for my next argument?

There is a string located within a file that starts with 4bceb and is 32 characters long.
To find it I tried the following
Input:
find / -type f 2>/dev/null | xargs grep "4bceb\w{27}" 2>/dev/null
after entering the command it seems like the script is awaiting some additional command.

Your command seems alright in principle, i.e. it should correctly execute the grep command for each file find returns. However, I don't believe your regular expression (respectively the way you call grep) is correct for what you want to achieve.
First, in order to get your expression to work, you need to tell grep that you are using Perl syntax by specifying the -P flag.
Second, your regexp will return the full lines that contain sequences starting with "4bceb" that are at least 32 characters long, but may be longer as well. If, for example your ./test.txt file contents were
4bcebUUUUUUUUUUUUUUUUUUUUUUUU31
4bcebVVVVVVVVVVVVVVVVVVVVVVVVV32
4bcebWWWWWWWWWWWWWWWWWWWWWWWWWW33
sometext4bcebYYYYYYYYYYYYYYYYYYYYYYYYY32somemoretext
othertext 4bcebZZZZZZZZZZZZZZZZZZZZZZZZZ32 evenmoretext
your output would include all lines except the first one (in which the sequence is shorter than 32 characters). If you actually want to limit your results to lines that just contain sequences that are exactly 32 characters long, you can use the -w flag (for word-regexp) with grep, which would only return lines 2 and 5 in the above example.
Third, if you only want the match but not the surrounding text, the grep flag -o will do exactly this.
And finally, you don't need to pipe the find output into xargs, as grep can directly do what you want:
grep -rnPow / -e "4bceb\w{27}"
will recursively (-r) scan all files starting from / and return just the ones that contain matching words, along with the matches (as well as the line numbers they were found in, as result of the flag -n):
./test.txt:2:4bcebVVVVVVVVVVVVVVVVVVVVVVVVV32
./test.txt:5:4bcebZZZZZZZZZZZZZZZZZZZZZZZZZ32

Extracting filenames containing one or more numbers then cat contents to output file

If possible, I'm looking for a bash one liner that concatenates all the files in a folder that are labelled motif<number>.motif into an output.txt file.
I have a few issues I'm struggling with.
A: The <number> contained in the filename can be one or two digits long and I don't know how to use regex (or something similar) to get all the files with either one or two digits. I can get the filenames containing single or double digit numbers separately using:
motif[0-9].motif or motif[0-9][0-9].motif
but can't work out how to get all the files listed together.
B: The second issue I have is I don't know how many files will be in the directory in advance, so I can just use a range of numbers to select the files. This command is in the middle of a long pipeline.
So lets say I have 20 folders:
motif1.motif
motif2.motif
...
motif19.motif
motif20.motif
I'd need to cat >> the contents of all of them into output.txt.

You can do:
cat motif{[0-9],[0-9][0-9]}.motif > output
or with extglob:
shopt -s extglob nullglob
cat motif[0-9]?([0-9]).motif > output

How can I combine multiple text files, remove duplicate lines and split the remaining lines into several files of certain length?

I have a lot of relatively small files with about 350.000 lines of text.
For example:
File 1:
asdf
wetwert
ddghr
vbnd
...
sdfre
File 2:
erye
yren
asdf
jkdt
...
uory
As you can see line 3 of file 2 is a duplicate of line 1 in file 1.
I want a program / Notepad++ Plugin that can check and remove these duplicates in multiple files.
The next problem I have is that I want all lists to be combined into large 1.000.000 line files.
So, for example, I have these files:
648563 lines
375924 lines
487036 lines
I want them to result in these files:
1.000.000 lines
511.523 lines
And the last 2 files must consist of only unique lines.
How can I possibly do this? Can I use some programs for this? Or a combination of multiple Notepad++ Plugins?
I know GSplit can split files of 1.536.243 into files of 1.000.000 and 536.243 lines, but that is not enough, and it doesn't remove duplicates.
I do want to create my own Notepad++ plugin or program if needed, but I have no idea how and where to start.
Thanks in advance.

You have asked about Notepad++ and are thus using Windows. On the other hand, you said you want to create a program if needed, so I guess the main goal is to get the job done.
This answer uses Unix tools - on Windows, you can get those with Cygwin.
To run the commands, you have to type (or paste) them in the terminal / console.
cat file1 file2 file3 | sort -u | split -l1000000 - outfile_
cat reads the files and echoes them; normally, to the screen, but the pipe | gets the output of the command left to it and pipes it through to the command on the right.
sort obviously sorts them, and the switch -u tells it to remove duplicate lines.
The output is then piped to split which is being told to split after 1000000 lines by the switch -l1000000. The - (with spaces around) tells it to read its input not from a file but from "standard input"; the output in sort -u in this case. The last word, outfile_, can be changed by you, if you want.
Written like it is, this will result in files like outfile_aa, outfile_ab and so on - you can modify this with the last word in this command.
If you have all the files in on directory, and nothing else is in there, you can use * instead of listing all the files:
cat * | sort -u | split -l1000000 - outfile_
If the files might contain empty lines, you might want to remove them. Otherwise, they'll be sorted to the top and your first file will not have the full 1.000.000 values:
cat file1 file2 file3 | grep -v '^\s*$' | sort -u | split -l1000000 - outfile_
This will also remove lines that consist only of whitespace.
grep filters input using regular expressions. -v inverts the filter; normally, grep keeps only lines that match. Now, it keeps only lines that don't match. ^\s*$ matches all lines that consist of nothing else than 0 or more characters of whitespace (like spaces or tabs).
If you need to do this regularly, you can write a script so you don't have to remember the details:
#!/bin/sh
cat * | sort -u | split -l1000000 - outfile_
Save this as a file (for example combine.sh) and run it with
./combine.sh

extract filename and path with slash delimiter from text using grep and lookaround

I am trying to write a bash script to automate checkin of changed files after comparing a local folder to another remote folder.
To achieve this I am trying to extract the filename with a portion of the path of the remote folder, to be used in the checkin commands. I am seeking assistance on extracting the filename with it's path.
To achieve the comparison I used diff command as follows
diff --brief --suppress-common-lines -x '*.class' -ar ~/myprojects/company/apps/product/package/test/ $env_var/java/package/test/
The above command prints output in following format:
Files /home/xxxx/myprojects/company/apps/product/package/test/fileName.java and /productdev/product/product121/java/package/test/filename.java differ
I want to the extract the file name between 'and' & 'differ'. So I used lookarounds regular expression in a grep command :
diff --brief --suppress-common-lines -x '*.class' -ar ~/myprojects/company/apps/product/package/test/ $env_var/java/package/test/ | grep -oP '(?<=and) .*(?=differ)'
which gave me:
/productdev/product/product121/java/package/test/filename.java
I would like to display the path starting from java to the end of the text as in: java/package/test/filename.java ?

You could try the below grep command,
grep -oP 'and.*\/\Kjava.*?(?= differ)'
That is,
diff --brief --suppress-common-lines -x '*.class' -ar ~/myprojects/company/apps/product/package/test/ $env_var/java/package/test/ | grep -oP 'and.*\/\Kjava.*?(?=\s*differ)'

For how I see it you will be getting all the files compared in both folders and will get several lines like the ones you mentioned.
So first step would be grep-ing all the lines "differ" in them. (If that command gives any other kind of lines too)
You can ignore the above step if I am wrong and didn't understand it right.
So the next step is grep-ing both the paths. For that you can use these :
awk '{print $2,$4}'
This will print only 2nd and third fields i.e., both the paths.
awk prints fields irrespective of the spaces.
Another simple way of doing it is :
cut -d" " -f 2,4
This will also do the same.
Here with "-d" flag we are specifying a delimiter to separate strings and with "-f" flag we are specifying the number of field places to pick from(so 2nd and 4th field).
Once you get these paths you can always store them in two variables and cut or awk whatever parts of it you want.

Find matches between 2 files

I'm trying to output matching lines in 2 files using AWK. I made it easier by making 2 files with just one column, they're phone numbers. I found many people asking the same question and getting the answer to use :
awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
The problem I encountered was that it simply doesn't want to work. The first file is small (~5MB) and the second file is considerably larger (~250MB).
I have some general knowledge of AWK and know that the above script should work, yet I'm unable to figure out why it's not.
Is there any other way I can achieve the same result?
GREP is a nice tool, but it clogs up the RAM and dies within seconds due to the file size.
I did run some spot checks to find out whether there are matches, and when I did a grep of random numbers from the smaller file and grep'd them through the big one and I did find matches, so I'm sure that there are.
Any help is appreciated!
[edit as requested by #Jaypal]
Sample code from both files :
File1:
01234567895
01234577896
01234556894
File2:
01234642784
02613467246
01234567895
Output:
01234567895
What I get:
xxx#xxx:~$ awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
xxx#xxx:~$

Update
The problem happens to be with the kind of file you were using. Apparently it came from a DOS system and had many \r around. To solve it, do "sanitize" them with:
dos2unix
Former answer
Your awk is pretty fine. However, you can also compare files with grep -f:
grep -f file1 file2
This will look for lines in file1 that are also in file2.
You can add options to make a better matching:
grep -wFf file1 file2
-w matches words
-F matches fixed strings (no regex).
Examples
$ cat a
hello
how are
you
I am fine areare
$ cat b
hel
are
$ grep -f b a
hello
how are
I am fine areare
$ grep -wf b a
how are

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js