How to remove both matching lines while removing duplicates - regex

I have a large text file containing a list of emails called "main", and I have sent mails to some of them. I have a list of 'sent' emails. Now, I want to remove the 'sent' emails from the list "main".
In other words, I want to remove both the matching raw from the text file while removing duplicates. Example:
I have:
email#email.com
test#test.com
email#email.com
I want:
test#test.com
Is there any easier way to achieve this? Please suggest a tool or method to do this, but please consider the text file is larger than 10MB.

In terminal:
cat test| sort | uniq -c | awk -F" " '{if($1==1) print $2}'

I use cygwin a lot for such tasks, as the unix command line is incredibly powerful.
Here's how to achieve what you want:
cat main.txt | sort -u | grep -Fvxf sent.txt
sort -u will remove duplicates (by sorting the main.txt file first), and grep will take care of removing the unwanted addresses.
Here's what the grep options mean:
-F plain text search
-v invert results
-x will force the whole line to match the pattern
-f read patterns from the specified file
Oh, and if your files are in the Windows format (CR LF newlines) you'll rather have to do this:
cat main.txt | dos2unix | sort -u | grep -Fvxf <(cat sent.txt | dos2unix)
Just like with the Windows command line, you can simply add:
> output.txt
at the end of the command line to redirect the output to a text file.

Related

How can I cut and statistics the string in text plain document?

I have got a large text plain doc,The content please refer this pic
cat textplain.txt|grep '^\.[\/[:alpha:]]*[\.\:][[:alpha:]]*'
I want the output result like below :
./external/selinux/libsepol/src/mls.c
./external/selinux/libsepol/src/handle.c
./external/selinux/libsepol/src/constraint.c
./external/selinux/libsepol/src/sidtab.c
./external/selinux/libsepol/src/nodes.c
./external/selinux/libsepol/src/conditiona.c
Question:
What's should I do
Just regenerate the file with
grep -lr des ./android/source/code
-l only lists the files with matches without showing their contents
-r is still needed to search subdirectories
-n has no influence on -l, so can be omitted. -c instead of -l would add the number of occurrences to each file name, but you'll probably want to | grep -v :0 to skip the zeroes.
Or, use cut and sort -u:
cut -d: -f1 textplain.txt | sort -u
-d: delimit columns by :
-f1 only output the first column
-u output unique lines

Retrieve file name when regexp matches content

So I have a directory which includes a bunch of text files, and inside each file there is a line that has the file's time stamp which has the format:
TimeStamp: mm/dd/yyyy
I am writing a script that takes in 3 inputs: month, date, and year, and I want to retrieve the name of the files that have the time stamps matched with the inputs.
I am using this line of code to match the files and output all the rows found to another file.
egrep 'TimeStamp: "$2"/"$3"/"$1"' inFile > outFile
However, I have not figured out a way to get the files names during the process.
Also, I believe there is a quick and simple way to do this with awk, but I am new to awk, so I am still struggling with it.
Note:
I'm assuming you want to BOTH capture matching lines AND, SEPARATELY, the names (paths) of the files that had matches (therefore, using just egrep -l is not enough).
Based on your question, I've changed 'TimeStamp: "$2"/"$3"/"$1"' to "TimeStamp: $2/$3/$1", because the former would treat $2, ... as literals (would not expand them), due to being enclosed in a single-quoted string.
If you already have a single filename to pass to egrep, you can use && to conditionally output that filename if that file contained matches (in addition to capturing the matches in a file).
egrep "TimeStamp: $2/$3/$1" inFile > outFile && printf '%s\n' inFile
When processing an entire directory, the simple and POSIX-compliant - but inefficient - approach is to process files in a loop:
for f in *; do
[ -f "$f" ] || continue # skip non-files or break, if dir. is empty
egrep "TimeStamp: $2/$3/$1" "$f" >> outFile && printf '%s\n' "$f"
done
If you use bash and GNU grep or BSD grep (also used on OSX), there's a more efficient solution:
egrep -sH "TimeStamp: $2/$3/$1" * |
tee >(cut -d: -f1 | sort -u > outFilenames) |
cut -d: -f2- > outFile
Since * potentially also matches directories, -s suppresses error message stemming from (invariably failing) attempts to process them as files.
-H ensures that each matching line is prefixed with the input filename followed by :
tee >(...) ... sends input to both stdout and the command inside >(...).
cut -d: -f1 | sort -u extracts the matching filenames from the result lines, creates a sorted list without duplicates from them, and sends them to file outFilenames.
cut -d: -f2- then extracts the matching lines (stripped of their filename prefix) and captures them in file outFile.
grep -l
Explanation
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which
output would normally have been printed. The scanning will stop on the first
match. (-l is specified by POSIX.)
Source

find the first match of a regex in a file, and print it

I have a collection of words on one side, and a file on the other side. I need their intersection. i.e. the words that do appear at least once in the file.
I am able to extract the matching lines with
sed -rn 's/(word1|word2|blablabla|wordn)/\1/p' myfile.txt
but I cannot go forward.
Thank-you for helping, Olivier
Perhaps' grep may work here?
grep -o -E 'word1|word2|word3' file.txt | sort -u
You can do it using grep and sort:
grep -o 'word1\|word2\|word3' myfile.txt | sort -u
The -o switch makes grep only output the matching string not the complete line. sort -u sorts the matching words and removes duplicates.
If I got you, you just need to pipe sed results to uniq:
sed -rn 's/.*(word1|word2|blablabla|wordn).*/\1/p' myfile.txt | uniq
Also you need to match the whole line in sed in order to get just the desired words as output. That's why I've placed .* in front and at the end of the pattern.

Grep: How do I print both the matching line and the pattern (using a file of patterns)?

I am using cygwin.
I am grepping accross multiple files in a directory.
I am pulling regex patterns from a file.
I am writing the results to a file.
I would like each result line to also contain the matching pattern(s).
Currently, the command I'm using achieves 1-4 from above.
grep -E -i -f c:\patterns\patterns.txt c:\dir\*.csv > c:\results\results.csv
I know that if I add the -o parameter that it will just give me the matching pattern instead, and from there I could match up the line numbers for this output and for the output not using -o. But -o seems to take so much longer.
The pattern file itself is in excess of 5,000 lines. The files that I'm searching exceed a million lines.
Sample Input: https://www.dropbox.com/s/axltx3wcj9ina32/SampleInputFiles.zip
Sample of Desired Output: https://www.dropbox.com/s/ko3dz4hzhnqg8pm/output.csv.zip
How do I get the data I need?
Thanks,
Chad
I did it with two grep and a join based on the matching line number
grep -onf /cygdrive/c/TEMP/exp_plain.txt filelist.txt |sort -u > 1.txt
grep -nf /cygdrive/c/TEMP/exp_plain.txt filelist.txt |sort -u > 2.txt
join -t ":" 1.txt 2.txt

Replacing text and duplicates

I have a log file with lines filled with things like this:
/home/Users/b/biaxib/is-clarithromycin-effective-against-strep.html
/home/Users/b/hihi/low-cost-biaxin-free-shipping.html
/home/Users/b/hoho/no-script-biaxin-fast-delivery.html
/home/Users/b/ihatespam/no-script-low-cost-biaxin.html
I want to extract only the username portion, and then remove duplicates, so that I am only left with this:
biaxib
hihi
hoho
ihatespam
The ruleset is:
Extract the text between "/home/Users/" and "/....." at the end
Remove duplicate lines after the above rule is applied
Do this inside Linux
Can someone help me with how to create such a script, or statement to do this?
Assuming that username always appears at 4th component of path:
$ cat test.txt
/home/Users/b/biaxib/is-clarithromycin-effective-against-strep.html
/home/Users/b/hihi/low-cost-biaxin-free-shipping.html
/home/Users/b/hoho/no-script-biaxin-fast-delivery.html
/home/Users/b/ihatespam/no-script-low-cost-biaxin.
$ cat test.txt | cut -d/ -f 5 | sort | uniq
biaxib
hihi
hoho
ihatespam
cat /path/to/your/log/file.txt | python3 -c '
import sys
for line in sys.stdin.readlines():
print( line.split("/")[5] )
' | sort | uniq
More conciseness probably achievable in perl or with other builtin tools (see other answer), but I personally shy away from the standard linux text manipulation tools (edit: cut is a useful one though).