How can I cut and statistics the string in text plain document? - regex

I have got a large text plain doc,The content please refer this pic
cat textplain.txt|grep '^\.[\/[:alpha:]]*[\.\:][[:alpha:]]*'
I want the output result like below :
./external/selinux/libsepol/src/mls.c
./external/selinux/libsepol/src/handle.c
./external/selinux/libsepol/src/constraint.c
./external/selinux/libsepol/src/sidtab.c
./external/selinux/libsepol/src/nodes.c
./external/selinux/libsepol/src/conditiona.c
Question:
What's should I do

Just regenerate the file with
grep -lr des ./android/source/code
-l only lists the files with matches without showing their contents
-r is still needed to search subdirectories
-n has no influence on -l, so can be omitted. -c instead of -l would add the number of occurrences to each file name, but you'll probably want to | grep -v :0 to skip the zeroes.
Or, use cut and sort -u:
cut -d: -f1 textplain.txt | sort -u
-d: delimit columns by :
-f1 only output the first column
-u output unique lines

Related

Linux Script + Regex

I'm trying to write a script that takes a string and then reads the file test and prints the names that starts with a pattern (ignoring case) to another file, the test file contains this data:
u001:x:Laith_Budairi
u002:x:laith_buda
u003:x:bara_adnan
u004:x:Basim_khadir
u005:x:bilal_jarrar
this is what i tried to do:
echo type a pattern to find
read s
cat test | cut -d: -f3 | grep -i '^$s' > printfile
but it doesn't print anything there in the file what do I do ?
'...' prevent variable expansion. Simply changing '^$s' to "^$s" would work, but there are other things to improve:
You don't need to cat, you can send cut a filename:
cut -d: -f3 test | grep ...
You do not really need the cut, you could grep immediately:
grep -i "^u...:x:$s" test > printfile
Though if you only one the name part than it's fine.

Retrieve file name when regexp matches content

So I have a directory which includes a bunch of text files, and inside each file there is a line that has the file's time stamp which has the format:
TimeStamp: mm/dd/yyyy
I am writing a script that takes in 3 inputs: month, date, and year, and I want to retrieve the name of the files that have the time stamps matched with the inputs.
I am using this line of code to match the files and output all the rows found to another file.
egrep 'TimeStamp: "$2"/"$3"/"$1"' inFile > outFile
However, I have not figured out a way to get the files names during the process.
Also, I believe there is a quick and simple way to do this with awk, but I am new to awk, so I am still struggling with it.
Note:
I'm assuming you want to BOTH capture matching lines AND, SEPARATELY, the names (paths) of the files that had matches (therefore, using just egrep -l is not enough).
Based on your question, I've changed 'TimeStamp: "$2"/"$3"/"$1"' to "TimeStamp: $2/$3/$1", because the former would treat $2, ... as literals (would not expand them), due to being enclosed in a single-quoted string.
If you already have a single filename to pass to egrep, you can use && to conditionally output that filename if that file contained matches (in addition to capturing the matches in a file).
egrep "TimeStamp: $2/$3/$1" inFile > outFile && printf '%s\n' inFile
When processing an entire directory, the simple and POSIX-compliant - but inefficient - approach is to process files in a loop:
for f in *; do
[ -f "$f" ] || continue # skip non-files or break, if dir. is empty
egrep "TimeStamp: $2/$3/$1" "$f" >> outFile && printf '%s\n' "$f"
done
If you use bash and GNU grep or BSD grep (also used on OSX), there's a more efficient solution:
egrep -sH "TimeStamp: $2/$3/$1" * |
tee >(cut -d: -f1 | sort -u > outFilenames) |
cut -d: -f2- > outFile
Since * potentially also matches directories, -s suppresses error message stemming from (invariably failing) attempts to process them as files.
-H ensures that each matching line is prefixed with the input filename followed by :
tee >(...) ... sends input to both stdout and the command inside >(...).
cut -d: -f1 | sort -u extracts the matching filenames from the result lines, creates a sorted list without duplicates from them, and sends them to file outFilenames.
cut -d: -f2- then extracts the matching lines (stripped of their filename prefix) and captures them in file outFile.
grep -l
Explanation
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which
output would normally have been printed. The scanning will stop on the first
match. (-l is specified by POSIX.)
Source

How to remove both matching lines while removing duplicates

I have a large text file containing a list of emails called "main", and I have sent mails to some of them. I have a list of 'sent' emails. Now, I want to remove the 'sent' emails from the list "main".
In other words, I want to remove both the matching raw from the text file while removing duplicates. Example:
I have:
email#email.com
test#test.com
email#email.com
I want:
test#test.com
Is there any easier way to achieve this? Please suggest a tool or method to do this, but please consider the text file is larger than 10MB.
In terminal:
cat test| sort | uniq -c | awk -F" " '{if($1==1) print $2}'
I use cygwin a lot for such tasks, as the unix command line is incredibly powerful.
Here's how to achieve what you want:
cat main.txt | sort -u | grep -Fvxf sent.txt
sort -u will remove duplicates (by sorting the main.txt file first), and grep will take care of removing the unwanted addresses.
Here's what the grep options mean:
-F plain text search
-v invert results
-x will force the whole line to match the pattern
-f read patterns from the specified file
Oh, and if your files are in the Windows format (CR LF newlines) you'll rather have to do this:
cat main.txt | dos2unix | sort -u | grep -Fvxf <(cat sent.txt | dos2unix)
Just like with the Windows command line, you can simply add:
> output.txt
at the end of the command line to redirect the output to a text file.

Regex grep file contents and invoke command

I have a file that has been generated containing MD5 info along with filenames. I'm wanting to remove the files from the directory they are in. I'm not sure how to go about doing this exactly.
filelist (file) contains:
MD5 (dupe) = 1fb218dfef4c39b4c8fe740f882f351a
MD5 (somefile) = a5c6df9fad5dc4299f6e34e641396d38
my command (which i would like to include with rm) looks like this:
grep -o "\((.*)\)" filelist
returns this:
(dupe)
(somefile)
*almost good, although the parentheses need to be eliminated (not sure how). I tried using grep -Po "(?<=\().*(?=\))" filelist using a lookahead/lookaround, but the command didn't work.
The next thing I would like to do is take the output filenames and delete them from the directory they are in. I'm not sure how to script it, but it would essentially do:
<returned results from grep>
rm dupe $target
rm somefile $target
If I understand correctly, you want to take lines like these
MD5 (dupe) = 1fb218dfef4c39b4c8fe740f882f351a
MD5 (somefile) = a5c6df9fad5dc4299f6e34e641396d38
extract the second column without the parentheses to get the filenames
dupe
somefile
and then delete the files?
Assuming the filenames don't have spaces, try this:
# this is where your duplicate files are.
dupe_directory='/some/path'
# Check that you found the right files:
awk '{print $2}' file-with-md5-lines.txt | tr -d '()' | xargs -I{} ls -l "$dupe_directory/{}"
# Looks ok, delete:
awk '{print $2}' file-with-md5-lines.txt | tr -d '()' | xargs -I{} rm -v "$dupe_directory/{}"
xargs -I{} means to replace the argument (dupe filename) with {} so it can be used in a more complex command.
The tool you're looking for is xargs: http://unixhelp.ed.ac.uk/CGI/man-cgi?xargs
It's pretty standard on *nix systems.
UPDATE: Given that target equals the directory where the files live...
I believe the syntax would look something like:
yourgrepcmd | xargs -I{} rm "$target{}"
The -I creates a placeholder string, and each line from your grep command gets inserted there.
UPDATE:
The step you need to remove the parens is a little use of sed's substitution command (http://unixhelp.ed.ac.uk/CGI/man-cgi?sed)
Something like this:
cat filelist | sed "s/MD5 (\([^)]*\)) .*$/\1/" | xargs -I{} rm "$target/{}"
The moral of the story here is, if you learn to utilize sed and xargs (or awk if you want something a little more advanced) you'll be a more capable linux user.

How do I select files 1-88

I have a files in the directory named OIS001_OD_EYE_MIC.png - OIS176_OD_EYE_MIC.png
Extract numbers 1-99 is easy as show by this regex.
I want 1-88 to divide the directory in half.
Why? So I can have two even sets of files to compress
ls | sed -n '/^[A-Z]*0[0-9][0-9].*EYE_MIC.png/p'
Here is my attempt of getting 0-99. Can you help me get 1-88 and perhaps 89-176?
You can use a range: {start..end} like this:
echo OIS00{0..88}_OD_EYE_MIC.png
will expand to
OIS000_OD_EYE_MIC.png OIS001_OD_EYE_MIC.png [...] OIS0087_OD_EYE_MIC.png OIS0088_OD_EYE_MIC.png
Look for Brace expansion in bash's man page
With a new-enough bash:
ls OIS0{01..88}_OD_EYE_MIC.png
With regexes you have to think about how the strings of certain number ranges look (you can't just match specific number ranges directly). 1-88:
/^[A-Z]*(00[1-9]|0[1-7][0-9]|08[0-8]).*EYE_MIC.png/
For 88 - 176:
/^[A-Z]*(089|09[0-9]|1[0-6][0-9]|17[0-6]).*EYE_MIC.png/
Here are some more examples.
Here's a piped parallel alternative:
ls -v | columns --by-columns -c2 | tr -s ' ' \
| tee >(cut -d' ' -f1 | tar cf part1.tar -T -) \
>(cut -d' ' -f2 | tar cf part2.tar -T -) > /dev/null
This method needs more work if the files have whitespace in their names.
The idea is to columnate the file-list and use tee to multiplex it into separate compression processes.
The columns program comes with the autogen package (at least in Debian).