Scanning folder to find all files with minimum matching words (UNIX) - regex

Having some trouble figuring out the command line to the following issue and hoping u guys can help!
Basically, I have a folder which contains a ~1000 PDF's. I need to search through every pdf and return the file names of PDF's that match certain words X amount of times.
For example, I have 10 PDF's which all contain the word "Fragile". I would like to return a list of all files that contain "Fragile" a minimum of 3 times throughout the PDF.
I am currently using pdfgrep and giving it a regex to look for, but it will return all the files that match at least once. I have seen a few recommendations out there piping the command with "awk", but i'm not sure what this really does...

Don't know much about pdfgrep, but if the output is like on https://pdfgrep.org/ it should be fairly easy to get the number of the lines in the output, doing something like:
for f in *.pdf; do if [ $(pdfgrep -nHm 10 "Fragile" "$f" | wc -l) -gt 2 ]; then echo $f; fi; done

Related

Finding files using regular expressions/wildcards

Within a particular directory, I have a series of files that are labelled sequentially:image0000.png, image0001.png, image0002.png, etc.. They are labelled by number, but I don't necessarily know how many preceding zeroes there are in the filename, i.e. whether it will be image0001.png or image00001.png.
Within a bash script, I wish to find a single file at a time (over a for loop), and then apply some processing to the file. This search could start at zero and end before I've reached the end, or could be of varying steps. To expand, I could want to find image0000.png, image0001.png, image0002.png and so forth, or I could start at image0010.png and find every other file, i.e. the next two would be image0012.png and image0014.png.
To try and find the first file (image0000.png), I've tried using find and ls, with the following outputs:
$ find video/figs/ -name 'image*[0]0.png'
video/figs/image00100.png
video/figs/image00000.png
$ ls video/figs/image*[0]0.png
-rw-r--r-- 1 user machine 165K Feb 19 09:06 video/figs/image00000.png
-rw-r--r-- 1 user machine 207K Feb 19 09:06 video/figs/image00100.png
Similar results occur for finding the second (i.e., find video/figs/ -name 'image*[0]0.png' finds image00101.png and image00001.png. So it's finding the file I want (image00001.png), but is also finding one that I don't (image00101.jpg). Can anyone help me understand why, and fix it?
I would use ls and grep for that:
ls | grep -oP 0*[1-9]+.png
Example:
$:/tmp/test$ ls
00001.png 00002.png 00010.png 00013.png 00201.png
$:/tmp/test$ ls | grep -oP 0*[1-9]+.png
00001.png
00002.png
00013.png
01.png
I suspect you don't want to dive into subdirectories, and find files, sorted by number, spread over subdirs.
So find isn't necessary.
ls image*{08..10}.png
image00010.png image0008.png image0009.png image0010.png image008.png image009.png
Part 2 of your question, only find every other file:
ls image*{08..10..2}.png
image00010.png image0008.png image0010.png image008.png
Maybe you know for-loops. It's like that,
for (i in 8 to 10 by 2)
or
for (int i=8; i <= 10; i+=2)
Restricting the search to find image image00010.png but not imageAB010.png wouldn't work.
The reason to exclude 101 is still unclear. Maybe it's only a sorting thing.
With directories, which aren't the PWD, there is no big difference:
ls video/figs/image*{08..10..2}.png
Note, that instead of ls, you use just the program, you want to process on the files, if the program is able to handle more than one file at a time, like ls.
Sincere thanks to everyone who contributed an answer - perhaps I explained it poorly, or I was too wedded to the code I'd already written to use any of the provided answers. However, I've found the following solutions:
1) Why did I find more answers than I expected?
find video/figs/ -name 'image*[0]0.png' uses very limited comprehension of wildcards, and thus the above was interpreted as finding a file with name image<wildcard>00.png. There is no way, using the -name option, to restrict the application of * to match only a given character (in this case, only find zero or more matches to 0.
2) How do I find the image files with an unknown number of padding zeroes?
The following is a MWE from my final code. It demonstrates how to search within a given directory SEARCH_DIR (not necessarily including subdirectories, but I haven't checked)
f1=0 # Starting number
f2=10 # End number
df=2 # number to skip between images
for ((f=$f1; f<=$f2; f=$f+$df)); do
export iFile=$(find $SEARCH_DIR -regex '.*/image0*'$f'.png')
done
The export ensures the variable is available to sub-processes, with the iFile=$() syntax allowing me to export the result of the command to the variable iFile. The bit within the parentheses is the bit I was looking for: find $SEARCH_DIR -regex '.*/image[0]*'$f'.png'
a) find $SEARCH_DIR specifies the location for the search
b) -regex specifies to use regular expressions, which are more powerful than standard bash scripting and allow me to limit wildcards as required
c) '.*/image0*'$f'.png': The regular expression search looks over the entire string, so apparently I need the initial .*/ to perform the match. The 0* now performs as I originally wanted - the * wildcard is now searching for zero or more matches of the preceding term, which here is 0 (so if I wanted to search for zero or more matches of any digit, I would use [0-9]*). The $f term is to search for the numbered file in the for loop.

How can I combine multiple text files, remove duplicate lines and split the remaining lines into several files of certain length?

I have a lot of relatively small files with about 350.000 lines of text.
For example:
File 1:
asdf
wetwert
ddghr
vbnd
...
sdfre
File 2:
erye
yren
asdf
jkdt
...
uory
As you can see line 3 of file 2 is a duplicate of line 1 in file 1.
I want a program / Notepad++ Plugin that can check and remove these duplicates in multiple files.
The next problem I have is that I want all lists to be combined into large 1.000.000 line files.
So, for example, I have these files:
648563 lines
375924 lines
487036 lines
I want them to result in these files:
1.000.000 lines
511.523 lines
And the last 2 files must consist of only unique lines.
How can I possibly do this? Can I use some programs for this? Or a combination of multiple Notepad++ Plugins?
I know GSplit can split files of 1.536.243 into files of 1.000.000 and 536.243 lines, but that is not enough, and it doesn't remove duplicates.
I do want to create my own Notepad++ plugin or program if needed, but I have no idea how and where to start.
Thanks in advance.
You have asked about Notepad++ and are thus using Windows. On the other hand, you said you want to create a program if needed, so I guess the main goal is to get the job done.
This answer uses Unix tools - on Windows, you can get those with Cygwin.
To run the commands, you have to type (or paste) them in the terminal / console.
cat file1 file2 file3 | sort -u | split -l1000000 - outfile_
cat reads the files and echoes them; normally, to the screen, but the pipe | gets the output of the command left to it and pipes it through to the command on the right.
sort obviously sorts them, and the switch -u tells it to remove duplicate lines.
The output is then piped to split which is being told to split after 1000000 lines by the switch -l1000000. The - (with spaces around) tells it to read its input not from a file but from "standard input"; the output in sort -u in this case. The last word, outfile_, can be changed by you, if you want.
Written like it is, this will result in files like outfile_aa, outfile_ab and so on - you can modify this with the last word in this command.
If you have all the files in on directory, and nothing else is in there, you can use * instead of listing all the files:
cat * | sort -u | split -l1000000 - outfile_
If the files might contain empty lines, you might want to remove them. Otherwise, they'll be sorted to the top and your first file will not have the full 1.000.000 values:
cat file1 file2 file3 | grep -v '^\s*$' | sort -u | split -l1000000 - outfile_
This will also remove lines that consist only of whitespace.
grep filters input using regular expressions. -v inverts the filter; normally, grep keeps only lines that match. Now, it keeps only lines that don't match. ^\s*$ matches all lines that consist of nothing else than 0 or more characters of whitespace (like spaces or tabs).
If you need to do this regularly, you can write a script so you don't have to remember the details:
#!/bin/sh
cat * | sort -u | split -l1000000 - outfile_
Save this as a file (for example combine.sh) and run it with
./combine.sh

Grep pattern match between very large files is way too slow

I've spent way too much time on this and am looking for suggestions. I have too very large files (FASTQ files from an Illumina sequencing run for those interested). What I need to do is match a pattern common between both files and print that line plus the 3 lines below it into two separate files without duplications (which exist in the original files). Grep does this just fine but the files are ~18GB and matching between them is ridiculously slow. Example of what I need to do is below.
FileA:
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
NTTTCAGTTAGGGCGTTTGAAAACAGGCACTCCGGCTAGGCTGGTCAAGG
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
BP\cccc^ea^eghffggfhh`bdebgfbffbfae[_ffd_ea[H\_f_c
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
NAGGATTTAAAGCGGCATCTTCGAGATGAAATCAATTTGATGTGATGAGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
BP\ccceeggggfiihihhiiiihiiiiiiiiihighiighhiifhhhic
#DLZ38V1_0262:8:2316:21261:100790#ATAGCG/1
TGTTCAAAGCAGGCGTATTGCTCGAATATATTAGCATGGAATAATAGAAT
+DLZ38V1_0262:8:2316:21261:100790#ATAGCG/1
__\^c^ac]ZeaWdPb_e`KbagdefbZb[cebSZIY^cRaacea^[a`c
You can see 3 unique headers starting with # followed by 3 additional lines
FileB:
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
GAAATCAATGGATTCCTTGGCCAGCCTAGCCGGAGTGCCTGTTTTCAAAC
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
_[_ceeeefffgfdYdffed]e`gdghfhiiihdgcghigffgfdceffh
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
There are 4 headers here but only 2 are unique as one of them is repeated 3 times
I need the common headers between the two files without duplicates plus the 3 lines below them. In the same order in each file.
Here's what I have so far:
grep -E #DLZ38V1.*/ --only-matching FileA | sort -u -o FileA.sorted
grep -E #DLZ38V1.*/ --only-matching FileB | sort -u -o FileB.sorted
comm -12 FileA.sorted FileB.sorted > combined
combined
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/
This is only the common headers between the two files without duplicates. This is what I want.
Now I need to match these headers to the original files and grab the 3 lines below them but only once.
If I use grep I can get what I want for each file
while read -r line; do
grep -A3 -m1 -F $line FileA
done < combined > FileA.Final
FileA.Final
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
NAGGATTTAAAGCGGCATCTTCGAGATGAAATCAATTTGATGTGATGAGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
BP\ccceeggggfiihihhiiiihiiiiiiiiihighiighhiifhhhic
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
NTTTCAGTTAGGGCGTTTGAAAACAGGCACTCCGGCTAGGCTGGTCAAGG
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
BP\cccc^ea^eghffggfhh`bdebgfbffbfae[_ffd_ea[H\_f_c
The while loop is repeated to generate FileB.Final
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
GAAATCAATGGATTCCTTGGCCAGCCTAGCCGGAGTGCCTGTTTTCAAAC
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
This works but FileA and FileB are ~18GB and my combined file is around ~2GB. Does anyone have any suggestions on how I can dramatically speed up the last step?
Depending on how often do you need to run this:
you could dump (you'll probably want bulk inserts with the index built afterwards) your data into a Postgres (sqlite?) database, build an index on it, and enjoy the fruits of 40 years of research into efficient implementations of relational databases with practically no investment from you.
you could mimic having a relational database by using the unix utility 'join', but there wouldn't be much joy, since that doesn't give you an index, yet it is likely to be faster than 'grep', you might hit physical limitations...I never tried to join two 18G files.
you could write a bit of C code (put your favourite compiled (to machine code) language here), which converts your strings (four letters only, right?) into binary and builds an index (or more) based on it. This could be made lightning fast and small memory footprint as your fifty character string would take up only two 64bit words.
Thought I should post the fix I came up with for this. Once I obtained the combined file (above) I used a perl hash reference to read them into memory and scan file A. Matches in file A were hashed and used to scan file B. This still takes a lot of memory but works very fast. From 20+ days with grep to ~20 minutes.

OSX : list files that contains Author:XXXX then grep them to find how many methods they have

I am trying to come up with script that will find how many methods/functions are in files written by specific developers.
For example I want to filter down files that contains Author:XXXXXX then I want to search them to see how many instances the contain -(void)test (count the lines that start with or list the lines)
Thanks
One possible solution:
for F in $(grep -l Author *.txt); do echo $F; grep -c void $F; done
You might want to put some quote marks in various places to protect the file names from expansion...

Text editor that searches within search results

Does anyone know of a text editor that searches within search results using regex?
I would like to perform a regex search on several text files and get a list of matches and then apply another regex search on the search results to further narrow down results. I would prefer a Windows GUI editor rather than a specialized editor with a steeper learning curve like Vim or Emacs.
You might want to look at PowerGrep. It's not exactly a text editor, but you can open files containing your search results within its built-in text editor, and edit stuff there.
The main thing though is that it allows you to search using a regex (or list of regexes), then apply an additional regex to each search result, before returning a 'final' result, which I believe is what you are asking for. Kind of hard to explain, but maybe you get the idea.
The only problem with PowerGrep is that its UI is not very good. To say it takes some getting used to is an understatement. But once you figure it out, you can do a lot of powerful stuff (search/replace, data collection, etc on multiple files whose file names can also be regexes).
The companion product EditPadPro by the same company is also a great editor that has a really good regex engine built-in (probably the same one as in PowerGrep), but it doesn't allow you to do the 'regex-applied-to-a-regex-result' that I think you are asking for.
Do you want list of files in which text matches both reg.exps or a list of lines?
In the first case you can do :
{ grep -l -R 'pattern1' * ; grep -l -R 'pattern2' * } | sort | uniq -d
Note that with Windows you can get those binaries from GnuWin32 and use nearly the same syntax in a batch file:
( grep -l -R "pattern1" *
grep -l -R "pattern2" *
) | sort | uniq -d
In the last case you can with vim use my answer to narrow quickfix results with reg.exp.
Of course you can also copy your search results to a buffer and do some linewise filtering.