Random sample from regex - regex

I would like to test a tool on a small number of files from a directory. To run the tool on all files in the directory, I would run:
./my-tool input/*.test
However, the tool takes a long time to run and I would like to test it only on a subset of the files in input/. Currently, I am copying a random subset to another folder and using the regex to grab all files from that folder
My question is: Is there any way to limit the regex matches? i.e. a way to run ./my-tool input/[PATTERN].test Where [PATTERN] is a regex that will expand to only be n matches. Even better, is there a way to do that and randomize which ones are returned?

On GNU/Linux you can easily and robustly select a subset of files with shuf:
shuf -ze -n 10 input/*.test | xargs -0 ./my-tool

Related

Finding files using regular expressions/wildcards

Within a particular directory, I have a series of files that are labelled sequentially:image0000.png, image0001.png, image0002.png, etc.. They are labelled by number, but I don't necessarily know how many preceding zeroes there are in the filename, i.e. whether it will be image0001.png or image00001.png.
Within a bash script, I wish to find a single file at a time (over a for loop), and then apply some processing to the file. This search could start at zero and end before I've reached the end, or could be of varying steps. To expand, I could want to find image0000.png, image0001.png, image0002.png and so forth, or I could start at image0010.png and find every other file, i.e. the next two would be image0012.png and image0014.png.
To try and find the first file (image0000.png), I've tried using find and ls, with the following outputs:
$ find video/figs/ -name 'image*[0]0.png'
video/figs/image00100.png
video/figs/image00000.png
$ ls video/figs/image*[0]0.png
-rw-r--r-- 1 user machine 165K Feb 19 09:06 video/figs/image00000.png
-rw-r--r-- 1 user machine 207K Feb 19 09:06 video/figs/image00100.png
Similar results occur for finding the second (i.e., find video/figs/ -name 'image*[0]0.png' finds image00101.png and image00001.png. So it's finding the file I want (image00001.png), but is also finding one that I don't (image00101.jpg). Can anyone help me understand why, and fix it?
I would use ls and grep for that:
ls | grep -oP 0*[1-9]+.png
Example:
$:/tmp/test$ ls
00001.png 00002.png 00010.png 00013.png 00201.png
$:/tmp/test$ ls | grep -oP 0*[1-9]+.png
00001.png
00002.png
00013.png
01.png
I suspect you don't want to dive into subdirectories, and find files, sorted by number, spread over subdirs.
So find isn't necessary.
ls image*{08..10}.png
image00010.png image0008.png image0009.png image0010.png image008.png image009.png
Part 2 of your question, only find every other file:
ls image*{08..10..2}.png
image00010.png image0008.png image0010.png image008.png
Maybe you know for-loops. It's like that,
for (i in 8 to 10 by 2)
or
for (int i=8; i <= 10; i+=2)
Restricting the search to find image image00010.png but not imageAB010.png wouldn't work.
The reason to exclude 101 is still unclear. Maybe it's only a sorting thing.
With directories, which aren't the PWD, there is no big difference:
ls video/figs/image*{08..10..2}.png
Note, that instead of ls, you use just the program, you want to process on the files, if the program is able to handle more than one file at a time, like ls.
Sincere thanks to everyone who contributed an answer - perhaps I explained it poorly, or I was too wedded to the code I'd already written to use any of the provided answers. However, I've found the following solutions:
1) Why did I find more answers than I expected?
find video/figs/ -name 'image*[0]0.png' uses very limited comprehension of wildcards, and thus the above was interpreted as finding a file with name image<wildcard>00.png. There is no way, using the -name option, to restrict the application of * to match only a given character (in this case, only find zero or more matches to 0.
2) How do I find the image files with an unknown number of padding zeroes?
The following is a MWE from my final code. It demonstrates how to search within a given directory SEARCH_DIR (not necessarily including subdirectories, but I haven't checked)
f1=0 # Starting number
f2=10 # End number
df=2 # number to skip between images
for ((f=$f1; f<=$f2; f=$f+$df)); do
export iFile=$(find $SEARCH_DIR -regex '.*/image0*'$f'.png')
done
The export ensures the variable is available to sub-processes, with the iFile=$() syntax allowing me to export the result of the command to the variable iFile. The bit within the parentheses is the bit I was looking for: find $SEARCH_DIR -regex '.*/image[0]*'$f'.png'
a) find $SEARCH_DIR specifies the location for the search
b) -regex specifies to use regular expressions, which are more powerful than standard bash scripting and allow me to limit wildcards as required
c) '.*/image0*'$f'.png': The regular expression search looks over the entire string, so apparently I need the initial .*/ to perform the match. The 0* now performs as I originally wanted - the * wildcard is now searching for zero or more matches of the preceding term, which here is 0 (so if I wanted to search for zero or more matches of any digit, I would use [0-9]*). The $f term is to search for the numbered file in the for loop.

How to search files in windows file explorer with specified extension name?

We can search files in windows 7 or higher version using the following tool:
(I don't have image uploading privilage. I mean the top-right area in windows file explorer.)
When I search for MATLAB files using "*.m", it not only returns *.m files, but also returns *.mp3, *.mp4 files. Is there any way to show *.m files exclusively?
Thanks!
I assume you used the quotation marks here to show the text you typed, because ironically the exact way how it should work is to put the search in quotation marks...
so
*.m
finds .mp3 as well as .m but
"*.m"
should only find the .m files. Alternatively you could also write
ext:".m"
which would guarantee that only extensions are searched. (Although I am not sure if this is ever necessary here, because while windows can have a dot in the filename and also can have files without extensions I am not sure if it is possible to have both at the same time.)
using the following
"*.m"
will solve your problem.You can find more information on regex to be used in msdn in the following link .Advanced query syntax
Above that, you can also take advantage of the wildcard character *.
For example, if you want to search for a file with a name ending with 024 or starting with 024 then you can put in the search box like *024.* or 024*.* respectively.
Here the * after . represents files with any extensions, if you want particular then mention extension line 024.png.
Explorer don't have a function of finding with RegEx.
You need to use Power-Shell instead of Win Explorer;
for example: where '(?i)Out' is a regex
Get-ChildItem -Path e:\temp -Recurse -File | Where-Object { $_.Name -match '(?i)Out' }
alternatively you can just simply search for your extension like this:
.extension
eg:
typing .exe will give you all the files with .exe extensions in a folder.
PS: Typing .xml OR .vmcx will give you both type of files. It is useful if you seek to make an archive of different kinds of files stored in different folders or locations.
You can get close to proper regex support from the mostly awesome Cygwin, and as a bonus you get most every linux tool running natively on linux. But it still doesnn't know that .* means "zero or more of anything", ^ means the start of a line (and $ the end), so some things are still weird.
And a startlingly large bunch of weird corner cases that only deranged perl programmers notice fail the test.
So many other things it gets wrong, but it's more workable than anything in any windows OS, plus you get perl, grep, diff, wget, curl, etc. -- the whole GNU lib for free.
If you want a full on bash shell with proper respect for regex, install the super neet-o Bash for Windows 10
Either will do what you want. And they're a billion times faster than that stupid search bar that takes off at 100 mph then crawls to 1 pixel per 10 minutes near the end.

Specifying a range of files using regex

I have a huge amount of files (in the hundreds of thousands) that all have the same format of name.
The filename format is:
[prefix][number]suffix]
where the [prefix] and [suffix] of all the files is the same, and just the number part changes. The number part is something like 0004732
So the filenames are:
[prefix]004732[suffix]
[prefix]004733[suffix]
[prefix]004734[suffix]
etc.
I need to move a range of about 100,000 files (with consecutive numbers) to another directory, and I was wondering if it is possible to do this with a regular expression.
You're looking for character classes. It's a bit difficult to specify number ranges using regex because it works on text, not numbers, but it can be done something like this (for files 1-100):
prefix[0-1][0-9][0-9]suffix
prefix[0-1]\d\dsuffix #this also works in PERL regex
More complicated numbers get trickier. For 0-211:
prefix([0-1][0-9][0-9]|20[0-9]|21[0-1])suffix
If you're on Windows, install Cygwin, and do the following. If you're on Mac OS X or Linux, just open a terminal. You'll need to do the following:
ls PREFIX* | sed 's/PREFIX\(0[0-9]\)SUFFIX/mv & tmp\/PREFIX\1SUFFIX/' | sh
What is this doing?
Lists all files starting with the specified prefix
Pipes this list to sed, which uses a regex pattern to match only files that fall within the range you specify
Create a new string using the move command
Pipes the move command string to the shell (sh) and executes it
You can tweak the regex to match your number range by looking at the following:
http://www.regular-expressions.info/numericranges.html
To the best of my knowledge, there is no regex (to handle complex cases), but you can use loop easily:
The following code runs in linux. I ran simnilar code on Windows using CygWin and it works as well. Maybe there is similar way to do in Windows.
If the two numbers are with the same digits;
Example: from
[prefix]000012345[suffix]
to
[prefix]000056789[suffix]
:
for (( i=12345; i<56789; i++)); do mv "[prefix]0000$i[suffix]" /newDirectoryPath done
Otherwise you can do with multiple (usually two or three) commands;
Example: from
[prefix]000012345[suffix]
to
[prefix]003456789[suffix]
:
for (( i=12345; i<99999; i++)); do mv "[prefix]0000$i[suffix]" /newDirectoryPath done
for (( i=100000; i<999999; i++)); do mv "[prefix]000$i[suffix]" /newDirectoryPath done
for (( i=1000000; i<3456789; i++)); do mv "[prefix]00$i[suffix]" /newDirectoryPath done

Grep pattern match between very large files is way too slow

I've spent way too much time on this and am looking for suggestions. I have too very large files (FASTQ files from an Illumina sequencing run for those interested). What I need to do is match a pattern common between both files and print that line plus the 3 lines below it into two separate files without duplications (which exist in the original files). Grep does this just fine but the files are ~18GB and matching between them is ridiculously slow. Example of what I need to do is below.
FileA:
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
NTTTCAGTTAGGGCGTTTGAAAACAGGCACTCCGGCTAGGCTGGTCAAGG
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
BP\cccc^ea^eghffggfhh`bdebgfbffbfae[_ffd_ea[H\_f_c
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
NAGGATTTAAAGCGGCATCTTCGAGATGAAATCAATTTGATGTGATGAGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
BP\ccceeggggfiihihhiiiihiiiiiiiiihighiighhiifhhhic
#DLZ38V1_0262:8:2316:21261:100790#ATAGCG/1
TGTTCAAAGCAGGCGTATTGCTCGAATATATTAGCATGGAATAATAGAAT
+DLZ38V1_0262:8:2316:21261:100790#ATAGCG/1
__\^c^ac]ZeaWdPb_e`KbagdefbZb[cebSZIY^cRaacea^[a`c
You can see 3 unique headers starting with # followed by 3 additional lines
FileB:
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
GAAATCAATGGATTCCTTGGCCAGCCTAGCCGGAGTGCCTGTTTTCAAAC
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
_[_ceeeefffgfdYdffed]e`gdghfhiiihdgcghigffgfdceffh
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
There are 4 headers here but only 2 are unique as one of them is repeated 3 times
I need the common headers between the two files without duplicates plus the 3 lines below them. In the same order in each file.
Here's what I have so far:
grep -E #DLZ38V1.*/ --only-matching FileA | sort -u -o FileA.sorted
grep -E #DLZ38V1.*/ --only-matching FileB | sort -u -o FileB.sorted
comm -12 FileA.sorted FileB.sorted > combined
combined
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/
This is only the common headers between the two files without duplicates. This is what I want.
Now I need to match these headers to the original files and grab the 3 lines below them but only once.
If I use grep I can get what I want for each file
while read -r line; do
grep -A3 -m1 -F $line FileA
done < combined > FileA.Final
FileA.Final
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
NAGGATTTAAAGCGGCATCTTCGAGATGAAATCAATTTGATGTGATGAGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/1
BP\ccceeggggfiihihhiiiihiiiiiiiiihighiighhiifhhhic
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
NTTTCAGTTAGGGCGTTTGAAAACAGGCACTCCGGCTAGGCTGGTCAAGG
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/1
BP\cccc^ea^eghffggfhh`bdebgfbffbfae[_ffd_ea[H\_f_c
The while loop is repeated to generate FileB.Final
#DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
GCCATTCAGTCCGAATTGAGTACAGTGGGACGATGTTTCAAAGGTCTGGC
+DLZ38V1_0262:8:1101:1369:2106#ATAGCG/2
_aaeeeeegggggiiiiihihiiiihgiigfggiighihhihiighhiii
#DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
GAAATCAATGGATTCCTTGGCCAGCCTAGCCGGAGTGCCTGTTTTCAAAC
+DLZ38V1_0262:8:1101:1430:2087#ATAGCG/2
This works but FileA and FileB are ~18GB and my combined file is around ~2GB. Does anyone have any suggestions on how I can dramatically speed up the last step?
Depending on how often do you need to run this:
you could dump (you'll probably want bulk inserts with the index built afterwards) your data into a Postgres (sqlite?) database, build an index on it, and enjoy the fruits of 40 years of research into efficient implementations of relational databases with practically no investment from you.
you could mimic having a relational database by using the unix utility 'join', but there wouldn't be much joy, since that doesn't give you an index, yet it is likely to be faster than 'grep', you might hit physical limitations...I never tried to join two 18G files.
you could write a bit of C code (put your favourite compiled (to machine code) language here), which converts your strings (four letters only, right?) into binary and builds an index (or more) based on it. This could be made lightning fast and small memory footprint as your fifty character string would take up only two 64bit words.
Thought I should post the fix I came up with for this. Once I obtained the combined file (above) I used a perl hash reference to read them into memory and scan file A. Matches in file A were hashed and used to scan file B. This still takes a lot of memory but works very fast. From 20+ days with grep to ~20 minutes.

Linux Commad Line Zip with Regex

I have thousands of jpg files that are all called 1.jpg, 2.jpg, 3.jpg and so on. I need to zip up a range of them and I thought I could do this with regex, but so far haven't had any luck.
Here is the command
zip images.zip '[66895-105515]'.jpg
Does anyone have any ideas?
I am very sure that is not possible to match number ranges like this with regular expressions (digit ranges, yes, but not whole multi-digit numbers), as regular expressions work on the character level. However, you can use the "seq" command to generate the list of files and use "xargs" to pass them to "zip":
seq --format %g.jpg 66895 105515 | xargs zip images.zip
I tested the command with a bunch of dummy files under Linux and it works fine.
Use in conjunction with ls and bash range ({m..n}) operator like this:
ls {66895..105515}".jpg" 2>/dev/null | zip jpegs -#
You need to pipe some stuff - list the files, filter by the regex, zip up each listed file.
ls | grep [66895-10551] | xargs zip images.zip
Edit: Whoops, didn't test with multi-digit numbers. As denisw mentions, this method won't work.