Finding files using regular expressions/wildcards - regex

Within a particular directory, I have a series of files that are labelled sequentially:image0000.png, image0001.png, image0002.png, etc.. They are labelled by number, but I don't necessarily know how many preceding zeroes there are in the filename, i.e. whether it will be image0001.png or image00001.png.
Within a bash script, I wish to find a single file at a time (over a for loop), and then apply some processing to the file. This search could start at zero and end before I've reached the end, or could be of varying steps. To expand, I could want to find image0000.png, image0001.png, image0002.png and so forth, or I could start at image0010.png and find every other file, i.e. the next two would be image0012.png and image0014.png.
To try and find the first file (image0000.png), I've tried using find and ls, with the following outputs:
$ find video/figs/ -name 'image*[0]0.png'
video/figs/image00100.png
video/figs/image00000.png
$ ls video/figs/image*[0]0.png
-rw-r--r-- 1 user machine 165K Feb 19 09:06 video/figs/image00000.png
-rw-r--r-- 1 user machine 207K Feb 19 09:06 video/figs/image00100.png
Similar results occur for finding the second (i.e., find video/figs/ -name 'image*[0]0.png' finds image00101.png and image00001.png. So it's finding the file I want (image00001.png), but is also finding one that I don't (image00101.jpg). Can anyone help me understand why, and fix it?

I would use ls and grep for that:
ls | grep -oP 0*[1-9]+.png
Example:
$:/tmp/test$ ls
00001.png 00002.png 00010.png 00013.png 00201.png
$:/tmp/test$ ls | grep -oP 0*[1-9]+.png
00001.png
00002.png
00013.png
01.png

I suspect you don't want to dive into subdirectories, and find files, sorted by number, spread over subdirs.
So find isn't necessary.
ls image*{08..10}.png
image00010.png image0008.png image0009.png image0010.png image008.png image009.png
Part 2 of your question, only find every other file:
ls image*{08..10..2}.png
image00010.png image0008.png image0010.png image008.png
Maybe you know for-loops. It's like that,
for (i in 8 to 10 by 2)
or
for (int i=8; i <= 10; i+=2)
Restricting the search to find image image00010.png but not imageAB010.png wouldn't work.
The reason to exclude 101 is still unclear. Maybe it's only a sorting thing.
With directories, which aren't the PWD, there is no big difference:
ls video/figs/image*{08..10..2}.png
Note, that instead of ls, you use just the program, you want to process on the files, if the program is able to handle more than one file at a time, like ls.

Sincere thanks to everyone who contributed an answer - perhaps I explained it poorly, or I was too wedded to the code I'd already written to use any of the provided answers. However, I've found the following solutions:
1) Why did I find more answers than I expected?
find video/figs/ -name 'image*[0]0.png' uses very limited comprehension of wildcards, and thus the above was interpreted as finding a file with name image<wildcard>00.png. There is no way, using the -name option, to restrict the application of * to match only a given character (in this case, only find zero or more matches to 0.
2) How do I find the image files with an unknown number of padding zeroes?
The following is a MWE from my final code. It demonstrates how to search within a given directory SEARCH_DIR (not necessarily including subdirectories, but I haven't checked)
f1=0 # Starting number
f2=10 # End number
df=2 # number to skip between images
for ((f=$f1; f<=$f2; f=$f+$df)); do
export iFile=$(find $SEARCH_DIR -regex '.*/image0*'$f'.png')
done
The export ensures the variable is available to sub-processes, with the iFile=$() syntax allowing me to export the result of the command to the variable iFile. The bit within the parentheses is the bit I was looking for: find $SEARCH_DIR -regex '.*/image[0]*'$f'.png'
a) find $SEARCH_DIR specifies the location for the search
b) -regex specifies to use regular expressions, which are more powerful than standard bash scripting and allow me to limit wildcards as required
c) '.*/image0*'$f'.png': The regular expression search looks over the entire string, so apparently I need the initial .*/ to perform the match. The 0* now performs as I originally wanted - the * wildcard is now searching for zero or more matches of the preceding term, which here is 0 (so if I wanted to search for zero or more matches of any digit, I would use [0-9]*). The $f term is to search for the numbered file in the for loop.

Related

Scanning folder to find all files with minimum matching words (UNIX)

Having some trouble figuring out the command line to the following issue and hoping u guys can help!
Basically, I have a folder which contains a ~1000 PDF's. I need to search through every pdf and return the file names of PDF's that match certain words X amount of times.
For example, I have 10 PDF's which all contain the word "Fragile". I would like to return a list of all files that contain "Fragile" a minimum of 3 times throughout the PDF.
I am currently using pdfgrep and giving it a regex to look for, but it will return all the files that match at least once. I have seen a few recommendations out there piping the command with "awk", but i'm not sure what this really does...
Don't know much about pdfgrep, but if the output is like on https://pdfgrep.org/ it should be fairly easy to get the number of the lines in the output, doing something like:
for f in *.pdf; do if [ $(pdfgrep -nHm 10 "Fragile" "$f" | wc -l) -gt 2 ]; then echo $f; fi; done

export filenames to temp file bash

I have a lot of files in multiple directories that all have the following setup for the filename:
prob123456_01
I want to delete the trailing "_01" off of each file name and export them to a temp file. How exactly would I delete the trailing "_01" as well as export? I am rather new to scripting so any help would be greatly appreciated!
As you've tagged with bash, I'll assume that you can use globstar
shopt -s globstar # enable globstar
for f in **_[0-9][0-9]; do echo "${f%_*}"; done > tmp
With globstar enabled, the pattern **_[0-9][0-9] matches any file ending in _, followed by any 2 digit number, in the current directory and any subdirectories. ${f%_*} removes the end of the file name using bash's built-in string manipulation functionality.
Better yet, as Charles Duffy suggests (thanks), you can use an array instead of a loop:
files=( **_[0-9][0-9] ); printf '%s\n' "${files[#]%_*}"
The array is filled the filenames that match the same pattern as before. ${files[#]%_*} removes the last part from each element of the array and passes them all as arguments to printf, which prints each result on a separate line.
Either of these approaches is likely to be quicker than using find as everything is done in the shell, without executing any separate processes.
Previously I had suggested to use the pattern **_{00..99}, although this is not ideal for a couple of reasons. It is less efficient, as it expands to **_00, **_01, **_02, ..., **_99. Also, any of those 100 patterns that don't match will be included literally in the output unless another option, nullglob is enabled.
It's up to you whether you use [0-9] or [[:digit:]] but the advantage of the latter is that it matches all characters defined to be a digit, which may vary depending on your locale. If this isn't a concern, I would go with the former.
If I understand you correctly, you want a list of the filenames without the trailing _01. The following would do that:
find . -type f -name '*_01' | sed 's/_01$//' > tmp.lst
find . -type f -name '*_01' looks for all the files in the current directory, and its descendent directories, for files with names ending in _01.
| is the so-called pipe, handing the results of the left-hand call to the right-hand call.
sed 's/_01$//' removes the _01 from the end of each filename.
> tmp.lst writes the result into the file tmp.lst
These are all pretty basic parts of working with bash and its likes, so it might be a good idea to look at a tutorial or two and familiarize yourself with those and a few others ;)

Specifying a range of files using regex

I have a huge amount of files (in the hundreds of thousands) that all have the same format of name.
The filename format is:
[prefix][number]suffix]
where the [prefix] and [suffix] of all the files is the same, and just the number part changes. The number part is something like 0004732
So the filenames are:
[prefix]004732[suffix]
[prefix]004733[suffix]
[prefix]004734[suffix]
etc.
I need to move a range of about 100,000 files (with consecutive numbers) to another directory, and I was wondering if it is possible to do this with a regular expression.
You're looking for character classes. It's a bit difficult to specify number ranges using regex because it works on text, not numbers, but it can be done something like this (for files 1-100):
prefix[0-1][0-9][0-9]suffix
prefix[0-1]\d\dsuffix #this also works in PERL regex
More complicated numbers get trickier. For 0-211:
prefix([0-1][0-9][0-9]|20[0-9]|21[0-1])suffix
If you're on Windows, install Cygwin, and do the following. If you're on Mac OS X or Linux, just open a terminal. You'll need to do the following:
ls PREFIX* | sed 's/PREFIX\(0[0-9]\)SUFFIX/mv & tmp\/PREFIX\1SUFFIX/' | sh
What is this doing?
Lists all files starting with the specified prefix
Pipes this list to sed, which uses a regex pattern to match only files that fall within the range you specify
Create a new string using the move command
Pipes the move command string to the shell (sh) and executes it
You can tweak the regex to match your number range by looking at the following:
http://www.regular-expressions.info/numericranges.html
To the best of my knowledge, there is no regex (to handle complex cases), but you can use loop easily:
The following code runs in linux. I ran simnilar code on Windows using CygWin and it works as well. Maybe there is similar way to do in Windows.
If the two numbers are with the same digits;
Example: from
[prefix]000012345[suffix]
to
[prefix]000056789[suffix]
:
for (( i=12345; i<56789; i++)); do mv "[prefix]0000$i[suffix]" /newDirectoryPath done
Otherwise you can do with multiple (usually two or three) commands;
Example: from
[prefix]000012345[suffix]
to
[prefix]003456789[suffix]
:
for (( i=12345; i<99999; i++)); do mv "[prefix]0000$i[suffix]" /newDirectoryPath done
for (( i=100000; i<999999; i++)); do mv "[prefix]000$i[suffix]" /newDirectoryPath done
for (( i=1000000; i<3456789; i++)); do mv "[prefix]00$i[suffix]" /newDirectoryPath done

Regex and shell - multiple recursive rename

I have a folder with several hundreds of folders inside it. These folders contain another folder each, called images, and in this folder there is sometimes a strictly numerically named .jpg file. Sometimes there are other JPG files in the folder as well, but these need to be ignored if they aren't strictly numeric.
I would like to learn how to write a script which would, when run in a given folder, traverse every single subfolder and look for this numeric file. It would then add the "_n" suffix to a copy of each, if such a file does not already exist.
Can this be done through the unix terminal easily?
To be more specific, this is the structure I'm dealing with:
master folder
18556
images
2234.jpg
47772
images
2234.jpg
2234_n.jpg
some_pic.jpg
77377
images
88723
images
22.jpg
some_pic.jpg
After the script is run, the situation would look like this:
master folder
18556
images
2234.jpg
2234_n.jpg
47772
images
2234.jpg
2234_n.jpg
some_pic.jpg
77377
images
88723
images
22.jpg
22_n.jpg
some_pic.jpg
Update: Sorry about the typo, I accidentally put 2235 into 47772.
Update 2: Regarding the 2nd comment on the mathematical.coffee's answer, the OS I am currently on (at work) is MacOS, but my main machines are running CentOS and Ubuntu at home, so I just assumed my situation applies to all unix based systems.
You can use the -regex switch to find to match /somefolder/images/numeric.jpg:
find -type f -regex './[^/]+/images/[0-9]+\.jpg$'
Edit: refinement from #JonathanLeffler: add -type f to find so it only finds files (ie don't match a directory called '12345.jpg').
The ./[^/]+/ is for the first folder (if that first folder is always numeric too you can change it to [0-9]+).
The [0-9]+\.jpg$ means a jpg file with file name only being numeric.
You might want to change the jpg to jpe?g to allow .jpeg, but that's up to you.
Then it's a matter of copying these to xxx_n.jpg.
for f in $(find -type f -regex './[^/]+/images/[0-9]+\.jpg$')
do
# replace '.jpg' in $f (filename) with '_n.jpg'
newf=${f/\.jpg/_n\.jpg}
# see if this new file exists
if [ ! -f $newf ];
then
# if not exists, copy it.
cp "$f" "$newf"
fi
done
What should be the logic behind the renames in Folder 47772? If we assume you want to rename all the files just consisting of numbers to numbers + _n
With mmv you could write it like:
mmv "[0-9][0-9]*.jpg" "#1#2#3_n.jpg"
Note: mmv is for moving; mcp is for copying, and so is more appropriate to this question.
Question of Vader:
Well I checked the man page and the problem is that it's a bit strange.
I was thinking [0-9]* would match zero or more numbers. I turns out that this assumption was wrong.
The problem is that I could not tell I want two or more numbers at the start of the name.
So [0-9][0-9]* matches a name starting with at least two numbers (after that it takes all the rest up to the .. Now every [0-9] is one pattern and so I had to make the to pattern into:
"#1#2#3_n.jpg" With e.g 1234.jpg I have #1 = 1; #2 = 2, #3 = 34 So
#1#2#3 -> 1234; _n appends the _n and .jpg the extension
However it would rename also files with 12some_other_stuff.jpg sot 12some_other_stuff_n.jpg. It's not ideal but achieves in this context what was intended.

Apply regular expression substitution globally to many files with a script

I want to apply a certain regular expression substitution globally to about 40 Javascript files in and under a directory. I'm a vim user, but doing this by hand can be tedious and error-prone, so I'd like to automate it with a script.
I tried sed, but handling more than one line at a time is awkward, especially if there is no limit to how many lines the pattern might match.
I also tried this script (on a single file, for testing):
ex $1 <<EOF
gs/,\(\_\s*[\]})]\)/\1/
EOF
The pattern will eliminate a trailing comma in any Perl/Ruby-style list, so that "[a, b, c,]" will come out as "[a, b, c]" in order to satisfy Internet Explorer, which alone among browsers, chokes on such lists.
The pattern works beautifully in vim but does nothing if I run it in ex, as per the above script.
Can anyone see what I might be missing?
You asked for a script, but you mentioned that you are vim user. I tend to do project-wide find and replace inside of vim, like so:
:args **/*.js | argdo %s/,\(\_\s*[\]})]\)/\1/ge | update
This is very similar to the :bufdo solution mentioned by another commenter, but it will use your args list rather than your buflist (and thus doesn't require a brand new vim session nor for you to be careful about closing buffers you don't want touched).
:args **/*.js - sets your arglist to contain all .js files in this directory and subdirectories
| - pipe is vim's command separator, letting us have multiple commands on one line
:argdo - run the following command(s) on all arguments. it will "swallow" subsequent pipes
% - a range representing the whole file
:s - substitute command, which you already know about
:s_flags, ge - global (substitute as many times per line as possible) and suppress errors (i.e. "No match")
| - this pipe is "swallowed" by the :argdo, so the following command also operates once per argument
:update - like :write but only when the buffer has been modified
This pattern will obviously work for any vim command which you want to run on multiple files, so it's a handy one to keep in mind. For example, I like to use it to remove trailing whitespace (%s/\s\+$//), set uniform line-endings (set ff=unix) or file encoding (set filencoding=utf8), and retab my files.
1) Open all the files with vim:
bash$ vim $(find . -name '*.js')
2) Apply substitute command to all files:
:bufdo %s/,\(\_\s*[\]})]\)/\1/ge
3) Save all the files and quit:
:wall
:q
I think you'll need to recheck your search pattern, it doesn't look right. I think where you have \_\s* you should have \_s* instead.
Edit: You should also use the /ge options for the :s... command (I've added these above).
You can automate the actions of both vi and ex by passing the argument +'command' from the command line, which enables them to be used as text filters.
In your situation, the following command should work fine:
find /path/to/dir -name '*.js' | xargs ex +'%s/,\(\_\s*[\]})]\)/\1/g' +'wq!'
you can use a combination of the find command and sed
find /path -type f -iname "*.js" -exec sed -i.bak 's/,[ \t]*]/]/' "{}" +;
If you are on windows, Notepad++ allows you to run simple regexes on all opened files.
Search for ,\s*\] and replace with ]
should work for the type of lists you describe.