Bash: find and concatenate filenames with two digits - regex

I'm trying to find all instances of csv files in a set of directories and concatenate them into one csv file.
The catch is that the directories are numbered. I only want directories that end in two digits. For example, I want directories RUN11, RUN12, etc, but not RUN1, RUN2.
If I didn't care about having two-digit numbers, I'd do this (from here)
find $(pwd)/RUN* -name '*csv' |xargs cat > big_cat_file.csv
I tried this:
find $(pwd)/RUN[!0-9]{2} -name '*csv' |xargs cat > big_cat_file.csv
But it says no such file or directory.
How can I grab csv files from directories with names like RUN11, RUN12, but not RUN1, RUN2?

You are trying to use regular expression syntax where you need to use a glob.
You just need to specify the range twice, rather than using {2}:
find "$PWD"/RUN[0-9][0-9] -name '*csv' |xargs cat > big_cat_file.csv
(Note that [!0-9] matches any single character except a digit.)
To accommodate any legal filename that might match *csv, you should use the -exec primary instead of xargs. (Consider what would happen if a file name contains whitespace, or in the worst case, a newline.)
find "$PWD"/RUN[0-9][0-9] -name '*csv' -exec cat {} + > big_cat_file.csv
This not only works with any valid file name, but minimizes the number of calls to cat that are required.

Related

Extracting filenames containing one or more numbers then cat contents to output file

If possible, I'm looking for a bash one liner that concatenates all the files in a folder that are labelled motif<number>.motif into an output.txt file.
I have a few issues I'm struggling with.
A: The <number> contained in the filename can be one or two digits long and I don't know how to use regex (or something similar) to get all the files with either one or two digits. I can get the filenames containing single or double digit numbers separately using:
motif[0-9].motif or motif[0-9][0-9].motif
but can't work out how to get all the files listed together.
B: The second issue I have is I don't know how many files will be in the directory in advance, so I can just use a range of numbers to select the files. This command is in the middle of a long pipeline.
So lets say I have 20 folders:
motif1.motif
motif2.motif
...
motif19.motif
motif20.motif
I'd need to cat >> the contents of all of them into output.txt.
You can do:
cat motif{[0-9],[0-9][0-9]}.motif > output
or with extglob:
shopt -s extglob nullglob
cat motif[0-9]?([0-9]).motif > output

For the love of BASH, regex, locate & find - contains A not B

Goal: Regex pattern for use with find and locate that "Contains A but not B"
So I have a bash script that manipulates a few video files.
In its current form, I create a variable to act on later with a for loop that works well:
if [ "$USE_FIND" = true ]; then
vid_files=$(find "${DIR}" -type f -regex ".*\.\(mkv\|avi\|ts\|mp4\|m2ts\)")
else
vid_files=$(locate -ir "${DIR}.*\.\(mkv\|avi\|ts\|mp4\|m2ts\)")
fi
So "contains A" is any one of the listed extensions.
I'd like to add to a condition where if a certain string (B) is contained the file isn't added to the array (can be a directory or a filename).
I've spent some time with lookaheads trying to implement this to no avail. So an example of "not contains B" as "Robot" - I've used different forms of .*(?!Robot).*
e.g. ".*\(\?\!Robot\).*\.\(mkv\|avi\|ts\|mp4\|m2ts\)" for find but it doesn't work.
I've sort of exhausting regex101.com, terminal and chmod +x at this point and would welcome some help. I think it's the case that's it's called through a bash script causing me the difficulty.
One of my many sources of reference in trying to sort this:
Ref: Is there a regex to match a string that contains A but does not contain B
You may want to avoid the use find inside a process substitution to build a list of files, as, while this is admittedly rare, filenames could contain newlines.
You could use an array, which will handle file names without issues (assuming the array is later expanded properly).
declare -a vid_files=()
while IFS= read -r -d '' file
do
! [[ "$file" =~ Robot ]] || continue
vid_files+=("$file")
done < <(find "${DIR}" -type f -regex ".*\.\(mkv\|avi\|ts\|mp4\|m2ts\)" -print0)
The -print0 option of find generates a null byte to separate the file names, and the -d '' option of read allows a null byte to be used as a record separator (both obviously go together).
You can get the list of files using "${vid_files[#]}" (double quotes are important to prevent word splitting). You can also iterate over the list easily :
for file in "${vid_files[#]}"
do
echo "$file"
done

export filenames to temp file bash

I have a lot of files in multiple directories that all have the following setup for the filename:
prob123456_01
I want to delete the trailing "_01" off of each file name and export them to a temp file. How exactly would I delete the trailing "_01" as well as export? I am rather new to scripting so any help would be greatly appreciated!
As you've tagged with bash, I'll assume that you can use globstar
shopt -s globstar # enable globstar
for f in **_[0-9][0-9]; do echo "${f%_*}"; done > tmp
With globstar enabled, the pattern **_[0-9][0-9] matches any file ending in _, followed by any 2 digit number, in the current directory and any subdirectories. ${f%_*} removes the end of the file name using bash's built-in string manipulation functionality.
Better yet, as Charles Duffy suggests (thanks), you can use an array instead of a loop:
files=( **_[0-9][0-9] ); printf '%s\n' "${files[#]%_*}"
The array is filled the filenames that match the same pattern as before. ${files[#]%_*} removes the last part from each element of the array and passes them all as arguments to printf, which prints each result on a separate line.
Either of these approaches is likely to be quicker than using find as everything is done in the shell, without executing any separate processes.
Previously I had suggested to use the pattern **_{00..99}, although this is not ideal for a couple of reasons. It is less efficient, as it expands to **_00, **_01, **_02, ..., **_99. Also, any of those 100 patterns that don't match will be included literally in the output unless another option, nullglob is enabled.
It's up to you whether you use [0-9] or [[:digit:]] but the advantage of the latter is that it matches all characters defined to be a digit, which may vary depending on your locale. If this isn't a concern, I would go with the former.
If I understand you correctly, you want a list of the filenames without the trailing _01. The following would do that:
find . -type f -name '*_01' | sed 's/_01$//' > tmp.lst
find . -type f -name '*_01' looks for all the files in the current directory, and its descendent directories, for files with names ending in _01.
| is the so-called pipe, handing the results of the left-hand call to the right-hand call.
sed 's/_01$//' removes the _01 from the end of each filename.
> tmp.lst writes the result into the file tmp.lst
These are all pretty basic parts of working with bash and its likes, so it might be a good idea to look at a tutorial or two and familiarize yourself with those and a few others ;)

unix find filenames that are lexicographically less that a given filename

I have a list of files in a directory that are automatically generated by a system with the date in the filename. Some examples are: audit_20111020, audit_20111021, audit_20111022, etc.
I want to clean up files older than 18 months therefore I want to put together a unix find command that will find files less than audit_20100501 and delete them.
Does any know how to use lexicographical order as a criteria in the find command?
Another Perl variant:
perl -E'while(<audit_*>) { say if /(\d{8})/ && $1 < 20100501}'
Replace say by unlink if it prints expected filenames.
Note: < performs numerical comparison, use lt if you want string comparison.
With Perl it's easy. Type perl and:
for (glob "*")
{
my($n) = /(\d+)/;
unlink if ($n < 20100501);
}
^D
Test before using. Note that I'm assuming this is a fixed format and the directory only contains these files
It is possible to sort find's result using the sort command:
find . -name "audit*" | sort -n
... then find a way to split this list.
But for what you want to do, i.e. delete directories older than a certain date (18 months is ~547 days), you could use the below instead:
find -ctime -547 -type d | xargs -I{} rm -rf {}

Bash go through list of dirs and generate md5

What would be the bash script that:
Goes through a directory, and puts all the sub-directories in an array
For each dir, generate an md5 sum of a file inside that dir
Also, the file who's md5sum has to be generated doesn't always have the same name and path. However, the pattern is always the same:
/var/mobile/Applications/{ the dir name here is taken from the array }/{some name}.app/{ binary, who's name is the same as it's parent dir, but without the .app extension }
I've never worked with bash before (and have never needed to) so this may be something really simple and nooby. Anybody got an idea? As can be seen by the path, this is designed to be run on an iDevice.
for dir in /var/mobile/Applications/*; do
for app in "$dir"/*.app; do
appdirname=${app##*/}
appname=${appdirname%.app}
binary="$app/$appname"
if [ -f "$binary" ]; then
echo "I: dir=$dir appbase=$appbase binary=$binary"
fi
done
done
Try this, I hope the code is straight-forward. The two things worth explaining are:
${app##*/}, which uses the ## operator to strip off the longest prefix matching the expression */.
${appdirname%.app}, which uses the % operator to strip off the shortest suffix matching the expression .app. (You could have also used %% (strip longest suffix) instead of %, since the pattern .app is always four characters long.)
Try something like:
ls -1 /Applications/*/Contents/Info.plist | while read name; do md5 -r "$name"; done
the above will show md5 checksum for all Info.plist files for all applications, like:
d3bde2b76489e1ac081b68bbf18a7c29 /Applications/Address Book.app/Contents/Info.plist
6a093349355d20d4af85460340bc72b2 /Applications/Automator.app/Contents/Info.plist
f1c120d6ccc0426a1d3be16c81639ecb /Applications/Calculator.app/Contents/Info.plist
Bash is very easy but you need to know the cli-tools of your system.
For to print the md5 hash of all files of the a directory recursively:
find /yourdirectory/ -type f | xargs md5sum
If you only want to list the tree of directories:
find /tmp/ -type d
You can generate a list with:
MYLIST=$( find /tmp/ -type d )
Use "for" for iterate the list:
for i in $MYLIST; do
echo $i;
done
If you are a newbie in bash:
http://tldp.org/LDP/Bash-Beginners-Guide/html/
http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html