How to grep for a file extension - regex

I am currently trying to a make a script that would grep input to see if something is of a certain file type (zip for instance), although the text before the file type could be anything, so for instance
something.zip
this.zip
that.zip
would all fall under the category. I am trying to grep for these using a wildcard, and so far I have tried this
grep ".*.zip"
But whenever I do that, it will find the .zip files just fine, but it will still display output if there are additional characters after the .zip so for instance .zippppppp or .zipdsjdskjc would still be picked up by grep. Having said that, what should I do to prevent grep from displaying matches that have additional characters after the .zip?

Test for the end of the line with $ and escape the second . with a backslash so it only matches a period and not any character.
grep ".*\.zip$"
However ls *.zip is a more natural way to do this if you want to list all the .zip files in the current directory or find . -name "*.zip" for all .zip files in the sub-directories starting from (and including) the current directory.

On UNIX, try:
find . -type f -name \*.zip

You can also use grep to find all files with a specific extension:
find .|grep -e "\.gz$"
The . means the current folder.
If you want to specify a folder other than the current folder, just replace the . with the path of the folder.
Here is an example: Let's find all files that end with .gz and are in the folder /var/log
find /var/log/ |grep -e "\.gz$"
The output is something similar to the following:
✘ ⚙> find /var/log/ |grep -e "\.gz$"
/var/log//mail.log.1.gz
/var/log//mail.log.0.gz
/var/log//system.log.3.gz
/var/log//system.log.7.gz
/var/log//system.log.6.gz
/var/log//system.log.2.gz
/var/log//system.log.5.gz
/var/log//system.log.1.gz
/var/log//system.log.0.gz
/var/log//system.log.4.gz
The $ sign says that the file extension is ending with gz

I use this to get a listing of the file types inside a folder.
find . -type f | egrep -i -E -o "\.{1}\w*$" | sort -su
Outputs for example:
.DS_Store
.MP3
.aif
.aiff
.asd
.doc
.flac
.jpg
.m4a
.m4p
.m4r
.mp3
.pdf
.png
.txt
.wav
.wma
.zip
BONUS: with
find . -type f | egrep -i -E -o "\.{1}\w*$" | sort | uniq -c
You'll get the file count:
106 .DS_Store
35 .MP3
89 .aif
5 .aiff
525 .asd
1 .doc
60 .flac
48 .jpg
149 .m4a
11 .m4p
1 .m4r
12844 .mp3
1 .pdf
5 .png
9 .txt
108 .wav
44 .wma
2 .zip

You need to do a couple of things. It should look like this:
grep '.*\.zip$'
You need to escape the second dot, so it will just match a dot, and not any character. Using single quotes makes the escaping a bit easier.
You need the dollar sign at the end of the line to indicate that you want the "zip" to occur at the end of the line.

grep -r pattern --include="*.txt" /path/to/dir/

Try: grep -o -E "(\\.([A-z])+)+"
I used this to get multi-dotted/multiple extensions. So if the input was hello.tar.gz, then it would output .tar.gz.
For single dotted, use grep -o -E "\\.([A-z])+$".
Tested on Cygwin/MingW+MSYS.

One more fix/addon of the above example:
# multi-dotted/multiple extensions
grep -oEi "(\\.([A-z0-9])+)+" file.txt
# single dotted
grep -oEi "\\.([A-z0-9])+$" file.txt
This will get file extensions like '.mp3' and etc.

Just reviewing some of the other answers. The .* isn't necessary, and if you're looking for a certain file extension, it's best to include -i so that it's case-insensitive; in case the file is HELLO.ZIP, for example. I don't think the quotes are necessary, either.
grep -i \.zip$

If you just want to find in the current folder, why not with this simple command without grep ?
ls *.zip

Simply do :
grep ".*.zip$"
The "$" indicates the end of line

Related

rename files by removing the first 171 characters?

I have thousands files downloaded from internet with naming convention like this:
HTTP_services.cgi?FILENAME=%2Fdata%2FGPM_L3%2FGPM_3IMERGM.06%2F2020%2F3B-MO.MS.MRG.3IMERG.20200301-S000000-E235959.03.V06B.HDF5&FORMAT=bmM0Lw&BBOX=-9,114.3,-8,115.8&LABEL=3B-MO.MS.MRG.3IMERG.20200301-S000000-E235959.03.V06B.HDF5.SUB.nc4
I want to rename all the file by removing the first 171 characters in the filename. So I will have a file with name "3B-MO.MS.MRG.3IMERG.20200301-S000000-E235959.03.V06B.HDF5.SUB.nc4"
Is there any one-liner solution that I can use? I am using terminal in mac.
You may try the below regex:
.{171}
Explanation of the above regex:
. - Represents a metacharacter representing anything except a new line.
{171} - Represents a quantifier indicating any character can come 171 times.
You can find the demo of the above regex in here.
You can use the GNU rename utility in order to execute the below command to achieve your result.
rename 's/.{171}//g' *.nc4
Worth Reading: I can't run rename command on MACOS. What to do?
rename is the best solution, but you can also use substring commands:
for file in `ls *IMERG*` ; do
mv $file ${file:171}
done
or alternatively using cut:
for file in `ls *IMERG*` ; do
mv $file `echo ${file} | cut -c 171-`
done
if you sure exactly 171 characters will work for each file name.

Searching files in directory using regex and grep

So I have to find every file in the /etc directory that start with a,b or c
what i have is: grep -l '/^[a-cA-C].*/g' /etc/* though i keep getting every file in the /etc directory.
I use grep -lto get every file (I guess using find or grep doesn't matter
'/^[a-cA-C].*/g' to find everything that starts with a,b or c uppercase or lowercase followed by zero or more characters ending with a global search so it doesn't stop after the first match
I know the regex is right cause i've checked it with a regex-checker online.
EDIT: found the solution --> ls /etc/[a-cA-C]*
Here my example:
find ./ -type f -exec basename {} \; | grep -Ei '^(a|b|c)'
It search recursively and find all files, but return in output only basename of file, is it ok for you?
You can try this one:
find | grep '^\./[abc]'

Use [msys] bash to remove all files whose name matches a pattern, regardless of file-name letter-case

I need a way to clean up a directory, which is populated with C/C++ built-files (.o, .a, .EXE, .OBJ, .LIB, etc.) produced by (1) some tools which always create files having UPPER-CASE names, and (2) other tools which always create lower-case file names. (I have no control over the tools.)
I need to do this from a MinGW 'msys' bash.exe shell script (or bash command prompt). I understand piping (|), but haven't come up with the right combination of exec's yet. I have successfully filtered the file names, using commands like this example:
ls | grep '.\.[eE][xX][eE]'
to list all files having any case-combination of letters in the file-extension--this example gets all the executable (e.g. ".EXE") files.
(I'll be doing similar for .o, .a, .OBJ, .LIB, .lib, .MAP, etc., which all share the same directory as the C/C++ source files. I don't want to delete the source files, only the built-files. And yes, I probably should rework the directory structure, to use a separate directory for the built-files [only], but that will take time, and I need a quick solution now.)
How can I merge the above command with "something" else (e.g., like the 'rm -f' command???), to carry this the one step further, to actually delete [only] those filtered-out files from the current directory? (I'm hopeful for a solution which does not require a temporary file to hold the filtered file names.)
Adding this answer because the accepted answer is suggesting practices which are not-recommended in actual scripts. (Please don't feel bad, I was also on that track once..)
Parsing ls output is a NO-NO! See http://mywiki.wooledge.org/ParsingLs for more detailed explanation on why.
In short, ls separates the filenames with newline; which can be present in the filename itself. (Plus, ls does not handle other special characters properly. ls prints the output in human readable form.) In unix/linux, it's perfectly valid to have a newline in the filename.
A unix filename cannot have a NULL character though. Hence below command should work.
find /path/to/some/directory -iname '*.exe' -print0 | xargs -0 rm -f
find: is a tool used to, well, find files matching the required pattern/criterion.
-iname: search using particular names, case insensitive. Note that the argument to -iname is wildcard, not regex.
-print0: Print the file names separated by NULL character.
xargs: Takes the input from stdin & runs the commands supplied (rm -f in this case) on them. The input is separaed by white-space by default.
-0 specifies that the input is separated by null character.
Or even better approach,
find /path/to/some/directory -iname '*.exe' -delete
-delete is a built-in feature of find, which deletes the files found with the pattern.
Note that if you want to do some other operation, like move them to particular directory, you'd need to use first option with xargs.
Finally, this command find /path/to/some/directory -iname '*.exe' -delete would recursively find the *.exe files/directories. You can restrict the search to current directory with -maxdepth 1 & filetype to simple file (not directory, pipe etc.) using -type f. Check the manual link I provided for more details.
this is what you mean?
rm -f `ls | grep '.\.[eE][xX][eE]'`
but usually your "ls | grep ..." output will have some other fields that you have to strip out such as date etc., so you might just want to output the file name itself.
try something like:
rm -f `ls | grep '.\.[eE][xX][eE]' | awk '{print $9}'`
where you file name is in the 9th field like:
-rwxr-xr-x 1 Administrators None 283 Jul 2 2014 search.exe
You can use following command:
ls | grep '.\.[eE][xX][eE]' | xargs rm -f
Use of "xargs" would turn standard input ( in this case output of the previous command) as arguments for "rm -f" command.

Remove duplicate filename extensions

I have thousands of files named something like filename.gz.gz.gz.gz.gz.gz.gz.gz.gz.gz.gz
I am using the find command like this find . -name "*.gz*" to locate these files and either use -exec or pipe to xargs and have some magic command to clean this mess, so that I end up with filename.gz
Someone please help me come up with this magic command that would remove the unneeded instances of .gz. I had tried experimenting with sed 's/\.gz//' and sed 's/(\.gz)//' but they do not seem to work (or to be more honest, I am not very familiar with sed). I do not have to use sed by the way, any solution that would help solve this problem would be welcome :-)
one way with find and awk:
find $(pwd) -name '*.gz'|awk '{n=$0;sub(/(\.gz)+$/,".gz",n);print "mv",$0,n}'|sh
Note:
I assume there is no special chars (like spaces...) in your filename. If there were, you need quote the filename in mv command.
I added a $(pwd) to get the absolute path of found name.
you can remove the ending |sh to check generated mv ... .... cmd, if it is correct.
If everything looks good, add the |sh to execute the mv
see example here:
You may use
ls a.gz.gz.gz |sed -r 's/(\.gz)+/.gz/'
or without the regex flag
ls a.gz.gz.gz |sed 's/\(\.gz\)\+/.gz/'
ls *.gz | perl -ne '/((.*?.gz).*)/; print "mv $1 $2\n"'
It will print shell commands to rename your files, it won't execute those commands. It is safe. To execute it, you can save it to file and execute, or simply pipe to shell:
ls *.gz | ... | sh
sed is great for replacing text inside files.
You can do that with bash string substitution:
for file in *.gz.gz; do
mv "${file}" "${file%%.*}.gz"
done
This might work for you (GNU sed):
echo *.gz | sed -r 's/^([^.]*)(\.gz){2,}$/mv -v & \1\2/e'
find . -name "*.gz.gz" |
while read f; do echo mv "$f" "$(sed -r 's/(\.gz)+$/.gz/' <<<"$f")"; done
This only previews the renaming (mv) command; remove the echo to perform actual renaming.
Processes matching files in the current directory tree, as in the OP (and not just files located directly in the current directory).
Limits matching to files that end in at least 2 .gz extensions (so as not to needlessly process files that end in just one).
When determining the new name with sed, makes sure that substring .gz doesn't just match anywhere in the filename, but only as part of a contiguous sequence of .gz extensions at the end of the filename.
Handles filenames with special chars. such as embedded spaces correctly (with the exception of filenames with embedded newlines.)
Using bash string substitution:
for f in *.gz.gz; do
mv "$f" "${f%%.gz.gz*}.gz"
done
This is a slight modification of jaypal's nice answer (which would fail if any of your files had a period as part of its name, such as foo.c.gz.gz). (Mine is not perfect, either) Note the use of double-quotes, which protects against filenames with "bad" characters, such as spaces or stars.
If you wish to use find to process an entire directory tree, the variant is:
find . -name \*.gz.gz | \
while read f; do
mv "$f" "${f%%.gz.gz*}.gz"
done
And if you are fussy and need to handle filenames with embedded newlines, change the while read to while IFS= read -r -d $'\0', and add a -print0 to find; see How do I use a for-each loop to iterate over file paths output by the find utility in the shell / Bash?.
But is this renaming a good idea? How was your filename.gz.gz created? gzip has guards against accidentally doing so. If you circumvent these via something like gzip -c $1 > $1.gz, buried in some script, then renaming these files will give you grief.
Another way with rename:
find . -iname '*.gz.gz' -exec rename -n 's/(\.\w+)\1+$/$1/' {} +
When happy with the results remove -n (dry-run) option.

How do I use grep in the terminal to print a list of files matching a specific grep pattern?

For a school project, I have to SSH into a folder on the school server, the usr/bin folder which has a list of files, then print a list of files that start with "file". I know Regex half-decently, at least conceptually, but I'm not sure of the UNIX command to do this.
I tried grep '^[file][a-zA-Z0-9]*' (start of a line, letters f-i-l-e, then 0 or more occurrences of any other number or digit) but that doesn't seem to work.
Help?
You can use find command for this once you are connected to your school server.
find /usr/bin -type f -name "file*"
How would I do it if I wanted all files that started with a OR b, and ended with a OR b
Using find:
find /usr/bin -type f -regex "^[ab].*[ab]$"
Using ls and grep:
ls -1 /usr/bin | grep "^[ab].*[ab]$"
You should be able to use a simple ls command to get this information.
cd /usr/bin
ls -1 file*
For more complex matches, you could pipe the output of ls to grep, but wwomack's solution is simplest for your scenario.
# for file names starting with "file"
ls /usr/bin | grep ^file
# more complex file names
ls /usr/bin | grep "^[ab].*[ab]$"
# files that do not start with alphabetic characters
ls -a | grep ^[^a-zA-Z]
grep works on the contents of files, not file names. But, using pipes (|), you are able to treat the output (referred to as stdout) of one command as an input file (stdin) to another command.
You'll want to study regular expressions (and grep) more on your own, but here are some basics. First, grep operates on a line-by-line basis, comparing each line to the regex and printing it if it matches. At the beginning of the regex ^ anchors the match to the beginning of the line; at the end, $ anchors it to the end. If the regex pattern does not begin or end with these symbols then any subsequence of the line that matches the pattern causes the line to match.
For example, grep ^file$ only matches if the line only contains the word file while grep file matches any line that contains the word file anywhere. grep file$ matches lines that end with the word file with 0 or more characters before it.
Regarding your question, "whose names do not start with either a lowercase or an uppercase English letter" your command could be much simplified (see third example), but also notice that you begin the pattern with $: since $ matches the end of the line, your regex is impossible. One final note, in my example, I used ls -a to return all files including hidden . files. On Unix and Linux systems, if the first character of the file name is a dot, then the file will not normally show up when listing a directory.