Regular expression for begins with and ends with - regex

I'm trying to run a search with git to get me all the staged files in one of two folders: local or components. I only want to get JS files. The command runs in the console.
What I have so far:
STAGED_FILES=($(git diff --cached --name-only --diff-filter=ACM | grep "^(local|components).*?.js"))
This gets me all the staged files:
git diff --cached --name-only --diff-filter=ACM
This gets me all files paths beginning with local or components
grep "^(local|components)"
And this gets me all js files
grep ".js"
And this returns me nothing for some reason:
($(git diff --cached --name-only --diff-filter=ACM | grep "^(local|components).*?.js"))
What is the Regular Expression that I could search with that would get me all the js files in these two folders?

It didn't work because grep doesn't support ? lazy matches. You can use -E for extended regular expressions.
For example consider these
$ echo "asfasdfzasdfasdfz" | grep -E "a.*?z"
asfasdfzasdfasdfz
$ echo "asfasdfzasdfasdfz" | grep "a.*?z"
$ echo "asfasdfzasdfasdf?z" | grep "a.*?z"
asfasdfzasdfasdf?z
As you can see without -E it tries to match ? also within the string.

Besides the regular expression based answers, you can do this directly in Git, which has the notion of a "pathspec" including shell style globbing:
git diff --cached --name-only \
--diff-filter=ACM -- 'local/**/*.js' 'components/**/*.js'
(line broken for display formatting; note that the ** support is new in Git version 1.8.2).
That said, regular expressions are "more powerful" than shell globs, so you may want to keep nu11p01n73R's answer in mind. Note, however, that non-greedy matches (*?) match as little as possible, rather than as much as possible:
pattern input result (matched part in parentheses)
abc.*e 0abcdefeged 0(abcdefege)d
abc.*?e 0abcdefeged 0(abcde)feged
abc.*d 0abcdefeged 0(abcdefeged)
abc.*?d 0abcdefeged 0(abcd)efeged
Your expression, ^(local|components).*?.js, says (in extended interpretations anyway): match the start of line; then match either local or components as literal text; then match as few characters as possible, perhaps none; then match any character; then match a literal j; then match a literal s. Hence this matches local-jaguar-xjs-vehicles because it begins with local, contains some text, has one character more before js, and continues on.
The shell glob pattern local/**/*.js matches only the directory local, followed by any number—possibly zero—of subdirectory components, followed by a file whose name ends with .js, with the dot matched literally. So this is equivalent to the pattern ^local/(.*/|)[^/]*\.js$: the literal text local matched at the start of the line, followed by one slash; followed by either: any number of characters ending in slash (taking as many as possible), or nothing at all; followed by any number (including none) of any character except slash, followed by a literal .js, followed by the end of the line.
Note that because this expression is anchored at both ends (must match at the beginning and end of line), and there is only one Kleene star in the middle, it does not matter whether we use a greedy or non-greedy match: the left-side anchor matches at the left and the right-side anchor matches at the right, and a greedy match takes as much of the middle as it can—i.e., all of it—while a non-greedy match takes as little of the middle as it can ... which is still "all of it".
(This does, of course, assume that the file names are being printed with just one on each line. Fortunately git diff --name-only does just that. Also, shell ** for "any number of directories" is not supported in all shells, nor all non-shell file name globbing, but it is used in Git's pathspecs (search for "pathspec").)

Related

Trying to use GNU find to search recursively for filenames only (not directories) containing a string in any portion of the file name

Trying to find a command that is flexible enough to allow for some variations of the string, but not other variations of it.
For instance, I am looking for audio files that have some variation of "rain" in the filename only (rains, raining, rained, rainbow, rainfall, a dark rain cloud, etc), whether at the beginning, end or middle of the filename.
However, this also includes words like "brain", "train", "grain", "drain", "Lorraine", et al, which are not wanted (basically any word that has nothing to do with the concept of rain).
Something like this fails:
find . -name '*rain*' ! -name '*brain*'| more
And I'm having no luck with even getting started on building a successful regex variant because I cannot wrap my mind around regex ... for instance, this doesn't do anything:
# this is incomplete, just a stub of where I was going
# -type f also still includes a directory name
find . -regextype findutils-default -iregex '\(*rain*\)' -type f
Any help would be greatly appreciated. If I could see a regex command that does everything I want it to do, with an explanation of each character in the command, it would help me learn more about regex with the find command in general.
edit 1:
Taking cues from all the feedback so far from jhnc and Seth Falco, I have tried this:
find . -type f | grep -Pi '(?<![a-zA-Z])rain'
I think this pretty much works (I don't think it is missing anything), my only issue with it is that it also matches on occurrences of "rain" further up the path, not only in the file name. So I get example output like this:
./Linux/path/to/radiohead - 2007 - in rainbows/09 Jigsaw Falling Into Place.mp3
Since "rain" is not in the filename itself, this is a result I'd rather not see. So I tried this:
find . -type f -printf '%f\n' | grep -Pi '(?<![a-zA-Z])rain'
That does ensure that only filenames are matched, but it also does not output the paths to the filenames, which I would still like to see, so I know where the file is.
So I guess what I really need is a PCRE (PCRE2 ?) which can take the seemingly successful look-behind method, but only apply it after the last path delimiter (/ since I am on Linux), and I am still stumped.
specification:
match "rain"
in filename
only at start of a word
case-insensitive
assumptions:
define "word" to be sequence of letters (no punctuation, digits, etc)
paths have form prefix/name where prefix can have one or more levels delimited by / and name does not contain /
constraints:
find -iregex matches against entire path (-name only matches filename)
find -iregex must match entirety of path (eg. "c" is only a partial match and does not match path "a/b/c")
method:
find can return matches against non-files (eg. directories). Given definition 6, we would be unable to tell if name is a directory or an ordinary file. To satisfy 2, we can exclude non-files using find's -type f predicate.
We can compare paths found by find against our specification by using find's case-insensitive regex matching predicate (-iregex). The "grep" flavour (-regextype grep) is sufficiently expressive.
Just using 1, a suitable regex is: rain
2+6+7 says we must forbid / after "rain": rain[^/]*$
[/] matches character in set (ie. /)
[^/]: ^ inverts match: ie. character that is not /
* matches preceding match zero or more times
$ constrains preceding match to occur at end of input
3+5 says there must be no immediately preceding word characters: [^a-z]rain[^/]*$
a-z is a shortcut for the range a to z
8 requires matching the prefix explicitly: ^.*[^a-z]rain[^/]*$
^ outside of [...] constrains subsequent match to occur at beginning of input
. matches anything
[^a-z] matches a non-alphabetic
Final command-line:
find . -type f -regextype grep -iregex '^.*[^a-z]rain[^/]*$'
Note: The leading ^ and trailing $ are not actually required, given 8, and could be elided.
exercise for the reader:
extend "word" to non-ASCII characters (eg. UTF-8)
You probably want to use either a character class, word boundary, or just have a negative look behind for alpha characters.
Look Behind
^.+(?<![a-zA-Z])rain[^\/]*$
Matches any instance of rain, but only if it's not following [a-zA-Z], and doesn't have any slashes afterwards. Unfortunately, find doesn't support look ahead or look behind… so we'll use a character class instead.
Character Class
^.+(?:^|[^a-zA-Z])rain[^\/]*$
Matches the start of the line, or a character that isn't [a-zA-Z], then proceeds to match by the characters for rain if it comes immediately after, so long as there are no slashes afterwards.
You can use it in find like this:
find ./ -iregex '.+(?:^|[^a-zA-Z])rain[^\/]*'
The ^ at the start and $ at the end of the pattern are implied when using find with -iregex, so you can omit them.

Linux find command: searching for a filename containing parentheses

I need to find files with filenames like this:
<some regex-matched text> (1).<some regex-matched text>
i.e. I want to search for filenames containing
text ending in a space
then an opening bracket (parenthesis)
followed by the numeral 1
followed by a closing bracket
possibly followed by a dot followed by some more text...
I first went find . -regex '.* \(1\)\..*'. But the brackets are sort of ignored: files matching .* 1\..* are returned.
In the course of my attempt to find an answer I found this page covering Linux find. Here I find this phrase:
"Grouping is performed with backslashes followed by parentheses ‘\(’,
‘\)’."
[NB to show you the reader a single backslash, as shown in that page, I have doubled the backslashes to write the single backslashes above!]
I wasn't sure what to make of that, i.e. how to escape ordinary brackets in that case. I thought maybe doubling up the backslashes in the find expression might work, but it didn't.
Even if I try to do it without using a regex, find seems to have some problems with brackets and/or a dot in this place:
mike#M17A .../accounts $ find . -name *(1).pdf
[... finds stuff OK]
mike#M17A .../accounts $ find . -name *(1).*
find: paths must precede expression: ..
Usage: find [-H] [-L] [-P] [-Olevel] [-D help|tree|search|stat|rates|opt|exec|time] [path...] [expression]
mike#M17A .../accounts $ find . -name *(1)\.*
find: paths must precede expression: ..
Usage: find [-H] [-L] [-P] [-Olevel] [-D help|tree|search|stat|rates|opt|exec|time] [path...] [expression]
NB putting a space after in the initial * in these attempts also fails...
That is because you don't need to escape these parenthesis. This should work :
find . -regex '.* (1)\(\..*\)?'
Though a capture group is used (escaped parenthesis) \(\..*\) so that we can make the last match optional (possibly followed by a dot followed by some more text) with ?

foo[E1,E2,...]* glob matches desired contents, but foo[E1,E2,...]_* does not?

I saw something weird today in the behaviour of the Bash Shell when globbing.
So I ran an ls command with the following Glob:
ls GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]* | grep ":"
the result was as expected
GM12878_Hs_InSitu_MboI_rE1_TagDirectory:
GM12878_Hs_InSitu_MboI_rE2_TagDirectory:
GM12878_Hs_InSitu_MboI_rF_TagDirectory:
GM12878_Hs_InSitu_MboI_rG1_TagDirectory:
GM12878_Hs_InSitu_MboI_rG2_TagDirectory:
GM12878_Hs_InSitu_MboI_rH_TagDirectory:
however when I change the same regex by introducing an underscore to this
ls GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]_* | grep ":"
my expected result is the complete set as shown above, however what I get is a subset:
GM12878_Hs_InSitu_MboI_rF_TagDirectory:
GM12878_Hs_InSitu_MboI_rH_TagDirectory:
Can someone explain what's wrong in my logic when I introduce an underscore sign before the asterisk?
I am using Bash.
You misunderstand what your glob is doing.
You were expecting this:
GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]*
to be a glob of files that have any of those comma-separated segments but that's not what [] globbing does. [] globbing is a character class expansion.
Compare:
$ echo GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]
GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]
to what you were trying to get (which is brace {} expansion):
$ echo GM12878_Hs_InSitu_MboI_r{E1,E2,F,G1,G2,H}
GM12878_Hs_InSitu_MboI_rE1 GM12878_Hs_InSitu_MboI_rE2 GM12878_Hs_InSitu_MboI_rF GM12878_Hs_InSitu_MboI_rG1 GM12878_Hs_InSitu_MboI_rG2 GM12878_Hs_InSitu_MboI_rH
You wanted that latter expansion.
Your expansion uses a character class which matches the character E-H, 1-2, and ,; it's identical to:
GM12878_Hs_InSitu_MboI_r[EFGH12,]_*
which, as I expect you can now see, isn't going to match any two character entries (where the underscore-less version will).
* in fileystem globs is not like * in regex. In a regex * means "0 or more of the preceeding pattern," but in filesystem globs it means "anything at all of any size". So in your first example, the _ is just part of the "anything" from the * but in the second you're matching any single character within your character class (not the patterns you seem to be trying to define) followed by _ followed by anything at all.
Also, character classes don't work the way you're trying to use them. [...] will match any character within the brackets, so your pattern is actually the same as [EFGH12,] since those are all the letters in class you define.
To get the grouping of patterns you want, you should use { instead of [ like
ls GM12878_Hs_InSitu_MboI_r{E1,E2,F,G1,G2,H}_* | grep ":"
As far as I know, and this article supports my me, the square brackets don't work as a choice but as a character set, so using [E1,E2,F,G1,G2,H] actually is equivalent to exactly one occurrence of [EGHF12,]. You can then interpret the second result as "one character of EGHF12, and an underscore", which matches GM12878_Hs_InSitu_MboI_rF_TagDirectory: but not GM12878_Hs_InSitu_MboI_rG1_TagDirectory: (there is the r followed by more that "one occurrence of...").
The first regex works because you used the asterisk, which matches what is missed by the wrong [...].
A correct expression would be:
ls GM12878_Hs_InSitu_MboI_r{E1|E2|F|G1|G2|H}* | grep ":"

Specific Perl Regular Expression Needed

Have a perl script that needs to process all files of a certain type from a given directory. The files should be those that end in .jup, and SHOULDN'T contain the word 'TEMP_' in the filename. I.E. It should allow corrected.jup, but not TEMP_corrected.jup.
Have tried a look-ahead, but obviously have the pattern incorrect:
/(?!TEMP_).*\.jup$/
This returns the entire directory contents though, including files with any extension and those containing TEMP_, such as the file TEMP_corrected.jup.
The regular expression you want is:
/^((?!TEMP_).)*\.jup$/
The main difference is that your regular expression is not anchored at the start of the string, so it matches any substring that satisfies your criteria - so in the example of TEMP_corrected.jup, the substrings corrected.jup and EMP_corrected.jup both match.
(The other difference is that putting () round both the lookahead and the . ensures that TEMP_ isn't allowed anywhere in the string, as opposed to just not at the start. Not sure whether that's important to you or not!)
If you're getting files other than .jup files, then there is another problem - your expression should only match .jup files. You can test your expression with:
perl -ne 'if(/^((?!TEMP_).)*\.jup$/) {print;}'
then type strings: perl will echo them back if they match, and not if they don't. For example:
$ perl -ne 'if(/^((?!TEMP_).)*\.jup$/) {print;}'
foo
foo.jup
foo.jup <-- perl printed this because 'foo.jup' matched
TEMP_foo.jup

Using the star sign in grep

I am trying to search for the substring "abc" in a specific file in linux/bash
So I do:
grep '*abc*' myFile
It returns nothing.
But if I do:
grep 'abc' myFile
It returns matches correctly.
Now, this is not a problem for me. But what if I want to grep for a more complex string, say
*abc * def *
How would I accomplish it using grep?
The asterisk is just a repetition operator, but you need to tell it what you repeat. /*abc*/ matches a string containing ab and zero or more c's (because the second * is on the c; the first is meaningless because there's nothing for it to repeat). If you want to match anything, you need to say .* -- the dot means any character (within certain guidelines). If you want to just match abc, you could just say grep 'abc' myFile. For your more complex match, you need to use .* -- grep 'abc.*def' myFile will match a string that contains abc followed by def with something optionally in between.
Update based on a comment:
* in a regular expression is not exactly the same as * in the console. In the console, * is part of a glob construct, and just acts as a wildcard (for instance ls *.log will list all files that end in .log). However, in regular expressions, * is a modifier, meaning that it only applies to the character or group preceding it. If you want * in regular expressions to act as a wildcard, you need to use .* as previously mentioned -- the dot is a wildcard character, and the star, when modifying the dot, means find one or more dot; ie. find one or more of any character.
The dot character means match any character, so .* means zero or more occurrences of any character. You probably mean to use .* rather than just *.
Use grep -P - which enables support for Perl style regular expressions.
grep -P "abc.*def" myfile
The "star sign" is only meaningful if there is something in front of it. If there isn't the tool (grep in this case) may just treat it as an error. For example:
'*xyz' is meaningless
'a*xyz' means zero or more occurrences of 'a' followed by xyz
This worked for me:
grep ".*${expr}" - with double-quotes, preceded by the dot.
Where ${expr} is whatever string you need in the end of the line.
So in your case:
grep ".*abc.*" myFile
Standard unix grep.
The expression you tried, like those that work on the shell command line in Linux for instance, is called a "glob". Glob expressions are not full regular expressions, which is what grep uses to specify strings to look for. Here is (old, small) post about the differences. The glob expressions (as in "ls *") are interpreted by the shell itself.
It's possible to translate from globs to REs, but you typically need to do so in your head.
You're not using regular expressions, so your grep variant of choice should be fgrep, which will behave as you expect it to.
Try grep -E for extended regular expression support
Also take a look at:
The grep man page
'*' works as a modifier for the previous item. So 'abc*def' searches for 'ab' followed by 0 or more 'c's follwed by 'def'.
What you probably want is 'abc.*def' which searches for 'abc' followed by any number of characters, follwed by 'def'.
This may be the answer you're looking for:
grep abc MyFile | grep def
Only thing is... it will output lines were "def" is before OR after "abc"
$ cat a.txt
123abcd456def798
123456def789
Abc456def798
123aaABc456DEF
* matches the preceding character zero or more times.
$ grep -i "abc*def" a.txt
$
It would match, for instance "abdef" or "abcdef" or "abcccccccccdef". But none of these are in the file, so no match.
. means "match any character" Together with *, .* means match any character any number of times.
$ grep -i "abc.*def" a.txt
123abcd456def798
Abc456def798
123aaABc456DEF
So we get matches.
There are alot of online references about regular expressions, which is what is being used here.
I summarize other answers, and make these examples to understand how the regex and glob work.
There are three files
echo 'abc' > file1
echo '*abc' > file2
echo '*abcc' > file3
Now I execute the same commands for these 3 files, let's see what happen.
(1)
grep '*abc*' file1
As you said, this one return nothing. * wants to repeat something in front of it. For the first *, there is nothing in front of it to repeat, so the system recognize this * just a character *. Because the string in the file is abc, there is no * in the string, so you cannot find it. The second * after c means it repeat c 0 or more times.
(2)
grep '*abc*' file2
This one return *abc, because there is a * in the front, it matches the pattern *abc*.
(3)
grep '*abc*' file3
This one return *abcc because there is a * in the front and 2 c at the tail. so it matches the pattern *abc*
(4)
grep '.*abc.*' file1
This one return abc because .* indicate 0 or more repetition of any character.