I'm struggling with Bash regular expressions - regex

I'm frustrated trying to find out how to use regex to do anything useful. I'm completely uncertain on everything that I do, and I've resorted to trial and error; which has not been effective.
I'm trying to list files in the current directory that starts with a letter, contains a number, end with a dot followed by a lowercase character, etc.
So I know starts with a letter would be:
^[a-zA-Z]
but I don't know how to follow that up with CONTAINS a number. I know ends with a dot can be [\.]*, but I'm not sure. I'm seeing that $ is also used to match strings at the end of the word.
I have no idea if I should be using find with regex to do this, or ls | grep .... I'm completely lost. Any direction would be appreciated.
I guess the specific question I was trying to ask, was how to I glue the expressions together. For example, I tried ls | grep ^[a-zA-Z][0-9] but this only shows files that start with letter, followed by a number. I don't know how write a regex that starts with a letter, and then add the next requirement, ie. contains a number.

Starts with a letter: ^[a-zA-Z]
Contains a number: .*[0-9].*
Ends with a dot and lowercase letter: \.[a-z]$
Together:
^[a-zA-Z].*[0-9].*\.[a-z]$
The best way to find files that match a regex is with find -regex. When you use that the ^ and $ anchors are implied, so you can omit them. You'll need to tack on .*/ at the front to match any directory, since -regex matches both the directory and file name.
find -regex '.*/[a-zA-Z].*[0-9].*\.[a-z]'

There's plenty of documentation online, eg. GNU's Reference Manual.
Your particular example, would require something like:
^[:alpha:].*[:digit:].*\.[:lower:]$
or if POSIX classes are not available:
^[a-zA-Z].*[0-9].*\.[a-z]$
You can read either as:
start of line
a letter (upper or lower case)
any character, zero or more times
a digit
any character, zero or more times
a dot (must be escaped with a backslash)
a lower case letter
end of line
Once you settle on a regular expression, you can use it with ls within the directory you wish to find the files in:
cd <dir>
ls -1 | grep '^[a-zA-Z].*[0-9].*\.[a-z]$'
NOTE: I tried to improve my answer based on some of the comments.

Related

How do I create a RegEx which has multiple criteria?

I am working through a lab on RegEx which asks me to:
Search the 'countries' file for all the words with nine characters and
the letter i.How many results are found?
I am working in a generic Linux command prompt in a online emulated environment. I am allowed to use grep, awk or sed though I am feeling a preference for grep.
(I am 100% a noob when it comes to RegEx so please explain it to me like I'm 5)
Per a previous lab I already used something like below which finds me all countries which have 9 characters, however I cannot find a way to make it find all words which have 9 characters AND contain the letter i in any position.
grep -E '\b\w{9}\b' countries
The | operator does not help because its an OR operator and will find me all instances that i is found, and all words which are 9 characters and I need both to happen at the same time. I tried multiple grep statements as well and it seems the emulator may not accept that.
I am also trying to stick to [] character sets as the next question asks for multiple letters within the 9 letter word.
One way of solving this problem is to use grep twice, and pipe one result to the next.
First, we find all words with length 9, like you did on the previous exercise:
grep -Eo '\b\w{9}\b' countries
I'm using the flag o that lists only the matching words, printing one word per line.
Next, we use Linux pipe (not regex OR) to feed the output of the first grep to a second grep:
grep -Eo '\b\w{9}\b' countries | grep 'i'
The final output will be all words with nine characters and i.
Depending on your requirements, this approach may be considered "cheating" if you're more focused on Regex, but a good solution if you're also learning Linux.
The fact you are looking for words complicates the regex (in contrary to lines in the file), but it is also possible to come up with a single regex to match these words.
\b(?=\w*i)\w{9}\b
This builds on \b\w{9}\b you already have. (?=\w*i) is the AND condition. After we find the beginning of the word (\b), we look ahead for \w*i (zero or more letters, and then our i). We're using \w* in the lookahead, not .*, so we are looking at the same word. (?=.*i) would have matched any i also after the nine characters.
After finding the i, we continue to make sure the word is only 9 letters.
Working example: https://regex101.com/r/G5EVdM/1

Match files that have a 0000 in the filename on a certain position

I have hundreds of files like this:
201670000_FOR1.xml
201670000_GAL0.xml
201670000_GAL1.xml
20184301_2.xml
20184301_3.xml
20184301_4.xml
I need to match all the files that have a 0000 on position 6-9. The first 3 files should match, the bottom 3 not. I tried:
find -E . -regex '/^.{6}0000*/' | wc -l
but it yields zero results. How would the correct regex look like?
You can use this find regex:
find -E . -regex '.*/.{5}0{4}.*'
./201670000_GAL0.xml
./201670000_FOR1.xml
./201670000_GAL1.xml
Regex Details:
.* matches part of filename before /
.{5} matches first 5 characters after /
Then we match 4 zeroes using 0{4}
Finally .* for remaining characters.
You can also avoid regex by using this glob pattern:
find . -name '?????0000*'
The slashes cannot be part of a file name. Take them out. (Some tools do require slashes as delimiters around regexes but find is definitely not one of them.)
Your examples all have five characters, not six, before the zeros, and 0* matches only zeros, not a zero followed by anything (that would be 0.*) so you probably want ^.{5}0{4}.*
More economically and succinctly,
wc -l ?????0000*
matches all files with this pattern in the current directory, and
wc -l **/?????0000*
in many shells examines all subdirectories recursively (but ** is not properly portable to POSIX sh).
It's not clear from your question whether you want to examine subdirectories, but find always examines subdirectories, too, unless you specifically tell it not to. On a tree with many subdirectories, this can make a significant difference in performance.

Find results with grep and write to file

I would like to get all the results with grep or egrep from a file on my computer.
Just discovered that the regex of finding the string
'+33. ... ... ..' is by the following regex
\+33.[0-9].[0-9].[0-9].[0-9].' Or is this not correct?
My grep command is:
grep '\+31.[0-9].[0.9].[0.9].[0-9]' Samsung\ GT-i9400\ Galaxy\ S\ II.xry >> resultaten.txt
The output file is only giving me as following:
"Binary file Samsung GT-i9400 .xry matches"
..... and no results were given.
Can someone help me please with getting the results and writing to a file?
Firstly, the default behavior of grep is to print the line containing a match. Because binary files do not contain lines, it only prints a message when it finds a match in a binary file. However, this can be overridden with the -a flag.
But then, you end up with the problem that the "lines" it prints are not useful. You probably want to add the -o option to only print the substrings which actually matched.
Finally, your regex isn't correct at all. The lone dot . is a metacharacter which matches any character, including a control character or other non-text character. Given the length of your regex, you are unlikely to catch false positives, but you might want to explain what you want the dot to match. I have replaced it with [ ._-] which matches a space and some punctuation characters which are common in phone numbers. Maybe extend or change it, depending on what interpunction you expect in your phone numbers.
In regular grep, a plus simply matches itself. With grep -E the syntax would change, and you would need to backslash the plus; but in the absence of this option, the backslash is superfluous (and actually wrong in this context in some dialects, including GNU grep, where a backslashed plus selects the extended meaning, which is of course a syntax error at beginning of string, where there is no preceding expression to repeat one or more times; but GNU grep will just silently ignore it, rather than report an error).
On the other hand, your number groups are also wrong. [0-9] matches a single digit, where apparently the intention is to match multiple digits. For convenience, I will use the grep -E extension which enables + to match one or more repetitions of the previous character. Then we also get access to ? to mark the punctuation expressions as optional.
Wrapping up, try this:
grep -Eao '\+33[0-9]+([^ ._-]?[0-9]+){3}' \
'Samsung GT-i9400 Galaxy S II.xry' >resultaten.txt
In human terms, this requires a literal +33 followed by required additional digits, then followed by three number groups of one or more digits, each optionally preceded by punctuation.
This will overwrite resultaten.txt which is usually what you want; the append operation you had also makes sense in many scenarios, so change it back if that's actually what you want.
If each dot in your template +33. ... ... .. represents a required number, and the spaces represent required punctuation, the following is closer to what you attempted to specify:
\+33[0-9]([^ ._-][0-9]{3}){2}[^ ._-][0-9]{2}
That is, there is one required digit after 33, then two groups of exactly three digits and one of two, each group preceded by one non-optional spacing or punctuation character.
(Your exposition has +33 while your actual example has +31. Use whichever is correct, or perhaps allow any sequence of numbers for the country code, too.)
It means that you're find a match but the file you're greping isn't a text file, it's a binary containing non-printable bytes. If you really want to grep that file, try:
strings Samsung\ GT-i9400\ Galaxy\ S\ II.xry | grep '+31.[0-9].[0.9].[0.9].[0-9]' >> resultaten.txt

Grep is messing up my understanding

For sometime I have been trying to play with grep to retrieve data from files and I noticed something funny.
It might be my ignorance but here is what happens...
Suppose I have a file ABC. the data is:
a
abc
ab
bac
bb
ac
Now ran this grep command,
grep a* ABC
I found the output to contain lines starting a with b.c. why is this happening?
You used 'a*' as your search pattern... the '*' means ZERO or MORE of the previous character, so 'b.c' matches, having ZERO or more 'a's in it.
On a semi-related note, I'd recommend quoting the 'a*' bit, since if you have ANY files in the current subdirectory which start with a, you'll be VERY surprised to see what you're really searching for, since the shell (bash,zsh,csh,sh,dash,wtfsh...) will perform wildcard expansion automatically BEFORE the command is executed.
if you want to search for lines which START with 'a', then you'll need to anchor the search pattern with a leading ^ character, so your pattern becomes '^a*', but again, the * means ZERO or more, so it's not useful in this situation where you only have one letter... use '^a' instead.
As a contrived example, if you wanted to find all the lines containing a 'c' AND those containing the letters 'bc', then you could use 'b*c' as the search pattern... meaning ZERO or more b's, and a c.
The power of the regex search pattern is immense, and takes some time to grok. Peruse the man pages for grep(1), regex(7), pcre(3), pcresyntax(3), pcrepattern(3).
Once you get the hang of them, regex's are useful in sed, grep, perl, vim, (probably emacs too), ... uh, it's late (early?) nothing more comes to mind, but they're VERY powerful.
As some bonus, '*' means ZERO or more, '+' means ONE or more, and '?' means ZERO or ONE.
So to search for things with two or more a's... 'aa+', which is 1 a, and 1+ a (1 or more)
I ramble.... (regex(7)!)
grep tries to find that pattern in the whole line. Use ^a to get line starting with a or ^a*$ to find lines containing only as (including the empty line).
also, please quote that shell argument (eg: '^a*$'), if you use a* and there is a file in the working directory starting with an a you will get very weird results...
Try this, it works for me. The ^ means beginning of a line - so it has to start with a.
grep ^a ABC
You need to put quotes around your pattern:
grep "a*" ABC
Otherwise the * is interpreted by the shell (which does wild-card filename matching), instead of by grep itself.

Remove stuff, retrieve numbers, retrieve text with spaces in place of dots, remove the rest

This is my first question, so I hope I didn't mess too much with the title and the formatting.
I have a bunch of file a client of mine sent me in this form:
Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
What I need is a regex to output just:
212 The Actual Title Of the Chapter
I'm not gonna use it with any script language in particular; it's a batch renaming of files through an app supporting regex (which already "preserves" the extension).
So far, all I was able to do was this:
/.*x(\d+)\.(.*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Capture everything before a number preceded by an "x", group numbers after the "x", group everything following until a 3 digit Uppercase word is met, then capture everything that follows)
which gives me back:
212 The.Actual.Title.Of.the.Chapter
Having seen the result I thought that something like:
/.*x(\d+)\.([^.]*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Changed second group to "Capture everything which is not a dot...") would have worked as expected.
Instead, the whole regex fails to match completely.
What am I missing?
TIA
ciĆ 
ale
.*x(\d+)\. matches Name.Of.Chapter.021x212.
\.[A-Z]{3}.* matches .DOC.NAME-Some.stuff.Here.ext
But ([^.]*?) does not match The.Actual.Title.Of.the.Chapter because this regex does not allow for any periods at all.
since you are on Mac, you could use the shell
$ s="Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext"
$ echo ${s#*x}
212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
$ t=${s#*x}
$ echo ${t%.[A-Z][A-Z][A-Z].*}
212.The.Actual.Title.Of.the.Chapter
Or if you prefer sed, eg
echo $filename | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//'
For processing multiple files
for file in *.ext
do
newfile=${file#*x}
newfile=${newfile%.[A-Z][A-Z][A-Z].*}
# or
# newfile=$(echo $file | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//')
mv "$file" "$newfile"
done
To your question "How can I remove the dots in the process of matching?" the answer is "You can't." The only way to do that is by processing the result of the match in a second step, as others have said. But I think there's a more basic question that needs to be addressed, which is "What does it mean for a regex to match a given input?"
A regex is usually said to match a string when it describes any substring of that string. If you want to be sure the regex describes the whole string, you need to add the start (^) and end ($) anchors:
/^.*x(\d+)\.(.*?)\.[A-Z]{3}.*$/
But in your case, you don't need to describe the whole string; if you get rid of the .* at either end, it will serve your just as well:
/x(\d+)\.(.*?)\.[A-Z]{3}/
I recommend you not get in the habit of "padding" regexes with .* at beginning and end. The leading .* in particular can change the behavior of the regex in unexpected ways. For example, it there were two places in the input string where x(\d+)\. could match, your "real" match would have started at the second one. Also, if it's not anchored with ^ or \A, a leading .* can make the whole regex much less efficient.
I said "usually" above because some tools do automatically "anchor" the match at the beginning (Python's match()) or at both ends (Java's matches()), but that's pretty rare. Most of the shells and command-line tools available on *nix systems define a regex match in the traditional way, but it's a good idea to say what tool(s) you're using, just in case.
Finally, a word or two about vocabulary. The parentheses in (\d+) cause the matched characters to be captured, not grouped. Many regex flavors also support non-capturing parentheses in the form (?:\d+), which are used for grouping only. Any text that is included in the overall match, whether it's captured or not, is said to have been consumed (not captured). The way you used the words "capture" and "group" in your question is guaranteed to cause maximum confusion in anyone who assumes you know what you're talking about. :D
If you haven't read it yet, check out this excellent tutorial.