Grep is messing up my understanding - regex

For sometime I have been trying to play with grep to retrieve data from files and I noticed something funny.
It might be my ignorance but here is what happens...
Suppose I have a file ABC. the data is:
a
abc
ab
bac
bb
ac
Now ran this grep command,
grep a* ABC
I found the output to contain lines starting a with b.c. why is this happening?

You used 'a*' as your search pattern... the '*' means ZERO or MORE of the previous character, so 'b.c' matches, having ZERO or more 'a's in it.
On a semi-related note, I'd recommend quoting the 'a*' bit, since if you have ANY files in the current subdirectory which start with a, you'll be VERY surprised to see what you're really searching for, since the shell (bash,zsh,csh,sh,dash,wtfsh...) will perform wildcard expansion automatically BEFORE the command is executed.
if you want to search for lines which START with 'a', then you'll need to anchor the search pattern with a leading ^ character, so your pattern becomes '^a*', but again, the * means ZERO or more, so it's not useful in this situation where you only have one letter... use '^a' instead.
As a contrived example, if you wanted to find all the lines containing a 'c' AND those containing the letters 'bc', then you could use 'b*c' as the search pattern... meaning ZERO or more b's, and a c.
The power of the regex search pattern is immense, and takes some time to grok. Peruse the man pages for grep(1), regex(7), pcre(3), pcresyntax(3), pcrepattern(3).
Once you get the hang of them, regex's are useful in sed, grep, perl, vim, (probably emacs too), ... uh, it's late (early?) nothing more comes to mind, but they're VERY powerful.
As some bonus, '*' means ZERO or more, '+' means ONE or more, and '?' means ZERO or ONE.
So to search for things with two or more a's... 'aa+', which is 1 a, and 1+ a (1 or more)
I ramble.... (regex(7)!)

grep tries to find that pattern in the whole line. Use ^a to get line starting with a or ^a*$ to find lines containing only as (including the empty line).
also, please quote that shell argument (eg: '^a*$'), if you use a* and there is a file in the working directory starting with an a you will get very weird results...

Try this, it works for me. The ^ means beginning of a line - so it has to start with a.
grep ^a ABC

You need to put quotes around your pattern:
grep "a*" ABC
Otherwise the * is interpreted by the shell (which does wild-card filename matching), instead of by grep itself.

Related

How do I create a RegEx which has multiple criteria?

I am working through a lab on RegEx which asks me to:
Search the 'countries' file for all the words with nine characters and
the letter i.How many results are found?
I am working in a generic Linux command prompt in a online emulated environment. I am allowed to use grep, awk or sed though I am feeling a preference for grep.
(I am 100% a noob when it comes to RegEx so please explain it to me like I'm 5)
Per a previous lab I already used something like below which finds me all countries which have 9 characters, however I cannot find a way to make it find all words which have 9 characters AND contain the letter i in any position.
grep -E '\b\w{9}\b' countries
The | operator does not help because its an OR operator and will find me all instances that i is found, and all words which are 9 characters and I need both to happen at the same time. I tried multiple grep statements as well and it seems the emulator may not accept that.
I am also trying to stick to [] character sets as the next question asks for multiple letters within the 9 letter word.
One way of solving this problem is to use grep twice, and pipe one result to the next.
First, we find all words with length 9, like you did on the previous exercise:
grep -Eo '\b\w{9}\b' countries
I'm using the flag o that lists only the matching words, printing one word per line.
Next, we use Linux pipe (not regex OR) to feed the output of the first grep to a second grep:
grep -Eo '\b\w{9}\b' countries | grep 'i'
The final output will be all words with nine characters and i.
Depending on your requirements, this approach may be considered "cheating" if you're more focused on Regex, but a good solution if you're also learning Linux.
The fact you are looking for words complicates the regex (in contrary to lines in the file), but it is also possible to come up with a single regex to match these words.
\b(?=\w*i)\w{9}\b
This builds on \b\w{9}\b you already have. (?=\w*i) is the AND condition. After we find the beginning of the word (\b), we look ahead for \w*i (zero or more letters, and then our i). We're using \w* in the lookahead, not .*, so we are looking at the same word. (?=.*i) would have matched any i also after the nine characters.
After finding the i, we continue to make sure the word is only 9 letters.
Working example: https://regex101.com/r/G5EVdM/1

I'm struggling with Bash regular expressions

I'm frustrated trying to find out how to use regex to do anything useful. I'm completely uncertain on everything that I do, and I've resorted to trial and error; which has not been effective.
I'm trying to list files in the current directory that starts with a letter, contains a number, end with a dot followed by a lowercase character, etc.
So I know starts with a letter would be:
^[a-zA-Z]
but I don't know how to follow that up with CONTAINS a number. I know ends with a dot can be [\.]*, but I'm not sure. I'm seeing that $ is also used to match strings at the end of the word.
I have no idea if I should be using find with regex to do this, or ls | grep .... I'm completely lost. Any direction would be appreciated.
I guess the specific question I was trying to ask, was how to I glue the expressions together. For example, I tried ls | grep ^[a-zA-Z][0-9] but this only shows files that start with letter, followed by a number. I don't know how write a regex that starts with a letter, and then add the next requirement, ie. contains a number.
Starts with a letter: ^[a-zA-Z]
Contains a number: .*[0-9].*
Ends with a dot and lowercase letter: \.[a-z]$
Together:
^[a-zA-Z].*[0-9].*\.[a-z]$
The best way to find files that match a regex is with find -regex. When you use that the ^ and $ anchors are implied, so you can omit them. You'll need to tack on .*/ at the front to match any directory, since -regex matches both the directory and file name.
find -regex '.*/[a-zA-Z].*[0-9].*\.[a-z]'
There's plenty of documentation online, eg. GNU's Reference Manual.
Your particular example, would require something like:
^[:alpha:].*[:digit:].*\.[:lower:]$
or if POSIX classes are not available:
^[a-zA-Z].*[0-9].*\.[a-z]$
You can read either as:
start of line
a letter (upper or lower case)
any character, zero or more times
a digit
any character, zero or more times
a dot (must be escaped with a backslash)
a lower case letter
end of line
Once you settle on a regular expression, you can use it with ls within the directory you wish to find the files in:
cd <dir>
ls -1 | grep '^[a-zA-Z].*[0-9].*\.[a-z]$'
NOTE: I tried to improve my answer based on some of the comments.

Select all files that do not have string between 2 other strings

I have a set of files that i need to loop through and find all the files that does not have a specific string between 2 other specific strings. How can i do that?
I tried this but it didnt work:
grep -lri "\(stringA\).*\(?<!stringB\).*\(stringC\)" ./*.sql
EDIT:
the file could have structure as following:
StringA
StringB
StringA
StringC
all i want i s to know if there is any occurences where string A and stringC has no stringC in between.
You can use the -L option of grep to print all files which don't match and look for the specific combination of strings:
grep -Lri "\(stringA\).*\(stringB\).*\(stringC\)" ./*.sql
The short answer is along the lines of:
grep "abc[^(?:def)]*ghi" ./testregex
That's based on a testregex file like so:
abcghiabc
abcdefghi
abcghi
The output will be:
$ grep "abc[^(?:def)]*ghi" ./testregex
abcghiabc
abcghi
Mapped to your use-case, I'd wager this translates roughly to:
grep -lri "stringA[^(?:stringB)]*stringC" ./*.sql
Note that I've removed the ".*" between each string, since that will match the very string that you're attempting to exclude.
Update: The original question now calls out line breaks, so use grep's -z flag:
-z
suppress newline at the end of line, subtituting it for null character. That is, grep knows where end of line is, but sees the input as one big line.
Thus:
grep -lriz "stringA[^(?:stringB)]*stringC" ./*.sql
When I first had to use this approach myself, I wrote up the following explanation...
Specifically: I wanted to match "any character, any number of times,
non-greedy (so defer to subsequent explicit patterns), and NOT
MATCHING THE SEQUENCE />".
The last part is what I'm writing to share: "not matching the sequence
/>". This is the first time I've used character sequences combined
with "any character" logic.
My target string:
<img class="photo" src="http://d3gqasl9vmjfd8.cloudfront.net/49c7a10a-4a45-4530-9564-d058f70b9e5e.png" alt="Iron or Gold" />
My first attempt:
<img.*?class="photo".*?src=".*?".*?/>
This worked in online regex testers, but failed for some reason within
my actual Java code. Through trial and error, I found that replacing
every ".?" with "[^<>]?" was successful. That is, instead of
"non-greedy matching of any character", I could use "non-greedy
matching of any character except < or >".
But, I didn't want to use this, since I've seen alt text which
includes these characters. In my particular case, I wanted to use the
character sequence "/>" as the exclusion sequence -- once that
sequence was encountered, stop the "any character" matching.
This brings me to my lesson:
Part 1: Character sequences can be achieved using (?:regex). That is,
use the () parenthesis as normal for a character sequence, but prepend
with "?:" in order to prevent the sequence from being matched as a
target group. Ergo, "(?:/>)" would match "/>", while "(?:/>)*" would
match "/>/>/>/>".
Part 2: Such character sequences can be used in the same manner as
single characters. That is, "[^(?:/>)]*?" will match any character
EXCEPT the sequence "/>", any number of times, non-greedy.
That's pretty much it. The keywords for searching are "non-capturing
groups" and "negative lookahead|lookbehind", and the latter feature
goes much deeper than I've gone so far, with additional flags that I
don't yet grok. But the initial understanding gave me the tool I
needed for my immediate task, and it's a feature that I've wondered
about for awhile -- thus, I figured I'd share the basic introduction
in case any of you were curious about tucking it away in your toolset.
After playing around with the statement provided by the DreadPirateShawn:
stringA[^(?:stringB)]*stringC
I figured out that it is not a truly valid regex. This statement was excluding every character in the given set and not the full string. So I continued digging.
After some googling and testing the pattern, I came up with the following statement, that seems to fit my needs:
stringA\s*\t*(?:(?!stringB).)*\s*\t*stringC
This pattern matches any text except the provided string between 2 specified strings. It also takes into consideration whitespace characters.
There is more testing to be done, but it seems that this pattern perfectly fits my requirements
UPDATE: Here is a final version of the statement that seems to work for me:
grep -lriz "(set feedback on){0,}[ \t]*(?:(?!set feedback off).)*[ \t]*select sysdate from dual" ./*.sql

Regular Expression Using the Dot-Matches-All Mode

Normally the . doesn't match newline unless I specify the engine to do so with the (?s) flag. I tried this regexp on my editor's (UltraEdit v14.10) regexp engine using Perl style regexp mode:
(?s).*i
The search text contains multiple lines and each line contains many 'i' characters.
I expect the above regexp means: search as many characters (because with the '?s' the . now matches anything including newline) as possible (because of the greediness for *) until reaching the character 'i'.
This should mean "from the first character to the last 'i' in the last sentence" (greediness should reach the last sentence, right?).
But with UltraEdit's test, it turns out to be "from the first character to the last 'i' in the first sentence that contains an i". Is this result correct? Did I make any wrong interpretation of my reg expression?
e.g. given this text
aaa
bbb
aiaiaiaiaa
bbbicicid
it is
aaa
bbb
aiaiaiai
matched. But I expect:
aaa
bbb
aiaiaiaiaa
bbbicici
Your regex is correct, and so are your expectations of its performance.
This is a long-known bug in UltraEdit's regex implementation which I have written repeatedly to support about. As far as I know, it still hasn't been fixed. The problem appears to lie in the fact that UE's regex implementation is essentially line-based, and additional lines are taken into the match only if necessary. So .* will match greedily on the current line, but it will not cross a newline boundary if it doesn't have to in order to achieve a match.
There are some other subtle bugs with line endings. For example, lookbehind doesn't work across newlines, either.
Write to IDM support, or change to an editor with decent regex support. I did both.
Yes you are right this looks like a bug.
Your interpretation is correct. If you are in Perl mode and not Posix.
However it should apply to posix as well.
Altough defining the modifiers like you do is very rare.
Mostly you provide a string with delimiters and the modifier afterwards like /.*i/s
But this doesn't matter because your way is correct too. And if it wouldnt be supported, it wouldn't match the first newline either.
So yes, this is definately a bug in your program.
You're right that that regex should match the entire string (all 4 lines). My guess is that UltraEdit is attempting to do some sort of optimization by working line by line, and only accumulating new lines "when necessary".

Remove stuff, retrieve numbers, retrieve text with spaces in place of dots, remove the rest

This is my first question, so I hope I didn't mess too much with the title and the formatting.
I have a bunch of file a client of mine sent me in this form:
Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
What I need is a regex to output just:
212 The Actual Title Of the Chapter
I'm not gonna use it with any script language in particular; it's a batch renaming of files through an app supporting regex (which already "preserves" the extension).
So far, all I was able to do was this:
/.*x(\d+)\.(.*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Capture everything before a number preceded by an "x", group numbers after the "x", group everything following until a 3 digit Uppercase word is met, then capture everything that follows)
which gives me back:
212 The.Actual.Title.Of.the.Chapter
Having seen the result I thought that something like:
/.*x(\d+)\.([^.]*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Changed second group to "Capture everything which is not a dot...") would have worked as expected.
Instead, the whole regex fails to match completely.
What am I missing?
TIA
ciĆ 
ale
.*x(\d+)\. matches Name.Of.Chapter.021x212.
\.[A-Z]{3}.* matches .DOC.NAME-Some.stuff.Here.ext
But ([^.]*?) does not match The.Actual.Title.Of.the.Chapter because this regex does not allow for any periods at all.
since you are on Mac, you could use the shell
$ s="Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext"
$ echo ${s#*x}
212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
$ t=${s#*x}
$ echo ${t%.[A-Z][A-Z][A-Z].*}
212.The.Actual.Title.Of.the.Chapter
Or if you prefer sed, eg
echo $filename | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//'
For processing multiple files
for file in *.ext
do
newfile=${file#*x}
newfile=${newfile%.[A-Z][A-Z][A-Z].*}
# or
# newfile=$(echo $file | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//')
mv "$file" "$newfile"
done
To your question "How can I remove the dots in the process of matching?" the answer is "You can't." The only way to do that is by processing the result of the match in a second step, as others have said. But I think there's a more basic question that needs to be addressed, which is "What does it mean for a regex to match a given input?"
A regex is usually said to match a string when it describes any substring of that string. If you want to be sure the regex describes the whole string, you need to add the start (^) and end ($) anchors:
/^.*x(\d+)\.(.*?)\.[A-Z]{3}.*$/
But in your case, you don't need to describe the whole string; if you get rid of the .* at either end, it will serve your just as well:
/x(\d+)\.(.*?)\.[A-Z]{3}/
I recommend you not get in the habit of "padding" regexes with .* at beginning and end. The leading .* in particular can change the behavior of the regex in unexpected ways. For example, it there were two places in the input string where x(\d+)\. could match, your "real" match would have started at the second one. Also, if it's not anchored with ^ or \A, a leading .* can make the whole regex much less efficient.
I said "usually" above because some tools do automatically "anchor" the match at the beginning (Python's match()) or at both ends (Java's matches()), but that's pretty rare. Most of the shells and command-line tools available on *nix systems define a regex match in the traditional way, but it's a good idea to say what tool(s) you're using, just in case.
Finally, a word or two about vocabulary. The parentheses in (\d+) cause the matched characters to be captured, not grouped. Many regex flavors also support non-capturing parentheses in the form (?:\d+), which are used for grouping only. Any text that is included in the overall match, whether it's captured or not, is said to have been consumed (not captured). The way you used the words "capture" and "group" in your question is guaranteed to cause maximum confusion in anyone who assumes you know what you're talking about. :D
If you haven't read it yet, check out this excellent tutorial.