Regex: grep('pattern') catches 'pattern2' - regex

I'm looking for logical solution, using regex, so that I can query grep for pattern and not catch pattern2. Some kind of 'stop', or 'up until' logic.
This question is about performing this type of query, not about naming conventions. I'm not looking for a workaround, just the regexp logic.
For the sake of argument, let's make the context 'up to date' ubuntu bash. But what I really want is something that only utilizes the regexp logic.
For a list as below
entry
entry1
entry2
entry.qualifier
entry.qualifier2
pseudo command: grep("entry")
Note, this will match all of entries because as there is no 'stop' logic. I'm sure the solution is actually quite simple, I just haven't used regex in a long time.
Something like 'not anything after the pattern'?

grep supports word boundary so a pure regex based answer would be:
grep '\bentry\b' file
However grep also supports -w flag (match words) so you can also use:
grep -w 'entry' file

If you're using GNU grep, what can help here are the wound boundary anchor operators \< and \> that it supports. That is to say \<entry\>.
POSIX doesn't specify any \b or \< or -w command line option. What if you have to use grep that doesn't have them? The problem can be solved by testing each line of the file with pure regular expression which must match it completely.
Suppose we want to pick out lines which contain the identifier entry that isn't a substring of a longer identifier name. Suppose identifiers are strings of English letters, digits and underscores. We can use this:
grep -E '^(|.*[^A-Za-z_0-9])entry([^A-Za-z_0-9].*|)$'
Note that the entire pattern is anchored on both ends, so that it must completely match an entire line. It matches any occurrence of entry which:
is either not preceded by anything, or else is preceded by a non-identifier character, possibly with other characters in front of it; and
is either not followed by anything, or else followed by a non-identifier character, possibly followed by other characters.
This approach is also useful if you have a specific idea of what constitutes a "word" which differs from the definition used by the GNU grep \b or \< operators. Suppose the file format is such that entry123 is in fact two different tokens entry and 123, and thus has to match. However entryabc must not match. For this, the GNU grep pattern \bentry\b or \<entry\> won't help; it will not match entry123. However, the above trick can readily be adapted to work:
grep -E '^(|.*[^A-Za-z])entry([^A-Za-z].*|)$'
I.e. entry surrounded by nothing, or else characters that are not upper or lower case letters. So this is worth to "keep in your back pocket".

Related

Regex whitespace before character [duplicate]

I am attempting to grep for all instances of Ui\. not followed by Line or even just the letter L
What is the proper way to write a regex for finding all instances of a particular string NOT followed by another string?
Using lookaheads
grep "Ui\.(?!L)" *
bash: !L: event not found
grep "Ui\.(?!(Line))" *
nothing
Negative lookahead, which is what you're after, requires a more powerful tool than the standard grep. You need a PCRE-enabled grep.
If you have GNU grep, the current version supports options -P or --perl-regexp and you can then use the regex you wanted.
If you don't have (a sufficiently recent version of) GNU grep, then consider getting ack.
The answer to part of your problem is here, and ack would behave the same way:
Ack & negative lookahead giving errors
You are using double-quotes for grep, which permits bash to "interpret ! as history expand command."
You need to wrap your pattern in SINGLE-QUOTES:
grep 'Ui\.(?!L)' *
However, see #JonathanLeffler's answer to address the issues with negative lookaheads in standard grep!
You probably cant perform standard negative lookaheads using grep, but usually you should be able to get equivalent behaviour using the "inverse" switch '-v'. Using that you can construct a regex for the complement of what you want to match and then pipe it through 2 greps.
For the regex in question you might do something like
grep 'Ui\.' * | grep -v 'Ui\.L'
(Edit: this is not as strong as a true lookahead, but can often be used to work around the problem.)
If you need to use a regex implementation that doesn't support negative lookaheads and you don't mind matching extra character(s)*, then you can use negated character classes [^L], alternation |, and the end of string anchor $.
In your case grep 'Ui\.\([^L]\|$\)' * does the job.
Ui\. matches the string you're interested in
\([^L]\|$\) matches any single character other than L or it matches the end of the line: [^L] or $.
If you want to exclude more than just one character, then you just need to throw more alternation and negation at it. To find a not followed by bc:
grep 'a\(\([^b]\|$\)\|\(b\([^c]\|$\)\)\)' *
Which is either (a followed by not b or followed by the end of the line: a then [^b] or $) or (a followed by b which is either followed by not c or is followed by the end of the line: a then b, then [^c] or $.
This kind of expression gets to be pretty unwieldy and error prone with even a short string. You could write something to generate the expressions for you, but it'd probably be easier to just use a regex implementation that supports negative lookaheads.
*If your implementation supports non-capturing groups then you can avoid capturing extra characters.
If your grep doesn't support -P or --perl-regexp, and you can install PCRE-enabled grep, e.g. "pcregrep", than it won't need any command-line options like GNU grep to accept Perl-compatible regular expressions, you just run
pcregrep "Ui\.(?!Line)"
You don't need another nested group for "Line" as in your example "Ui.(?!(Line))" -- the outer group is sufficient, like I've shown above.
Let me give you another example of looking negative assertions: when you have list of lines, returned by "ipset", each line showing number of packets in a middle of the line, and you don't need lines with zero packets, you just run:
ipset list | pcregrep "packets(?! 0 )"
If you like perl-compatible regular expressions and have perl but don't have pcregrep or your grep doesn't support --perl-regexp, you can you one-line perl scripts that work the same way like grep:
perl -e "while (<>) {if (/Ui\.(?!Lines)/){print;};}"
Perl accepts stdin the same way like grep, e.g.
ipset list | perl -e "while (<>) {if (/packets(?! 0 )/){print;};}"
At least for the case of not wanting an 'L' character after the "Ui." you don't really need PCRE.
grep -E 'Ui\.($|[^L])' *
Here I've made sure to match the special case of the "Ui." at the end of the line.

Difference between using grep regex pattern with or without quotes?

I'm learning from Linux Academy and the tutorial shows how to use grep and regex.
He is putting his regex pattern in between quotes something like this:
grep 'pattern' file.txt
This seems to be the same than doing it without quotes:
grep pattern file.txt
But when he does something like this, he needs to escape the { and }:
grep '^A\{1,4\}' file.txt
And after doing some testing these scape characters don't seem to be needed when writing the pattern without the quotes.
grep ^A{1,4} file.txt
So what is the difference between these two methods?
Are the quotations necessary?
Why in the first case the escape characters are needed?
Lastly, I've also seen other methods like grep -E and egrep, which is the most common method that people use to grep with regex?
Edit: Thanks for the reminder that the pattern goes before the file.
Many thanks!
You can sometimes get away with omitting quotes, but it's safest not to. This is because the syntax of regular expressions overlaps that of filename wildcard patterns, and when the shell sees something that looks like a wildcard pattern (and it isn't in quotes), the shell will try to "expand" it into a list of matching filenames. If there are no matching files, it gets passed through unchanged, but if there are matches it gets replaced with the matching filenames.
Here's a simple example. Suppose we're trying to search file.txt for an "a" followed optionally by some "b"s, and print only the matches. So you run:
grep -o ab* file.txt
Now, "ab* could be interpreted as a wildcard pattern looking for files that start with "ab", and the shell will interpret it that way. If there are no files in the current directory that start with "ab", this won't cause a problem. But suppose there are two, "abcd.txt" and "abcdef.jpg". Then the shell expands this to the equivalent of:
grep -o abcd.txt abcdef.jpg file.txt
...and then grep will search the files abcdef.jpg and file.txt for the regex pattern abcd.txt.
So, basically, using an unquoted regex pattern might work, but is not safe. So don't do it.
BTW, I'd also recommend using single-quotes instead of double-quotes, because there are some regex characters that're treated specially by the shell even when they're in double-quotes (mostly dollar sign and backslash/escape). Again, they'll often get passed through unchanged, but not always, and unless you understand the (somewhat messy) parsing rules, you might get unexpected results.
BTW^2, for similar reasons you should (almost) always put double-quotes around variable references (e.g. grep -O 'ab* "$filename" instead of grep -O 'ab*' $filename). Single-quotes don't allow variable references at all; unquoted variable references are subject to word splitting and wildcard expansion, both of which can cause trouble. Double-quoted variables get expanded and nothing else.
BTW^3, there are a bunch of variants of regular expression syntax. The reason the curly braces in your example expression need to be escaped is that, by default, grep uses POSIX "basic" regular expression syntax ("BRE"). In BRE syntax, some regex special characters (including curly brackets and parentheses) must be escaped to have their special meaning (and some others, like alternation with |, are just not available at all). grep -E, on the other hand, uses "extended" regular expression syntax ("ERE"), in which those characters have their special meanings unless they're escaped.
And then there's the Perl-compatible syntax (PCRE), and many other variants. Using the wrong variant of the syntax is a common cause of trouble with regular expressions (e.g. using perl extensions in an ERE context, as here and here). It's important to know which variant the tool you're using understands, and write your regex to that syntax.
Here's a simple example: "a", followed by 1 to 3 space-like characters, followed by "b", in various regex syntax variants:
a[[:space:]]\{1,3\}b # BRE syntax
a[[:space:]]{1,3}b # ERE syntax
a\s{1,3}b # PCRE syntax
Just to make things more complicated, some tools will nominally accept one syntax, but also allow some extensions from other syntax variants. In the example above, you can see that perl added the shorthand \s for a space-like character, which is not part of either POSIX standard syntax. But in fact many tools that nominally use BRE or ERE will actually accept the \s shorthand.
Actually, there are two completely unrelated aspects of escaping in your question. The first has to do how to represent strings in bash. This is about readability, which usually means personal taste. For example, I don't like escaping, hence I prefer writing ab\ cd as 'ab cd'. Hence, I would write
echo 'ab cd'
grep -F 'ab cd' myfile.txt
instead of
echo ab\ cd
grep -F ab\ cd myfile.txt
but there is nothing wrong with either one, and you can choose whichever looks simpler to you.
The other aspect indeed is related to grep, at least as long as you do not use the -F option in grep, which always interprets the search argument literally. In this case, the shell is not involved, and the question is whether a certain character is interpreted as a regexp character or as a literal. Gordon Davisson has already explained this in detail, so I give only an example which combines both aspects:
Say you want to grep for a space, followed by one or more periods, followed by another space. You can't write this as
grep -E .+ myfile.txt
because the spaces would be eaten by bash and the . would have special meaning to grep. Hence, you have to choose some escape mechanism. My personal style would be
grep -E ' [.]+ ' myfile.txt
but many people dislike the [.] and prefer \. instead. This would then become
grep -E ' \.+ ' myfile.txt
This still uses quotes to salvage the spaces from the shell, but escapes the period for grep. If you prefer to use no quotes at all, you can write
grep -E \ \\.+\ myfile.txt
Note that you need to prefix the \ which is intended for grep by another \, because the backslash has, like a space, a special meaning for the shell, and if you would not write \\., grep would not see a backslash-period, but just a period.

Exactly two capitalized words on a line

I want to create a regular expression which can replace lines that contain exactly two words beginning with an uppercase with the character 'X'.
I'm currently using this:
sed -e '/\b[A-Z][a-z]*\b c X /home/Morgan/desktop/test
The problem is the following: it only changes lines which contain 1 or more words described by the regular expression in my test.txt.
I don't know how to say that i want a X only on lines with exactly 2 words beginning with an uppercase. Either word can occur anywhere within the line.
My test.txt contains:
Bonjour oui oui Bonjour -> this must be replaced by X
Bonjour Bonjour Bonjour -> this mustn't
Bonjour Oui bonjour oui -> this must be replaced by X
You seem to be attempting to use the Perl/PCRE word boundary \b but typical sed implementations do not understand this regular expression dialect. By your problem description, you are looking for beginning and end of line, anyway; this is a very basic regex anchor which was introduced already in the original grep: ^ matches beginning of line, and $ matches end of line.
Without anchors, a regular expression will match anywhere in the line. To say "only two" you really must check the entire line and make sure there are not three or more of what you're looking for.
"Find a line with exactly two words which begin with uppercase" needs to be rephrased or massaged a bit before you can attempt to write a regex. If we -- provisionally, for this discussion -- define w to mean "word which does not begin with uppercase" and W to mean one which does, you want ^w*Ww*Ww*$ -- exactly two uppercase words, and zero or more non-uppercase words in any position before, between, or after them.
A word which begins with uppercase is [A-Z][a-z]* (this requires all the subsequent characters to be lowercase) and a word which doesn't is [a-z][a-z]* (or [a-z]\+ if your sed supports that regex variation).
Because words need spaces between them, an optional word expression needs to be parenthesized so you can say "zero or more of this entire sequence". Typically, sed regex requires grouping parentheses to be backslashed as well, though this differs between versions.
So, try this:
sed 's/^\([a-z][a-z]* \)*[A-Z][a-z]*\( [a-z][a-z]*\)* [A-Z][a-z]*\( [a-z][a-z]*\)*$/X/' file
If indeed you have GNU sed, this can be simplified a bit:
sed -r 's/^([a-z]+ )*[A-Z][a-z]*( [a-z]+)* [A-Z][a-z]*( [a-z]+)*$/X/' file
This definition of "word" might not be sufficient; perhaps you can refine it to suit your circumstances. In particular, the spacing is assumed to be regular (exactly one space between words; no leading or trailing whitespace on the lines) and no text may contain characters outside of spaces and the alphabetics a-z in upper or lower case. (Whether accented characters like è and Á are also considered alphabetics in this range depends on your locale settings. Maybe set LC_ALL=fr_FR.utf-8 in your script if French locale settings are important.)
Notice also how the sed substition command requires exactly three delimiter characters -- traditionally, we use a slash, but you can use any punctuation character. The form is s/regex/replacement/flags where the regex, the replacement, and the flags can all be empty, but the s and the delimiters are always required.

Find results with grep and write to file

I would like to get all the results with grep or egrep from a file on my computer.
Just discovered that the regex of finding the string
'+33. ... ... ..' is by the following regex
\+33.[0-9].[0-9].[0-9].[0-9].' Or is this not correct?
My grep command is:
grep '\+31.[0-9].[0.9].[0.9].[0-9]' Samsung\ GT-i9400\ Galaxy\ S\ II.xry >> resultaten.txt
The output file is only giving me as following:
"Binary file Samsung GT-i9400 .xry matches"
..... and no results were given.
Can someone help me please with getting the results and writing to a file?
Firstly, the default behavior of grep is to print the line containing a match. Because binary files do not contain lines, it only prints a message when it finds a match in a binary file. However, this can be overridden with the -a flag.
But then, you end up with the problem that the "lines" it prints are not useful. You probably want to add the -o option to only print the substrings which actually matched.
Finally, your regex isn't correct at all. The lone dot . is a metacharacter which matches any character, including a control character or other non-text character. Given the length of your regex, you are unlikely to catch false positives, but you might want to explain what you want the dot to match. I have replaced it with [ ._-] which matches a space and some punctuation characters which are common in phone numbers. Maybe extend or change it, depending on what interpunction you expect in your phone numbers.
In regular grep, a plus simply matches itself. With grep -E the syntax would change, and you would need to backslash the plus; but in the absence of this option, the backslash is superfluous (and actually wrong in this context in some dialects, including GNU grep, where a backslashed plus selects the extended meaning, which is of course a syntax error at beginning of string, where there is no preceding expression to repeat one or more times; but GNU grep will just silently ignore it, rather than report an error).
On the other hand, your number groups are also wrong. [0-9] matches a single digit, where apparently the intention is to match multiple digits. For convenience, I will use the grep -E extension which enables + to match one or more repetitions of the previous character. Then we also get access to ? to mark the punctuation expressions as optional.
Wrapping up, try this:
grep -Eao '\+33[0-9]+([^ ._-]?[0-9]+){3}' \
'Samsung GT-i9400 Galaxy S II.xry' >resultaten.txt
In human terms, this requires a literal +33 followed by required additional digits, then followed by three number groups of one or more digits, each optionally preceded by punctuation.
This will overwrite resultaten.txt which is usually what you want; the append operation you had also makes sense in many scenarios, so change it back if that's actually what you want.
If each dot in your template +33. ... ... .. represents a required number, and the spaces represent required punctuation, the following is closer to what you attempted to specify:
\+33[0-9]([^ ._-][0-9]{3}){2}[^ ._-][0-9]{2}
That is, there is one required digit after 33, then two groups of exactly three digits and one of two, each group preceded by one non-optional spacing or punctuation character.
(Your exposition has +33 while your actual example has +31. Use whichever is correct, or perhaps allow any sequence of numbers for the country code, too.)
It means that you're find a match but the file you're greping isn't a text file, it's a binary containing non-printable bytes. If you really want to grep that file, try:
strings Samsung\ GT-i9400\ Galaxy\ S\ II.xry | grep '+31.[0-9].[0.9].[0.9].[0-9]' >> resultaten.txt

Remove stuff, retrieve numbers, retrieve text with spaces in place of dots, remove the rest

This is my first question, so I hope I didn't mess too much with the title and the formatting.
I have a bunch of file a client of mine sent me in this form:
Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
What I need is a regex to output just:
212 The Actual Title Of the Chapter
I'm not gonna use it with any script language in particular; it's a batch renaming of files through an app supporting regex (which already "preserves" the extension).
So far, all I was able to do was this:
/.*x(\d+)\.(.*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Capture everything before a number preceded by an "x", group numbers after the "x", group everything following until a 3 digit Uppercase word is met, then capture everything that follows)
which gives me back:
212 The.Actual.Title.Of.the.Chapter
Having seen the result I thought that something like:
/.*x(\d+)\.([^.]*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Changed second group to "Capture everything which is not a dot...") would have worked as expected.
Instead, the whole regex fails to match completely.
What am I missing?
TIA
cià
ale
.*x(\d+)\. matches Name.Of.Chapter.021x212.
\.[A-Z]{3}.* matches .DOC.NAME-Some.stuff.Here.ext
But ([^.]*?) does not match The.Actual.Title.Of.the.Chapter because this regex does not allow for any periods at all.
since you are on Mac, you could use the shell
$ s="Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext"
$ echo ${s#*x}
212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
$ t=${s#*x}
$ echo ${t%.[A-Z][A-Z][A-Z].*}
212.The.Actual.Title.Of.the.Chapter
Or if you prefer sed, eg
echo $filename | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//'
For processing multiple files
for file in *.ext
do
newfile=${file#*x}
newfile=${newfile%.[A-Z][A-Z][A-Z].*}
# or
# newfile=$(echo $file | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//')
mv "$file" "$newfile"
done
To your question "How can I remove the dots in the process of matching?" the answer is "You can't." The only way to do that is by processing the result of the match in a second step, as others have said. But I think there's a more basic question that needs to be addressed, which is "What does it mean for a regex to match a given input?"
A regex is usually said to match a string when it describes any substring of that string. If you want to be sure the regex describes the whole string, you need to add the start (^) and end ($) anchors:
/^.*x(\d+)\.(.*?)\.[A-Z]{3}.*$/
But in your case, you don't need to describe the whole string; if you get rid of the .* at either end, it will serve your just as well:
/x(\d+)\.(.*?)\.[A-Z]{3}/
I recommend you not get in the habit of "padding" regexes with .* at beginning and end. The leading .* in particular can change the behavior of the regex in unexpected ways. For example, it there were two places in the input string where x(\d+)\. could match, your "real" match would have started at the second one. Also, if it's not anchored with ^ or \A, a leading .* can make the whole regex much less efficient.
I said "usually" above because some tools do automatically "anchor" the match at the beginning (Python's match()) or at both ends (Java's matches()), but that's pretty rare. Most of the shells and command-line tools available on *nix systems define a regex match in the traditional way, but it's a good idea to say what tool(s) you're using, just in case.
Finally, a word or two about vocabulary. The parentheses in (\d+) cause the matched characters to be captured, not grouped. Many regex flavors also support non-capturing parentheses in the form (?:\d+), which are used for grouping only. Any text that is included in the overall match, whether it's captured or not, is said to have been consumed (not captured). The way you used the words "capture" and "group" in your question is guaranteed to cause maximum confusion in anyone who assumes you know what you're talking about. :D
If you haven't read it yet, check out this excellent tutorial.