How to comprehend expression/pattern in find/grep/rsync? - regex

I have to use find, grep and rsync commands for my program. Generally, I rarely used all of these in a single script so didn't notice earlier. Is there a category of regular-expression that fit these commands like:
find command: follows regex type1
grep command: follows regex type2
rsync command: follows regex type3
For example, for finding all the paths which lead to my program log file, we can do:
find -type f -name "foo.log*"
Here, in the above command, the star is not acting like a proper regular expression, as in regex, the star corresponds to the zero/one/multiple instances of the immediate before expression which is character('g') in this case? So if it actually follows regex, it can match filenames like:
foo.lo
foo.log
foo.logg
foo.loggg
and so on...
Similar to find command, the rsync behave when given expression for its source and destination path. While on the other hand, I noticed the grep command do follow the regular expression.
So, in total:
Do all of these commands follow a different kind of regular expression?
Or some of them follows regex while some of them do not, and if not, then what pattern they follow? Basically, I'm looking for the generalisation of the patterns of all these tools?
I'm new to Linux tools. Please guide!

There is a big difference between wildcards and regular expressions.
Wildcards:
special characters that define a simple search pattern
used by shells (bash, old MS-DOS, ...), and by many unix commands (find, ...)
limited set of wildcards, typically just:
* - zero or more chars (any combination)
? - exactly one char (any char)
[...] - exactly one char out of a set or range of chars, such as [0-9a-f] for a hex digit
see tutorial: https://linuxhint.com/bash_wildcard_tutorial/
Regular Expression:
a sequence of characters that define a search pattern
think of regular expressions (regex for short) as wildcards on steroids
regex patterns are used to find or find and replace strings
powerful language, natively supported by most programming languages
there are different flavors of regular expressions, typically grouped into these categories:
POSIX Basic (BRE - Basic Regular Expressions)
POSIX Extended (ERE - Extended Regular Expressions)
Perl and PCRE (Perl Compatible Regular Expressions)
JavaScript
many more flavors, see https://en.wikipedia.org/wiki/Comparison_of_regular-expression_engines
some unix commands allow you to select one regex flavor or another; for example:
grep uses POSIX Basic by default
grep -E or egrep uses POSIX Extended
grep -Puses Perl
Wikipedia article: https://en.wikipedia.org/wiki/Regular_expression
tutorial: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

Related

Difference between using grep regex pattern with or without quotes?

I'm learning from Linux Academy and the tutorial shows how to use grep and regex.
He is putting his regex pattern in between quotes something like this:
grep 'pattern' file.txt
This seems to be the same than doing it without quotes:
grep pattern file.txt
But when he does something like this, he needs to escape the { and }:
grep '^A\{1,4\}' file.txt
And after doing some testing these scape characters don't seem to be needed when writing the pattern without the quotes.
grep ^A{1,4} file.txt
So what is the difference between these two methods?
Are the quotations necessary?
Why in the first case the escape characters are needed?
Lastly, I've also seen other methods like grep -E and egrep, which is the most common method that people use to grep with regex?
Edit: Thanks for the reminder that the pattern goes before the file.
Many thanks!
You can sometimes get away with omitting quotes, but it's safest not to. This is because the syntax of regular expressions overlaps that of filename wildcard patterns, and when the shell sees something that looks like a wildcard pattern (and it isn't in quotes), the shell will try to "expand" it into a list of matching filenames. If there are no matching files, it gets passed through unchanged, but if there are matches it gets replaced with the matching filenames.
Here's a simple example. Suppose we're trying to search file.txt for an "a" followed optionally by some "b"s, and print only the matches. So you run:
grep -o ab* file.txt
Now, "ab* could be interpreted as a wildcard pattern looking for files that start with "ab", and the shell will interpret it that way. If there are no files in the current directory that start with "ab", this won't cause a problem. But suppose there are two, "abcd.txt" and "abcdef.jpg". Then the shell expands this to the equivalent of:
grep -o abcd.txt abcdef.jpg file.txt
...and then grep will search the files abcdef.jpg and file.txt for the regex pattern abcd.txt.
So, basically, using an unquoted regex pattern might work, but is not safe. So don't do it.
BTW, I'd also recommend using single-quotes instead of double-quotes, because there are some regex characters that're treated specially by the shell even when they're in double-quotes (mostly dollar sign and backslash/escape). Again, they'll often get passed through unchanged, but not always, and unless you understand the (somewhat messy) parsing rules, you might get unexpected results.
BTW^2, for similar reasons you should (almost) always put double-quotes around variable references (e.g. grep -O 'ab* "$filename" instead of grep -O 'ab*' $filename). Single-quotes don't allow variable references at all; unquoted variable references are subject to word splitting and wildcard expansion, both of which can cause trouble. Double-quoted variables get expanded and nothing else.
BTW^3, there are a bunch of variants of regular expression syntax. The reason the curly braces in your example expression need to be escaped is that, by default, grep uses POSIX "basic" regular expression syntax ("BRE"). In BRE syntax, some regex special characters (including curly brackets and parentheses) must be escaped to have their special meaning (and some others, like alternation with |, are just not available at all). grep -E, on the other hand, uses "extended" regular expression syntax ("ERE"), in which those characters have their special meanings unless they're escaped.
And then there's the Perl-compatible syntax (PCRE), and many other variants. Using the wrong variant of the syntax is a common cause of trouble with regular expressions (e.g. using perl extensions in an ERE context, as here and here). It's important to know which variant the tool you're using understands, and write your regex to that syntax.
Here's a simple example: "a", followed by 1 to 3 space-like characters, followed by "b", in various regex syntax variants:
a[[:space:]]\{1,3\}b # BRE syntax
a[[:space:]]{1,3}b # ERE syntax
a\s{1,3}b # PCRE syntax
Just to make things more complicated, some tools will nominally accept one syntax, but also allow some extensions from other syntax variants. In the example above, you can see that perl added the shorthand \s for a space-like character, which is not part of either POSIX standard syntax. But in fact many tools that nominally use BRE or ERE will actually accept the \s shorthand.
Actually, there are two completely unrelated aspects of escaping in your question. The first has to do how to represent strings in bash. This is about readability, which usually means personal taste. For example, I don't like escaping, hence I prefer writing ab\ cd as 'ab cd'. Hence, I would write
echo 'ab cd'
grep -F 'ab cd' myfile.txt
instead of
echo ab\ cd
grep -F ab\ cd myfile.txt
but there is nothing wrong with either one, and you can choose whichever looks simpler to you.
The other aspect indeed is related to grep, at least as long as you do not use the -F option in grep, which always interprets the search argument literally. In this case, the shell is not involved, and the question is whether a certain character is interpreted as a regexp character or as a literal. Gordon Davisson has already explained this in detail, so I give only an example which combines both aspects:
Say you want to grep for a space, followed by one or more periods, followed by another space. You can't write this as
grep -E .+ myfile.txt
because the spaces would be eaten by bash and the . would have special meaning to grep. Hence, you have to choose some escape mechanism. My personal style would be
grep -E ' [.]+ ' myfile.txt
but many people dislike the [.] and prefer \. instead. This would then become
grep -E ' \.+ ' myfile.txt
This still uses quotes to salvage the spaces from the shell, but escapes the period for grep. If you prefer to use no quotes at all, you can write
grep -E \ \\.+\ myfile.txt
Note that you need to prefix the \ which is intended for grep by another \, because the backslash has, like a space, a special meaning for the shell, and if you would not write \\., grep would not see a backslash-period, but just a period.

Regex for prosodically-defined words: working in Atom but not grep

I'm trying to search a .txt dictionary for all trisyllabic roots, and then have the matching roots passed to a new .txt file. The dictionary in question is a raw text version of Heath's Nunggubuyu dictionary. When I search the file in Atom (my preferred text editor), the following string does a pretty good job of singling out the desired roots and eliminating any material from the definitions below the headwords (which begin with whitespace), as well as any English words, and any trisyllabic strings interrupted by a hyphen or equals sign (which mean they are not monomorphemic roots). Forgive me if it looks clunky; I'm an absolute beginner. (In this orthography, vowel length is indicated with a ':', and there are only three vowels 'a,i,u'. None of the headwords have uppercase letters.)
^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b
However, I need the matched strings to be output to a new file. When I try using this same string in grep (on a Mac), nothing is matched. I use the syntax
grep -o "^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b" Dict-nofrontmatter.txt > output.txt
I've been searching for hours trying to figure out how to translate from Atom's regex dialect to grep (Mac), to no avail. Whenever I do manage to get matches, the results looks wildly different to what I expect, and what I get from Atom. I've also looked at some apparent grep tools for Atom, but the documentation is virtually non-existent so I can't work out what they even do. What am I getting wrong here? Should I try an alternative to grep?
grep supports different regex styles. From man re_format:
Regular expressions ("RE"s), as defined in POSIX.2, come in two
forms:
modern REs (roughly those of egrep; POSIX.2 calls these extended REs) and
obsolete REs (roughly those of ed(1); POSIX.2 basic REs).
Grep has switches to choose which variant is used. Sorted from less to many features:
fixed string: grep -F or fgrep
No regex at all. Plain text search.
basic regex: grep -G or just grep
|, +, and ? are ordinary characters. | has no equivalent. Parentheses must be escaped to work as sub-expressions.
extended regex: grep -E or egrep
"Normal" regexes with |, +, ? bounds and so on.
perl regex: grep -P (for GNU grep, not pre-installed on Mac)
Most powerful regexes. Supports lookaheads and other features.
In your case you should try grep -Eo "^\S....
Possibly the only thing missing from your grep command is the -E option:
regex='^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b'
grep -Eo "$regex" Dict-nofrontmatter.txt > output.txt
-E activates support for extended (modern) regular expressions, which work as one expects nowadays (duplication symbols + and ? work as expected, ( and ) form capture groups, | is alternation).
Without -E (or with -G) basic regular expressions are assumed - a limited legacy form that differs in syntax. Given that -E is part of POSIX, there's no reason not to use it.
On macOS, grep does understand character-class shortcuts such as \S and \W, and also word-boundary assertions such as \b - this is in contrast with the other BSD utilities that macOS comes with, notably sed and awk.
It doesn't look like you need it, but PRCEs (Perl-compatible Regular Expressions) would provide additional features, such as look-around assertions.
macOS grep doesn't support them, but GNU grep does, via the -P option. You can install GNU grep on macOS via Homebrew.
Alternatively, you can simply use perl directly; the equivalent of the above command would be:
regex='^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b'
perl -lne "print for m/$regex/g" Dict-nofrontmatter.txt > output.txt

Regular expression to search for not-a-specific-sequence-of-characters

I am looking for a regular expression that matches "not-a-specific-sequence-of-characters". A solution suddenly dawned on me (after a few years!) I am running bash on a Macintosh computer.
As an example, I want to match the word Path as long as it is not preceded by the word posix or Posix. Here is the regular expression I came up with:
[^[:space:]]*([^x]|[^i]x|[^s]ix|[^o]six|[^Pp]osix)Path
I would like to ask if there might be a more efficient or otherwise better approach. This approach can become somewhat cumbersome the longer the "not" sequence of characters is.
Perl regexes have handy "look-around" features.
perl -ne 'print if /(?<![pP]osix)Path' file
GNU grep has a -P flag to enable perl-compatible regular expressions, but OSX does not have GNU tools by default.
A straightforward technique is to filter the output of grep:
grep 'Path' file | grep -v '[pP]osixPath'

increase the integer value of 1 with Regex

I want to increase value.For Example
Jerry1
Jerry2
Jerry3
Jerry4
I want to change that.
Jerry2
Jerry3
Jerry4
Jerry5
How can I change ?
Don't try to abuse regular expressions for everything.
By design, regular expressions are meant to not support counting. The reason is simple: if you want to have this, you need at least a type-2 language, while processing is signficiantly more complex than for type 3 ("regular") languages.
See Wikipedia for details: https://en.wikipedia.org/wiki/Chomsky_hierarchy
So by the definition, once you fully support counting it probably no longer is a regular language.
There are extensions around, for example perl extended regular expressions, that do allow to solve this particular problem. But essentially, they are no longer regular expressions, but they invoke an external function to do the work.
The following perl extended regular expression should do what you want:
s/(-?\d+)/$1 + 1/eg
but essentially, only the matching part is a regular expression, the substitution is Perl, so turing complete. The e flag indicates the right part should be evaluated by Perl, not as regexp substitution string.
You can of course do this trick in pretty much any other regular expression engine. Match, then compute the increment, then substitute the match with the new value.
Full perl filter demo:
> echo 'Test 123 test 0 Banana9 -17 3 route66' | perl -pe 's/(-?\d+)/$1+1/eg'
Test 124 test 1 Banana10 -16 4 route67
The p flag makes perl read standard input and apply the program to each line, then output the result. That is why the actual script consists of the substitution only. This is what makes Perl so popular for unix scripting. You can even mass-apply this filter to a whole set of files (see -i for in-place modification, and the perlrun manual page). So in order to modify a whole set of files in place (backups will be postfixed with .bak):
perl -p -i .bak -e 's/(-?\d+)/$1+1/eg' <filenames>

SED regular expressions trouble

I have build the following regular expression in order to fix a big sql dump with invalid tags
This searches
\[ame=(?:\\"){0,1}(?:http://){0,1}(http://(?:www.|uk.|fr.|il.|hk.){0,1}youtube.com/watch\?v=([^&,",\\]+))[^\]]*\].+?video\]|\[video\](http://(?:www.|uk.|fr.|il.|hk.){0,1}youtube.com/watch\?v=([^\[,&,\\,"]+))\[/video\]
This replaces
[video=youtube;$2$4]$1$3[/video]
So this:
[ame=\"http://www.youtube.com/watch?v=FD5ArmOMisM\"]YouTube - Official Install Of X360FDU![/video]
will become
[video=youtube;FD5ArmOMisM]http://www.youtube.com/watch?v=FD5ArmOMisM[/video]
It behaves like a charm in EditPadPro (Windows) but it gives me conflicts with the codepages when I try to import it in my Linux based MySQL.
So since the file comes from a Linux installation I tried my luck with SED but it gives me errors errors errors. Obviously it has a different way to build regular expressions.
It is quite urgent to do the substitutions so I have no time reading the SED manual.
Can you give a hand to migrate my regular expressions to a SED friendly format?
Thanx in advance!
UPDATE: I added the escape chars proposed
\[ame=\(?:\\"\)\{0,1\}\(?:http:\/\/\)\{0,1\}\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^&,",\\]+\))[^\]]*\].+?video\]|\[video\]\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^\[,&,\\,"]+\))\[\/video\]
but I still get errors - Unkown command: ')'
Your regular expressions are using PCRE - Perl Compatible Regular Expression - notations. As defined by POSIX (codifying what was standardized by 7th Edition Unix circa 1978, which was a continuation of the previous versions of Unix), sed does not support PCRE.
Even GNU sed version 4.2.1, which supports ERE (extended regular expressions) as well as BRE (basic regular expressions) does not support PCRE.
Your best bet is probably to use Perl to provide you with the PCRE you need. Failing that, take the scripting language of your choice with PCRE support.
Sed just has some different escaping rules to the Regex flavor you're using.
() escaped \( \) - for grouping
[] are not - for character classes
{} escaped \{ \} - for numerators
\[ame=\(?:\\"\)\{0,1\}\(?:http:\/\/\)\{0,1\}\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^&,",\\]+\)\)[^\]]*\].+?video\]|\[video\]\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^\[,&,\\,"]+\)\)\[\/video\]
I noticed a couple of unescaped )'s on enclosing groups.