Why does \d not match digits when using awk? - regex

I'm finding a behaviour I can't really explain with awk. Maybe it's a silly mistake but I can't figure it out.
I have a file called files with some random filenames.
$ cat -e files
3beds.txt$
file4.txt$
file3.txt$
dedo$
file5.txt$
texto5.txt$
metoo.txt$
34lions$
texto2.txt$
file1.txt$
7hello$
summer$
missing$
hello.mundo$
helloWorld.txt$
texto3$
awkvars$
texto4$
yes$
file2.txt$
I want to print only the filenames containing digits. I used the command:
awk '/\d/{print $0}' files
But my result was:
$ awk '/\d/{print $0}' files
3beds.txt
dedo
hello.mundo
helloWorld.txt
I'd really appreciate it if someone can explain to me why those lines are being printed. Thank you!

Clue: the four lines that matched were the four lines that contained "d".
So, clearly \d was being interpreted as a literal "d".
Why? Because awk's regex syntax is POSIX Extended Regular Expressions, not the Perl, PCRE or Ecma you might be used to. So \d does not stand for "digit" as you were expecting. You ended up using a backslash escape to force a literal "d".
The equivalent for \d in awk depends on the semantics you want[1]. [0-9] will match only the ten ASCII digits. You could also use the POSIX character class for digit inside a POSIX Bracket Expression, [[:digit:]]:
When used on strings with non-ASCII characters, the [:digit:] class may include digits in other scripts, depending on the locale.
My quotations are from regular-expressions.info, which has a wealth of info on the many syntaxes. This page took info from that page and turned it into a convenient table that compares 15 of them in great detail.
[1]: Even for regex engines that support the shorthand \d, the semantics can differ:
Since certain character classes are used often, a series of shorthand character classes are available. \d is short for [0-9]. In most flavors that support Unicode, \d includes all digits from all scripts. Notable exceptions are Java, JavaScript, and PCRE. These Unicode flavors match only ASCII digits with \d.

with awk, if you want to print only the lines containing digits, you can use this regexp instead:
awk '/[[:digit:]]/' file
3beds.txt$
file4.txt$
file3.txt$
file5.txt$
texto5.txt$
34lions$
texto2.txt$
file1.txt$
7hello$
texto3$
texto4$
file2.txt$

Related

Difference between using grep regex pattern with or without quotes?

I'm learning from Linux Academy and the tutorial shows how to use grep and regex.
He is putting his regex pattern in between quotes something like this:
grep 'pattern' file.txt
This seems to be the same than doing it without quotes:
grep pattern file.txt
But when he does something like this, he needs to escape the { and }:
grep '^A\{1,4\}' file.txt
And after doing some testing these scape characters don't seem to be needed when writing the pattern without the quotes.
grep ^A{1,4} file.txt
So what is the difference between these two methods?
Are the quotations necessary?
Why in the first case the escape characters are needed?
Lastly, I've also seen other methods like grep -E and egrep, which is the most common method that people use to grep with regex?
Edit: Thanks for the reminder that the pattern goes before the file.
Many thanks!
You can sometimes get away with omitting quotes, but it's safest not to. This is because the syntax of regular expressions overlaps that of filename wildcard patterns, and when the shell sees something that looks like a wildcard pattern (and it isn't in quotes), the shell will try to "expand" it into a list of matching filenames. If there are no matching files, it gets passed through unchanged, but if there are matches it gets replaced with the matching filenames.
Here's a simple example. Suppose we're trying to search file.txt for an "a" followed optionally by some "b"s, and print only the matches. So you run:
grep -o ab* file.txt
Now, "ab* could be interpreted as a wildcard pattern looking for files that start with "ab", and the shell will interpret it that way. If there are no files in the current directory that start with "ab", this won't cause a problem. But suppose there are two, "abcd.txt" and "abcdef.jpg". Then the shell expands this to the equivalent of:
grep -o abcd.txt abcdef.jpg file.txt
...and then grep will search the files abcdef.jpg and file.txt for the regex pattern abcd.txt.
So, basically, using an unquoted regex pattern might work, but is not safe. So don't do it.
BTW, I'd also recommend using single-quotes instead of double-quotes, because there are some regex characters that're treated specially by the shell even when they're in double-quotes (mostly dollar sign and backslash/escape). Again, they'll often get passed through unchanged, but not always, and unless you understand the (somewhat messy) parsing rules, you might get unexpected results.
BTW^2, for similar reasons you should (almost) always put double-quotes around variable references (e.g. grep -O 'ab* "$filename" instead of grep -O 'ab*' $filename). Single-quotes don't allow variable references at all; unquoted variable references are subject to word splitting and wildcard expansion, both of which can cause trouble. Double-quoted variables get expanded and nothing else.
BTW^3, there are a bunch of variants of regular expression syntax. The reason the curly braces in your example expression need to be escaped is that, by default, grep uses POSIX "basic" regular expression syntax ("BRE"). In BRE syntax, some regex special characters (including curly brackets and parentheses) must be escaped to have their special meaning (and some others, like alternation with |, are just not available at all). grep -E, on the other hand, uses "extended" regular expression syntax ("ERE"), in which those characters have their special meanings unless they're escaped.
And then there's the Perl-compatible syntax (PCRE), and many other variants. Using the wrong variant of the syntax is a common cause of trouble with regular expressions (e.g. using perl extensions in an ERE context, as here and here). It's important to know which variant the tool you're using understands, and write your regex to that syntax.
Here's a simple example: "a", followed by 1 to 3 space-like characters, followed by "b", in various regex syntax variants:
a[[:space:]]\{1,3\}b # BRE syntax
a[[:space:]]{1,3}b # ERE syntax
a\s{1,3}b # PCRE syntax
Just to make things more complicated, some tools will nominally accept one syntax, but also allow some extensions from other syntax variants. In the example above, you can see that perl added the shorthand \s for a space-like character, which is not part of either POSIX standard syntax. But in fact many tools that nominally use BRE or ERE will actually accept the \s shorthand.
Actually, there are two completely unrelated aspects of escaping in your question. The first has to do how to represent strings in bash. This is about readability, which usually means personal taste. For example, I don't like escaping, hence I prefer writing ab\ cd as 'ab cd'. Hence, I would write
echo 'ab cd'
grep -F 'ab cd' myfile.txt
instead of
echo ab\ cd
grep -F ab\ cd myfile.txt
but there is nothing wrong with either one, and you can choose whichever looks simpler to you.
The other aspect indeed is related to grep, at least as long as you do not use the -F option in grep, which always interprets the search argument literally. In this case, the shell is not involved, and the question is whether a certain character is interpreted as a regexp character or as a literal. Gordon Davisson has already explained this in detail, so I give only an example which combines both aspects:
Say you want to grep for a space, followed by one or more periods, followed by another space. You can't write this as
grep -E .+ myfile.txt
because the spaces would be eaten by bash and the . would have special meaning to grep. Hence, you have to choose some escape mechanism. My personal style would be
grep -E ' [.]+ ' myfile.txt
but many people dislike the [.] and prefer \. instead. This would then become
grep -E ' \.+ ' myfile.txt
This still uses quotes to salvage the spaces from the shell, but escapes the period for grep. If you prefer to use no quotes at all, you can write
grep -E \ \\.+\ myfile.txt
Note that you need to prefix the \ which is intended for grep by another \, because the backslash has, like a space, a special meaning for the shell, and if you would not write \\., grep would not see a backslash-period, but just a period.

Regex: grep('pattern') catches 'pattern2'

I'm looking for logical solution, using regex, so that I can query grep for pattern and not catch pattern2. Some kind of 'stop', or 'up until' logic.
This question is about performing this type of query, not about naming conventions. I'm not looking for a workaround, just the regexp logic.
For the sake of argument, let's make the context 'up to date' ubuntu bash. But what I really want is something that only utilizes the regexp logic.
For a list as below
entry
entry1
entry2
entry.qualifier
entry.qualifier2
pseudo command: grep("entry")
Note, this will match all of entries because as there is no 'stop' logic. I'm sure the solution is actually quite simple, I just haven't used regex in a long time.
Something like 'not anything after the pattern'?
grep supports word boundary so a pure regex based answer would be:
grep '\bentry\b' file
However grep also supports -w flag (match words) so you can also use:
grep -w 'entry' file
If you're using GNU grep, what can help here are the wound boundary anchor operators \< and \> that it supports. That is to say \<entry\>.
POSIX doesn't specify any \b or \< or -w command line option. What if you have to use grep that doesn't have them? The problem can be solved by testing each line of the file with pure regular expression which must match it completely.
Suppose we want to pick out lines which contain the identifier entry that isn't a substring of a longer identifier name. Suppose identifiers are strings of English letters, digits and underscores. We can use this:
grep -E '^(|.*[^A-Za-z_0-9])entry([^A-Za-z_0-9].*|)$'
Note that the entire pattern is anchored on both ends, so that it must completely match an entire line. It matches any occurrence of entry which:
is either not preceded by anything, or else is preceded by a non-identifier character, possibly with other characters in front of it; and
is either not followed by anything, or else followed by a non-identifier character, possibly followed by other characters.
This approach is also useful if you have a specific idea of what constitutes a "word" which differs from the definition used by the GNU grep \b or \< operators. Suppose the file format is such that entry123 is in fact two different tokens entry and 123, and thus has to match. However entryabc must not match. For this, the GNU grep pattern \bentry\b or \<entry\> won't help; it will not match entry123. However, the above trick can readily be adapted to work:
grep -E '^(|.*[^A-Za-z])entry([^A-Za-z].*|)$'
I.e. entry surrounded by nothing, or else characters that are not upper or lower case letters. So this is worth to "keep in your back pocket".

Simple REGEX to find ON TAPE Numbers

I need to find the string ‘ON TAPE (000012, 000013)’. The number of course changes each time I need to search. I've been trying to learn regex, but I'm not taking to it very well. Anyone mind filling in the blank for me with a regex that will locate the string ‘ON TAPE (000012, 000013)’ ?
Welcome :)
(.*?\(\d+\, \d+\))
Check this out: Regex101
This is all dependent on the exact flavor of regex that you are using. Different languages handle regular expressions differently. Assuming that only the number is going to change, you could try
POSIX
With a POSIX-compliant regex engine, the () characters represent grouping, so they need to be escaped.
/ON\sTAPE\s\(\d+,\s\d+\)/
\s matches any whitespace character
\( and \) match the parentheses
\d matches any numeric character
+ means that the previous character can be repeated 1 to n times
javascript
For this particular case, javascript is POSIX-compliant.
php
For this particular case, php is POSIX-compliant.
python
For this particular case, python is POSIX-compliant.
grep
With grep, you don't need to escape the brackets, it doesn't handle the + character or the \d character.
ON\sTAPE\s([0-9][0-9]*,\s[0-9][0-9]*)
\s matches any whitespace character
[0-9] matches any numeric character
* means that the previous character can be repeated 0 to n times.
PS. The link that Nikolas shared is really useful :)

Grep for multiple strings with escaped pipe in each

I'm using Gitbash within Windows. I want to grep for a set of strings, each of which ends with a |
I think I can do each one singly with a backslash to escape the pipe:
grep abcdef\| filename.tsv
But to do them all together I end up with:
grep 'abcdef\|\|uvwxyz\|' filename.tsv
which fails. Any ideas?
I could just do each string individually and then concatenate the resulting files, but it would take days.
In basic posix regexes - which are used by grep - you must not escape the literal |. However you need to escape the | if it is used as a regex syntax element to specify alternatives.
The following expression should work:
grep 'abcdef|\|uvwxyz|' filename.tsv
An ERE might be the way to go, for easier readability.
egrep '(abcdef|uvwxyz)[|]' filename.tsv
This lets you manage your string list a little more easily, and "escapes" the trailing vertical bar by putting it inside a range. (This works for dots, asterisks, etc, as well.)
If egrep isn't available on your system, you can check to see if your existing grep includes a -E option for extended regexes.
There are two competing effects here which you may be confusing. Firstly, the | must be escaped or quoted so that it is not interpreted by the shell. Secondly, depending on which regex mode you are using, escaping/unescaping the pipe changes whether it is a literal character or a metacharacter.
I would suggest that you change your pattern to this:
grep 'abcdef|\|uvwxyz|' file
In basic regex mode, an escaped pipe \| is a regex OR, so this matches either pattern followed by a literal pipe.
Alternatively, if all your patterns end in a pipe and you have more than just two, perhaps you could use this:
grep -E '(abc|def|ghi)\|' file
In extended mode, escaping the pipe has the opposite effect, so this pattern matches any of the sequences of letters followed by a literal pipe.

(grep) Regex to match non-ASCII characters?

On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find and then do a grep to print the non-ASCII characters, and then do a wc -l to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.
However, is there a regular expression for 'any character that's not an ASCII character'?
This will match a single non-ASCII character:
[^\x00-\x7F]
This is a valid PCRE (Perl-Compatible Regular Expression).
You can also use the POSIX shorthands:
[[:ascii:]] - matches a single ASCII char
[^[:ascii:]] - matches a single non-ASCII char
[^[:print:]] will probably suffice for you.**
No, [^\x20-\x7E] is not ASCII.
This is real ASCII:
[^\x00-\x7F]
Otherwise, it will trim out newlines and other special characters that are part of the ASCII table!
You could also to check this page: Unicode Regular Expressions, as it contains some useful Unicode characters classes, like:
\p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.
You can use this regex:
[^\w \xC0-\xFF]
Case ask, the options is Multiline.
[^\x00-\x7F] and [^[:ascii:]] miss some control bytes so strings can be the better option sometimes. For example cat test.torrent | perl -pe 's/[^[:ascii:]]+/\n/g' will do odd things to your terminal, where as strings test.torrent will behave.
To Validate Text Box Accept Ascii Only use this Pattern
[\x00-\x7F]+
I use [^\t\r\n\x20-\x7E]+ and that seems to be working fine.
You don't really need a regex.
printf "%s\n" *[!\ -~]*
This will show file names with control characters in their names, too, but I consider that a feature.
If you don't have any matching files, the glob will expand to just itself, unless you have nullglob set. (The expression does not match itself, so technically, this output is unambiguous.)
This turned out to be very flexible and extensible.
$field =~ s/[^\x00-\x7F]//g ; # thus all non ASCII or specific items in question could be cleaned. Very nice either in selection or pre-processing of items that will eventually become hash keys.