Does wildcard characters behave differently? - regex

I want to understand if behavior of wildcard characters is same or not, for example:
1) On unix prompt, if i type ls *.xml, then it lists all files ending with .xml, for example: 1.xml, first.xml etc.
--> So as it appears, * matches any character.
2) Now, i was trying to find some text using grep in a file, and i executed following command:
grep -i "*.xml" first.txt
To my utter surprise there was no results returned, even though first.txt had contents like: first.xml, second.xml.
If i do grep -i "xml" first.xml, then I get the results.
This behavior is causing confusion, how does * matches any text in case (1) and in case (2) it is failing?
Does this behave different in different situations, and if so, where to find this info.

Shell uses globbing. https://en.wikipedia.org/wiki/Glob_(programming)
Grep uses regular expressions. https://en.wikipedia.org/wiki/Regular_expression
They are two different languages, and for grep, * is not a wild card. It means the previous character, repeated zero or more times
You want
grep '.*\.xml'
which means any character (.) repeated zero or more times (*) followed by a literal '.' (\.) xml
(For practical purposes you would just use fgrep .xml or grep '\.xml$' of course)

Related

Why do these two grep commands produce different results?

$ grep "^底线$" query_20220922 | wc -l
95701
$ grep -iF "底线" query_20220922 | wc -l
796591
Shouldn't the count be exactly the same? I want to count the exact match of the string.
-F matches a fixed string anywhere in a line. ^xyz$ matches lines which contain "xyz" exactly (nothing else).
You are looking for -x/--line-regexp and not -F/--fixed-strings.
To match lines which contain your search text exactly, without anything else and without interpreting your search text as regular expression, combine the two flags: grep -xF 'findme' file.txt.
Also, case-insensitive matching (-i) can match more lines too than case-sensitive matching (the default).
No, they do different things. The first uses a regular expression to search for "底线" alone on an input line (^ in a regular expression means beginning of line, and $ means end of line).
The second searches for the string anywhere on an input line. The -i flag does nothing at all here (it selects case-insensitive matching, but this is not well-defined for CJK character sets, so basically a no-op) and -F says to search literally (which makes the search faster for internal reasons, but doesn't change the semantics of a search string which doesn't contain any regex metacharacters).
It should be easy to see how they differ. For a large input file, it might be a bit challenging to find the differences if they are not conveniently mixed; but for a quick start, try
diff -u <(grep -m5 "^底线$" query_20220922) <(grep -m5Fi "底线" query_20220922)
where -m5 picks out the first five matches. (Try a different range, perhaps with tail, if the differences are all near the end of the file, for example.)
Tangentially, you usually want to replace the pipe to wc -l with grep -c; also,you might want to try grep -Fx "底线" as a faster alternative to the first search.

grep and regex stored in string

my question is quite short:
a="'[0-9]*'"
grep -E '[0-9]*' #for example, line containing 000 will be recognized and printed
but
grep -E $a #line containing 000 WILL NOT be printed, why is that?
Does substitution for grep regex change the command's behaviour or have I missed something from a syntactic point of view? In other words, how do I make it so that grep accepts regex from a string stored in a variable.
Thank you in advance.
Quotes go around data, not in data. That means, when you store data (in this case, a regex expression) in a variable, don't embed quotes in the variable; instead, put double-quotes around the variable when you use it:
a="[0-9]*"
grep -E "$a"
You can sometimes get away with leaving the double-quotes off when using variables (as in Avinash Raj's comment), but it's not generally safe. In this case, it'll work fine provided there are no files or subdirectories in the current working directories with names that happen to start with a digit. You see, without double-quotes around $a, the shell will take its value, try to split it into multiple words (not a problem here), try to expand each word that contains shell wildcards into a list of matching files (potential problem here), and pass that to the command (grep) as its list of arguments. That means that if you happen to have files that start with digits in the current directory, grep thinks you ran a command like this:
grep -E 1file.txt 2file.jpg 3file.etc
... and it treats the first filename as the pattern to search for, and any other filenames as files to be searched. And you'll be scratching your head wondering why your script works or fails depending on which directory you happen to be in.
Note: the pattern [0-9]* is a valid regular expression, and a valid shell glob (wildcard) pattern, but it means very different things in the two contexts. As a regex, it means 0 or more digits in a row. As a shell glob, it means something that starts with a digit. Speaking of which, grep -E '[0-9]*' is not actually going to be very useful, since everything contains strings of 0 or more digits, so it'll match every line of every file you feed it.

Grep Search Specific Character Trouble

I have searched extensively and cannot figure out what I am doing wrong here. I have a text file that may contain a string similar to the following:
/dev/dir1/dir2 200G 22G 179G 11% /usr/dir3/dir4
I generally know what the sting will look like up until the disk percentage indicator (i.e. 11%), but in the final part of the string I need to figure out if it ends in the usr (or sub) directories.
I want to use grep to do this search but am having problems. For example, the following command gives me output, but once i replace any of the "." characters where the "G" or "%" would be, or if I try to add "/usr/.*" at the end it refuses to return anything.
$ egrep ^/dev/dir1/dir2\s*\d*.\s*\d*.\s*\d*.\s*\d*.\s*.*$ testfile
/dev/dir1/dir2 200G 22G 179G 11% /usr/dir3/dir4
grep's extended regular expressions do not support using \d to match digits. Instead, use [0-9] or [:digit:]. You can use the following grep command:
egrep '^/dev/dir1/dir2\s*[0-9]*G\s*[0-9]*G\s*[0-9]*G\s*[0-9]*%\s*.*$'
You can also pass grep the -P option to enable Perl compatible regular expressions, which do support \d:
grep -P '^/dev/dir1/dir2\s*\d*G\s*\d*G\s*\d*G\s*\d*%\s*.*$'
Note the use of grep instead of egrep in the above command; -P is incompatible with egrep.
As a side note, I prefer to use + instead of * when I can, because it is stricter and can cause errors to become apparent sooner. For example, I assume there will always be at least one space and one digit in each place in the input, so you can use \s+ and [0-9]+ (or \d+). If your original pattern had used +, it would not have matched at all in the first place (whether it was quoted or not), and you would have known you had a problem even before adding the G or % to it. A working example is
egrep '^/dev/dir1/dir2\s+[0-9]+.\s+[0-9]+.\s+[0-9]+.\s+[0-9]+.\s+.+$'

Grep for multiple strings with escaped pipe in each

I'm using Gitbash within Windows. I want to grep for a set of strings, each of which ends with a |
I think I can do each one singly with a backslash to escape the pipe:
grep abcdef\| filename.tsv
But to do them all together I end up with:
grep 'abcdef\|\|uvwxyz\|' filename.tsv
which fails. Any ideas?
I could just do each string individually and then concatenate the resulting files, but it would take days.
In basic posix regexes - which are used by grep - you must not escape the literal |. However you need to escape the | if it is used as a regex syntax element to specify alternatives.
The following expression should work:
grep 'abcdef|\|uvwxyz|' filename.tsv
An ERE might be the way to go, for easier readability.
egrep '(abcdef|uvwxyz)[|]' filename.tsv
This lets you manage your string list a little more easily, and "escapes" the trailing vertical bar by putting it inside a range. (This works for dots, asterisks, etc, as well.)
If egrep isn't available on your system, you can check to see if your existing grep includes a -E option for extended regexes.
There are two competing effects here which you may be confusing. Firstly, the | must be escaped or quoted so that it is not interpreted by the shell. Secondly, depending on which regex mode you are using, escaping/unescaping the pipe changes whether it is a literal character or a metacharacter.
I would suggest that you change your pattern to this:
grep 'abcdef|\|uvwxyz|' file
In basic regex mode, an escaped pipe \| is a regex OR, so this matches either pattern followed by a literal pipe.
Alternatively, if all your patterns end in a pipe and you have more than just two, perhaps you could use this:
grep -E '(abc|def|ghi)\|' file
In extended mode, escaping the pipe has the opposite effect, so this pattern matches any of the sequences of letters followed by a literal pipe.

Find results with grep and write to file

I would like to get all the results with grep or egrep from a file on my computer.
Just discovered that the regex of finding the string
'+33. ... ... ..' is by the following regex
\+33.[0-9].[0-9].[0-9].[0-9].' Or is this not correct?
My grep command is:
grep '\+31.[0-9].[0.9].[0.9].[0-9]' Samsung\ GT-i9400\ Galaxy\ S\ II.xry >> resultaten.txt
The output file is only giving me as following:
"Binary file Samsung GT-i9400 .xry matches"
..... and no results were given.
Can someone help me please with getting the results and writing to a file?
Firstly, the default behavior of grep is to print the line containing a match. Because binary files do not contain lines, it only prints a message when it finds a match in a binary file. However, this can be overridden with the -a flag.
But then, you end up with the problem that the "lines" it prints are not useful. You probably want to add the -o option to only print the substrings which actually matched.
Finally, your regex isn't correct at all. The lone dot . is a metacharacter which matches any character, including a control character or other non-text character. Given the length of your regex, you are unlikely to catch false positives, but you might want to explain what you want the dot to match. I have replaced it with [ ._-] which matches a space and some punctuation characters which are common in phone numbers. Maybe extend or change it, depending on what interpunction you expect in your phone numbers.
In regular grep, a plus simply matches itself. With grep -E the syntax would change, and you would need to backslash the plus; but in the absence of this option, the backslash is superfluous (and actually wrong in this context in some dialects, including GNU grep, where a backslashed plus selects the extended meaning, which is of course a syntax error at beginning of string, where there is no preceding expression to repeat one or more times; but GNU grep will just silently ignore it, rather than report an error).
On the other hand, your number groups are also wrong. [0-9] matches a single digit, where apparently the intention is to match multiple digits. For convenience, I will use the grep -E extension which enables + to match one or more repetitions of the previous character. Then we also get access to ? to mark the punctuation expressions as optional.
Wrapping up, try this:
grep -Eao '\+33[0-9]+([^ ._-]?[0-9]+){3}' \
'Samsung GT-i9400 Galaxy S II.xry' >resultaten.txt
In human terms, this requires a literal +33 followed by required additional digits, then followed by three number groups of one or more digits, each optionally preceded by punctuation.
This will overwrite resultaten.txt which is usually what you want; the append operation you had also makes sense in many scenarios, so change it back if that's actually what you want.
If each dot in your template +33. ... ... .. represents a required number, and the spaces represent required punctuation, the following is closer to what you attempted to specify:
\+33[0-9]([^ ._-][0-9]{3}){2}[^ ._-][0-9]{2}
That is, there is one required digit after 33, then two groups of exactly three digits and one of two, each group preceded by one non-optional spacing or punctuation character.
(Your exposition has +33 while your actual example has +31. Use whichever is correct, or perhaps allow any sequence of numbers for the country code, too.)
It means that you're find a match but the file you're greping isn't a text file, it's a binary containing non-printable bytes. If you really want to grep that file, try:
strings Samsung\ GT-i9400\ Galaxy\ S\ II.xry | grep '+31.[0-9].[0.9].[0.9].[0-9]' >> resultaten.txt