Using the star sign in grep - regex

I am trying to search for the substring "abc" in a specific file in linux/bash
So I do:
grep '*abc*' myFile
It returns nothing.
But if I do:
grep 'abc' myFile
It returns matches correctly.
Now, this is not a problem for me. But what if I want to grep for a more complex string, say
*abc * def *
How would I accomplish it using grep?

The asterisk is just a repetition operator, but you need to tell it what you repeat. /*abc*/ matches a string containing ab and zero or more c's (because the second * is on the c; the first is meaningless because there's nothing for it to repeat). If you want to match anything, you need to say .* -- the dot means any character (within certain guidelines). If you want to just match abc, you could just say grep 'abc' myFile. For your more complex match, you need to use .* -- grep 'abc.*def' myFile will match a string that contains abc followed by def with something optionally in between.
Update based on a comment:
* in a regular expression is not exactly the same as * in the console. In the console, * is part of a glob construct, and just acts as a wildcard (for instance ls *.log will list all files that end in .log). However, in regular expressions, * is a modifier, meaning that it only applies to the character or group preceding it. If you want * in regular expressions to act as a wildcard, you need to use .* as previously mentioned -- the dot is a wildcard character, and the star, when modifying the dot, means find one or more dot; ie. find one or more of any character.

The dot character means match any character, so .* means zero or more occurrences of any character. You probably mean to use .* rather than just *.

Use grep -P - which enables support for Perl style regular expressions.
grep -P "abc.*def" myfile

The "star sign" is only meaningful if there is something in front of it. If there isn't the tool (grep in this case) may just treat it as an error. For example:
'*xyz' is meaningless
'a*xyz' means zero or more occurrences of 'a' followed by xyz

This worked for me:
grep ".*${expr}" - with double-quotes, preceded by the dot.
Where ${expr} is whatever string you need in the end of the line.
So in your case:
grep ".*abc.*" myFile
Standard unix grep.

The expression you tried, like those that work on the shell command line in Linux for instance, is called a "glob". Glob expressions are not full regular expressions, which is what grep uses to specify strings to look for. Here is (old, small) post about the differences. The glob expressions (as in "ls *") are interpreted by the shell itself.
It's possible to translate from globs to REs, but you typically need to do so in your head.

You're not using regular expressions, so your grep variant of choice should be fgrep, which will behave as you expect it to.

Try grep -E for extended regular expression support
Also take a look at:
The grep man page

'*' works as a modifier for the previous item. So 'abc*def' searches for 'ab' followed by 0 or more 'c's follwed by 'def'.
What you probably want is 'abc.*def' which searches for 'abc' followed by any number of characters, follwed by 'def'.

This may be the answer you're looking for:
grep abc MyFile | grep def
Only thing is... it will output lines were "def" is before OR after "abc"

$ cat a.txt
123abcd456def798
123456def789
Abc456def798
123aaABc456DEF
* matches the preceding character zero or more times.
$ grep -i "abc*def" a.txt
$
It would match, for instance "abdef" or "abcdef" or "abcccccccccdef". But none of these are in the file, so no match.
. means "match any character" Together with *, .* means match any character any number of times.
$ grep -i "abc.*def" a.txt
123abcd456def798
Abc456def798
123aaABc456DEF
So we get matches.
There are alot of online references about regular expressions, which is what is being used here.

I summarize other answers, and make these examples to understand how the regex and glob work.
There are three files
echo 'abc' > file1
echo '*abc' > file2
echo '*abcc' > file3
Now I execute the same commands for these 3 files, let's see what happen.
(1)
grep '*abc*' file1
As you said, this one return nothing. * wants to repeat something in front of it. For the first *, there is nothing in front of it to repeat, so the system recognize this * just a character *. Because the string in the file is abc, there is no * in the string, so you cannot find it. The second * after c means it repeat c 0 or more times.
(2)
grep '*abc*' file2
This one return *abc, because there is a * in the front, it matches the pattern *abc*.
(3)
grep '*abc*' file3
This one return *abcc because there is a * in the front and 2 c at the tail. so it matches the pattern *abc*
(4)
grep '.*abc.*' file1
This one return abc because .* indicate 0 or more repetition of any character.

Related

Grep a filename with a specific underscore pattern

I am trying to grep a pattern from files using egrep and regex without success.
What I need is to get a file with for example a convention name of:
xx_code_lastname_firstname_city.doc
The code should have at least 3 digits, the lastname and firstname and city can vary on size
I am trying the code below but it fails to achieve what I desire:
ls -1 | grep -E "[xx_][A-Za-z]{3,}[_][A-Za-z]{2,}[_][A-Za-z]{2,}[_][A-Za-z]{2,}[.][doc|pdf]"
That is trying to get the standard xx_ from the beggining, then any code that has at least 3 words and after that it must have another underscore, and so on.
Could anybody help ?
Consider an extglob, as follows:
#!/bin/bash
shopt -s extglob # turn on extended globbing syntax
files=( xx_[[:alpha:]][[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]]).#(doc|docx|pdf) )
[[ -e ${files[0]} ]] || -L ${files[0]} ]] && printf '%s\n' "${files[#]}"
This works because
[[:alpha:]][[:alpha:]]+([[:alpha:]])
...matches any string of three or more alpha characters -- two of them explicitly, one of them with the +() one-or-more extglob syntax.
Similarly,
#(doc|docx|pdf)
...matches any of these three specific strings.
So you're trying to match a literal xx_? Begin your pattern with that portion then.
xx_
Next comes the "3 digits" you're trying to match. I'm going to assume based off your own regex that by "digits" you mean characters (hence the [a-zA-Z] character classes). Let's make the quantifier non-greedy to avoid any unintentional capturing behavior.
xx_[a-zA-Z]{3,}?
For the firstname and lastname portions, I see you've specified a variable length with at least 2 characters. Let's make sure these quantifiers are non-greedy as well by appending the ? character after our quantifiers. According to your regex, it also looks like you expect your city construct to take a similar form to the firstname and lastname bits. Let's add all three then.
xx_[a-zA-Z]{3,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}\.
NOTE: We didn't need to make the city quantifier non-greedy since we asserted that it's followed by a literal ".", which we don't expect to appear anywhere else in the text we're interested in matching. Notice how it's escaped because it's a metacharacter in the regex syntax.
Lastly comes the file extensions, which your example has as "docx". I also see you put a "doc" and a "pdf" extension in your regex. Let's combine all three of these.
xx_[a-zA-Z]{3,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}\.(docx?|pdf)
Hopefully this works. Comment if you need any clarification. Notice how the "doc" and the "docx" portions were condensed into one element. This is not necessary, but I think it looks more deliberate in this form. It could also be written as (doc|docx|pdf). A little repetitive for my taste.

foo[E1,E2,...]* glob matches desired contents, but foo[E1,E2,...]_* does not?

I saw something weird today in the behaviour of the Bash Shell when globbing.
So I ran an ls command with the following Glob:
ls GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]* | grep ":"
the result was as expected
GM12878_Hs_InSitu_MboI_rE1_TagDirectory:
GM12878_Hs_InSitu_MboI_rE2_TagDirectory:
GM12878_Hs_InSitu_MboI_rF_TagDirectory:
GM12878_Hs_InSitu_MboI_rG1_TagDirectory:
GM12878_Hs_InSitu_MboI_rG2_TagDirectory:
GM12878_Hs_InSitu_MboI_rH_TagDirectory:
however when I change the same regex by introducing an underscore to this
ls GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]_* | grep ":"
my expected result is the complete set as shown above, however what I get is a subset:
GM12878_Hs_InSitu_MboI_rF_TagDirectory:
GM12878_Hs_InSitu_MboI_rH_TagDirectory:
Can someone explain what's wrong in my logic when I introduce an underscore sign before the asterisk?
I am using Bash.
You misunderstand what your glob is doing.
You were expecting this:
GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]*
to be a glob of files that have any of those comma-separated segments but that's not what [] globbing does. [] globbing is a character class expansion.
Compare:
$ echo GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]
GM12878_Hs_InSitu_MboI_r[E1,E2,F,G1,G2,H]
to what you were trying to get (which is brace {} expansion):
$ echo GM12878_Hs_InSitu_MboI_r{E1,E2,F,G1,G2,H}
GM12878_Hs_InSitu_MboI_rE1 GM12878_Hs_InSitu_MboI_rE2 GM12878_Hs_InSitu_MboI_rF GM12878_Hs_InSitu_MboI_rG1 GM12878_Hs_InSitu_MboI_rG2 GM12878_Hs_InSitu_MboI_rH
You wanted that latter expansion.
Your expansion uses a character class which matches the character E-H, 1-2, and ,; it's identical to:
GM12878_Hs_InSitu_MboI_r[EFGH12,]_*
which, as I expect you can now see, isn't going to match any two character entries (where the underscore-less version will).
* in fileystem globs is not like * in regex. In a regex * means "0 or more of the preceeding pattern," but in filesystem globs it means "anything at all of any size". So in your first example, the _ is just part of the "anything" from the * but in the second you're matching any single character within your character class (not the patterns you seem to be trying to define) followed by _ followed by anything at all.
Also, character classes don't work the way you're trying to use them. [...] will match any character within the brackets, so your pattern is actually the same as [EFGH12,] since those are all the letters in class you define.
To get the grouping of patterns you want, you should use { instead of [ like
ls GM12878_Hs_InSitu_MboI_r{E1,E2,F,G1,G2,H}_* | grep ":"
As far as I know, and this article supports my me, the square brackets don't work as a choice but as a character set, so using [E1,E2,F,G1,G2,H] actually is equivalent to exactly one occurrence of [EGHF12,]. You can then interpret the second result as "one character of EGHF12, and an underscore", which matches GM12878_Hs_InSitu_MboI_rF_TagDirectory: but not GM12878_Hs_InSitu_MboI_rG1_TagDirectory: (there is the r followed by more that "one occurrence of...").
The first regex works because you used the asterisk, which matches what is missed by the wrong [...].
A correct expression would be:
ls GM12878_Hs_InSitu_MboI_r{E1|E2|F|G1|G2|H}* | grep ":"

How to substitute a string even it contains regex meta characters using Shell or Perl?

I want to substitue a word which maybe contains regex meta characters to another word, for example, substitue the .Precilla123 as .Precill, I try to use below solution:
sed 's/.Precilla123/.Precill/g'
but it will change below line
"Precilla123";"aaaa aaa";"bbb bbb"
to
.Precill";"aaaa aaa";"bbb bbb"
This side effect is not I wanted. So I try to use:
perl -pe 's/\Q.Precilla123\E/.Precill/g'
The above solution can disable interpreted regex meta characters, it will not have the side effect.
Unfortunately, if the word contains $ or #, this solution still cannot work, because Perl keep $ and # as variable prefix.
Can anybody help this? Many thanks.
Please note that the word I want to substitute is NOT hard coded, it comes from a input file, you can consider it as variable.
Unfortunately, if the word contains $ or #, this solution still cannot work, because Perl keep $ and # as variable prefix.
This is not true.
If the value that you want to replace is in a Perl variable, then quotemeta will work on the variable's contents just fine, including the characters $ and #:
echo 'pre$foo to .$foobar' | perl -pe 'my $from = q{.$foo}; s/\Q$from\E/.to/g'
Outputs:
pre$foo to .tobar
If the words that you want to replace are in an external file, then simply load that data in a BEGIN block before composing your regular expressions for replacement.
sed 's/\.Precilla123/.Precill/g'
Escape the meta character with \.
Be carrefull, mleta charactere are not the same for search pattern that are mainly regex []{}()\.*+^$ where replacement is limited to &\^$ (+ the separator that depend of your first char after the s in both pattern)

Find and trim part of what is found using regular expression

I'm a newbie in writing regular expressions
I have a file name like this TST0101201304-123.txt and my target is to get the numbers between '-' and '.txt'
So I wrote this formula -([0-9]*)\.txt this will get me the numbers that I want, but in addition, it is retrieving the highfin '-' and the last part of the string also '.txt' so the result in the example above is '-123.txt'
So my question is:
Is there a way in regular expressions to get only part of the matched string, like a submatch of the match without the need to trim it in my shell script code for unix?
I found this answer but it is getting the same result:
Regexp: Trim parts of a string and return what ever is left
Tip: To test my regular expressions is used this website
You can use lookbehind and lookahead
(?<=-)[0-9]*(?=[.]txt)
Don't know if it would work in unix
Different regex-engines are different. Since you're using expr match, you need to make two changes:
expr match expects a regex that matches the entire string; so, you need to add .* at the beginning of yours, to cover everything before the hyphen.
expr match uses POSIX Basic Regular Expressions (BREs), which use \( and \) for grouping (and capturing) rather than merely ( and ).
But, conveniently, when you give expr match a regex that contains a capture-group, its output is the content of that capture-group; you don't need to do anything else special. So:
$ expr match TST0101201304-123.txt '.*-\([0-9]*\)\.txt'
123
sed is your friend.
echo filename | sed -e 's/-\([0-9]*\)/\1'
should get you what you want.

how to use regex under find command

I need to list all filenames which is having alien.digits
digits can be anytime from 1 to many
but it should not match if its the mixture of any other thing like alien.htm, alien.1php, alien.1234.pj.123, alien.123.12, alien.12.12p.234htm
I wrote:
find home/jassi/ -name "alien.[0-9]*"
But it is not working and its matching everything.
Any solution for that?
I think what you want is
find home/jassi/ -regex ".*/alien\.[0-9]+"
With -name option you don't specify a regular expression but a glob pattern.
Be aware that find expects that the whole path is matched by the regular expression.
Try this: find home/jassi/ -name "alien\.[0-9]+$"
It will match all files that have alien. and end with at least one digit but nothing else than digits. The $ character means end of string.
The * modifier means 0 or more of the previous match, and . means any character, which means it's matching alien.
Try this instead:
alien\.[0-9]+$
The + modifier means 1 or more of the previous match, and the . has been escaped to a literal character, and the $ on the end means "end of string".
You can also add a ^ to the start of the regex if you want to make sure that only files that exactly match your regex. The ^ character means "start of string", so ^alien\.[0-9]+$ will match alien.1234, but it won't match not_an_alien.1234.
It worked for me:
find home/jassi/ type -f -regex ".*/alien.[0-9]+"
I had to provide type -f to check if it's a file , else it would show the directory also of the same name.
Thanks bmk. I just figured out and at the same time you responded exactly the same thing. Great!