How to use zgrep and regular expression? - regex

I'm trying to do some research in a .gz file so I found out I should use zcat / zgrep now after a bit of research I can't figure out how to use a regex with zgrep
I tried to do it like this zgrep '[\s\S]{10,}' a.gz but nothing comes out even if there are string of minimum 10 characters in the file.
So how could I use zgrep to display string of minimum 10 characters ?

You should not use \S and \s in a POSIX BRE regex bracket expression as [\S\s] bracket expression matches either \, S or s. Use . instead of [\s\S] to match any char with a POSIX BRE/ERE regex.
Also, in a BRE pattern, {10,} must be written as \{10,\} as otherwise, when unescaped, {10,} matches a literal {10,} string.
Use
zgrep '.\{10,\}' a.gz

Related

How to get first and last names having 6 characters long using regular expression in linux

I want to find all the names having first and last names 6 characters long only.
I have write this grep '^.{6},.{6}$' names.txt but it is not giving any output. The names are written in file names.txt.
Sample ->
firstname,lastname
If you are using basic regular expressions, you have to escape the curly braces.
Using you pattern, you can add -E for extended regular expressions so you don't have to escape them.
grep -E '^.{6},.{6}$' names.txt
Note that a dot in the pattern can also match a space or a comma, so the pattern could also match:
,,,,,,,,,,,,, or ,
You could update the pattern to match 6 times any character except for spaces, tab or a comma using a negated character class [^,[:blank:]]
grep -E '^[^,[:blank:]]{6},[^,[:blank:]]{6}$' names.txt
grep '^.\{6\},.\{6\}$' names.txt

How to comment a include line using sed [duplicate]

I am using sed in a shell script to edit filesystem path names. Suppose I want to replace
/foo/bar
with
/baz/qux
However, sed's s/// command uses the forward slash / as the delimiter. If I do that, I see an error message emitted, like:
▶ sed 's//foo/bar//baz/qux//' FILE
sed: 1: "s//foo/bar//baz/qux//": bad flag in substitute command: 'b'
Similarly, sometimes I want to select line ranges, such as the lines between a pattern foo/bar and baz/qux. Again, I can't do this:
▶ sed '/foo/bar/,/baz/qux/d' FILE
sed: 1: "/foo/bar/,/baz/qux/d": undefined label 'ar/,/baz/qux/d'
What can I do?
You can use an alternative regex delimiter as a search pattern by backslashing it:
sed '\,some/path,d'
And just use it as is for the s command:
sed 's,some/path,other/path,'
You probably want to protect other metacharacters, though; this is a good place to use Perl and quotemeta, or equivalents in other scripting languages.
From man sed:
/regexp/
Match lines matching the regular expression regexp.
\cregexpc
Match lines matching the regular expression regexp. The c may be any character other than backslash or newline.
s/regular expression/replacement/flags
Substitute the replacement string for the first instance of the regular expression in the pattern space. Any character other than backslash or newline can be used instead of a slash to delimit the RE and the replacement. Within the RE and the replacement, the RE delimiter itself can be used as a literal character if it is preceded by a backslash.
Perhaps the closest to a standard, the POSIX/IEEE Open Group Base Specification says:
[2addr] s/BRE/replacement/flags
Substitute the replacement string for instances of the BRE in the
pattern space. Any character other than backslash or newline can
be used instead of a slash to delimit the BRE and the replacement.
Within the BRE and the replacement, the BRE delimiter itself can be
used as a literal character if it is preceded by a backslash."
When there is a slash / in theoriginal-string or the replacement-string, we need to escape it using \. The following command is work in ubuntu 16.04(sed 4.2.2).
sed 's/\/foo\/bar/\/baz\/qux/' file

Regex which captures a pattern plus everything after until a character is reached

I want a regular expression which catches every time +other appears as well as everything until the next comma.
With
(word),+(other)(word),(code),(word),(other)(code),(example)
(code),+(other),+(other)(code)(word)(example),(example),+(example)
+(code),(other)(code)(word),(code),(word)
I want to return
+(other)(word)
+(other)
+(other)(code)(word)(example)
My command that I would use looks something like egrep -o '\+\(other).*,.
The only problem is that the comma in this regex isn't necessarily the next comma. Right now the command returns
+(other)(word),(code),(word),(other)(code),
+(other),+(other)(code)(word)(example),(example),
You consume any 0+ chars as many as possible up to the last (and including) , with .*,.
To avoid matching , and only match up to the first ,, use a negated bracket expression [^,] and apply * quantifier to it:
egrep -o '\+\(other\)[^,]*
The [^,]* pattern will match any 0+ characters other than ,.
If your grep supports Perl compatible regular expressions (PCRE), you can use non-greedy matching:
$ grep -Po '\+\(other\).*?,' infile
+(other)(word),
+(other),
+(other)(code)(word)(example),

How to grep for this pattern in Unix

I want to grep for this particular pattern. The pattern is as follows
**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887
inside the file test.txt which has the following data
NNN**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887_20140628.csv
I tried using grep "**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887" test.txt but it's not returning anything. Please advice
EDIT:
Hi, basically i'm inside a loop and only sometimes i get files with this pattern. So currently im putting like grep "$i" test.txt which works in all the cases except when I have to encounter such patterns.
And I'm actually grepping for the exact file_number, file sequence.So if it says 123_29887 it will be 123_29887. Thanks.
You could use:
grep -P "(?i)\*\*[a-z\d]+\*\*[a-z]+_\d+_\d+" somepath
(?i) turns on case-insensitive mode
\*\* matches the two opening stars
[a-z\d]+ matches letters and digits
\*\* matches two more stars
[a-z]+ matches letters
_\d+_\d+ matches underscore, digits, underscore, digits
If you need to be more specific (for instance, you know that a group of digits always has three digits), you can replace parts of the expression: for instance, \d+ becomes \d{3}
Matching a Literal but Yet Unknown Pattern: \Q and \E
If you receive literal patterns that you need to match, such as **xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887, the issue is that special regex characters such as * need to be escaped. If the whole string is a literal, we do this by escaping the whole string between \Q and \E:
grep -P "\Q**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887\E" somepath
And in a loop, of course, you can build that regex programmatically by concatenating \Q and \E on both sides.

when to escape special character in shell

guys:
it is hard for me to judge when to escape special characters in shell, and which character should be escaped. for example:
sed '/[0-9]\{3\}/d' filename.txt
like above, why we should escape { while leave [ unchanged, i think they are both special chars.
Can you help me with this?
/br
ruan
The general answer is that you need to escape characters that have special meaning when you want to treat them as literal characters, not for their special meaning. The rules for what characters have special meaning vary from program to program.
Your specific question involves characters that have special meaning to sed; single quotes prevent any enclosed characters from being interpreted by bash.
In this case, you are escaping the { and } to prevent sed from interpreting them. First, consider this command:
sed '/[0-9]{3}/d' filename.txt
If you are using a version of sed that treats both [ and { specially, this command says to delete any line which contains a sequence of exactly 3 digits. The [0-9] is not a literal 5-character string; it's a regular expression that matches any single numeral. The {3} isn't a literal 3-character string; it's a modifier that matches exactly 3 of the preceding regular expression. Lines like the following will be matched:
593
3296
but not
34a7
because there aren't 3 digits in a row.
Now, consider your command:
sed '/[0-9]\{3\}/d' filename.txt
The [0-9] is still a regular expression that matches a single numeral. But now, you have escaped the braces. Instead of being a modifier for the preceding regular expression, sed will treat it as the literal characters {, 3, and }. So it will match lines like the following:
0{3}
1{3}
5{3}
but not lines like
346
because there are no braces.
Difference in this behavior is related to sed only.
In regular mode sed supports very basic regex only and hence { is matched literally unless escaped as you noticed.
sed '/[0-9]\{3\}/d'
In extended regex mode both [ and { don't need escaping:
sed -r '/[0-9]{3}/d'
OR on OSX:
sed -E '/[0-9]{3}/d'
[ and ] is considered a character class in both regular and extended regex modes (even shell's glob pattern supports it)
I think your question pertains to special characters in regular expressions. Check this out:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03
It mainly depend on sed version (posix compliant or extended behavior) and then you need to adapt depending of the shell because, indeed, some modification occur before the sed action is received like you state. The best example is the use of simple of double quote at shell level and the \( or ( at sed level.
so:
define the pattern (reg ex) you want
adapt for the sed version/option you are using
adapt for shell interpretation
let's have fun to create the substitution sed order of \{ by &/$IFS (literal, not IFS value) using double quote surrounding sed script in BASH/KSH shell and posix or GNU sed.