Caret symbol does not work with grep - regex

This does not work
grep -h '^zip' log*
this works
grep -h '[^bg]zip' log*
The log* files definitely contain a file named zip because the second command prints out the file name. But the first does not print anything at all. I try several and see that the caret symbol only works as negation in brackets. Outside of the bracket, it does not mean to indicate that something following it would be in the beginning of the word.
What is wrong here? I am using ubuntu 12.4

beginning of the word
^ marks the beginning of line, not word. "foo zip" will not match against ^zip, but "zip foo" will. If you want to match zip at the beginning of a word, use this:
grep \\bzip
\b marks a word boundary, but you need to double up on escapes because your shell will strip one. (grep '\bzip' also works.)

Related

Remove 2nd occurence after a given flag

How can I parse every line in a .txt file to remove everything after the second occurrence of a / after a given flag jdk on each line of a file.
For example
/usr/lib/jvm/jdk-1.7.0/2.0/zi/etc/GMT
/usr/lib/jvm/jdk1.7.2/3.0/zi/etc/GMT
/usr/share/servertool-java-openjdk/4.0/jce.jar
becomes
/usr/lib/jvm/jdk-1.7.0/2.0/
/usr/lib/jvm/jdk1.7.2/3.0/
/usr/share/servertool-java-openjdk/4.0/
Note, I can't just split on jdk, because it may be jdk-1.*.*/ etc.
My end goal is to find all the unique paths on a highly restricted SeLinux box that has the output of a locate jdk stored in a output.txt file
Update: my attempt so far, to get closer is
cat output.txt | awk -F '\\jdk' '{print $1"jdk"}' | sort -u
This just chops everything after jdk, and removes dupes.
sed is a very appropriate tool for this job. You'll use the s/// command to remove the part of the line you want to delete.
Note the slashes in the s/// command can be changed to other characters so that any slashes you have in the pattern or replacement parts don't need to be escaped.
Your pattern will be:
in capturing parentheses:
"jdk" followed by zero or more non-slashes
followed by a slash
followed by one or more non-slashes
followed by a slash
followed by any number of characters
The replacement will be the text that was captured.
You'll want to refer to the sed manual
3.3 The s Command
5 Regular Expressions: selecting text
5.2 Basic (BRE) and extended (ERE) regular expression
if you want to replace in the same file, you can use below script
#!/bin/bash
cat output.txt | while read line
do
x=${line#/*jdk*/*/}
replace=${line%${x}}
sed -i "s|$line|$replace|g" output.txt
done

Can someone breakdown this regular expression?

While looking for a way to format 'ifconfig' output and display only the network interfaces names, I found a regular expression that worked like a charm for OS X.
ifconfig -a | sed -E 's/[[:space:]:].*//;/^$/d'
How can I breakdown this regular expression so I can understand it?
Here is the sed command
s/[[:space:]:].*//;/^$/d
There is a semicolon in the middle, so it's actually two commands:
s/[[:space:]:].*//
/^$/d
First command is a substitution. What to substitute? It's between the 1st 2 slashes.
[[:space:]:].*
Character class [] of any kind of whitespace or a colon, followed by zero or more * of any character .. This matches everything in a line after the first whitespace or colon.
Substitute with what? Between the 2nd two slashes: s/...//: Nothing. The matched strings are deleted from each line.
This leaves the interface names which start their lines, the other lines remain too, but they are empty, as they start with whitespace.
How to remove these empty lines? That's the second command:
/^$/d
Find empty lines that match regex with nothing between start of line ^ and end of line $. Then delete them with command d.
All that's left are the interface names.
This is more a sequence of commands than it is a regular expression, but I suppose breaking the sequence down may be instructive.
Read the manpage on ifconfig to find this
Optionally, the -a flag may be used instead of an interface name. This
flag instructs ifconfig to display information about all interfaces in
the system. The -d flag limits this to interfaces that are down, and
-u limits this to interfaces that are up. When no arguments are given,
-a is implied.
That's one part done. The pipe (|) sends what ifconfig would normally print to the standard output to the standard input of sed instead.
You're passing sed the option -E. Again, man sed is your friend and tells you that this option means
Interpret regular expressions as extended (modern) regular
expressions rather than basic regular expressions (BRE's). The
re_format(7) manual page fully describes both formats.
This isn't all you need though... The first string that you're giving sed lets it know which operation to perform.
Search the same manual for the word "substitute" to reach this
paragraph:
[2addr]s/regular expression/replacement/flags
Substitute the replacement string for the first instance of
the regular expression in the pattern space. Any character other than
backslash or newline can be used instead of a slash to delimit the RE
and the replacement. Within the RE and the replacement, the RE
delimiter itself can be used as a literal character if it is preceded
by a backslash.
Now we can run man 7 re_format to decode the first command s/[[:space:]:].*// which means "for each line passed to standard input, substitute the part matching the extended regular expression [[:space:]:].* with the empty string"
[[:space:]:] = match either a : or any character in the character class [:space:]
.* = match any character (.), zero or more times (*)
To understand the second command look for the [2addr]d part of the sed manual page.
[2addr]d
Delete the pattern space and start the next cycle.
Let's then look at the next command /^$/d which says "for each line passed to standard input, delete it if it corresponds to the extended regex ^$"
^$ = a line that contains no characters between its start (^) and its end ($)
We've discussed how to start with man pages and follow the clues to "decode" commands you see in everyday life.
Thanks Benjamin and Xufox for the resources. After taking a look, this is my conclusion:
s/[[:space:]:].*//;
[[:space:]:] this will search for spaces and/or : and begin the execution of the command, and this and anything that comes afterwards(hence the '.*') will be substituted by nothing (because the next thing is //, which in between should be what we would want to substitute for, which in this case is nothing.).
;
marks the end of the first command
and then we have
/^$/d
where ^$ means search for all empty spaces and d to delete them.
This is half wrong. Take a look at the other answer which gives you the complete and correct response! Thanks guys.

Ignoring strings without using the -v flag

I am trying to use egrep to find lines in a file that contain a certain word, but dont start with that word.
I am currently doing as so...
egrep '^word|word' file.txt
I tried putting it in brackets with the ^ not symbol, but brackets specifiy each letter individually and not a word as a whole.
egrep'^[^word]|word' file.txt
How can I do this, to ignore a certain first word, for example I ignore every The that is at the beginning of a sentence but spot the other ones. Without using the v-flag.
All you need is:
grep '..*word' file
or:
grep -E '.+word' file
to find lines that contain word at a location other than the start of a line.

Grep for lines not beginning with "//"

I'm trying but failing to write a regex to grep for lines that do not begin with "//" (i.e. C++-style comments). I'm aware of the "grep -v" option, but I am trying to learn how to pull this off with regex alone.
I've searched and found various answers on grepping for lines that don't begin with a character, and even one on how to grep for lines that don't begin with a string, but I'm unable to adapt those answers to my case, and I don't understand what my error is.
> cat bar.txt
hello
//world
> cat bar.txt | grep "(?!\/\/)"
-bash: !\/\/: event not found
I'm not sure what this "event not found" is about. One of the answers I found used paren-question mark-exclamation-string-paren, which I've done here, and which still fails.
> cat bar.txt | grep "^[^\/\/].+"
(no output)
Another answer I found used a caret within square brackets and explained that this syntax meant "search for the absence of what's in the square brackets (other than the caret). I think the ".+" means "one or more of anything", but I'm not sure if that's correct and if it is correct, what distinguishes it from ".*"
In a nutshell: how can I construct a regex to pass to grep to search for lines that do not begin with "//" ?
To be even more specific, I'm trying to search for lines that have "#include" that are not preceeded by "//".
Thank you.
The first line tells you that the problem is from bash (your shell). Bash finds the ! and attempts to inject into your command the last you entered that begins with \/\/. To avoid this you need to escape the ! or use single quotes. For an example of !, try !cat, it will execute the last command beginning with cat that you entered.
You don't need to escape /, it has no special meaning in regular expressions. You also don't need to write a complicated regular expression to invert a match. Rather, just supply the -v argument to grep. Most of the time simple is better. And you also don't need to cat the file to grep. Just give grep the file name. eg.
grep -v "^//" bar.txt | grep "#include"
If you're really hungup on using regular expressions then a simple one would look like (match start of string ^, any number of white space [[:space:]]*, exactly two backslashes /{2}, any number of any characters .*, followed by #include):
grep -E "^[[:space:]]*/{2}.*#include" bar.txt
You're using negative lookahead which is PCRE feature and requires -P option
Your negative lookahead won't work without start anchor
This will of course require gnu-grep.
You must use single quotes to use ! in your regex otherwise history expansion is attempted with the text after ! in your regex, the reason of !\/\/: event not found error.
So you can use:
grep -P '^(?!\h*//)' file
hello
\h matches 0 or more horizontal whitespace.
Without -P or non-gnu grep you can use grep -v:
grep -v '^[[:blank:]]*//' file
hello
To find #include lines that are not preceded by // (or /* …), you can use:
grep '^[[:space:]]*#[[:space:]]*include[[:space:]]*["<]'
The regex looks for start of line, optional spaces, #, optional spaces, include, optional spaces and either " or <. It will find all #include lines except lines such as #include MACRO_NAME, which are legitimate but rare, and screwball cases such as:
#/*comment*/include/*comment*/<stdio.h>
#\
include\
<stdio.h>
If you have to deal with software containing such notations, (a) you have my sympathy and (b) fix the code to a more orthodox style before hunting the #include lines. It will pick up false positives such as:
/* Do not include this:
#include <does-not-exist.h>
*/
You could omit the final [[:space:]]*["<] with minimal chance of confusion, which will then pick up the macro name variant.
To find lines that do not start with a double slash, use -v (to invert the match) and '^//' to look for slashes at the start of a line:
grep -v '^//'
You have to use the -P (perl) option:
cat bar.txt | grep -P '(?!//)'
For the lines not beginning with "//", you could use (^[^/]{2}.*$).
If you don't like grep -v for this then you could just use awk:
awk '!/^\/\//' file
Since awk supports compound conditions instead of just regexps, it's often easier to specify what you want to match with awk than grep, e.g. to search for a and b in any order with grep:
grep -E 'a.*b|b.*a`
while with awk:
awk '/a/ && /b/'

Regex (grep) for multi-line search needed [duplicate]

This question already has answers here:
How can I search for a multiline pattern in a file?
(11 answers)
Closed 1 year ago.
I'm running a grep to find any *.sql file that has the word select followed by the word customerName followed by the word from. This select statement can span many lines and can contain tabs and newlines.
I've tried a few variations on the following:
$ grep -liIr --include="*.sql" --exclude-dir="\.svn*" --regexp="select[a-zA-Z0-
9+\n\r]*customerName[a-zA-Z0-9+\n\r]*from"
This, however, just runs forever. Can anyone help me with the correct syntax please?
Without the need to install the grep variant pcregrep, you can do a multiline search with grep.
$ grep -Pzo "(?s)^(\s*)\N*main.*?{.*?^\1}" *.c
Explanation:
-P activate perl-regexp for grep (a powerful extension of regular expressions)
-z Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. That is, grep knows where the ends of the lines are, but sees the input as one big line. Beware this also adds a trailing NUL char if used with -o, see comments.
-o print only matching. Because we're using -z, the whole file is like a single big line, so if there is a match, the entire file would be printed; this way it won't do that.
In regexp:
(?s) activate PCRE_DOTALL, which means that . finds any character or newline
\N find anything except newline, even with PCRE_DOTALL activated
.*? find . in non-greedy mode, that is, stops as soon as possible.
^ find start of line
\1 backreference to the first group (\s*). This is a try to find the same indentation of method.
As you can imagine, this search prints the main method in a C (*.c) source file.
I am not very good in grep. But your problem can be solved using AWK command.
Just see
awk '/select/,/from/' *.sql
The above code will result from first occurence of select till first sequence of from. Now you need to verify whether returned statements are having customername or not. For this you can pipe the result. And can use awk or grep again.
Your fundamental problem is that grep works one line at a time - so it cannot find a SELECT statement spread across lines.
Your second problem is that the regex you are using doesn't deal with the complexity of what can appear between SELECT and FROM - in particular, it omits commas, full stops (periods) and blanks, but also quotes and anything that can be inside a quoted string.
I would likely go with a Perl-based solution, having Perl read 'paragraphs' at a time and applying a regex to that. The downside is having to deal with the recursive search - there are modules to do that, of course, including the core module File::Find.
In outline, for a single file:
$/ = "\n\n"; # Paragraphs
while (<>)
{
if ($_ =~ m/SELECT.*customerName.*FROM/mi)
{
printf file name
go to next file
}
}
That needs to be wrapped into a sub that is then invoked by the methods of File::Find.