Say I have a line in a file "This is perhaps the easiest place to add new functionality." and I want to grep two words close to each other. I do
grep -ERHn "\beasiest\W+(?:\w+\W+){1,6}?place\b" *
that works and gives me the line. But when I do
grep -ERHn "\beasiest\W+(?:\w+\W+){1,10}?new\b" *
it fails, defeating the whole point of the {1,10}?
This one is listed in the regular-expression.info site and also a couple of Regex books. Though they do not describe it with grep but that should not matter.
Update
I put the regex into a python script. Works, but doesn't have the nice grep -C thing ...
#!/usr/bin/python
import re
import sys
import os
word1 = sys.argv[1]
word2 = sys.argv[2]
dist = sys.argv[3]
regex_string = (r'\b(?:'
+ word1
+ r'\W+(?:\w+\W+){0,'
+ dist
+ '}?'
+ word2
+ r'|'
+ word2
+ r'\W+(?:\w+\W+){0,'
+ dist
+ '}?'
+ word1
+ r')\b')
regex = re.compile(regex_string)
def findmatches(PATH):
for root, dirs, files in os.walk(PATH):
for filename in files:
fullpath = os.path.join(root,filename)
with open(fullpath, 'r') as f:
matches = re.findall(regex, f.read())
for m in matches:
print "File:",fullpath,"\n\t",m
if __name__ == "__main__":
findmatches(sys.argv[4])
Calling it as
python near.py charlie winning 6 path/to/charlie/sheen
works for me.
Do you really need the look ahead structure?
Maybe this is enough:
grep -ERHn "\beasiest\W+(\w+\W+){1,10}new\b" *
Here is what I get:
echo "This is perhaps the easiest place to add new functionality." | grep -EHn "\beasiest\W+(\w+\W+){1,10}new\b"
(standard input):1:This is perhaps the easiest place to add new
functionality.
Edit
As Camille Goudeseune said:
To make it easily usable, this can be added in a .bashrc:
grepNear() {
grep -EHn "\b$1\W+(\w+\W+){1,10}$2\b"
}.
Then at a bash prompt: echo "..." | grepNear easiest new
grep does not support the non-capturing groups of Python regular expressions. When you write something like (?:\w+\W+), you are asking grep to match a question mark ? followed by a colon : followed by one or more word chars \w+ followed by one or more non-word chars \W+. ? is a special character for grep regexes, for sure, but since it is following the beginning of a group, it is automatically escaped (in the same way that the regex [?] matches the question mark).
Let us test it? I have the following file:
$ cat file
This is perhaps the easiest place to add new functionality.
grep does not match it with the expression you used:
$ grep -ERHn "\beasiest\W+(?:\w+\W+){1,10}?new\b" file
Then, I created the following file:
$ cat file2
This is perhaps the easiest ?:place ?:to ?:add new functionality.
Note that each word is preceded by ?:. In this case, your expression matches the file:
$ grep -ERHn "\beasiest\W+(?:\w+\W+){1,10}?new\b" file2
file2:1:This is perhaps the easiest ?:place ?:to ?:add new functionality.
The solution is to remove the ?: of the expression:
$ grep -ERHn "\beasiest\W+(\w+\W+){1,10}?new\b" file
file:1:This is perhaps the easiest place to add new functionality.
Since you do not even need a non-capturing group (at least as far as I've seen) it does not bear any problem.
Bonus point: you can simplify your expression changing {1,10} to {0,10} and removing the following ?:
$ grep -ERHn "\beasiest\W+(\w+\W+){0,10}new\b" file
file:1:This is perhaps the easiest place to add new functionality.
Related
I have this config file with entry names encased in brackets: []. I need to extract each entry name into a list or variable to be used in a for loop. Still new and fumbling with some commands. I have a feeling grep is my answer but I don't know where to start. Any help would be appreciated.
[dropbox]
type = dropbox
scope = dropbox
token = {"access_token":"my_token"}
[drive2]
type = drive
scope = drive
token = {"access_token":"other_token"}
You can use sed:
sed -rn 's/(^\[)(.*)(\]$)/\2/p' configfile
Enable regex with -r. Split each line of the file (configfile) into three sections - start of line,[ then anything (.*) and then ], end of line. Substitute the whole line for just the second section and print.
You can use GNU grep:
echo "[dropbox]\ntype = dropbox" | grep -Po '\[\K[^\]]*'
# Prints: dropbox
Here, grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only, 1 match/line, not the entire lines.
\[\K[^\]]* : literal [, escaped, which is followed by the special character \K that tells the regex engine to pretend that the match starts at that point, which is followed by any non-] character, repeated 0 or more times ([^\]]*).
SEE ALSO:
grep manual
I have a large dictionary file that contains one word per line.
I want to extract all lines that contain only one kind of vowel, so "see" and "best" and "levee" and "whenever" would be extracted, but "like" or "house" or "and" wouldn't. It's fine for me having to go over the file a few times, changing the vowel I'm looking for each time.
This command: grep -io '\b[eqwrtzpsdfghjklyxcvbnm]*\b' dictionary.txt
returns no words containing any other vowels but E, but it also gives me words like BBC or BMW. How can I make the contained vowel a requirement?
How about
grep -i '^[^aiou]*e[^aiou]*$'
?
Here is an Awk attempt which collects all the hits in a single pass over the input file, then prints each bucket.
awk 'BEGIN { split("a:e:i:o:u", vowel, ":")
c = "[b-df-hj-np-tv-z]"
for (v in vowel)
regex = (regex ? regex "|" : "") "^" c "*" vowel[v] c "*(" vowel[v] c "]*)*$" }
$0 ~ regex { for (v in vowel) if ($0 ~ vowel[v]) {
hit[v] = ( hit[v] ? hit[v] ORS : "") $0
next } }
END { for (v in vowel) {
printf "=== %s ===\n", vowel[v]
print hit[v] } }' /usr/share/dict/words
You'll notice that it prints words with syllabic y like jolly and cycle. A more complex regex should fix that, though the really thorny cases (like rhyme) need a more sophisticated model of English orthography.
The regex is clumsy because Awk does not support backreferences; an earlier version of this answer contained a simpler regex which would work with grep -E or similar, but then collect all matches in the same bucket.
Demo: https://ideone.com/wNrvPu
Using -P (perl) option:
^(?=.*e)[^aiou]+$
Explanation:
^ # beginning of line
(?=.*e) # positive lookahead, make sure we at least 1 "e"
[^aiou]+ # 1 or more any character that is not vowel
$ # end of line
cat file.txt
see
best
levee
whenever
like
house
and
BBC
BMW
grep -P '^(?=.*e)[^aiou]+$' file.txt
see
best
levee
whenever
I want to grep the shortest match and the pattern should be something like:
<car ... model=BMW ...>
...
...
...
</car>
... means any character and the input is multiple lines.
You're looking for a non-greedy (or lazy) match. To get a non-greedy match in regular expressions you need to use the modifier ? after the quantifier. For example you can change .* to .*?.
By default grep doesn't support non-greedy modifiers, but you can use grep -P to use the Perl syntax.
Actualy the .*? only works in perl. I am not sure what the equivalent grep extended regexp syntax would be. Fortunately you can use perl syntax with grep so grep -P would work but grep -E which is same as egrep would not work (it would be greedy).
See also: http://blog.vinceliu.com/2008/02/non-greedy-regular-expression-matching.html
grep
For non-greedy match in grep you could use a negated character class. In other words, try to avoid wildcards.
For example, to fetch all links to jpeg files from the page content, you'd use:
grep -o '"[^" ]\+.jpg"'
To deal with multiple line, pipe the input through xargs first. For performance, use ripgrep.
My grep that works after trying out stuff in this thread:
echo "hi how are you " | grep -shoP ".*? "
Just make sure you append a space to each one of your lines
(Mine was a line by line search to spit out words)
Sorry I am 9 years late, but this might work for the viewers in 2020.
So suppose you have a line like "Hello my name is Jello".
Now you want to find the words that start with 'H' and end with 'o', with any number of characters in between. And we don't want lines we just want words. So for that we can use the expression:
grep "H[^ ]*o" file
This will return all the words. The way this works is that: It will allow all the characters instead of space character in between, this way we can avoid multiple words in the same line.
Now you can replace the space character with any other character you want.
Suppose the initial line was "Hello-my-name-is-Jello", then you can get words using the expression:
grep "H[^-]*o" file
The short answer is using the next regular expression:
(?s)<car .*? model=BMW .*?>.*?</car>
(?s) - this makes a match across multiline
.*? - matches any character, a number of times in a lazy way (minimal
match)
A (little) more complicated answer is:
(?s)<([a-z\-_0-9]+?) .*? model=BMW .*?>.*?</\1>
This will makes possible to match car1 and car2 in the following text
<car1 ... model=BMW ...>
...
...
...
</car1>
<car2 ... model=BMW ...>
...
...
...
</car2>
(..) represents a capturing group
\1 in this context matches the sametext as most recently matched by
capturing group number 1
I know that its a bit of a dead post but I just noticed that this works. It removed both clean-up and cleanup from my output.
> grep -v -e 'clean\-\?up'
> grep --version grep (GNU grep) 2.20
I have a string like this one below (nvram extract) that is used by tinc VPN to define the network hosts:
1<host1<host1.network.org<<0<10.10.10.0/24<<Ed25519PublicKey = 8dtRRgAaTbUNtPxW9U3nGn6U7uvfIPwRo1wnx7xMIUH<Subnet = 10.10.3.0/24>1<host2<host2.network.org<<0<10.10.9.0/24<<Ed25519PublicKey = irn48tqF2Em4rIG0ggBmpEfaVKtkl6DmGdSzTHMmVEI<>0<host3<host3.network.org<<0<10.10.11.0/24<<Ed25519PublicKey = wQt1sFwOsd1hnBaNGHq4JDyib22fOg1YqzOp0p08ZTD<>
I'm trying to extract from the above:
host1.network.org
host2.network.org
host3.network.org
The hostname and keys are made up, but the structure of the input string is accurate. By the way the end node could be as well be defined as an IP addresses, so I'm trying to extract what's in between the second occurrence of "<" and the first occurrence of "<<". Since this is a multi match the occurrences are counted after either beginning of the line or the ">" character. So the above could be read as follow:
1<host1<host1.network.org<<0<10.10.10.0/24<<Ed25519PublicKey = 8dtRRgAaTbUNtPxW9U3nGn6U7uvfIPwRo1wnx7xMIUH<Subnet = 10.10.3.0/24>
1<host2<host2.network.org<<0<10.10.9.0/24<<Ed25519PublicKey = irn48tqF2Em4rIG0ggBmpEfaVKtkl6DmGdSzTHMmVEI<>
0<host3<host3.network.org<<0<10.10.11.0/24<<Ed25519PublicKey = wQt1sFwOsd1hnBaNGHq4JDyib22fOg1YqzOp0p08ZTD<>
As I need this info in a shell script I guess I would need to store each host/IP as an emlement of an array.
I have used regexp online editors, and managed to work out this string:
^[0|1]<.*?(\<(.*?)\<<)|>[0|1]<.*?(\<(.*?)\<)
however is I run a
grep -Eo '^[0|1]<.*?(\<(.*?)\<<)|>[0|1]<.*?(\<(.*?)\<)'
against the initial stinge I get the full string in return so I must be doing something wrong :-/
P.S. running on buysbox:
`BusyBox v1.25.1 (2017-05-21 14:11:58 CEST) multi-call binary.
Usage: grep [-HhnlLoqvsriwFE] [-m N] [-A/B/C N] PATTERN/-e PATTERN.../-f FILE [FILE]...
Search for PATTERN in FILEs (or stdin)
-H Add 'filename:' prefix
-h Do not add 'filename:' prefix
-n Add 'line_no:' prefix
-l Show only names of files that match
-L Show only names of files that don't match
-c Show only count of matching lines
-o Show only the matching part of line
-q Quiet. Return 0 if PATTERN is found, 1 otherwise
-v Select non-matching lines
-s Suppress open and read errors
-r Recurse
-i Ignore case
-w Match whole words only
-x Match whole lines only
-F PATTERN is a literal (not regexp)
-E PATTERN is an extended regexp
-m N Match up to N times per file
-A N Print N lines of trailing context
-B N Print N lines of leading context
-C N Same as '-A N -B N'
-e PTRN Pattern to match
-f FILE Read pattern from file`
Thanks!
OK, no response to my comment so I'll enter it as answer. How about
\w*[a-z]\w*(\.\w*[a-z]\w*)+
It matches at least two parts of a fully qualified name, separated by a dot.
grep -Eo '\w*[a-z]\w*(\.\w*[a-z]\w*)+'
yields
host1.network.org
host2.network.org
host3.network.org
(assuming your string is entered in stdin ;)
The regex you have is based on capturing groups and with grep you can only get full matches. Besides, you use -E (POSIX ERE flavor), while your regex is actually not POSIX ERE compatible as it contains lazy quantifiers that are not supported by this flavor.
I think you can extract all non-< chars between < and << followed with a digit and then a < with a PCRE regex (-P option):
s='1<host1<host1.network.org<<0<10.10.10.0/24<<Ed25519PublicKey = 8dtRRgAaTbUNtPxW9U3nGn6U7uvfIPwRo1wnx7xMIUH<Subnet = 10.10.3.0/24>1<host2<host2.network.org<<0<10.10.9.0/24<<Ed25519PublicKey = irn48tqF2Em4rIG0ggBmpEfaVKtkl6DmGdSzTHMmVEI<>0<host3<host3.network.org<<0<10.10.11.0/24<<Ed25519PublicKey = wQt1sFwOsd1hnBaNGHq4JDyib22fOg1YqzOp0p08ZTD<>'
echo $s | grep -oP '(?<=<)[^<]+(?=<<[0-9]<)'
See the regex demo and a grep demo.
Output:
host1.network.org
host2.network.org
host3.network.org
Here, (?<=<) is a positive lookbehind that only checks for the < presence immediately to the left of the current location but does not add < to the match value, [^<]+ matches 1+ chars other than < and (?=<<[0-9]<) (a positive lookahead) requires <<, then a digit, and then a < but again does not add these chars to the match.
If you have no PCRE option in grep, try replacing all the text you do not need with some char, and then either split with awk, or use grep:
echo $s | \
sed 's/[^<]*<[^<]*<\([^<][^<]*\)<<[0-9]<[^<]*<<[^<]*[<>]*/|\1/g' | \
grep -oE '[^|]+'
See another online demo.
Can anyone explain me how the regular expression works in the sed substitute command.
$ cat path.txt
/usr/kbos/bin:/usr/local/bin:/usr/jbin:/usr/bin:/usr/sas/bin
/usr/local/sbin:/sbin:/bin/:/usr/sbin:/usr/bin:/opt/omni/bin:
/opt/omni/lbin:/opt/omni/sbin:/root/bin
$ sed 's/\(\/[^:]*\).**/\1/g' path.txt
/usr/kbos/bin
/usr/local/sbin
/opt/omni/lbin
From the above sed command they used back reference and save operator concept.
Can anyone explain me how the regular expression especially /[^:]* work in the substitute command to get only the first path in each line.
I think you wrote an extra asterisk * in your sed code, so it should be like this:
$ sed 's/\(\/[^:]*\).*/\1/g' file
/usr/kbos/bin
/usr/local/sbin
/opt/omni/lbin
To change the delimiter will help to understand it a little bit better:
sed 's#\(/[^:]*\).*#\1#g'
The s#something#otherthing#g is a basic sed command that looks for something and changes it for otherthing all over the file.
If you do s#(something)#\1#g then you "save" that something and then you can print it back with \1.
Hence, what it is doing is to get a pattern like /[^:]* and then print is back. /[^:]* means / and then every char except :. So it will get / + all the string until it finds a semicolon :. It will store that piece of the string and then print it back.
Small examples:
# get every char
$ echo "hello123bye" | sed 's#\([a-z]*\).*#\1#g'
hello
# get everything until it finds the number 3
$ echo "hello123bye" | sed 's#\([^3]*\).*#\1#g'
hello12
[^:]*
in regex would match all characters except for :, so it would match until this:
/usr/kbos/bin
also it would match these,
/usr/local/bin
/usr/jbin
/usr/bin
/usr/sas/bin
As, these all contains characters, that are not :
.* match any character, zero or more times.
Thus, this regex [^:]*.*, would match all this expressions:
/usr/kbos/bin:/usr/local/bin:/usr/jbin:/usr/bin:/usr/sas/bin
/usr/local/bin:/usr/jbin:/usr/bin:/usr/sas/bin
/usr/jbin:/usr/bin:/usr/sas/bin
/usr/bin:/usr/sas/bin
However, you get only the first field (ie,/usr/kbos/bin, by using back reference in sed), because, regular expression output the longest possible match found.