I'm running macOS.
There are the following strings:
/superman
/superman1
/superman/batman
/superman2/batman
/superman/wonderwoman
/superman3/wonderwoman
/batman/superman
/batman/superman1
/wonderwoman/superman
/wonderwoman/superman2
I want to grep only the bolded words.
I figured doing grep -wr 'superman/|/superman' would yield all of them, but it only yields /superman.
Any idea how to go about this?
You may use
grep -E '(^|/)superman($|/)' file
See the online demo:
s="/superman
/superman1
/superman/batman
/superman2/batman
/superman/wonderwoman
/superman3/wonderwoman
/batman/superman
/batman/superman1
/wonderwoman/superman
/wonderwoman/superman2"
grep -E '(^|/)superman($|/)' <<< "$s"
Output:
/superman
/superman/batman
/superman/wonderwoman
/batman/superman
/wonderwoman/superman
The pattern matches
(^|/) - start of string or a slash
superman - a word
($|/) - end of string or a slash.
grep '/superman\>'
\> is the "end of word marker", and for "superman3", the end of word is not following "man"
The problems with your -w solution:
| is not special in a basic regex. You either need to escape it or use grep -E
read the man page about how -w works:
The test is that the
matching substring must either be at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at the end of the line or followed by a
non-word constituent character
In the case where the line is /batman/superman,
the pattern superman/ does not appear
the pattern /superman is:
at the end of the line, which is OK, but
is prededed by the character "n" which is a word constituent character.
grep -w superman will give you better results, or if you need to have superman preceded by a slash, then my original answer works.
Related
I've files with below content:
76a6f0f631888fbd359420796093d19a3928123d remotes/origin/feature/ASC-122356
417435aceb671e41213697055b86d860d9a9a61c remotes/origin/feature/ASC-122356-3762
ae863a41fef068215be1529216e9dbba1314fa6f remotes/origin/master
I want to search if origin/master pattern is there or not in the file.
I'm currently doing like grep -e '^\S\+ origin/master$' but it's not correct. How can I do it?
Following would work with grep. Positive number of non-space characters, followed by a space, followed by a possibly empty sequence of non-space characters and followed by the expected string.
grep -P '\S+ \S*origin/master$' test
Can be improved to make sure the origin is either at the begining of the second column or preceded by a / to eliminate strings like remotes/backup-origin/master
grep -P '^\S+ (|\S*/)origin/master$' test
Note those expressions require (-P) - perl compatible regexes.
The pattern is uses '^\S+ ' to request that ALL characters before origin/master will be non-space (because of the '^').
Consider using similar version, which will ask for ONE space
grep -e ' \S\+origin/master$'
I have tens of long text files (10k - 100k record each) where some characters were lost by careless handling and got replaced with question marks. I need to build a list of corrupted words.
I'm sure the most effective approach would be to regex the file with sed or awk or some other bash tools, but I'm unable to compose regex that would do the trick.
Here are couple of sample records for processing:
?ilkin, Aleksandr, Zahhar, isa
?igadlo-?van, Maria, Karl, abikaasa, 27.10.45, Veli?anõ raj.
Desired output would be:
?ilkin
?igadlo-?van
Veli?anõ
My best result so far seems to retrieve only words from the beginning of records:
awk '$1 ~/\?/ {print $1}' test.txt
->
?ilkin,
?igadlo-?van,
I need to build a list of corrupted words
If the aim is to only search for matches grep would be the most fast and powerful tool:
grep -Po '(^|)([^?\s]*?\?[^\s,]*?)(?=\s|,|$)' test.txt
The output:
?ilkin
?igadlo-?van
Veli?anõ
Explanation:
-P option, allows perl regular expresssions
-o option, tells to print only matched substrings
(^|) - matches the start of the string or an empty value(we can't use word boundary anchor \b in this case cause question mark ? is considered as a word boundary)
[^?\s]*? - matches any character except ? and whitespace \s if occurs
\?[^\s,]*? - matches a question mark ? followed by any character except whitespace \s and ,(which can be at right word boundary)
(?=\s|,|$) - lookahead positive assertion, ensures that a needed substring is followed by either whitespace \s, comma , or placed at the end of the string
I need to find repeated words in a file using egrep (or grep -e) in unix (bash)
I tried:
egrep "(\<[a-zA-Z]+\>) \1" file.txt
and
egrep "(\b[a-zA-Z]+\b) \1" file.txt
but for some reason these consider things to be repeats that aren't!
for example, it thinks the string "word words" meets the criteria despite the word boundary condition \> or \b.
\1 matches whatever string was matched by the first capture. That is not the same as matching the same pattern as was matched by the first capture. So the fact that the first capture matched on a word boundary is no longer relevant, even though the \b is inside the capture parentheses.
If you want the second instance to also be on a word boundary, you need to say so:
egrep "(\b[a-zA-Z]+) \1\b" file.txt
That is no different from:
egrep "\b([a-zA-Z]+) \1\b" file.txt
The space in the pattern forces a word boundary, so I removed the redundant \bs. If you wanted to be more explicit, you could put them in:
egrep "\<([a-zA-Z]+)\> \<\1\>" file.txt
I use
pcregrep -M '(\b[a-zA-Z]+)\s+\1\b' *
to check my documents for such errors. This also works if there is a line break between the duplicated words.
Explanation:
-M, --multiline run in multiline mode (important if a line break is between the duplicated words.
[a-zA-Z]+: Match words
\b: Word boundary, see tutorial
(\b[a-zA-Z]+) group it
\s+ match at least one (but as many more as necessary) whitespace characters. This includes newline.
\1: Match whatever was in the first group
This is the expected behaviour. See what man grep says:
The Backslash Character and Special Expressions
The symbols \< and > respectively match the empty string at the
beginning and end of a word. The symbol \b matches the empty string at
the edge of a word, and \B matches the empty string provided it's not
at the edge of a word. The symbol \w is a synonym for [[:alnum:]] and
\W is a synonym for [^[:alnum:]].
and then in another place we see what "word" is:
Matching Control
Word-constituent characters are letters, digits, and the underscore.
So this is what will produce:
$ cat a
hello bye
hello and and bye
words words
this are words words
"words words"
$ egrep "(\b[a-zA-Z]+\b) \1" a
hello and and bye
words words
this are words words
"words words"
$ egrep "(\<[a-zA-Z]+\>) \1" a
hello and and bye
words words
this are words words
"words words"
egrep "(\<[a-zA-Z]+>) \<\1\>" file.txt
fixes the problem.
basically, you have to tell \1 that it needs to stay in word boundaries too
Is there a way to use extended regular expressions to find a specific pattern that ends with a string.
I mean, I want to match first 3 lines but not the last:
file_number_one.pdf # comment
file_number_two.pdf # not interesting
testfile_number____three.pdf # some other stuff
myfilezipped.pdf.zip some comments and explanations
I know that in grep, metacharacter $ matches the end of a line but I'm not interested in matching a line end but string end. Groups in grep are very odd, I don't understand them well yet.
I tried with group matching, actually I have a similar REGEX but it does not work with grep -E
(\w+).pdf$
Is there a way to do string ending match in grep/egrep?
Your example works with matching the space after the string also:
grep -E '\.pdf ' input.txt
What you call "string" is similar to what grep calls "word". A Word is a run of alphanumeric characters. The nice thing with words is that you can match a word end with the special \>, which matches a word end with a march of zero characters length. That also matches at the end of line. But the word characters can not be changed, and do not contain punctuation, so we can not use it.
If you need to match at the end of line too, where there is no space after the word, use:
grep -E '\.pdf |\.pdf$' input.txt
To include cases where the character after the file name is not a space character '', but other whitespace, like a tab, \t, or the name is directly followed by a comment, starting with #, use:
grep -E '\.pdf[[:space:]#]|\.pdf$' input.txt
I will illustrate the matching of word boundarys too, because that would be the perfect solution, except that we can not use it here because we can not change the set of characters that are seen as parts of a word.
The input contains foo as separate word, and as part of longer words, where the foo is not at the end of the word, and therefore not at a word boundary:
$ printf 'foo bar\nfoo.bar\nfoobar\nfoo_bar\nfoo\n'
foo bar
foo.bar
foobar
foo_bar
foo
Now, to match the boundaries of words, we can use \< for the beginning, and \> to match the end:
$ printf 'foo bar\nfoo.bar\nfoobar\nfoo_bar\nfoo\n' | grep 'foo\>'
foo bar
foo.bar
foo
Note how _ is matched as a word char - but otherwise, wordchars are only the alphanumerics, [a-zA-Z0-9].
Also note how foo an the end of line is matched - in the line containing only foo. We do not need a special case for the end of line.
You can use \> operator
grep 'word\>' fileName
You need to escape the . in your regex. This regex will match anything that ends in .pdf (and only things that end in .pdf):
.*\.pdf$
Positive lookaheads are the most suited for this kinda stuff. Have a try :
grep -P "(^\w+\.pdf)(?=\s)" file
I assume filenames will always be on the start of the line.
My testfile is:
PolicyChain:ComplementaryUser Caught
PolicyChain:SourceIP Caught
My regex is:
cat testfile | grep -E -o '[^PolicyChain:].+?'
It matches:
mplementaryUser Caught
SourceIP Caught
I'm ultimately just trying to match the string after the colon but before the space. Please help??
[^PolicyChain:] is a character class that matches one character that is NOT (as indicated by the ^) among P,o,l,i,c,y,C,h,a,i,n or :.
Then you match one character or more characters, lazily .+?.
Since the regex has to start by matching a non-c (the first token), it cannot start matching at the C of ComplementaryUser.
I suggest that your decision to use a character class is an error, and you want a positive lookbehind instead, such as (?<=^PolicyChain:): http://www.regular-expressions.info/refadv.html
A positive lookbehind means, 'look behind my current position and attempt to match this lookbehind regex. If it does match, we can continue with the rest of the main regex. If it does not match, we cannot continue.'
However note that lookaheads and lookbehinds are not POSIX-compliant, and you must use a perl-themed regex (PCRE) to have them. (Or .NET, Python, Java, Ruby...)
Try this instead.
cat testfile | sed -e "s/.*:\([^ ][^ ]*\).*/\1/"
You can simply use cut:
echo "PolicyChain:ComplementaryUser Caught" | cut -d: -f 2