Grep Regex Match Anything Including Newlines - regex

I'm using the following command:
grep -o '<tag.*tag>\|^--.*'
However, I need to match newlines as well. How can I do this?
I have tried a few of the following options found: (?s).* (.*?) .*? (.* )?
I have to use grep because awk and other commands will take too long on loner lines in my data.
Anyone know how to fix my code to match newlines too? Your input is highly appreciated.

Actually, I got it, needed to add the -P option and change the command to:
grep -oP '(<tag.*?tag>|^--.*)'
By using -P, I was able to get grep to match anything including newlines with the .*? pattern while also keeping my OR condition.

Related

Can I perform a 'non-global' grep and capture only the first match found for each line of input?

I understand that what I'm asking can be accomplished using awk or sed, I'm asking here how to do this using GREP.
Given the following input:
.bash_profile
.config/ranger/bookmarks
.oh-my-zsh/README.md
I want to use GREP to get:
.bash_profile
.config/
.oh-my-zsh/
Currently I'm trying
grep -Po '([^/]*[/]?){1}'
Which results in output:
.bash_profile
.config/
ranger/
bookmarks
.oh-my-zsh/
README.md
Is there some simple way to use GREP to only get the first matched string on each line?
I think you can grep non / letters like:
grep -Eo '^[^/]+'
On another SO site there is another similar question with solution.
You don't need grep for this at all.
cut -d / -f 1
The -o option says to print every substring which matches your pattern, instead of printing each matching line. Your current pattern matches every string which doesn't contain slashes (optionally including a trailing slash); but it's easy to switch to one which only matches this pattern at the beginning of a line.
grep -o '^[^/]*' file
Notice the addition of the ^ beginning of line anchor, and the omission of the -P option (which you were not really using anyway) as well as the silly beginner error {1}.
(I should add that plain grep doesn't support parentheses or repetitions; grep -E would support these constructs just fine, of you could switch to toe POSIX BRE variation which requires a backslash to use round or curly parentheses as metacharacters. You can probably ignore these details and just use grep -E everywhere unless you really need the features of grep -P, though also be aware that -P is not portable.)

bash regex for word with some suffixes but not one specific

I need (case-insensitive) all matches of several variations on a word--except one--including unknowns.
I want
accept
acceptance
acceptable
accepting
...but not "acception." A coworker used it when he meant "exception." A lot.
Since I can't anticipate the variations (or typos), I need to allow things like "acceptjunk" and "acceptMacarena"
I thought I could accomplish this with a negative lookahead, but this didn't give the results I needed
grep -iE '(?!acception)(accept[a-zA-Z]*)[[:space:]]' file
The trick is that I can accept (har) lines that contain "acception," provided that the other words match. For example this line is okay to match:
The acceptance of the inevitable is the acception.
...otherwise by now I'd have piped grep through grep -v and been done with it:
grep -iE '(accept)[a-zA-Z]*[[:space:]]' | grep -vi 'acception'
I've found some questions that are similar and many that are not quite so. Using a-zA-Z is likely unnecessary in grep -i but I'm flailing. I'm probably missing something small or basic...but I'm missing it nonetheless. What is it?
Thanks for reading.
PS: I'm not married to grep--but I am operating in bash--so if there's a magic awk command that would do this I'm all ears (eyes).
PPS: forgot to mention that on https://regex101.com/ the above lookahead seemed to work, but it doesn't with my full grep command.
To use lookarounds, you need GNU grep with PCRE available
grep -iP '(?!acception)(accept[a-z]*)[[:space:]]'
With awk, this might work
awk '{ip=$0; sub(/acception/, ""); if(/accept[a-zA-Z]*[[:space:]]/) print ip}'
ip=$0 save input line
sub(/acception/, "") remove unwanted words, can add other unwanted words with alternation
if(/accept[a-zA-Z]*[[:space:]]/) print ip then print the line if it still contains words being searched

using \b with grep pattern

I am learning bash and I have come across regular expressions.
There is an exercise where I have to match a word and I tried to use \b<word>\b but for some reason it was not matched until I used \\b<word>\\b. I actually tried it out of desperation when I couldn't understand why \b wasn't working.
You are proabably using grep \bword\b which is really grep bwordb after bash parses the backslashes.
Use grep '\bword\b' (note the single-quotes).
You can also use grep -w word to match whole words only.

Grep for lines not beginning with "//"

I'm trying but failing to write a regex to grep for lines that do not begin with "//" (i.e. C++-style comments). I'm aware of the "grep -v" option, but I am trying to learn how to pull this off with regex alone.
I've searched and found various answers on grepping for lines that don't begin with a character, and even one on how to grep for lines that don't begin with a string, but I'm unable to adapt those answers to my case, and I don't understand what my error is.
> cat bar.txt
hello
//world
> cat bar.txt | grep "(?!\/\/)"
-bash: !\/\/: event not found
I'm not sure what this "event not found" is about. One of the answers I found used paren-question mark-exclamation-string-paren, which I've done here, and which still fails.
> cat bar.txt | grep "^[^\/\/].+"
(no output)
Another answer I found used a caret within square brackets and explained that this syntax meant "search for the absence of what's in the square brackets (other than the caret). I think the ".+" means "one or more of anything", but I'm not sure if that's correct and if it is correct, what distinguishes it from ".*"
In a nutshell: how can I construct a regex to pass to grep to search for lines that do not begin with "//" ?
To be even more specific, I'm trying to search for lines that have "#include" that are not preceeded by "//".
Thank you.
The first line tells you that the problem is from bash (your shell). Bash finds the ! and attempts to inject into your command the last you entered that begins with \/\/. To avoid this you need to escape the ! or use single quotes. For an example of !, try !cat, it will execute the last command beginning with cat that you entered.
You don't need to escape /, it has no special meaning in regular expressions. You also don't need to write a complicated regular expression to invert a match. Rather, just supply the -v argument to grep. Most of the time simple is better. And you also don't need to cat the file to grep. Just give grep the file name. eg.
grep -v "^//" bar.txt | grep "#include"
If you're really hungup on using regular expressions then a simple one would look like (match start of string ^, any number of white space [[:space:]]*, exactly two backslashes /{2}, any number of any characters .*, followed by #include):
grep -E "^[[:space:]]*/{2}.*#include" bar.txt
You're using negative lookahead which is PCRE feature and requires -P option
Your negative lookahead won't work without start anchor
This will of course require gnu-grep.
You must use single quotes to use ! in your regex otherwise history expansion is attempted with the text after ! in your regex, the reason of !\/\/: event not found error.
So you can use:
grep -P '^(?!\h*//)' file
hello
\h matches 0 or more horizontal whitespace.
Without -P or non-gnu grep you can use grep -v:
grep -v '^[[:blank:]]*//' file
hello
To find #include lines that are not preceded by // (or /* …), you can use:
grep '^[[:space:]]*#[[:space:]]*include[[:space:]]*["<]'
The regex looks for start of line, optional spaces, #, optional spaces, include, optional spaces and either " or <. It will find all #include lines except lines such as #include MACRO_NAME, which are legitimate but rare, and screwball cases such as:
#/*comment*/include/*comment*/<stdio.h>
#\
include\
<stdio.h>
If you have to deal with software containing such notations, (a) you have my sympathy and (b) fix the code to a more orthodox style before hunting the #include lines. It will pick up false positives such as:
/* Do not include this:
#include <does-not-exist.h>
*/
You could omit the final [[:space:]]*["<] with minimal chance of confusion, which will then pick up the macro name variant.
To find lines that do not start with a double slash, use -v (to invert the match) and '^//' to look for slashes at the start of a line:
grep -v '^//'
You have to use the -P (perl) option:
cat bar.txt | grep -P '(?!//)'
For the lines not beginning with "//", you could use (^[^/]{2}.*$).
If you don't like grep -v for this then you could just use awk:
awk '!/^\/\//' file
Since awk supports compound conditions instead of just regexps, it's often easier to specify what you want to match with awk than grep, e.g. to search for a and b in any order with grep:
grep -E 'a.*b|b.*a`
while with awk:
awk '/a/ && /b/'

Grep regex to unscramble a word

I want to unscramble a word using the grep command.
I am using below code. I know there are other ways to do it, but I think I'm missing something here:
grep "^[yxusonlia]\{9\}$" /usr/share/dict/words
should produce one output:
anxiously
but it produces:
annulosan
innoxious
and many more. Basically I can't find how I should specify that characters
can only be matched once, so that I get only one output.
I apologise if it seems very simple but I tried a lot and can't find anything.
You can use grep -P (PCRE regex) with negative lookahead
grep -P '^(?:([yxusonlia])(?!.*?\1)){9}$' /usr/share/dict/words
anxiously
Explanation:
This grep regex uses negative lookahead (?!.*?\1) for each character matched by group #1 i.e. \1. Each character is matched only and only when it is not followed by the same character again in the string till the end.
You can use lookaheads to make sure that each letter is matched exactly one time. It is verbose and requires a version of grep that supports lookaheads (e.g. via -P). It may be better to build the search string programmatically.
grep -P "^(?=.*y)(?=.*x)(?=.*u)(?=.*s)(?=.*o)(?=.*n)(?=.*l)(?=.*i)(?=.*a)[yxusonlia]{9}$" /usr/share/dict/words