Linux - Only find a pattern within a line, not the whole line - regex

I want to use a regex to find a pattern in a file. That pattern may be in the middle of a line, but I don't want the whole line. I tried grep -a pattern file but this returns the entire line that contains the regex. The following is an example of what I'm trying to do. Does anyone know a way to do this?
Example:
Input: AAAAAAAAAAAAAXxXxXxXxBananasyYyYyYyYBBBBBBBCCCCCC
Regex: Xx.*yY
Ouput: XxXxXxXxBananasyYyYyYyY

you were close, you need the -o flag
grep -o 'Xx.*yY' <<<AAAAAAAAAAAAAXxXxXxXxBananasyYyYyYyYBBBBBBBCCCCCC
XxXxXxXxBananasyYyYyYyY

Use the -o option to print just the part of the line that matches the regexp
grep -o pattern file

In addition to grep -o (the simplest way), there are a couple of other options:
In bash, without relying on any particular implementation of grep:
$ regex='Xx.*yY'
$ [[ AAAAAAAAAAAAAXxXxXxXxBananasyYyYyYyYBBBBBBBCCCCCC =~ $regex ]]
$ echo ${BASH_REMATCH[0]}
XxXxXxXxBananasyYyYyYyY
Using expr, which is a little unwieldy (in part because the regular expression is implicitly anchored to the beginning of the string), but is defined by the POSIX standard so it should work on any POSIX platform, regardless of the shell used.
$ expr AAAAAAAAAAAAAXxXxXxXxBananasyYyYyYyYBBBBBBBCCCCCC : '[^X]*\(Xx.*yY\)'
XxXxXxXxBananasyYyYyYyY

Related

Can I perform a 'non-global' grep and capture only the first match found for each line of input?

I understand that what I'm asking can be accomplished using awk or sed, I'm asking here how to do this using GREP.
Given the following input:
.bash_profile
.config/ranger/bookmarks
.oh-my-zsh/README.md
I want to use GREP to get:
.bash_profile
.config/
.oh-my-zsh/
Currently I'm trying
grep -Po '([^/]*[/]?){1}'
Which results in output:
.bash_profile
.config/
ranger/
bookmarks
.oh-my-zsh/
README.md
Is there some simple way to use GREP to only get the first matched string on each line?
I think you can grep non / letters like:
grep -Eo '^[^/]+'
On another SO site there is another similar question with solution.
You don't need grep for this at all.
cut -d / -f 1
The -o option says to print every substring which matches your pattern, instead of printing each matching line. Your current pattern matches every string which doesn't contain slashes (optionally including a trailing slash); but it's easy to switch to one which only matches this pattern at the beginning of a line.
grep -o '^[^/]*' file
Notice the addition of the ^ beginning of line anchor, and the omission of the -P option (which you were not really using anyway) as well as the silly beginner error {1}.
(I should add that plain grep doesn't support parentheses or repetitions; grep -E would support these constructs just fine, of you could switch to toe POSIX BRE variation which requires a backslash to use round or curly parentheses as metacharacters. You can probably ignore these details and just use grep -E everywhere unless you really need the features of grep -P, though also be aware that -P is not portable.)

Why do my results appear to differ between ag and grep?

I'm having trouble correctly (and safely) executing the right regex searches with grep. I seem to be able to do what I want using ag
What I want to do in plain english:
Search my current directory (recursively?) for files that have lines containing both the words "nested" and "merge"
Successful attempt with ag:
$ ag --depth=2 -l "nested.*merge|merge.*nested" .
scratch.md
scratch.rb
Unsuccessful attempt with grep:
$ grep -elr 'nested.*merge|merge.*nested' .
grep: nested.*merge|merge.*nested: No such file or directory
grep: .: Is a directory
What am I missing? Also, could either approach be improved?
Thanks!
You probably want -E not -e, or just egrep.
A man grep will make you understand why -e gave you that error.
You can use grep -lr 'nested.*merge\|merge.*nested' or grep -Elr 'nested.*merge|merge.*nested' for your case.
Besides, for the latter one, E mean using ERE regular expression syntax, since grep will use BRE by default, where | will match character | and \| mean or.
For more detail about ERE and BRE, you can read this article

difference between 'i' and 'I' in sed

I thought i and I both mean ignorecase in sed, e.g.
$ echo "abcABC"|sed -e 's/a/j/gi'
jbcjBC
$ echo "abcABC"|sed -e 's/a/j/gI'
jbcjBC
However, looks like it's only for substitution:
$ echo "abcABC"|sed -e '/a/id' # <--
d
abcABC
$ echo "abcABC"|sed -e '/a/Id'
$
It's really confusing.
Where can I find full reference of the meaning of regular expression for sed?
i and I are indeed flags to the s command; they are not generally applicable to all uses of regular expressions in sed. The GNU man page is oddly silent on which flags s accepts (or even the fact that s accepts flags), so you'll have to look in the info page (run info sed).
Other uses of regular expressions are governed by the function in which they are used.
In your other examples, i and I are the actual sed functions applied to lines that match the regular expression a; i means to insert text. As far as I can tell, I is an unrecognized function and so ignored, leaving d as the function, deleting the line. (My interpretation of I may be wrong.)
The sed man page in FreeBSD in the section describing options to the s (substitute) command, says only:
i or I Match the regular expression in a case-insensitive
way
Thus, the following are identical:
s/a/j/gi
s/a/j/gI
But that's only using i as a modifier to the s command. In your second example, you're using i as a command. The man page in this case states:
[1addr]i\
text Write text to the standard output.
and at least in FreeBSD's sed, there is no I (capital-I) command. So your sed script /a/id would (1) match lines containing an a, and if found (2) print the text "d". Which is what you saw.
And since I is not a command, I would have expected an error, but my results match yours -- /a/Id appears to eliminate output.
Note that commands, commands, and completeness of documentation may differ depending on the variant of sed you are using.

Looking for a trailing $ sign using Regex

The pattern I'm looking for looks like $guid1$ with the $ signs on each side. Unfortunately, my regex in grep (and probably elsewhere) interprets that last $ as something else.
"\$guid[0-9]\$" works but "\$guid[0-9]\$" does not. What can I do?
You need to use single quotes around your regex:
grep '\$guid1\$' file
OR use fgrep for fixed string search:
fgrep '$guid1$' file

grep through binary file

I have a binary file that contains lines in the following form:
blabla^A2013.04.03-09:35:04^Ablabla
where ^A is the binary character 001.
I want to be able to perform a grep that will give me only what is between the ^A (not the whole line).
I know that flag -o is only for match but I don't know how to search for that binary character
You should be able to include control-A on the command line by simply typing control-A where you want it to appear. At worst, you might need to type control-V before it. You can also explore notations using bash's ANSI-C quoting such as $'\001'.
Try doing this :
grep --binary-files=text pattern file.txt
so :
$ grep --binary-files=text -oP '\^\K[^\^]+(?=\^)' file.txt
A2013.04.03-09:35:04