Regex for The Same Pattern Multiple Times in One Line - regex

The pattern I'm looking for is this:
TXT.*\.txt
That pattern can occur multiple times in any given line. I would like to either extract each instance of the pattern out or alternatively delete the text that surrounds each instance using sed (or anything, really).
Thanks!

You can use Perl as:
$ cat file
foo TXT1.txt bar TXT2.txt baz
foo TXT3.txt bar TXT4.txt baz
$ perl -ne 'print "$1\n" while(/(TXT.*?\.txt)/g)' file
TXT1.txt
TXT2.txt
TXT3.txt
TXT4.txt
$

You can use grep as:
grep -o 'TXT[^.]*\.txt' file

Related

Why Perl regex doesn't match "\n" with a following character?

This is my file foo.txt:
a
b
This is what I'm doing:
$ perl -pi -e 's/\nb/z/g' foo.txt
Nothing changes in the file, while I'm expecting it to become:
az
Why? It's Perl v5.34.0.
The firs time you evaluate the substitution, you match against a␊. The second time, against b␊. So it doesn't match either times.
You want to match against the entire file. You can tell Perl to consider the entire file one line by using -g aka -0777.
perl -i -gpe's/\nb/z/g' foo.txt # 5.36+
perl -i -0777pe's/\nb/z/g' foo.txt

sed regex with alternative on Solaris doesn't work

Currently I'm trying to use sed with regex on Solaris but it doesn't work.
I need to show only lines matching to my regex.
sed -n -E '/^[a-zA-Z0-9]*$|^a_[a-zA-Z0-9]*$/p'
input file:
grtad
a_pitr
_aupa
a__as
baman
12353
ai345
ki_ag
-MXx2
!!!23
+_)#*
I want to show only lines matching to above regex:
grtad
a_pitr
baman
12353
ai345
Is there another way to use alternative? Is it possible in perl?
Thanks for any solutions.
With Perl
perl -ne 'print if /^(a_)?[a-zA-Z0-9]*$/' input.txt
The (a_)? matches a_ one-or-zero times, so optionally. It may or may not be there.
The (a_) also captures the match, what is not needed. So you can use (?:a_)? instead. The ?: makes () only group what is inside (so ? applies to the whole thing), but not remember it.
with grep
$ grep -xiE '(a_)?[a-z0-9]*' ip.txt
grtad
a_pitr
baman
12353
ai345
-x match whole line
-i ignore case
-E extended regex, if not available, use grep -xi '\(a_\)\?[a-z0-9]*'
(a_)? zero or one time match a_
[a-z0-9]* zero or more alphabets or numbers
With sed
sed -nE '/^(a_)?[a-zA-Z0-9]*$/p' ip.txt
or, with GNU sed
sed -nE '/^(a_)?[a-z0-9]*$/Ip' ip.txt

How to filter foobar from all grep results of foo?

I am searching a large codebase for all occurrences of the company acronym, which is a small 3-character word like foo. I normally do this sort of thing with
grep -Rnoi 'foo' *
starting at the top of the code base. However, since this is a small word that can produce an overwhelming amount of false positives, like 'foobar' or 'foocat', how might I go about filtering out the false positives?
I was thinking something along the lines of...
grep -Rnoi 'foo' * | grep [excludeMagicOption] 'foobar'
where the displayed results shows all foo occurrences without 'foobar'. What are some options for doing this?
If I understand your question that you only want to match foo and not foocat, use the -w or --word-regexp option to match only whole word occurrences of foo. Example:
Input file
$ cat foo.txt
foo
foocat
foobar
foo
foofighter
Use Output
$ grep -Roniw 'foo' foo.txt
1:foo
4:foo
You can add more conditions to the initial regex to just match a set of whole words. From your example in the comment foo and foo-, you could use:
grep -Roniw 'foo[-]*' foo.txt
Input file
$ cat foo.txt
foo
foocat
foobar
foo
foofighter
foo-
Use Output
$ grep -Roniw 'foo' foo.txt
1:foo
4:foo
6:foo-
You can use a word boundary, denoted by \b in most (not all) Extended RE engines, and supported by egrep and grep -E. This includes start and end of line, and non-alphas.
For example: test.txt:
foo
foobar
foocat
foobar = foocat * 3
foobar = foo++
Feel the foo
What are the foo's price?
Strange how football changes.
Where is foo and bar?
Using:
grep -E '\bfoo\b' test.txt
Gives:
foo
foobar = foo++
Feel the foo
What are the foo's price?
Where is foo and bar?
Edit: Some regular expression engines use other character sequences for word boundaries. There is a summary here: http://www.regular-expressions.info/refwordboundaries.html
You want the -v option:
grep -Rnoi 'foo' * | grep -v 'foobar'
From grep --help:
-v, --invert-match select non-matching lines

regex command line linux - select all lines between two strings

I have a text file with contents like this:
here is some super text:
this is text that should
be selected with a cool match
And this is how it all ends
blah blah...
I am trying to get the two lines (but could be more or less lines) between:
some super text:
and
And this is how
I am using grep on an ubuntu machine and a lot of the patterns I've found seem to be specific to different kinds of regex engines.
So I should end up with something like this:
grep "my regex goes here" myFileNameHere
Not sure if egrep is needed, but could use that just as easy.
You can use addresses in sed:
sed -e '/some super text/,/And this is how/!d' file
!d means "don't output if not in the range".
To exclude the border lines, you must be more clever:
sed -n -e '/some super text/ {n;b c}; d;:c {/And this is how/ {d};p;n;b c}' file
Or, similarly, in Perl:
perl -ne 'print if /some super text/ .. /And this is how/' file
To exclude the border lines again, change it to
perl -ne '$in = /some super text/ .. /And this is how/; print if $in > 1 and $in !~ /E/' file
I don't see how it could be done in grep. Using awk:
awk '/^And this is how/ {p=0}; p; /some super text:$/ {p=1}' file
Give a try to pcregrep instead of normal grep. Because normal grep won't help you to fetch multiple lines in a row.
$ pcregrep -M -o '(?s)some super text:[^\n]*\n\K.*?(?=\n[^\n]*And this is how)' file
this is text that should
be selected with a cool match
(?s) Dotall modifier allows dot to match even newline characters also.
\K Discards the previously matched characters.
From pcregrep --help
-M, --multiline run in multiline mode
-o, --only-matching=n show only the part of the line that matched
TL;DR
With your corpus, another way to solve the problem is by matching lines with leading whitespace, rather than using a flip-flop operator of some sort to match start and end lines. The following solutions work with your posted example.
GNU Grep with PCRE Compiled In
$ grep -Po '^\s+\K.*' /tmp/corpus
this is text that should
be selected with a cool match
Alternative: Use pcregrep Instead
$ pcregrep -o '^\s+\K.*' /tmp/corpus
this is text that should
be selected with a cool match

regular expression pattern replacement *.xml

I am new to regular expression, and don't deal with it regularly so posting it as a question.
I want to replace
blah.xml
haha.xml
to
user/home/blah.xml
user/home/haha.xml
I would prefer to do it with sed.
Cheers
SK
You can use sed as:
$ cat file
foo
blah.xml
haha.xml
bar
$ sed -r 's#([^.]*\.xml)#user/home/\1#g' file
foo
user/home/blah.xml
user/home/haha.xml
bar
To answer your question in the comments, try:
$ echo "file is blah.xml" | sed -r 's#(\w+\.xml)#user/home/\1#'
file is user/home/blah.xml
sed 's=\(.*\.xml\)=user/home/\1=g'
sed -e 's#^#user/home/#'
inserts user/home/ at the beginning of each line.