capturing each word containing pattern regex - regex

I'm trying to write a sed script that finds every word that contains a certain pattern and then prepends all words that contain that pattern. For example:
foobarbaz barfoobaz barbazfoo barbaz
might turn into:
quxfoobarbaz quxbarfoobaz quxbarbazfoo barbaz
I understand the basics of capture groups and backrefrences, but I'm still having trouble. Specifically I can't get it so that it captures each whole word separately.
s/\(.*\)men\(.*\)/ not just the \1men\2, but the \1women\2 and \1children\2 too /
I tried using \s, for whitespace as many sites recommend, but sed treats \s as the separate characters \ and s

You could use the non-space character \S as follows:
sed 's/\S*foo\S*/qux&/g' <<< "foobarbaz barfoobaz barbazfoo barbaz"
this will match words containing foo. The replacement string qux& will prepend every matched pattern with qux. Output:
quxfoobarbaz quxbarfoobaz quxbarbazfoo barbaz

It works fine if no spaces in each word.
echo "foobarbaz barfoobaz barbazfoo barbaz" | sed 's/\([^ ]*foo[^ ]*\)/qux\1/g'

Related

Using sed to replace space delimited strings

echo 'bar=start "bar=second CONFIG="$CONFIG bar=s buz=zar bar=g bar=ggg bar=f bar=foo bar=zoo really?=yes bar=z bar=yes bar=y bar=one bar=o que=idn"' | sed -e 's/^\|\([ "]\)bar=[^ ]*[ ]*/\1/g'
Actual output:
CONFIG="$CONFIG buz=zar bar=ggg bar=foo really?=yes bar=yes bar=one que=idn"
Expected output:
CONFIG="$CONFIG buz=zar really?=yes que=idn"
What I'm missing in my regex?
Edit:
This works as expected (with GNU sed):
's/\(^\|\(['\''" ]\)\)bar=[^ ]*/\2/g; s/[ ][ ]\+/ /g; s/[ ]*\(['\''"]\+\)[ ]*/\1/g'
sed regular expressions are pretty limited. They don't include \w as a synonym for [a-zA-Z0-9_], for example. They also don't include \b which means the zero-length string at the beginning or end of a word (which you really want in this situation...).
s/ bar=[^ ]* *//
is close, but the problem is the trailing * removes the space that might precede the next bar=. So, in ... bar=aaa bar=bbb ... the first match is bar=aaa leaving bar=bbb ... to try for the second match but it won't match because you already consumed the space before bar.
s/ bar=[^ ]*//
is better -- don't consume the trailing spaces, leave them for the next match attempt. If you want to match bar=something even if it's at the beginning of the string, insert a space at the beginning first:
sed 's/^bar=/ bar=/; s/ bar=[^ ]*//'
If you want to remove all instances of bar=something then you can simplify your regex as such:
\sbar=\w+
This matches all bar= plus all whole words. The bar= must be preceded by a whitespace character.
Demonstration:
https://regex101.com/r/xbBhJZ/3
As sed:
s/\sbar=\w\+//g
This correctly accounts for foobar=bar.
Like Waxrat's answer, you have to insert a space at the beginning for it to properly match as it's now matching against a preceding whitespace character before the bar=. This can be easily done since you're quoting your string explicitly.

Using sed to replace string matching regex with wildcards

I have a string I'm trying manipulate with sed
js/plex.js?hash=f1c2b98&version=2.4.23"
Desired output is
js/plex.js"
This is what I'm currently trying
sed -i s'/js\/plex.js[\?.\+\"]/js\/plex.js"/'
But it is only matching the first ? and returns this output
js/plex.js"hash=f1c2b98&version=2.4.23"
I can't see why this isn't working after a few hours
This works
echo 'js/plex.js?hash=f1c2b98&version=2.4.23"' | sed s:.js?.*:.js:g
With the original Regex:
Firstly I would suggest use a different delimiter (like : in sed when using / in the regex. Secondly, the use of [] means that you are matching the characters inside the brackets (and as such it will not expand the .+ to the end of the line - you could potentially try put the + after the [])
perhaps
sed 's#\(js/plex.js?\)[^"]\+".*#\1#g'
..
\# is used as a delimiter
\(js/plex.js?\)[^"]\+".* #find this pattern and replace everything with your marked pattern \1 found
The marked pattern
In sed you can mark part of a pattern or the whole pattern buy using \( \). .
When part of a pattern is enclosed by brackets () escaped by backslashes..the pattern is marked/stored...
in my example this is my pattern without marking
js/plex.js?[^"]\+".*
but I only want sed to remember js/plex.js? and replace the whole line with only this piece of pattern js/plex.js? ..with sed the first marked pattern is known as \1, the second \2 and so forth
\(js/plex.js?\) ---> is marked as \1
Hence I replace the whole line with \1

Grep for a string that ends with specific character

Is there a way to use extended regular expressions to find a specific pattern that ends with a string.
I mean, I want to match first 3 lines but not the last:
file_number_one.pdf # comment
file_number_two.pdf # not interesting
testfile_number____three.pdf # some other stuff
myfilezipped.pdf.zip some comments and explanations
I know that in grep, metacharacter $ matches the end of a line but I'm not interested in matching a line end but string end. Groups in grep are very odd, I don't understand them well yet.
I tried with group matching, actually I have a similar REGEX but it does not work with grep -E
(\w+).pdf$
Is there a way to do string ending match in grep/egrep?
Your example works with matching the space after the string also:
grep -E '\.pdf ' input.txt
What you call "string" is similar to what grep calls "word". A Word is a run of alphanumeric characters. The nice thing with words is that you can match a word end with the special \>, which matches a word end with a march of zero characters length. That also matches at the end of line. But the word characters can not be changed, and do not contain punctuation, so we can not use it.
If you need to match at the end of line too, where there is no space after the word, use:
grep -E '\.pdf |\.pdf$' input.txt
To include cases where the character after the file name is not a space character '', but other whitespace, like a tab, \t, or the name is directly followed by a comment, starting with #, use:
grep -E '\.pdf[[:space:]#]|\.pdf$' input.txt
I will illustrate the matching of word boundarys too, because that would be the perfect solution, except that we can not use it here because we can not change the set of characters that are seen as parts of a word.
The input contains foo as separate word, and as part of longer words, where the foo is not at the end of the word, and therefore not at a word boundary:
$ printf 'foo bar\nfoo.bar\nfoobar\nfoo_bar\nfoo\n'
foo bar
foo.bar
foobar
foo_bar
foo
Now, to match the boundaries of words, we can use \< for the beginning, and \> to match the end:
$ printf 'foo bar\nfoo.bar\nfoobar\nfoo_bar\nfoo\n' | grep 'foo\>'
foo bar
foo.bar
foo
Note how _ is matched as a word char - but otherwise, wordchars are only the alphanumerics, [a-zA-Z0-9].
Also note how foo an the end of line is matched - in the line containing only foo. We do not need a special case for the end of line.
You can use \> operator
grep 'word\>' fileName
You need to escape the . in your regex. This regex will match anything that ends in .pdf (and only things that end in .pdf):
.*\.pdf$
Positive lookaheads are the most suited for this kinda stuff. Have a try :
grep -P "(^\w+\.pdf)(?=\s)" file
I assume filenames will always be on the start of the line.

How to grep for this pattern in Unix

I want to grep for this particular pattern. The pattern is as follows
**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887
inside the file test.txt which has the following data
NNN**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887_20140628.csv
I tried using grep "**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887" test.txt but it's not returning anything. Please advice
EDIT:
Hi, basically i'm inside a loop and only sometimes i get files with this pattern. So currently im putting like grep "$i" test.txt which works in all the cases except when I have to encounter such patterns.
And I'm actually grepping for the exact file_number, file sequence.So if it says 123_29887 it will be 123_29887. Thanks.
You could use:
grep -P "(?i)\*\*[a-z\d]+\*\*[a-z]+_\d+_\d+" somepath
(?i) turns on case-insensitive mode
\*\* matches the two opening stars
[a-z\d]+ matches letters and digits
\*\* matches two more stars
[a-z]+ matches letters
_\d+_\d+ matches underscore, digits, underscore, digits
If you need to be more specific (for instance, you know that a group of digits always has three digits), you can replace parts of the expression: for instance, \d+ becomes \d{3}
Matching a Literal but Yet Unknown Pattern: \Q and \E
If you receive literal patterns that you need to match, such as **xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887, the issue is that special regex characters such as * need to be escaped. If the whole string is a literal, we do this by escaping the whole string between \Q and \E:
grep -P "\Q**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887\E" somepath
And in a loop, of course, you can build that regex programmatically by concatenating \Q and \E on both sides.

Why doesn't sed interpret this regex properly?

echo "This is a test string" | sed 's/This/\0/'
First I match substring This using the regex This. Then I replace the entire string with the first match using \0. So the result should be just the matched string.
But it prints out the entire line. Why is this so?
You don't replace the whole string with \0, just the pattern match, which is This. In other words, you replace This with This.
To replace the whole line with This, you can do:
echo "This is a test string" | sed '/This/s/.*/This/'
It looks for a line matching This, and replaces the whole line with This. In this case (since there is only one line) you can also do:
echo "This is a test string" | sed 's/.*/This/'
If you want to reuse the match, then you can do
echo "This is a test string" | sed 's/.*\(This\).*/\1/'
\( and \) are used to remember the match inside them. It can be referenced as \1 (if you have more than one pair of \( and \), then you can also use \2, \3, ...).
In the example above this is not very helpful, since we know that inside \( and \) is the word This, but if we have a regex inside the parentheses that can match different words, this can be very helpful.
sed 's/.*\(PatThis\).*/PatThat/'
or
se '/PatThis/ s/.*/PatThat/'
In your request "PatThis" and "PatThat" are the same contain ("This"). In the comment (
I need to select a number using \d\d\d\d and then use it as
replacement
) you have 2 different value for the pattern PatThis and PatThat
the \1 is not really needed because you know the exact contain (unless 'PatThis' is a regex with special char like \ & ? .)