Regex for finding comma between words - regex

What's the regular expression for finding all instances of a comma, without a trailing space, between words? e.g. "someword,otherword"?
I'm using this pattern in Eclipse's search tool:
([^,\n\s']),([^,\n\s\)\]'])
which works perfectly, but when I use this same pattern with grep like:
grep -nHIirE -- ([^,\n\s']),([^,\n\s\)\]'])
it finds nothing. What am I doing wrong?

Use word boundaries \b:
echo "abc,def ghi, jkl" | grep '\b,\b'
(find the first comma, but not the second)

Passing ([^,\n\s']),([^,\n\s\)\]']) to grep is a string literal search, you need to put it in single or double quotes to make it a regex search "([^,\n\s']),([^,\n\s\)\]'])"

Using grep this pattern should work:
grep -nHIir -- "[^,'[:space:]],[^],)[:space:]']"
Use [:space:] to match any whitespace including newlines
Importantly don't escape ] inside a negated character class. Just place it at first position.
Avoid unnecessary grouping
You don't even need extended regex flavor for this

Related

Regular expression to match string in line between single ":" field delimiters and exclude them, when the string also contains "::" field delimiters

Using a regular expression, I need to match only the IPv4 subnet mask from the given input string:
ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
For testing this input string is contained in a text file called file.txt, however the actual use case will be to parse /proc/cmdline, and I will need a solution that starts parsing, counting fields, and matching after encountering "ip=" until the next white space character.
I'm using bash 4.2.46 with GNU grep 2.20 on an EL 7.9 workstation, x86_64 to test the expression.
Based on examples I've seen looking at other questions, I've come up with the following grep command and PCRE regular expression which gives output that is very close to what I need.
[user#ws01 ~]$ grep -o -P '(?<!:)(?:\:[0-9])(.*?)(?=:)' file.txt
:255.255.254.0
My understanding of what I've done here is that, I've started with a negative lookbehind with a ":" character to try and exclude the first "::" field, followed by a non capturing group to match on an escaped ":" character, followed by a number, [0-9], then a capturing group with .*?, for the actual match of the string itself, and finally a look ahead for the next ":" character.
The problem is that this gives the desired string, but includes an extra : character at the beginning of the string.
Expected output should look like this:
255.255.254.0
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters. The reason for this is because a field can have an empty value. For example
:<null>:ip:gw:netmask:hostname:<null>:off
Null is shown here to indicate an omitted value not passed by the user, that the user does not need to provide for the intended purpose.
I've tried a few different expressions as suggested in other answers that use negative look behinds and look aheads to not start matching at a : which is neighbored by another :
For example, see this question:
Regular Expression to find a string included between two characters while EXCLUDING the delimiters
If I can start matching at the first single colon, by itself, which is not followed by or preceded by another : character, while excluding the colon character as the delimiter, and continue matching until the next single colon which is also not neighboring another : and without including the colon character, that should match the desired string.
I'm able to match the exact string by including "255" in an expression like this: (Which will work for all of our present use cases)
[user#ws01 ~]$ grep -o -P '(?:)255.*?(?=:)' file.txt
255.255.254.0
The logic problem here is that the subnet mask itself, may not always start with "255", but it should be a number, [0-9] which is why I'm attempting to use that in the expression above. For the sake of simplicity, I don't need to validate that it's not greater than 255.
Using gnu-grep you could write the pattern as:
grep -oP '(?<!:):\K\d{1,3}(?:\.\d{1,3}){3}(?=:(?!:))' file.txt
Output
255.255.254.0
Explanation
(?<!:): Negative lookahead, assert not : to the left and then match :
\K Forget what is matched until now
\d{1,3}(?:\.\d{1,3}){3} Match 4 times 1-3 digits separated by .
(?=:(?!:)) Positive lookahead, assert : that is not followed by :
See a regex demo.
Using grep
$ grep -oP '(?<!:)?:\K([0-9.]+)(?=:[[:alpha:]])' file.txt
View Demo here
or
$ grep -oP '[^:]*:\K[^:[:alpha:]]*' file.txt
Output
255.255.254.0
If these are delimiters, your value should be in a clearly predictable place.
Just treat every colon as a delimiter and select the 4th field.
$: awk -F: '{print $4}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
I'm not sure what you mean by
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters.
If your delimiters aren't predictable and parse-able, they are useless. If you mean the fields can have or not have quotes, but you need to exclude quotes, we can do that. If double colons are one delimiter and single colons are another that's horrible design, but we can probably handle that, too.
$: awk -F'::' '{ split($2,x,":"); print x[2];}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
For quotes, you need to provide an example.
Since the number of fields is always the same, simply separated by ":", you can use cut.
That solution will also work if you have empty fields.
cut -d":" -f4

capturing each word containing pattern regex

I'm trying to write a sed script that finds every word that contains a certain pattern and then prepends all words that contain that pattern. For example:
foobarbaz barfoobaz barbazfoo barbaz
might turn into:
quxfoobarbaz quxbarfoobaz quxbarbazfoo barbaz
I understand the basics of capture groups and backrefrences, but I'm still having trouble. Specifically I can't get it so that it captures each whole word separately.
s/\(.*\)men\(.*\)/ not just the \1men\2, but the \1women\2 and \1children\2 too /
I tried using \s, for whitespace as many sites recommend, but sed treats \s as the separate characters \ and s
You could use the non-space character \S as follows:
sed 's/\S*foo\S*/qux&/g' <<< "foobarbaz barfoobaz barbazfoo barbaz"
this will match words containing foo. The replacement string qux& will prepend every matched pattern with qux. Output:
quxfoobarbaz quxbarfoobaz quxbarbazfoo barbaz
It works fine if no spaces in each word.
echo "foobarbaz barfoobaz barbazfoo barbaz" | sed 's/\([^ ]*foo[^ ]*\)/qux\1/g'

Using sed to replace string matching regex with wildcards

I have a string I'm trying manipulate with sed
js/plex.js?hash=f1c2b98&version=2.4.23"
Desired output is
js/plex.js"
This is what I'm currently trying
sed -i s'/js\/plex.js[\?.\+\"]/js\/plex.js"/'
But it is only matching the first ? and returns this output
js/plex.js"hash=f1c2b98&version=2.4.23"
I can't see why this isn't working after a few hours
This works
echo 'js/plex.js?hash=f1c2b98&version=2.4.23"' | sed s:.js?.*:.js:g
With the original Regex:
Firstly I would suggest use a different delimiter (like : in sed when using / in the regex. Secondly, the use of [] means that you are matching the characters inside the brackets (and as such it will not expand the .+ to the end of the line - you could potentially try put the + after the [])
perhaps
sed 's#\(js/plex.js?\)[^"]\+".*#\1#g'
..
\# is used as a delimiter
\(js/plex.js?\)[^"]\+".* #find this pattern and replace everything with your marked pattern \1 found
The marked pattern
In sed you can mark part of a pattern or the whole pattern buy using \( \). .
When part of a pattern is enclosed by brackets () escaped by backslashes..the pattern is marked/stored...
in my example this is my pattern without marking
js/plex.js?[^"]\+".*
but I only want sed to remember js/plex.js? and replace the whole line with only this piece of pattern js/plex.js? ..with sed the first marked pattern is known as \1, the second \2 and so forth
\(js/plex.js?\) ---> is marked as \1
Hence I replace the whole line with \1

How to grep for this pattern in Unix

I want to grep for this particular pattern. The pattern is as follows
**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887
inside the file test.txt which has the following data
NNN**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887_20140628.csv
I tried using grep "**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887" test.txt but it's not returning anything. Please advice
EDIT:
Hi, basically i'm inside a loop and only sometimes i get files with this pattern. So currently im putting like grep "$i" test.txt which works in all the cases except when I have to encounter such patterns.
And I'm actually grepping for the exact file_number, file sequence.So if it says 123_29887 it will be 123_29887. Thanks.
You could use:
grep -P "(?i)\*\*[a-z\d]+\*\*[a-z]+_\d+_\d+" somepath
(?i) turns on case-insensitive mode
\*\* matches the two opening stars
[a-z\d]+ matches letters and digits
\*\* matches two more stars
[a-z]+ matches letters
_\d+_\d+ matches underscore, digits, underscore, digits
If you need to be more specific (for instance, you know that a group of digits always has three digits), you can replace parts of the expression: for instance, \d+ becomes \d{3}
Matching a Literal but Yet Unknown Pattern: \Q and \E
If you receive literal patterns that you need to match, such as **xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887, the issue is that special regex characters such as * need to be escaped. If the whole string is a literal, we do this by escaping the whole string between \Q and \E:
grep -P "\Q**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887\E" somepath
And in a loop, of course, you can build that regex programmatically by concatenating \Q and \E on both sides.

How do I write a SED regex to extract a string delimited by another string?

I am using GNU sed version 4.2.1 and I am trying to write a non-greedy SED regex to extract a string that delimited by two other strings. This is easy when the delimiting strings are single-character:
s:{\([^}]*\)}:\1:g
In that example the string is delimited by '{' on the left and '}' on the right.
If the delimiting strings are multiple characters, say '{{{' and '}}}' I can adjust the above expression like this:
s:{{{\([^}}}]*\)}}}:\1:g
so the centre expression matches anything not containing the '}}}' closing string. But this only works if the match string does not contain '}' at all. Something like:
{{{cannot match {this broken} example}}}
will not work but
{{{can match this example}}}
does work. Of course
s:{{{\(.*\)}}}:\1:g
always works but is greedy so isn't suitable where multiple patterns occur on the same line.
I understand [^a] to mean anything except a and [^ab] to mean anything except a or b so, despite it appearing to work, I don't think [^}}}] is the correct way to exclude that sequence of 3 consecutive characters.
So how to I write a regex for SED that matches a string that is delimited bt two other strings ?
You are correct that [^}}}] doesn't work. A negated character class matches anything that is not one of the characters inside it. Repeating characters doesn't change the logic. So what you wrote is the same as [^}]. (It is easy to see why this works when there are no braces inside the expression).
In Perl and compatible regular expressions, you can use ? to make a * or + non-greedy:
s:{{{(.*?)}}}:$1:g
This will always match the first }}} after the opening {{{.
However, this is not possible in Sed. In fact, I don't think there is any way in Sed of doing this match. The only other way to do this is use advanced features like look-ahead, which Sed also does not have.
You can easily use Perl in a sed-like fashion with the -pe options, which cause it to take a single line of code from the command line (-e) and automatically loop over each line and print the result (-p).
perl -pe 's:{{{(.*?)}}}:$1:g'
The -i option for in-place editing of files is also useful, but make sure your regex is correct first!
For more information see perlrun.
With sed you could do something like:
sed -e :a -e 's/\(.*\){{{\(.*\)}}}/\1\2/ ; ta'
With:
{{{can match this example}}} {{{can match this 2nd example}}}
This gives:
can match this example can match this 2nd example
It is not lazy matching, but by replacing from right to left we can make use of sed's greediness.