Unable to figure out regex bash or sed or awk

Unable to figure out regex bash or sed or awk - regex

I wanted to split the following jdk-1.6.0_30-fcs.x86_64 to just jdk-1.6.0_30. I tried the following sed 's/\([a-z][^fcs]*\).*/\1/'but I end up with jdk-1.6.0_30-. I think am approaching it the wrong way, is there a way to start from the end of the word and traverse backwards till I encounter -?

Not exactly, but you can anchor the pattern to the end of the string with $. Then you just need to make sure that the characters you repeat may not include hyphens:
echo jdk-1.6.0_30-fcs.x86_64 | sed 's/-[^-]*$//'
This will match from a - to the end of the string, but all characters in between must be different from - (so that it does not match for the first hyphen already).
A slightly more detailed explanation. The engine tries to match the literal - first. That will first work at the first - in the string (obviously). Then [^-]* matches as many non-- characters as possible, so it will consume 1.6.0_30 (because the next character is in fact a hyphen). Now the engine will try to match $, but that does not work because we are not at the end of the string. Some backtracking occurs, but we can ignore that here. In the end the engine will abandon matching the first - and continue through the string. Then the engine will match the literal - with the second -. Now [^-]* will consume fcs.x86_64. Now we are actually at the end of the string and $ will match, so the full match (which will be removed is) -fcs.x86_64.

Use cut >>
echo 'jdk-1.6.0_30-fcs.x86_64' | cut -d- -f-2

Try doing this :
echo 'jdk-1.6.0_30-fcs.x86_64' | sed 's/-fcs.*//'
If using bash, sh or ash, you can do :
var=jdk-1.6.0_30-fcs.x86_64
echo ${var%%-fcs*}
jdk-1.6.0_30
Later solution use parameter expansion, tested on Linux and Minix3

Related

Regular expression to match string in line between single ":" field delimiters and exclude them, when the string also contains "::" field delimiters

Using a regular expression, I need to match only the IPv4 subnet mask from the given input string:
ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
For testing this input string is contained in a text file called file.txt, however the actual use case will be to parse /proc/cmdline, and I will need a solution that starts parsing, counting fields, and matching after encountering "ip=" until the next white space character.
I'm using bash 4.2.46 with GNU grep 2.20 on an EL 7.9 workstation, x86_64 to test the expression.
Based on examples I've seen looking at other questions, I've come up with the following grep command and PCRE regular expression which gives output that is very close to what I need.
[user#ws01 ~]$ grep -o -P '(?<!:)(?:\:[0-9])(.*?)(?=:)' file.txt
:255.255.254.0
My understanding of what I've done here is that, I've started with a negative lookbehind with a ":" character to try and exclude the first "::" field, followed by a non capturing group to match on an escaped ":" character, followed by a number, [0-9], then a capturing group with .*?, for the actual match of the string itself, and finally a look ahead for the next ":" character.
The problem is that this gives the desired string, but includes an extra : character at the beginning of the string.
Expected output should look like this:
255.255.254.0
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters. The reason for this is because a field can have an empty value. For example
:<null>:ip:gw:netmask:hostname:<null>:off
Null is shown here to indicate an omitted value not passed by the user, that the user does not need to provide for the intended purpose.
I've tried a few different expressions as suggested in other answers that use negative look behinds and look aheads to not start matching at a : which is neighbored by another :
For example, see this question:
Regular Expression to find a string included between two characters while EXCLUDING the delimiters
If I can start matching at the first single colon, by itself, which is not followed by or preceded by another : character, while excluding the colon character as the delimiter, and continue matching until the next single colon which is also not neighboring another : and without including the colon character, that should match the desired string.
I'm able to match the exact string by including "255" in an expression like this: (Which will work for all of our present use cases)
[user#ws01 ~]$ grep -o -P '(?:)255.*?(?=:)' file.txt
255.255.254.0
The logic problem here is that the subnet mask itself, may not always start with "255", but it should be a number, [0-9] which is why I'm attempting to use that in the expression above. For the sake of simplicity, I don't need to validate that it's not greater than 255.

Using gnu-grep you could write the pattern as:
grep -oP '(?<!:):\K\d{1,3}(?:\.\d{1,3}){3}(?=:(?!:))' file.txt
Output
255.255.254.0
Explanation
(?<!:): Negative lookahead, assert not : to the left and then match :
\K Forget what is matched until now
\d{1,3}(?:\.\d{1,3}){3} Match 4 times 1-3 digits separated by .
(?=:(?!:)) Positive lookahead, assert : that is not followed by :
See a regex demo.

Using grep
$ grep -oP '(?<!:)?:\K([0-9.]+)(?=:[[:alpha:]])' file.txt
View Demo here
or
$ grep -oP '[^:]*:\K[^:[:alpha:]]*' file.txt
Output
255.255.254.0

If these are delimiters, your value should be in a clearly predictable place.
Just treat every colon as a delimiter and select the 4th field.
$: awk -F: '{print $4}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
I'm not sure what you mean by
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters.
If your delimiters aren't predictable and parse-able, they are useless. If you mean the fields can have or not have quotes, but you need to exclude quotes, we can do that. If double colons are one delimiter and single colons are another that's horrible design, but we can probably handle that, too.
$: awk -F'::' '{ split($2,x,":"); print x[2];}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
For quotes, you need to provide an example.

Since the number of fields is always the same, simply separated by ":", you can use cut.
That solution will also work if you have empty fields.
cut -d":" -f4

What does the following SED pattern exactly do?

I am working on a CGI script and the developer who worked on this before me has used a SED Pattern.
COMMAND=`echo "$QUERY_STRING" | sed -n 's/^.*com_tex=\([^&]*\).*$/\1/p' | sed "s/%20/ /g"`
Here com_tex is the name of the text box in HTML.
What this line does is it takes a value form the HTML text box and assigns it to a SHELL variable. The SED pattern is apparently (not sure) necessary to extract the value from HTML without the other unnecessary accompanying stuff.
I will also mention the issue what I am asking this. The same pattern is used for a text area where I am entering a command and I need it retrieved exactly as it is. However it's getting jumbled up. Eg. IF I enter the following command in text box:
/usr/bin/free -m >> /home/admin/memlog.txt
The value that gets stored in the variable is:
%2Fusr%2Fbin%2Ffree+-m+%3E%3E+%2Fhome%2Fadmin%2Fmemlog.txt
All of us can get that / is being substituted by %2F, a space by + and the > sign by %3E.
But I just can not figure how this is specified in the above pattern! Will someone please tell me how that pattern works or what pattern should I substitute there so that I would get my entered command instead of the output I am getting?

sed -n
-n switch means "Dont print"
's/
s is for substitutions, / is a delimiter so the command looks like
s/Thing to sub/subsitution/optional extra command
^.*com_tex=
^ means the start of the line
.* means match 0 or more of any character
So it will match the longest string from the start of the line up to com_tex=
\(\)
This is a capture group, whatever is matched inside these brackets is saved and can be used later
[^&]*
[^] When the hat is used inside square brackets it means do not match any characters inside the brackets
* The same as before means 0 or more matches
The capture group combined with this means capture any character except &.
.*$
The same as the first bit except $ means the end of the line, so this matches everything until the end
/\1/p'
After the second / is the substitution. \1 is the capture group from before, so this will substitute everything we matched in the first part(the whole line) with the capture group.
p means print, this must be explicitly stated as the -n switch was used and will prevent other lines from being printed.
|
PIPE
s/%20/ /g
Sub %20 for a space, g means global so do it for every match on the line
HTH :)

This is not performed by any of the patterns. My best guess is that this escaping is performed by the shell or whatever fetches the HTML.
I will try to explain the patterns a little at a time
sed -n
-n specifies that sed should not print out the text to be matched, ie the html, after applying the commands.
The command following is of the form 's/regexp/replacement/flags'
^.*com_tex=\([^&]*\).*$
^ matches the beginning of the line
.* matches zero to many of any character
com_tex= matches the characters literally
\([^&]*\) '\(' specifies the beginning of a group that can later be backreferenced via its index. '[^&]*' matches zero to many characters which are not '&'. '\)' specifies the end of the group.
.* See above
$ matches the end of the line
\1
The above replacement is a backreference to the first (and only) group in the regexp i.e. '[^&]*'. So the replacement replaces the entire line with all characters immediately following 'com_tex=' till the first '&'.
The p flag specifies that if a substitution took place, the current line post substitution should be printed.
sed "s/%20/ /g"
The above is much simpler, it replaces all (not just the first) occurences of '%20' with a space ' '.

How to terminate a regular expression and start another

I have a file which have the data something like this
34sdf, 434ssdf, 43fef,
34sdf, 434ssdf, 43fef, sdfsfs,
I have to identify the sdfsfs, and replace it and/or print the line.
The exact condition is the tokens are comma separated. target expression starts with a non numeric character, and till a comma is met.
Now i start with [^0-9] for starting with a non numeric character, but the next character is really unknown to me, it can be a number, a special char, an alphabet or even a space. So I wanted a (anything)*. But the previous [] comes into play and spoils it. [^0-9]* or [^0-9].*, or [^0-9]\+.*, or [^0-9]{1}*, or [^0-9][^,]* or [^0-9]{1}[^\,]*, nothing worked till now. So my question is how to write a regex for this (starting character a non numeric, then any character except a comma or any number of character till comma) I am using grep and sed (gnu). Another question is for posix or non-posix, any difference comes there?

Something like that maybe?
(?:(?:^(\D.*?))|(?:,\s(\D.*?))),
This captures the string that starts with a non-numeric character. Tested here.
I'm not sure if sed supports \D, but you can easily replace it with [^0-9] if not, which you already know.
EDIT: Can be trimmed to:
(?:\s|^)(\D.*?),

With sed, and slight modifications to your last regex:
sed -n 's/.*,[ ]*\([^ 0-9][^\,]*\),/\1/p' input

I think pattern (\s|^)(\D[^,]+), will catch it.
It matches white-space or start of string and group of a non-digit followed by anything but comma, which is followed by comma.
You can use [^0-9] if \D is not supported.

This might work for you (GNU sed):
sed '/\b[^0-9,][^,]*/!d' file # only print lines that match
or:
sed -n 's/\b[^0-9,][^,]*/XXX/gp' file # substitute `XXX` for match

Why sed doesn't print an optional group?

I have two strings, say foo_bar and foo_abc_bar. I would like to match both of them, and if the first one is matched I would like to emphasize it with = sign. So, my guess was:
echo 'foo_abc_bar' | sed -r 's/(foo).*(abc)?.*(bar)/\1=\2=\3/g'
> foo==bar
or
echo 'foo_abc_bar' | sed -r 's/(foo).*((abc)?).*(bar)/\1=\2=\3/g'
> foo==
But as output above shows none of them work.
How can I specify an optional group that will match if the string contains it or just skip if not?

The solution:
echo 'foo_abc_bar' | sed -r 's/(foo)_((abc)_)?(bar)/\1=\3=\4/g'
Why your previous attempts didn't work:
.* is greedy, so for the regex (foo).*(abc)?.*(bar) attempting to match 'foo_abc_bar' the (foo) will match 'foo', and then the .* will initially match the rest of the string ('_abc_bar'). The regex will continue until it reaches the required (bar) group and this will fail, at which point the regex will backtrack by giving up characters that had been matched by the .*. This will happen until the first .* is only matching '_abc_', at which point the final group can match 'bar'. So instead of the 'abc' in your string being matched in the capture group it is matched in the non-capturing .*.
Explanation of my solution:
The first and most important thing is to replace the .* with _, there is no need to match any arbitrary string if you know what the separator will be. The next thing we need to do is figure out exactly which portion of the string is optional. If the strings 'foo_abc_bar' and 'foo_bar' are both valid, then the 'abc_' in the middle is optional. We can put this in an optional group using (abc_)?. The last step is to make sure that we still have the string 'abc' in a capturing group, which we can do by wrapping that portion in an additional group, so we end up with ((abc)_)?. We then need to adjust the replacement because there is an extra group, so instead of \1=\2=\3 we use \1=\3=\4, \2 would be the string 'abc_' (if it matched). Note that in most regex implementations you could also have used a non-capturing group and continued to use \1=\2=\3, but sed does not support non-capturing groups.
An alternative:
I think the regex above is your best bet because it is most explicit (it will only match the exact strings you are interested in). However you could also avoid the issue described above by using lazy repetition (matches as few characters as possible) instead of greedy repetition (matches as many characters as possible). You can do this by changing the .* to .*?, so your expression would look something like this:
echo 'foo_abc_bar' | sed -r 's/(foo).*?(abc).*?(bar)/\1=\2=\3/g'

Maybe you could simply use:
echo 'foo_abc_bar' | sed -r 's/(foo|bar|abc)_?/\1=/g'
echo 'foo_bar' | sed -r 's/(foo|bar|abc)_?/\1=/g'
> foo=abc=bar=
> foo=bar=
This avoids the foo==bar you get with foo_bar and I found it a bit weird to show emphasis by putting = sometimes before the match, sometimes after the match.

Substitution till the end of the line in bash

I have a huge text file with lots of lines like:
asdasdasdaasdasd_DATA_3424223423423423
gsgsdgsgs_DATA_6846343636
.....
I would like to do, for each line, to substitute from DATA_ .. to the end, with just empty space so I would get:
asdasdasdaasdasd_DATA_
gsgsdgsgs_DATA_
.....
I know that you can do something similar with:
sed -e "s/^DATA_*$/DATA_/g" filename.txt
but it does not work.
Do you know how?
Thanks

You have two problems: you're unnecessarily matching beginning and end of line with ^ and $, and you're looking for _* (zero or more underscores) instead of .* (zero or more of any character. Here's what you want:
sed -e 's/_DATA_.*/_DATA_/'
The g on the end (global) won't do anything, because you're already going to remove everything from the first instance of "DATA" onward - there can't be another match.
P.S. The -e isn't strictly necessary if you only have one expression, but if you think you might tack more on, it's a convenient habit.

With regular expressions, * means the previous character, any number of times. To match any character, use .
So what you really want is .* which means any character, any number of times, like this:
sed 's/DATA_.*/DATA_/' filename.txt
Also, I removed the ^ which means start of line, since you want to match "DATA_" even if it's not in the beginning of a line.

using awk. Set field delimiter as "DATA", then get field 1 ($1). No need regular expression
$ awk -F"_DATA_" '{print $1"_DATA_"}' file
asdasdasdaasdasd_DATA_
gsgsdgsgs_DATA_

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Unable to figure out regex bash or sed or awk - regex

I wanted to split the following jdk-1.6.0_30-fcs.x86_64 to just jdk-1.6.0_30. I tried the following sed 's/\([a-z][^fcs]\)./\1/'but I end up with jdk-1.6.0_30-. I think am approaching it the wrong way, is there a way to start from the end of the word and traverse backwards till I encounter -?

Use cut >> echo 'jdk-1.6.0_30-fcs.x86_64' | cut -d- -f-2

Try doing this : echo 'jdk-1.6.0_30-fcs.x86_64' | sed 's/-fcs.//' If using bash, sh or ash, you can do : var=jdk-1.6.0_30-fcs.x86_64 echo ${var%%-fcs} jdk-1.6.0_30 Later solution use parameter expansion, tested on Linux and Minix3

Related

Regular expression to match string in line between single ":" field delimiters and exclude them, when the string also contains "::" field delimiters

What does the following SED pattern exactly do?

How to terminate a regular expression and start another

Why sed doesn't print an optional group?

Substitution till the end of the line in bash

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Unable to figure out regex bash or sed or awk - regex

I wanted to split the following jdk-1.6.0_30-fcs.x86_64 to just jdk-1.6.0_30. I tried the following sed 's/\([a-z][^fcs]*\).*/\1/'but I end up with jdk-1.6.0_30-. I think am approaching it the wrong way, is there a way to start from the end of the word and traverse backwards till I encounter -?

Use cut >> echo 'jdk-1.6.0_30-fcs.x86_64' | cut -d- -f-2

Try doing this : echo 'jdk-1.6.0_30-fcs.x86_64' | sed 's/-fcs.*//' If using bash, sh or ash, you can do : var=jdk-1.6.0_30-fcs.x86_64 echo ${var%%-fcs*} jdk-1.6.0_30 Later solution use parameter expansion, tested on Linux and Minix3

Related

Regular expression to match string in line between single ":" field delimiters and exclude them, when the string also contains "::" field delimiters

What does the following SED pattern exactly do?

How to terminate a regular expression and start another

Why sed doesn't print an optional group?

Substitution till the end of the line in bash

Categories

Resources

I wanted to split the following jdk-1.6.0_30-fcs.x86_64 to just jdk-1.6.0_30. I tried the following sed 's/\([a-z][^fcs]\)./\1/'but I end up with jdk-1.6.0_30-. I think am approaching it the wrong way, is there a way to start from the end of the word and traverse backwards till I encounter -?

Try doing this : echo 'jdk-1.6.0_30-fcs.x86_64' | sed 's/-fcs.//' If using bash, sh or ash, you can do : var=jdk-1.6.0_30-fcs.x86_64 echo ${var%%-fcs} jdk-1.6.0_30 Later solution use parameter expansion, tested on Linux and Minix3