What does the following SED pattern exactly do?

What does the following SED pattern exactly do? - regex

I am working on a CGI script and the developer who worked on this before me has used a SED Pattern.
COMMAND=`echo "$QUERY_STRING" | sed -n 's/^.*com_tex=\([^&]*\).*$/\1/p' | sed "s/%20/ /g"`
Here com_tex is the name of the text box in HTML.
What this line does is it takes a value form the HTML text box and assigns it to a SHELL variable. The SED pattern is apparently (not sure) necessary to extract the value from HTML without the other unnecessary accompanying stuff.
I will also mention the issue what I am asking this. The same pattern is used for a text area where I am entering a command and I need it retrieved exactly as it is. However it's getting jumbled up. Eg. IF I enter the following command in text box:
/usr/bin/free -m >> /home/admin/memlog.txt
The value that gets stored in the variable is:
%2Fusr%2Fbin%2Ffree+-m+%3E%3E+%2Fhome%2Fadmin%2Fmemlog.txt
All of us can get that / is being substituted by %2F, a space by + and the > sign by %3E.
But I just can not figure how this is specified in the above pattern! Will someone please tell me how that pattern works or what pattern should I substitute there so that I would get my entered command instead of the output I am getting?

sed -n
-n switch means "Dont print"
's/
s is for substitutions, / is a delimiter so the command looks like
s/Thing to sub/subsitution/optional extra command
^.*com_tex=
^ means the start of the line
.* means match 0 or more of any character
So it will match the longest string from the start of the line up to com_tex=
\(\)
This is a capture group, whatever is matched inside these brackets is saved and can be used later
[^&]*
[^] When the hat is used inside square brackets it means do not match any characters inside the brackets
* The same as before means 0 or more matches
The capture group combined with this means capture any character except &.
.*$
The same as the first bit except $ means the end of the line, so this matches everything until the end
/\1/p'
After the second / is the substitution. \1 is the capture group from before, so this will substitute everything we matched in the first part(the whole line) with the capture group.
p means print, this must be explicitly stated as the -n switch was used and will prevent other lines from being printed.
|
PIPE
s/%20/ /g
Sub %20 for a space, g means global so do it for every match on the line
HTH :)

This is not performed by any of the patterns. My best guess is that this escaping is performed by the shell or whatever fetches the HTML.
I will try to explain the patterns a little at a time
sed -n
-n specifies that sed should not print out the text to be matched, ie the html, after applying the commands.
The command following is of the form 's/regexp/replacement/flags'
^.*com_tex=\([^&]*\).*$
^ matches the beginning of the line
.* matches zero to many of any character
com_tex= matches the characters literally
\([^&]*\) '\(' specifies the beginning of a group that can later be backreferenced via its index. '[^&]*' matches zero to many characters which are not '&'. '\)' specifies the end of the group.
.* See above
$ matches the end of the line
\1
The above replacement is a backreference to the first (and only) group in the regexp i.e. '[^&]*'. So the replacement replaces the entire line with all characters immediately following 'com_tex=' till the first '&'.
The p flag specifies that if a substitution took place, the current line post substitution should be printed.
sed "s/%20/ /g"
The above is much simpler, it replaces all (not just the first) occurences of '%20' with a space ' '.

Related

Regular expression to match string in line between single ":" field delimiters and exclude them, when the string also contains "::" field delimiters

Using a regular expression, I need to match only the IPv4 subnet mask from the given input string:
ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
For testing this input string is contained in a text file called file.txt, however the actual use case will be to parse /proc/cmdline, and I will need a solution that starts parsing, counting fields, and matching after encountering "ip=" until the next white space character.
I'm using bash 4.2.46 with GNU grep 2.20 on an EL 7.9 workstation, x86_64 to test the expression.
Based on examples I've seen looking at other questions, I've come up with the following grep command and PCRE regular expression which gives output that is very close to what I need.
[user#ws01 ~]$ grep -o -P '(?<!:)(?:\:[0-9])(.*?)(?=:)' file.txt
:255.255.254.0
My understanding of what I've done here is that, I've started with a negative lookbehind with a ":" character to try and exclude the first "::" field, followed by a non capturing group to match on an escaped ":" character, followed by a number, [0-9], then a capturing group with .*?, for the actual match of the string itself, and finally a look ahead for the next ":" character.
The problem is that this gives the desired string, but includes an extra : character at the beginning of the string.
Expected output should look like this:
255.255.254.0
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters. The reason for this is because a field can have an empty value. For example
:<null>:ip:gw:netmask:hostname:<null>:off
Null is shown here to indicate an omitted value not passed by the user, that the user does not need to provide for the intended purpose.
I've tried a few different expressions as suggested in other answers that use negative look behinds and look aheads to not start matching at a : which is neighbored by another :
For example, see this question:
Regular Expression to find a string included between two characters while EXCLUDING the delimiters
If I can start matching at the first single colon, by itself, which is not followed by or preceded by another : character, while excluding the colon character as the delimiter, and continue matching until the next single colon which is also not neighboring another : and without including the colon character, that should match the desired string.
I'm able to match the exact string by including "255" in an expression like this: (Which will work for all of our present use cases)
[user#ws01 ~]$ grep -o -P '(?:)255.*?(?=:)' file.txt
255.255.254.0
The logic problem here is that the subnet mask itself, may not always start with "255", but it should be a number, [0-9] which is why I'm attempting to use that in the expression above. For the sake of simplicity, I don't need to validate that it's not greater than 255.

Using gnu-grep you could write the pattern as:
grep -oP '(?<!:):\K\d{1,3}(?:\.\d{1,3}){3}(?=:(?!:))' file.txt
Output
255.255.254.0
Explanation
(?<!:): Negative lookahead, assert not : to the left and then match :
\K Forget what is matched until now
\d{1,3}(?:\.\d{1,3}){3} Match 4 times 1-3 digits separated by .
(?=:(?!:)) Positive lookahead, assert : that is not followed by :
See a regex demo.

Using grep
$ grep -oP '(?<!:)?:\K([0-9.]+)(?=:[[:alpha:]])' file.txt
View Demo here
or
$ grep -oP '[^:]*:\K[^:[:alpha:]]*' file.txt
Output
255.255.254.0

If these are delimiters, your value should be in a clearly predictable place.
Just treat every colon as a delimiter and select the 4th field.
$: awk -F: '{print $4}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
I'm not sure what you mean by
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters.
If your delimiters aren't predictable and parse-able, they are useless. If you mean the fields can have or not have quotes, but you need to exclude quotes, we can do that. If double colons are one delimiter and single colons are another that's horrible design, but we can probably handle that, too.
$: awk -F'::' '{ split($2,x,":"); print x[2];}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
For quotes, you need to provide an example.

Since the number of fields is always the same, simply separated by ":", you can use cut.
That solution will also work if you have empty fields.
cut -d":" -f4

extracting text between nth and n+1th occurence of a character using sed

I'd like to know how to take the following string
/text1/text2/text3/wanted_text/text5/text6
and get the wanted text, based solely on its position between the 4th and 5th /?

A substitution command is enough (I've obviously assumed that the interesting part is between the 4th and 5th / as you said):
echo your_text | sed -E 's!(/[^/]+){3}/([^/]+).*!\2!'
where I've used ! as separator for the parts of the substitution command, in order to avoid having to escape every /.
More in detail:
s!…!…! is the seach-and-substitute command, where you put the search pattern in the first … and the replacement in the second …;
the seach pattern is (/[^/]+){3}/([^/]+).* and matches 3 occurrences of a / followed by 1 or more non-/, followed by a / followed by 1 ore more non-/; the (…) are for grouping a part of a regex such that you can apply quatifiers (like {3}) to the whole group (just like in (/[^/]+){3}), and for capturing the matching text to allow you to refer to it in the replacement; in this case, the third of the 3 texts matching (/[^/]+){3} is referred to via \1, whereas the text matched by ([^/]+) is referred to via \2;
the replacement is simply \2 (see previous point).
For more details about how the search pattern works, and to understand all of its parts, you can refer to this demo on regex 101.
(-E is a non-POSIX-compliant option that makes the script more readable. Without it, you have to prepend \ to each of (, ), {, } and +.)

Only remove white space which precedes a specific character. with libre office regex engine

I'm attempting to perform this in Libre Office's regex engine and sed in ubuntu terminal.
1.Example strings:
-Polizeiwache (f)police station
-Freibad (n)open-air swimming pool
2.Desired outputs:
-Polizeiwache (f)policestation
-Freibad (n)open-airswimmingpool
I've been trying to select the character ) and replace every succeeding space with nothing.
Any help is appreciated.

What you're trying to accomplish is unclear. Your text says "replace every", but your example shows replacing only the first space. To replace every:
sed 'h;s/[^)]*//;s/ //g;x;s/).*//;G;s/\n//'
What this does:
h copy the line to sed's hold space
s/[^)]*// replace not-a-) repeated with nothing. This deletes the first part of the line.
s/ //g replace a blank with nothing. g option says do for every occurrence. Now we have the second part of the line as we want it.
x exchange hold space and pattern (working) space. Now we have the whole line in the pattern space again.
s/).*// replace ) followed by any character sequence with nothing. Now we have the first part of the line.
G append the hold space. Now we have re-joined the whole line (second part edited), separated by a newline.
s/\n// remove the newline in the middle.

You can try this:
sed -i.bak 's/\()[^ ]*\) /\1/g' yourfile
Pattern details:
\( # open the capture group 1
) # a literal closing parenthesis
[^ ]* # zero or more (*) characters that are not a space [^ ]
\) # close the capture group 1
# a space (do you see it?)
\1 is a backreference to the capture group 1, in clear it contains all that has been matched in this group. Since the space is not in the group, it is removed.
g stands for global research

Here is an awk
awk -F\) '{gsub(/ /,"",$NF)}1' OFS=\) file
-Polizeiwache (f)policestation
-Freibad (n)open-airswimmingpool
It sets ) as filed separator, and then removes space from last field.
Or if you like to change only first space, use sub instead of gsub:
awk -F\) '{sub(/ /,"",$NF)}1' OFS=\) file
-Polizeiwache (f)policestation
-Freibad (n)open-airswimming pool

Regular expression to match beginning and end of a line?

Could anyone tell me a regex that matches the beginning or end of a line? e.g. if I used sed 's/[regex]/"/g' filehere the output would be each line in quotes? I tried [\^$] and [\^\n] but neither of them seemed to work. I'm probably missing something obvious, I'm new to these

Try:
sed -e 's/^/"/' -e 's/$/"/' file

To add quotes to the start and end of every line is simply:
sed 's/.*/"&"/g'
The RE you were trying to come up with to match the start or end of each line, though, is:
sed -r 's/^|$/"/g'
Its an ERE (enable by "-r") so it will work with GNU sed but not older seds.

matthias's response is perfectly adequate, but you could also use a backreference to do this. if you're learning regular expressions, they are a handy thing to know.
here's how that would be done using a backreference:
sed 's/\(^.*$\)/"\1"/g' file
at the heart of that regex is ^.*$, which means match anything (.*) surrounded by the start of the line (^) and the end of the line ($), which effectively means that it will match the whole line every time.
putting that term inside parenthesis creates a backreference that we can refer to later on (in the replace pattern). but for sed to realize that you mean to create a backreference instead of matching literal parentheses, you have to escape them with backslashes. thus, we end up with \(^.*$\) as our search pattern.
the replace pattern is simply a double quote followed by \1, which is our backreference (refers back to the first pattern match enclosed in parentheses, hence the 1). then add your last double quote to end up with "\1".

Unable to figure out regex bash or sed or awk

I wanted to split the following jdk-1.6.0_30-fcs.x86_64 to just jdk-1.6.0_30. I tried the following sed 's/\([a-z][^fcs]*\).*/\1/'but I end up with jdk-1.6.0_30-. I think am approaching it the wrong way, is there a way to start from the end of the word and traverse backwards till I encounter -?

Not exactly, but you can anchor the pattern to the end of the string with $. Then you just need to make sure that the characters you repeat may not include hyphens:
echo jdk-1.6.0_30-fcs.x86_64 | sed 's/-[^-]*$//'
This will match from a - to the end of the string, but all characters in between must be different from - (so that it does not match for the first hyphen already).
A slightly more detailed explanation. The engine tries to match the literal - first. That will first work at the first - in the string (obviously). Then [^-]* matches as many non-- characters as possible, so it will consume 1.6.0_30 (because the next character is in fact a hyphen). Now the engine will try to match $, but that does not work because we are not at the end of the string. Some backtracking occurs, but we can ignore that here. In the end the engine will abandon matching the first - and continue through the string. Then the engine will match the literal - with the second -. Now [^-]* will consume fcs.x86_64. Now we are actually at the end of the string and $ will match, so the full match (which will be removed is) -fcs.x86_64.

Use cut >>
echo 'jdk-1.6.0_30-fcs.x86_64' | cut -d- -f-2

Try doing this :
echo 'jdk-1.6.0_30-fcs.x86_64' | sed 's/-fcs.*//'
If using bash, sh or ash, you can do :
var=jdk-1.6.0_30-fcs.x86_64
echo ${var%%-fcs*}
jdk-1.6.0_30
Later solution use parameter expansion, tested on Linux and Minix3

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

What does the following SED pattern exactly do? - regex

Related

Regular expression to match string in line between single ":" field delimiters and exclude them, when the string also contains "::" field delimiters

extracting text between nth and n+1th occurence of a character using sed

Only remove white space which precedes a specific character. with libre office regex engine

Regular expression to match beginning and end of a line?

Unable to figure out regex bash or sed or awk

Categories

Resources