extract substring with SED - regex

I have the next strings:
for example:
input1 = abc-def-ghi-jkl
input2 = mno-pqr-stu-vwy
I want extract the first word between "-"
for the fisrt string I want to get: def
if the input is the second string, I want to get: pqr
I want to use the command SED, Could you help me please?

Use
sed 's,^[^-]*-\([^-]*\).*,\1,' file
The string after the first - will be captured up to the second - and the rest will be matched, then the matched line will be replaced with the group text.

With bash:
var='input1 = abc-def-ghi-jkl'
var=${var#*-} # remove shortest prefix `*-`, this removes `input1 = abc-`
echo "${var%%-*}" # remove longest suffix `-*`, this removes `-ghi-jkl`
Or with awk:
awk -F'-' '{print $2}' <<<'input1 = abc-def-ghi-jkl'
Use - as input field separator and print the second field.
Or with cut:
cut -d'-' -f2 <<<'input1 = abc-def-ghi-jkl'

When you want to use sed, you can choose between solutions like
# Double processing
echo "$input1" | sed 's/[^-]*-//;s/-.*//'
# Normal approach
echo "$input1" | sed -r 's/^[^-]*-([^-]*)|-.*)/\1/g'
# Funny alternative
echo "$input1" | sed -r 's/(^[^-]*-|-.*)//g'
The obvious "external" tool would be cut. You can also look at a Bash builtin solution like
[[ ${input1} =~ ([^-]*)-([^-]*) ]] && printf %s "${BASH_REMATCH[2]}"

grep solution (in my opinion this is the most natural approach, as you are only trying to find matches to a regular expression - you are not looking to edit anything, so there should be no need for the more advanced command sed)
grep -oP '^[^-]*-\K[^-]*(?=-)' << EOF
> abc-qrs-bobo-the-clown
> 123-45-6789
> blah-blah-blah
> no dashes here
> mahi-mahi
> EOF
Output
qrs
45
blah
Explanation
Look at the inputs first, included here for completeness as a heredoc (more likely you would name your file as the last argument to grep.) The solution requires at least two dashes to be present in the string; in particular, for mahi-mahi it will find no match. If you want to find the second mahi as a match, you can remove the lookahead assertion at the end of the regular expression (see below).
The regular expression does this. First note the command options: -o to return only the matched substring, not the entire line; and -P to use Perl extensions. Then, the regular expression: start from the beginning of the line (^); look for zero or more non-dash characters followed by dash, and then (\K) discard this part of the required match from the substrings found to match the pattern. Then look for zero or more non-dash characters again - this will be returned by the command. Finally, require a dash following this pattern, but do not include it in the match. This is done with a lookahead (marked by (?= ... )).

Related

Convert regex positive look ahead to sed operation

I would like to sed to find and replace every occurrence of - with _ but only before the first occurrence of = on every line.
Here is a dataset to work with:
ke-y_0-1="foo"
key_two="bar"
key_03-three="baz-jazz-mazz"
key-="rax_foo"
key-05-five="craz-"
In the end the dataset should look like this:
ke_y_0_1="foo"
key_two="bar"
key_03_three="baz-jazz-mazz"
key_="rax_foo"
key_05_five="craz-"
I found this regex will match properly.
\-(?=.*=)
However the regex uses positive lookaheads and it appears that sed (even with -E, -e or -r) dose not know how to work with positive lookaheads.
I tried the following but keep getting Invalid preceding regular expression
cat dataset.txt | sed -r "s/-(?=.*=)/_/g"
Is it possible to convert this in a usable way with sed?
Note, I do not want to use perl. However I am open to awk.
You can use
sed ':a;s/^\([^=]*\)-/\1_/;ta' file
See the online demo:
#!/bin/bash
s='ke-y_0-1="foo"
key_two="bar"
key_03-three="baz-jazz-mazz"
key-="rax_foo"
key-05-five="craz-"'
sed ':a; s/^\([^=]*\)-/\1_/;ta' <<< "$s"
Output:
ke_y_0_1="foo"
key_two="bar"
key_03_three="baz-jazz-mazz"
key_="rax_foo"
key_05_five="craz-"
Details:
:a - setting a label named a
s/^\([^=]*\)-/\1_/ - find any zero or more chars other than a = char from the start of string (while capturing into Group 1 (\1)) and then matches a - char, and replaces with Group 1 value (\1) and a _ (that replaces the found - char)
ta - jump to lable a location upon successful replacement. Else, stop.
You might also use awk setting the field separator to = and replace all - with _ for the first field.
To print only the replaced lines:
awk 'BEGIN{FS=OFS="="}gsub("-", "_", $1)' file
Output
ke_y_0_1="foo"
key_03_three="baz-jazz-mazz"
key_="rax_foo"
key_05_five="craz-"
If you want to print all lines:
awk 'BEGIN{FS=OFS="="}{gsub("-", "_", $1);print}' file

How to find and replace a pattern string using sed/perl/awk?

I have a file foo.properties with contents like
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.03,delta:1.0,gamma:.5
In my script, I need to replace whatever value is against ph (The current value is unknown to the bash script) and change it to 0.5. So the the file should look like
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5
I know it can be easily done if the current value is known by using
sed "s/\,ph\:0.03\,/\,ph\:0.5\,/" foo.properties
But in my case, I have to actually read the contents against allNames and search for the value and then replace within a for loop. Rest all is taken care of but I can't figure out the sed/perl command for this.
I tried using sed "s/\,ph\:.*\,/\,ph\:0.5\,/" foo.properties and some variations but it didn't work.
A simpler sed solution:
sed -E 's/([=,]ph:)[0-9.]+/\10.5/g' file
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5
Here we match ([=,]ph:) (i.e. , or = followed by ph:) and capture in group #1. This should be followed by 1+ of [0-9.] character to natch any number. In replacement we put \1 back with 0.5
With your shown samples, please try following awk code.
awk -v new_val="0.5" '
match($0,/,ph:[0-9]+(\.[0-9]+)?/){
val=substr($0,RSTART+1,RLENGTH-1)
sub(/:.*/,":",val)
print substr($0,1,RSTART) val new_val substr($0,RSTART+RLENGTH)
next
}
1
' Input_file
Detailed Explanation: Creating awk's variable named new_val which contains new value which needs to put in. In main program of awk using match function of awk to match ,ph:[0-9]+(\.[0-9]+)? regex in each line, if a match of regex is found then storing that matched value into variable val. Then substituting everything from : to till end of value in val variable with : here. Then printing values as pre requirement of OP(values before matched regex value with val(edited matched value in regex) with new value and rest of line), using next will avoid going further and by mentioning 1 printing rest other lines which are NOT having a matched value in it.
2nd solution: Using sub function of awk.
awk -v newVal="0.5" '/^allNames=/{sub(/,ph:[^,]*/,",ph:"newVal)} 1' Input_file
Would you please try a perl solution:
perl -pe '
s/(?<=\bph:)[\d.]+(?=,|$)/0.5/;
' foo.properties
The -pe option makes perl to read the input line by line, perform
the operation, then print it as sed does.
The regex (?<=\bph:) is a zero-length lookbehind which matches
the string ph: preceded by a word boundary.
The regex [\d.]+ will match a decimal number.
The regex (?=,|$) is a zero-length lookahead which matches
a comma or the end of the string.
As the lookbehind and the lookahead has zero length, they are not
substituted by the s/../../ operator.
[Edit]
As Dave Cross comments, the lookahead (?=,|$) is unnecessary as long as the input file is correctly formatted.
Works with decimal place or not, or no value, anywhere in the line.
sed -E 's/(^|[^-_[:alnum:]])ph:[0-9]*(.[0-9]+)?/ph:0.5/g'
Or possibly:
sed -E 's/(^|[=,[:space:]])ph:[0-9]+(.[0-9]+)?/ph:0.5/g'
The top one uses "not other naming characters" to describe the character immediately before a name, the bottom one uses delimiter characters (you could add more characters to either). The purpose is to avoid clashing with other_ph or autograph.
Here you go
#!/usr/bin/perl
use strict;
use warnings;
print "\nPerl Starting ... \n\n";
while (my $recordLine =<DATA>)
{
chomp($recordLine);
if (index($recordLine, "ph:") != -1)
{
$recordLine =~ s/ph:.*?,/ph:0.5,/g;
print "recordLine: $recordLine ...\n";
}
}
print "\nPerl End ... \n\n";
__DATA__
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.03,delta:1.0,gamma:.5
output:
Perl Starting ...
recordLine: allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5 ...
Perl End ...
Using any sed in any shell on every Unix box (the other sed solutions posted that use sed -E require GNU or BSD seds):
a) if ph: is never the first tag in the allNames list (as shown in your sample input):
$ sed 's/\(,ph:\)[^,]*/\10.5/' foo.properties
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5
b) or if it can be first:
$ sed 's/\([,=]ph:\)[^,]*/\10.5/' foo.properties
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5

regex help to end match on first occurent of a character [duplicate]

I'm trying to use sed to clean up lines of URLs to extract just the domain.
So from:
http://www.suepearson.co.uk/product/174/71/3816/
I want:
http://www.suepearson.co.uk/
(either with or without the trailing slash, it doesn't matter)
I have tried:
sed 's|\(http:\/\/.*?\/\).*|\1|'
and (escaping the non-greedy quantifier)
sed 's|\(http:\/\/.*\?\/\).*|\1|'
but I can not seem to get the non-greedy quantifier (?) to work, so it always ends up matching the whole string.
Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a later regex. Fortunately, Perl regex for this context is pretty easy to get:
perl -pe 's|(http://.*?/).*|\1|'
In this specific case, you can get the job done without using a non-greedy regex.
Try this non-greedy regex [^/]* instead of .*?:
sed 's|\(http://[^/]*/\).*|\1|g'
With sed, I usually implement non-greedy search by searching for anything except the separator until the separator :
echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*\)/.*;\1;p'
Output:
http://www.suon.co.uk
this is:
don't output -n
search, match pattern, replace and print s/<pattern>/<replace>/p
use ; search command separator instead of / to make it easier to type so s;<pattern>;<replace>;p
remember match between brackets \( ... \), later accessible with \1,\2...
match http://
followed by anything in brackets [], [ab/] would mean either a or b or /
first ^ in [] means not, so followed by anything but the thing in the []
so [^/] means anything except / character
* is to repeat previous group so [^/]* means characters except /.
so far sed -n 's;\(http://[^/]*\) means search and remember http://followed by any characters except / and remember what you've found
we want to search untill the end of domain so stop on the next / so add another / at the end: sed -n 's;\(http://[^/]*\)/' but we want to match the rest of the line after the domain so add .*
now the match remembered in group 1 (\1) is the domain so replace matched line with stuff saved in group \1 and print: sed -n 's;\(http://[^/]*\)/.*;\1;p'
If you want to include backslash after the domain as well, then add one more backslash in the group to remember:
echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*/\).*;\1;p'
output:
http://www.suon.co.uk/
Simulating lazy (un-greedy) quantifier in sed
And all other regex flavors!
Finding first occurrence of an expression:
POSIX ERE (using -r option)
Regex:
(EXPRESSION).*|.
Sed:
sed -r ‍'s/(EXPRESSION).*|./\1/g' # Global `g` modifier should be on
Example (finding first sequence of digits) Live demo:
$ sed -r 's/([0-9]+).*|./\1/g' <<< 'foo 12 bar 34'
12
How does it work?
This regex benefits from an alternation |. At each position engine tries to pick the longest match (this is a POSIX standard which is followed by couple of other engines as well) which means it goes with . until a match is found for ([0-9]+).*. But order is important too.
Since global flag is set, engine tries to continue matching character by character up to the end of input string or our target. As soon as the first and only capturing group of left side of alternation is matched (EXPRESSION) rest of line is consumed immediately as well .*. We now hold our value in the first capturing group.
POSIX BRE
Regex:
\(\(\(EXPRESSION\).*\)*.\)*
Sed:
sed 's/\(\(\(EXPRESSION\).*\)*.\)*/\3/'
Example (finding first sequence of digits):
$ sed 's/\(\(\([0-9]\{1,\}\).*\)*.\)*/\3/' <<< 'foo 12 bar 34'
12
This one is like ERE version but with no alternation involved. That's all. At each single position engine tries to match a digit.
If it is found, other following digits are consumed and captured and the rest of line is matched immediately otherwise since * means
more or zero it skips over second capturing group \(\([0-9]\{1,\}\).*\)* and arrives at a dot . to match a single character and this process continues.
Finding first occurrence of a delimited expression:
This approach will match the very first occurrence of a string that is delimited. We can call it a block of string.
sed 's/\(END-DELIMITER-EXPRESSION\).*/\1/; \
s/\(\(START-DELIMITER-EXPRESSION.*\)*.\)*/\1/g'
Input string:
foobar start block #1 end barfoo start block #2 end
-EDE: end
-SDE: start
$ sed 's/\(end\).*/\1/; s/\(\(start.*\)*.\)*/\1/g'
Output:
start block #1 end
First regex \(end\).* matches and captures first end delimiter end and substitues all match with recent captured characters which
is the end delimiter. At this stage our output is: foobar start block #1 end.
Then the result is passed to second regex \(\(start.*\)*.\)* that is same as POSIX BRE version above. It matches a single character
if start delimiter start is not matched otherwise it matches and captures the start delimiter and matches the rest of characters.
Directly answering your question
Using approach #2 (delimited expression) you should select two appropriate expressions:
EDE: [^:/]\/
SDE: http:
Usage:
$ sed 's/\([^:/]\/\).*/\1/g; s/\(\(http:.*\)*.\)*/\1/' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
Output:
http://www.suepearson.co.uk/
Note: this will not work with identical delimiters.
sed does not support "non greedy" operator.
You have to use "[]" operator to exclude "/" from match.
sed 's,\(http://[^/]*\)/.*,\1,'
P.S. there is no need to backslash "/".
sed - non greedy matching by Christoph Sieghart
The trick to get non greedy matching in sed is to match all characters excluding the one that terminates the match. I know, a no-brainer, but I wasted precious minutes on it and shell scripts should be, after all, quick and easy. So in case somebody else might need it:
Greedy matching
% echo "<b>foo</b>bar" | sed 's/<.*>//g'
bar
Non greedy matching
% echo "<b>foo</b>bar" | sed 's/<[^>]*>//g'
foobar
Non-greedy solution for more than a single character
This thread is really old but I assume people still needs it.
Lets say you want to kill everything till the very first occurrence of HELLO. You cannot say [^HELLO]...
So a nice solution involves two steps, assuming that you can spare a unique word that you are not expecting in the input, say top_sekrit.
In this case we can:
s/HELLO/top_sekrit/ #will only replace the very first occurrence
s/.*top_sekrit// #kill everything till end of the first HELLO
Of course, with a simpler input you could use a smaller word, or maybe even a single character.
HTH!
This can be done using cut:
echo "http://www.suepearson.co.uk/product/174/71/3816/" | cut -d'/' -f1-3
another way, not using regex, is to use fields/delimiter method eg
string="http://www.suepearson.co.uk/product/174/71/3816/"
echo $string | awk -F"/" '{print $1,$2,$3}' OFS="/"
sed certainly has its place but this not not one of them !
As Dee has pointed out: Just use cut. It is far simpler and much more safe in this case. Here's an example where we extract various components from the URL using Bash syntax:
url="http://www.suepearson.co.uk/product/174/71/3816/"
protocol=$(echo "$url" | cut -d':' -f1)
host=$(echo "$url" | cut -d'/' -f3)
urlhost=$(echo "$url" | cut -d'/' -f1-3)
urlpath=$(echo "$url" | cut -d'/' -f4-)
gives you:
protocol = "http"
host = "www.suepearson.co.uk"
urlhost = "http://www.suepearson.co.uk"
urlpath = "product/174/71/3816/"
As you can see this is a lot more flexible approach.
(all credit to Dee)
sed 's|(http:\/\/[^\/]+\/).*|\1|'
There is still hope to solve this using pure (GNU) sed. Despite this is not a generic solution in some cases you can use "loops" to eliminate all the unnecessary parts of the string like this:
sed -r -e ":loop" -e 's|(http://.+)/.*|\1|' -e "t loop"
-r: Use extended regex (for + and unescaped parenthesis)
":loop": Define a new label named "loop"
-e: add commands to sed
"t loop": Jump back to label "loop" if there was a successful substitution
The only problem here is it will also cut the last separator character ('/'), but if you really need it you can still simply put it back after the "loop" finished, just append this additional command at the end of the previous command line:
-e "s,$,/,"
sed -E interprets regular expressions as extended (modern) regular expressions
Update: -E on MacOS X, -r in GNU sed.
Because you specifically stated you're trying to use sed (instead of perl, cut, etc.), try grouping. This circumvents the non-greedy identifier potentially not being recognized. The first group is the protocol (i.e. 'http://', 'https://', 'tcp://', etc). The second group is the domain:
echo "http://www.suon.co.uk/product/1/7/3/" | sed "s|^\(.*//\)\([^/]*\).*$|\1\2|"
If you're not familiar with grouping, start here.
I realize this is an old entry, but someone may find it useful.
As the full domain name may not exceed a total length of 253 characters replace .* with .\{1, 255\}
This is how to robustly do non-greedy matching of multi-character strings using sed. Lets say you want to change every foo...bar to <foo...bar> so for example this input:
$ cat file
ABC foo DEF bar GHI foo KLM bar NOP foo QRS bar TUV
should become this output:
ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV
To do that you convert foo and bar to individual characters and then use the negation of those characters between them:
$ sed 's/#/#A/g; s/{/#B/g; s/}/#C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/g; s/}/bar/g; s/{/foo/g; s/#C/}/g; s/#B/{/g; s/#A/#/g' file
ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV
In the above:
s/#/#A/g; s/{/#B/g; s/}/#C/g is converting { and } to placeholder strings that cannot exist in the input so those chars then are available to convert foo and bar to.
s/foo/{/g; s/bar/}/g is converting foo and bar to { and } respectively
s/{[^{}]*}/<&>/g is performing the op we want - converting foo...bar to <foo...bar>
s/}/bar/g; s/{/foo/g is converting { and } back to foo and bar.
s/#C/}/g; s/#B/{/g; s/#A/#/g is converting the placeholder strings back to their original characters.
Note that the above does not rely on any particular string not being present in the input as it manufactures such strings in the first step, nor does it care which occurrence of any particular regexp you want to match since you can use {[^{}]*} as many times as necessary in the expression to isolate the actual match you want and/or with seds numeric match operator, e.g. to only replace the 2nd occurrence:
$ sed 's/#/#A/g; s/{/#B/g; s/}/#C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/2; s/}/bar/g; s/{/foo/g; s/#C/}/g; s/#B/{/g; s/#A/#/g' file
ABC foo DEF bar GHI <foo KLM bar> NOP foo QRS bar TUV
Have not yet seen this answer, so here's how you can do this with vi or vim:
vi -c '%s/\(http:\/\/.\{-}\/\).*/\1/ge | wq' file &>/dev/null
This runs the vi :%s substitution globally (the trailing g), refrains from raising an error if the pattern is not found (e), then saves the resulting changes to disk and quits. The &>/dev/null prevents the GUI from briefly flashing on screen, which can be annoying.
I like using vi sometimes for super complicated regexes, because (1) perl is dead dying, (2) vim has a very advanced regex engine, and (3) I'm already intimately familiar with vi regexes in my day-to-day usage editing documents.
Since PCRE is also tagged here, we could use GNU grep by using non-lazy match in regex .*? which will match first nearest match opposite of .*(which is really greedy and goes till last occurrence of match).
grep -oP '^http[s]?:\/\/.*?/' Input_file
Explanation: using grep's oP options here where -P is responsible for enabling PCRE regex here. In main program of grep mentioning regex which is matching starting http/https followed by :// till next occurrence of / since we have used .*? it will look for first / after (http/https://). It will print matched part only in line.
echo "/home/one/two/three/myfile.txt" | sed 's|\(.*\)/.*|\1|'
don bother, i got it on another forum :)
sed 's|\(http:\/\/www\.[a-z.0-9]*\/\).*|\1| works too
Here is something you can do with a two step approach and awk:
A=http://www.suepearson.co.uk/product/174/71/3816/
echo $A|awk '
{
var=gensub(///,"||",3,$0) ;
sub(/\|\|.*/,"",var);
print var
}'
Output:
http://www.suepearson.co.uk
Hope that helps!
Another sed version:
sed 's|/[:alnum:].*||' file.txt
It matches / followed by an alphanumeric character (so not another forward slash) as well as the rest of characters till the end of the line. Afterwards it replaces it with nothing (ie. deletes it.)
#Daniel H (concerning your comment on andcoz' answer, although long time ago): deleting trailing zeros works with
s,([[:digit:]]\.[[:digit:]]*[1-9])[0]*$,\1,g
it's about clearly defining the matching conditions ...
You should also think about the case where there is no matching delims. Do you want to output the line or not. My examples here do not output anything if there is no match.
You need prefix up to 3rd /, so select two times string of any length not containing / and following / and then string of any length not containing / and then match / following any string and then print selection. This idea works with any single char delims.
echo http://www.suepearson.co.uk/product/174/71/3816/ | \
sed -nr 's,(([^/]*/){2}[^/]*)/.*,\1,p'
Using sed commands you can do fast prefix dropping or delim selection, like:
echo 'aaa #cee: { "foo":" #cee: " }' | \
sed -r 't x;s/ #cee: /\n/;D;:x'
This is lot faster than eating char at a time.
Jump to label if successful match previously. Add \n at / before 1st delim. Remove up to first \n. If \n was added, jump to end and print.
If there is start and end delims, it is just easy to remove end delims until you reach the nth-2 element you want and then do D trick, remove after end delim, jump to delete if no match, remove before start delim and and print. This only works if start/end delims occur in pairs.
echo 'foobar start block #1 end barfoo start block #2 end bazfoo start block #3 end goo start block #4 end faa' | \
sed -r 't x;s/end//;s/end/\n/;D;:x;s/(end).*/\1/;T y;s/.*(start)/\1/;p;:y;d'
If you have access to gnu grep, then can utilize perl regex:
grep -Po '^https?://([^/]+)(?=)' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
http://www.suepearson.co.uk
Alternatively, to get everything after the domain use
grep -Po '^https?://([^/]+)\K.*' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
/product/174/71/3816/
The following solution works for matching / working with multiply present (chained; tandem; compound) HTML or other tags. For example, I wanted to edit HTML code to remove <span> tags, that appeared in tandem.
Issue: regular sed regex expressions greedily matched over all the tags from the first to the last.
Solution: non-greedy pattern matching (per discussions elsewhere in this thread; e.g. https://stackoverflow.com/a/46719361/1904943).
Example:
echo '<span>Will</span>This <span>remove</span>will <span>this.</span>remain.' | \
sed 's/<span>[^>]*>//g' ; echo
This will remain.
Explanation:
s/<span> : find <span>
[^>] : followed by anything that is not >
*> : until you find >
//g : replace any such strings present with nothing.
Addendum
I was trying to clean up URLs, but I was running into difficulty matching / excluding a word - href - using the approach above. I briefly looked at negative lookarounds (Regular expression to match a line that doesn't contain a word) but that approach seemed overly complex and did not provide a satisfactory solution.
I decided to replace href with ` (backtick), do the regex substitutions, then replace ` with href.
Example (formatted here for readability):
printf '\n
<a aaa h href="apple">apple</a>
<a bbb "c=ccc" href="banana">banana</a>
<a class="gtm-content-click"
data-vars-link-text="nope"
data-vars-click-url="https://blablabla"
data-vars-event-category="story"
data-vars-sub-category="story"
data-vars-item="in_content_link"
data-vars-link-text
href="https:example.com">Example.com</a>\n\n' |
sed 's/href/`/g ;
s/<a[^`]*`/\n<a href/g'
apple
banana
Example.com
Explanation: basically as above. Here,
s/href/` : replace href with ` (backtick)
s/<a : find start of URL
[^`] : followed by anything that is not ` (backtick)
*` : until you find a `
/<a href/g : replace each of those found with <a href
Unfortunately, as mentioned, this it is not supported in sed.
To overcome this, I suggest to use the next best thing(actually better even), to use vim sed-like capabilities.
define in .bash-profile
vimdo() { vim $2 --not-a-term -c "$1" -es +"w >> /dev/stdout" -cq! ; }
That will create headless vim to execute a command.
Now you can do for example:
echo $PATH | vimdo "%s_\c:[a-zA-Z0-9\\/]\{-}python[a-zA-Z0-9\\/]\{-}:__g" -
to filter out python in $PATH.
Use - to have input from pipe in vimdo.
While most of the syntax is the same. Vim features more advanced features, and using \{-} is standard for non-greedy match. see help regexp.

Regex parsing issue of multi-line file, replacing two consistent patterns around arbitrary persistent text [duplicate]

I'm trying to use sed to clean up lines of URLs to extract just the domain.
So from:
http://www.suepearson.co.uk/product/174/71/3816/
I want:
http://www.suepearson.co.uk/
(either with or without the trailing slash, it doesn't matter)
I have tried:
sed 's|\(http:\/\/.*?\/\).*|\1|'
and (escaping the non-greedy quantifier)
sed 's|\(http:\/\/.*\?\/\).*|\1|'
but I can not seem to get the non-greedy quantifier (?) to work, so it always ends up matching the whole string.
Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a later regex. Fortunately, Perl regex for this context is pretty easy to get:
perl -pe 's|(http://.*?/).*|\1|'
In this specific case, you can get the job done without using a non-greedy regex.
Try this non-greedy regex [^/]* instead of .*?:
sed 's|\(http://[^/]*/\).*|\1|g'
With sed, I usually implement non-greedy search by searching for anything except the separator until the separator :
echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*\)/.*;\1;p'
Output:
http://www.suon.co.uk
this is:
don't output -n
search, match pattern, replace and print s/<pattern>/<replace>/p
use ; search command separator instead of / to make it easier to type so s;<pattern>;<replace>;p
remember match between brackets \( ... \), later accessible with \1,\2...
match http://
followed by anything in brackets [], [ab/] would mean either a or b or /
first ^ in [] means not, so followed by anything but the thing in the []
so [^/] means anything except / character
* is to repeat previous group so [^/]* means characters except /.
so far sed -n 's;\(http://[^/]*\) means search and remember http://followed by any characters except / and remember what you've found
we want to search untill the end of domain so stop on the next / so add another / at the end: sed -n 's;\(http://[^/]*\)/' but we want to match the rest of the line after the domain so add .*
now the match remembered in group 1 (\1) is the domain so replace matched line with stuff saved in group \1 and print: sed -n 's;\(http://[^/]*\)/.*;\1;p'
If you want to include backslash after the domain as well, then add one more backslash in the group to remember:
echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*/\).*;\1;p'
output:
http://www.suon.co.uk/
Simulating lazy (un-greedy) quantifier in sed
And all other regex flavors!
Finding first occurrence of an expression:
POSIX ERE (using -r option)
Regex:
(EXPRESSION).*|.
Sed:
sed -r ‍'s/(EXPRESSION).*|./\1/g' # Global `g` modifier should be on
Example (finding first sequence of digits) Live demo:
$ sed -r 's/([0-9]+).*|./\1/g' <<< 'foo 12 bar 34'
12
How does it work?
This regex benefits from an alternation |. At each position engine tries to pick the longest match (this is a POSIX standard which is followed by couple of other engines as well) which means it goes with . until a match is found for ([0-9]+).*. But order is important too.
Since global flag is set, engine tries to continue matching character by character up to the end of input string or our target. As soon as the first and only capturing group of left side of alternation is matched (EXPRESSION) rest of line is consumed immediately as well .*. We now hold our value in the first capturing group.
POSIX BRE
Regex:
\(\(\(EXPRESSION\).*\)*.\)*
Sed:
sed 's/\(\(\(EXPRESSION\).*\)*.\)*/\3/'
Example (finding first sequence of digits):
$ sed 's/\(\(\([0-9]\{1,\}\).*\)*.\)*/\3/' <<< 'foo 12 bar 34'
12
This one is like ERE version but with no alternation involved. That's all. At each single position engine tries to match a digit.
If it is found, other following digits are consumed and captured and the rest of line is matched immediately otherwise since * means
more or zero it skips over second capturing group \(\([0-9]\{1,\}\).*\)* and arrives at a dot . to match a single character and this process continues.
Finding first occurrence of a delimited expression:
This approach will match the very first occurrence of a string that is delimited. We can call it a block of string.
sed 's/\(END-DELIMITER-EXPRESSION\).*/\1/; \
s/\(\(START-DELIMITER-EXPRESSION.*\)*.\)*/\1/g'
Input string:
foobar start block #1 end barfoo start block #2 end
-EDE: end
-SDE: start
$ sed 's/\(end\).*/\1/; s/\(\(start.*\)*.\)*/\1/g'
Output:
start block #1 end
First regex \(end\).* matches and captures first end delimiter end and substitues all match with recent captured characters which
is the end delimiter. At this stage our output is: foobar start block #1 end.
Then the result is passed to second regex \(\(start.*\)*.\)* that is same as POSIX BRE version above. It matches a single character
if start delimiter start is not matched otherwise it matches and captures the start delimiter and matches the rest of characters.
Directly answering your question
Using approach #2 (delimited expression) you should select two appropriate expressions:
EDE: [^:/]\/
SDE: http:
Usage:
$ sed 's/\([^:/]\/\).*/\1/g; s/\(\(http:.*\)*.\)*/\1/' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
Output:
http://www.suepearson.co.uk/
Note: this will not work with identical delimiters.
sed does not support "non greedy" operator.
You have to use "[]" operator to exclude "/" from match.
sed 's,\(http://[^/]*\)/.*,\1,'
P.S. there is no need to backslash "/".
sed - non greedy matching by Christoph Sieghart
The trick to get non greedy matching in sed is to match all characters excluding the one that terminates the match. I know, a no-brainer, but I wasted precious minutes on it and shell scripts should be, after all, quick and easy. So in case somebody else might need it:
Greedy matching
% echo "<b>foo</b>bar" | sed 's/<.*>//g'
bar
Non greedy matching
% echo "<b>foo</b>bar" | sed 's/<[^>]*>//g'
foobar
Non-greedy solution for more than a single character
This thread is really old but I assume people still needs it.
Lets say you want to kill everything till the very first occurrence of HELLO. You cannot say [^HELLO]...
So a nice solution involves two steps, assuming that you can spare a unique word that you are not expecting in the input, say top_sekrit.
In this case we can:
s/HELLO/top_sekrit/ #will only replace the very first occurrence
s/.*top_sekrit// #kill everything till end of the first HELLO
Of course, with a simpler input you could use a smaller word, or maybe even a single character.
HTH!
This can be done using cut:
echo "http://www.suepearson.co.uk/product/174/71/3816/" | cut -d'/' -f1-3
another way, not using regex, is to use fields/delimiter method eg
string="http://www.suepearson.co.uk/product/174/71/3816/"
echo $string | awk -F"/" '{print $1,$2,$3}' OFS="/"
sed certainly has its place but this not not one of them !
As Dee has pointed out: Just use cut. It is far simpler and much more safe in this case. Here's an example where we extract various components from the URL using Bash syntax:
url="http://www.suepearson.co.uk/product/174/71/3816/"
protocol=$(echo "$url" | cut -d':' -f1)
host=$(echo "$url" | cut -d'/' -f3)
urlhost=$(echo "$url" | cut -d'/' -f1-3)
urlpath=$(echo "$url" | cut -d'/' -f4-)
gives you:
protocol = "http"
host = "www.suepearson.co.uk"
urlhost = "http://www.suepearson.co.uk"
urlpath = "product/174/71/3816/"
As you can see this is a lot more flexible approach.
(all credit to Dee)
sed 's|(http:\/\/[^\/]+\/).*|\1|'
There is still hope to solve this using pure (GNU) sed. Despite this is not a generic solution in some cases you can use "loops" to eliminate all the unnecessary parts of the string like this:
sed -r -e ":loop" -e 's|(http://.+)/.*|\1|' -e "t loop"
-r: Use extended regex (for + and unescaped parenthesis)
":loop": Define a new label named "loop"
-e: add commands to sed
"t loop": Jump back to label "loop" if there was a successful substitution
The only problem here is it will also cut the last separator character ('/'), but if you really need it you can still simply put it back after the "loop" finished, just append this additional command at the end of the previous command line:
-e "s,$,/,"
sed -E interprets regular expressions as extended (modern) regular expressions
Update: -E on MacOS X, -r in GNU sed.
Because you specifically stated you're trying to use sed (instead of perl, cut, etc.), try grouping. This circumvents the non-greedy identifier potentially not being recognized. The first group is the protocol (i.e. 'http://', 'https://', 'tcp://', etc). The second group is the domain:
echo "http://www.suon.co.uk/product/1/7/3/" | sed "s|^\(.*//\)\([^/]*\).*$|\1\2|"
If you're not familiar with grouping, start here.
I realize this is an old entry, but someone may find it useful.
As the full domain name may not exceed a total length of 253 characters replace .* with .\{1, 255\}
This is how to robustly do non-greedy matching of multi-character strings using sed. Lets say you want to change every foo...bar to <foo...bar> so for example this input:
$ cat file
ABC foo DEF bar GHI foo KLM bar NOP foo QRS bar TUV
should become this output:
ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV
To do that you convert foo and bar to individual characters and then use the negation of those characters between them:
$ sed 's/#/#A/g; s/{/#B/g; s/}/#C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/g; s/}/bar/g; s/{/foo/g; s/#C/}/g; s/#B/{/g; s/#A/#/g' file
ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV
In the above:
s/#/#A/g; s/{/#B/g; s/}/#C/g is converting { and } to placeholder strings that cannot exist in the input so those chars then are available to convert foo and bar to.
s/foo/{/g; s/bar/}/g is converting foo and bar to { and } respectively
s/{[^{}]*}/<&>/g is performing the op we want - converting foo...bar to <foo...bar>
s/}/bar/g; s/{/foo/g is converting { and } back to foo and bar.
s/#C/}/g; s/#B/{/g; s/#A/#/g is converting the placeholder strings back to their original characters.
Note that the above does not rely on any particular string not being present in the input as it manufactures such strings in the first step, nor does it care which occurrence of any particular regexp you want to match since you can use {[^{}]*} as many times as necessary in the expression to isolate the actual match you want and/or with seds numeric match operator, e.g. to only replace the 2nd occurrence:
$ sed 's/#/#A/g; s/{/#B/g; s/}/#C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/2; s/}/bar/g; s/{/foo/g; s/#C/}/g; s/#B/{/g; s/#A/#/g' file
ABC foo DEF bar GHI <foo KLM bar> NOP foo QRS bar TUV
Have not yet seen this answer, so here's how you can do this with vi or vim:
vi -c '%s/\(http:\/\/.\{-}\/\).*/\1/ge | wq' file &>/dev/null
This runs the vi :%s substitution globally (the trailing g), refrains from raising an error if the pattern is not found (e), then saves the resulting changes to disk and quits. The &>/dev/null prevents the GUI from briefly flashing on screen, which can be annoying.
I like using vi sometimes for super complicated regexes, because (1) perl is dead dying, (2) vim has a very advanced regex engine, and (3) I'm already intimately familiar with vi regexes in my day-to-day usage editing documents.
Since PCRE is also tagged here, we could use GNU grep by using non-lazy match in regex .*? which will match first nearest match opposite of .*(which is really greedy and goes till last occurrence of match).
grep -oP '^http[s]?:\/\/.*?/' Input_file
Explanation: using grep's oP options here where -P is responsible for enabling PCRE regex here. In main program of grep mentioning regex which is matching starting http/https followed by :// till next occurrence of / since we have used .*? it will look for first / after (http/https://). It will print matched part only in line.
echo "/home/one/two/three/myfile.txt" | sed 's|\(.*\)/.*|\1|'
don bother, i got it on another forum :)
sed 's|\(http:\/\/www\.[a-z.0-9]*\/\).*|\1| works too
Here is something you can do with a two step approach and awk:
A=http://www.suepearson.co.uk/product/174/71/3816/
echo $A|awk '
{
var=gensub(///,"||",3,$0) ;
sub(/\|\|.*/,"",var);
print var
}'
Output:
http://www.suepearson.co.uk
Hope that helps!
Another sed version:
sed 's|/[:alnum:].*||' file.txt
It matches / followed by an alphanumeric character (so not another forward slash) as well as the rest of characters till the end of the line. Afterwards it replaces it with nothing (ie. deletes it.)
#Daniel H (concerning your comment on andcoz' answer, although long time ago): deleting trailing zeros works with
s,([[:digit:]]\.[[:digit:]]*[1-9])[0]*$,\1,g
it's about clearly defining the matching conditions ...
You should also think about the case where there is no matching delims. Do you want to output the line or not. My examples here do not output anything if there is no match.
You need prefix up to 3rd /, so select two times string of any length not containing / and following / and then string of any length not containing / and then match / following any string and then print selection. This idea works with any single char delims.
echo http://www.suepearson.co.uk/product/174/71/3816/ | \
sed -nr 's,(([^/]*/){2}[^/]*)/.*,\1,p'
Using sed commands you can do fast prefix dropping or delim selection, like:
echo 'aaa #cee: { "foo":" #cee: " }' | \
sed -r 't x;s/ #cee: /\n/;D;:x'
This is lot faster than eating char at a time.
Jump to label if successful match previously. Add \n at / before 1st delim. Remove up to first \n. If \n was added, jump to end and print.
If there is start and end delims, it is just easy to remove end delims until you reach the nth-2 element you want and then do D trick, remove after end delim, jump to delete if no match, remove before start delim and and print. This only works if start/end delims occur in pairs.
echo 'foobar start block #1 end barfoo start block #2 end bazfoo start block #3 end goo start block #4 end faa' | \
sed -r 't x;s/end//;s/end/\n/;D;:x;s/(end).*/\1/;T y;s/.*(start)/\1/;p;:y;d'
If you have access to gnu grep, then can utilize perl regex:
grep -Po '^https?://([^/]+)(?=)' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
http://www.suepearson.co.uk
Alternatively, to get everything after the domain use
grep -Po '^https?://([^/]+)\K.*' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
/product/174/71/3816/
The following solution works for matching / working with multiply present (chained; tandem; compound) HTML or other tags. For example, I wanted to edit HTML code to remove <span> tags, that appeared in tandem.
Issue: regular sed regex expressions greedily matched over all the tags from the first to the last.
Solution: non-greedy pattern matching (per discussions elsewhere in this thread; e.g. https://stackoverflow.com/a/46719361/1904943).
Example:
echo '<span>Will</span>This <span>remove</span>will <span>this.</span>remain.' | \
sed 's/<span>[^>]*>//g' ; echo
This will remain.
Explanation:
s/<span> : find <span>
[^>] : followed by anything that is not >
*> : until you find >
//g : replace any such strings present with nothing.
Addendum
I was trying to clean up URLs, but I was running into difficulty matching / excluding a word - href - using the approach above. I briefly looked at negative lookarounds (Regular expression to match a line that doesn't contain a word) but that approach seemed overly complex and did not provide a satisfactory solution.
I decided to replace href with ` (backtick), do the regex substitutions, then replace ` with href.
Example (formatted here for readability):
printf '\n
<a aaa h href="apple">apple</a>
<a bbb "c=ccc" href="banana">banana</a>
<a class="gtm-content-click"
data-vars-link-text="nope"
data-vars-click-url="https://blablabla"
data-vars-event-category="story"
data-vars-sub-category="story"
data-vars-item="in_content_link"
data-vars-link-text
href="https:example.com">Example.com</a>\n\n' |
sed 's/href/`/g ;
s/<a[^`]*`/\n<a href/g'
apple
banana
Example.com
Explanation: basically as above. Here,
s/href/` : replace href with ` (backtick)
s/<a : find start of URL
[^`] : followed by anything that is not ` (backtick)
*` : until you find a `
/<a href/g : replace each of those found with <a href
Unfortunately, as mentioned, this it is not supported in sed.
To overcome this, I suggest to use the next best thing(actually better even), to use vim sed-like capabilities.
define in .bash-profile
vimdo() { vim $2 --not-a-term -c "$1" -es +"w >> /dev/stdout" -cq! ; }
That will create headless vim to execute a command.
Now you can do for example:
echo $PATH | vimdo "%s_\c:[a-zA-Z0-9\\/]\{-}python[a-zA-Z0-9\\/]\{-}:__g" -
to filter out python in $PATH.
Use - to have input from pipe in vimdo.
While most of the syntax is the same. Vim features more advanced features, and using \{-} is standard for non-greedy match. see help regexp.

how to use sed, awk, or gawk to print only what is matched?

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt
You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'
I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt
If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).
You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.
perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.
If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp
why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution
there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.
you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"
For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}
gawk '/.*abc([0-9]+)xyz.*/' file