Only get alphanumeric characters in capture group using sed - regex

Input:
x.y={aaa b .c}
Note that the the content within {} are only an example, in reality it could be any value.
Problem: I would like to keep only the alphanumeric characters within the {}.
So it would be come:
x.y={aaabbc}
Trial 0
$ echo 'x.y={aaa b .c}' | sed 's/[^[:alnum:]]\+//g'
xyaaabc
This is great, but I'd like to only modify the part within {}. So I thought this may need capture groups, hence I went ahead and tried these:
Trial 1
$ echo 'x.y={aaa b .c}' | sed -E 's/x.y=\{(.*)\}/x.y={\1}/'
x.y={aaa b .c}
Here I have captured the content I want to modify (aaa b .c) correctly, but I need a way to somehow do s/[^[:alnum:]]\+//g only on \1.
Instead, I tried capturing all alphanumeric characters only (to \1) like this:
Trial 2
$ echo 'x.y={aaa b .c}' | sed -E 's/x.y=\{([[:alnum:]]+)\}/x.y={\1}/'
x.y={aaa b .c}
Of course, it doesn't work because I'm only expecting alnum's and then immediately a } literal. I didn't tell it to ignore the non-alnum's. I.e, this part:
s/x.y=\{([[:alnum:]]+)\}/x.y={\1}/
^^^^^^^^^^^^^^^^^^
It literally matches: an open brace, some alnum's, and a closing brace -- which is not what I want. I'd like it to match everything, but only capture the alnum's.
Example of input/output:
x.y={aaa b .c} blah
blah
x.y={1 2 3 def} blah
blah
to
x.y={aaabc} blah
blah
x.y={123def} blah
blah
I searched the web before finally giving up and posting the question but I didn't find anything helpful as I didn't see anyone with a similar problem as mine. Would appreciate some help this as I'd love to have a better understanding of variables in regex/sed, thanks!

With your shown samples, please try following in awk. Written and tested in GNU awk.
awk '
match($0,/\{[^}]*}/){
val=substr($0,RSTART,RLENGTH)
gsub(/[^{}a-zA-Z]/,"",val)
$0=substr($0,1,RSTART-1) val substr($0,RSTART+RLENGTH)
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/\{[^}]*}/){ ##using match function of awk to match from { to first occurrence of }
val=substr($0,RSTART,RLENGTH) ##Creating val which has sub string of matched regex in it.
gsub(/[^{}a-zA-Z]/,"",val) ##Globally substituting everything apart from { } and alphabets in val.
$0=substr($0,1,RSTART-1) val substr($0,RSTART+RLENGTH) ##saving everything before match val and everything after match here.
}
1 ##Printing line if it doesn't meet `match` condition mentioned above.
' Input_file ##Mentioning Input_file name here.
Generic solution: In case you have multiple occurrences of { and } then try following awk code.
awk '
{
line=""
while(match($0,/\{[^}]*}/)){
val=substr($0,RSTART,RLENGTH)
gsub(/[^{}a-zA-Z]/,"",val)
line=(line?line:"") (substr($0,1,RSTART-1) val)
$0=substr($0,RSTART+RLENGTH)
}
if(RSTART+RLENGTH!=length($0)){
$0=line $0
}
else{
$0=line
}
}
1
' Input_file

With sed (tested on GNU sed, syntax may vary for other implementations):
$ sed -E ':a s/(\{[[:alnum:]]*)[^[:alnum:]]+([^}]*})/\1\2/; ta' ip.txt
x.y={aaabc} blah
blah
x.y={123def} blah
blah
:a marks that location as label a (used to jump using ta as long as the substitution succeeds)
(\{[[:alnum:]]*) matches { followed by zero or more alnum characaters
[^[:alnum:]]+ matches one or more non-alnum characters
([^}]*}) matches till the next } character
If perl is okay:
$ perl -pe 's/\{\K[^}]+(?=\})/$&=~s|[^a-z\d]+||gir/e' ip.txt
x.y={aaabc} blah
blah
x.y={123def} blah
blah
\{\K[^}]+(?=\}) match sequence of { to } (assuming } cannot occur in between)
\{\K and (?=\}) are used to avoid the braces from being part of the matched portion
e flag allows you to use Perl code in replacement portion, in this case another substitute command
$&=~s|[^a-z\d]+||gir here, $& refers to entire matched portion, gi flags are used for global/case-insensitive and r flag is used to return the value of this substitution instead of modifying $&
[^a-z\d]+ matches non-alphanumeric characters (assuming ASCII, you can also use [^[:alnum:]]+)
use \W+ if you want to preserve underscores as well
For both solutions, you can add x\.y= prefix if needed to narrow the scope of matching.

Here is another gnu-awk solution using FPAT:
s='x.y={aaa b .c}'
awk -v OFS= -v FPAT='{[^}]+}|[^{}]+' '
{
for (i=1; i<=NF; ++i)
if ($i ~ /^{/) $i = "{" gensub(/[^[:alnum:]]+/, "", "g", $i) "}"
} 1' <<< "$s"
x.y={aaabc}

Related

regex for capturing a number with a range of digits in AWK

Im trying to capture numbers inside a file using AWK, I could capture all, but im not being able to capture those in a certain amount of digits. What im doing wrong?
echo -e "$teste" | awk '/_OA/ { match($0,/\[\([:digit:]{4,13}\]/);oa = substr($0,RSTART,RLENGTH);print oa}'
File sample:
_OA ............. [6712227000168]
_OA Tasdsd, OA .. [91][355016]
_OA Tasdsd, DA .. [91][5512987000]
Expected:
6712227000168
355016
5512987000
Hint for the regex match answers:
Thanks so much for all the answers, i found link that I need to use a --posix option because of my awk version.
With your shown samples please try following awk solution. Simply making field separator as ] OR [ and in main block checking condition if line starts from _QA then printing the 2nd last field.
awk -F"[][]" '/^_QA /{print $(NF-1)}' Input_file
You could update the pattern and the values for RSTART and RLENGTH to not match the leading and trailing square brackets.
The digits part should be [[:digit:]] and there is a \( in the pattern that matches ( that should not be there.
awk '/_OA/ { match($0,/\[[[:digit:]]{4,13}\]/);oa = substr($0,RSTART+1,RLENGTH-2);print oa}' <<< "$teste"
Output
6712227000168
355016
5512987000
As there are multiple occurrences of digits between square brackets, if you want to match multiple occurrences:
teste='_OA Tasdsd, OA .. [91][355016][123456789][1][9999]'
awk '/_OA/ {
while(match($0,/\[[[:digit:]]{4,13}]/)){
start=RSTART+1; len=RLENGTH-2
s=substr($0,start,len)
res=res?res","s:s
$0=substr($0,start+len)
}
print res
res = ""
}' <<< "$teste"
Output
355016,123456789,9999
Your regexp \[\([:digit:]{4,13}\] says:
\[ = the literal character [
\( = the literal character (
[:digit:] = a bracket expression containing a character set of the characters :, d, i, g, t
{4,13} = a regexp interval that's 4 to 13 repetitions of the preceding bracket expression
\] = the literal character ]
The 2 main issues with that which are causing your regexp to be unable to match any of your input are:
You don't have any (s in your input (from #2 above), and
To match digits you need a character class [:digit:] inside a bracket expression [[:digit:]], not a character set :digit: inside a bracket expression [:digit:] (from #3 above)
You also don't actually need to escape the ] at the end of the regexp as it's only a regexp metachar (end of bracket expression) if preceded by a matching unescaped [ (start of bracket expression).
So the regexp I think you wanted to write instead would have been:
\[[[:digit:]]{4,13}]
e.g.:
$ awk '/_OA/ { match($0,/\[[[:digit:]]{4,13}]/);oa = substr($0,RSTART,RLENGTH);print oa}' file
[6712227000168]
[355016]
[5512987000]
or to only print the numbers:
$ awk '/_OA/ { match($0,/\[[[:digit:]]{4,13}]/);oa = substr($0,RSTART+1,RLENGTH-2);print oa}' file
6712227000168
355016
5512987000
If you're not married to awk:
grep -Eo '[[:digit:]]{4,13}'
With GNU awk:
gawk 'match($0, /[[:digit:]]{4,13}/, m) {print m[0]}'
but that only matches the first such number in each record. To find them all:
gawk '{
line = $0
while (match(line, /[[:digit:]]{4,13}/, m)) {
print m[0]
line = substr(line, m[0,"start"] + m[0,"length"])
}
}'
Ref the match function in the gawk manual.
You can use
awk '/_OA/ { match($0,/\[[[:digit:]]{4,13}]/);print substr($0,RSTART+1,RLENGTH-2)}'
See the online demo:
#!/bin/bash
s='_OA ............. [6712227000168]
_OA Tasdsd, OA .. [91][355016]
_OA Tasdsd, DA .. [91][5512987000]'
awk '/_OA/ { match($0,/\[[[:digit:]]{4,13}]/);print substr($0,RSTART+1,RLENGTH-2)}' <<< "$s"
Output:
6712227000168
355016
5512987000
Details:
\[ - a [ char
[[:digit:]]{4,13} - four to thirteen digits (note that the [:digit:] POSIX character class must be used within [...], a bracket expression)
] - a ] char (it is not special, no need escaping)
And substr($0,RSTART+1,RLENGTH-2) means that we
$0 - take the match
RSTART+1 - starting with the second char
RLENGTH-2 - and then as many characters as is the match length - 2 (thus getting rid of enclosing [ and ] chars)

How to match and cut the string with different conditions using sed?

I want to grep the string which comes after WORK= and ignore if there comes paranthesis after that string .
The text looks like this :
//INALL TYPE=GH,WORK=HU.ET.ET(IO)
//INA2 WORK=HU.TY.TY(OP),TYPE=KK
//OOPE2 TYPE=KO,WORK=TEXT.LO1.LO2,TEXT
//OOP2 TYPE=KO,WORK=TEST1.TEST2
//H1 WORK=OP.TEE.GHU,TYPE=IU
So, desirable output should print only :
TEXT.L01.L02
TEST1.TEST2
OP.TEE.GHU
So far , I could just match and cut before WORK= but could not remove WORK= itself:
sed -E 's/(.*)(WORK=.*)/\2/'
I am not sure how to continue . Can anyone help please ?
You can use
sed -n '/WORK=.*([^()]*)/!s/.*WORK=\([^,]*\).*/\1/p' file > newfile
Details:
-n - suppresses the default line output
/WORK=.*([^()]*)/! - if a line contains a WORK= followed with any text and then a (...) substring skips it
s/.*WORK=\([^,]*\).*/\1/p - else, takes the line and removes all up to and including WORK=, and then captures into Group 1 any zero or more chars other than a comma, and then remove the rest of the line; p prints the result.
See the sed demo:
s='//INALL TYPE=GH,WORK=HU.ET.ET(IO)
//INA2 WORK=HU.TY.TY(OP),TYPE=KK
//OOPE2 TYPE=KO,WORK=TEXT.LO1.LO2,TEXT
//OOP2 TYPE=KO,WORK=TEST1.TEST2
//H1 WORK=OP.TEE.GHU,TYPE=IU'
sed -n '/WORK=.*([^()]*)/!s/.*WORK=\([^,]*\).*/\1/p' <<< "$s"
Output:
TEXT.LO1.LO2
TEST1.TEST2
OP.TEE.GHU
Could you please try following awk, written and tested with shown samples in GNU awk.
awk '
match($0,/WORK=[^,]*/){
val=substr($0,RSTART+5,RLENGTH-5)
if(val!~/\([a-zA-Z]+\)/){ print val }
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/WORK=[^,]*/){ ##Using match function to match WORK= till comma comes.
val=substr($0,RSTART+5,RLENGTH-5) ##Creating val with sub string of match regex here.
if(val!~/\([a-zA-Z]+\)/){ print val } ##checking if val does not has ( alphabets ) then print val here.
}
' Input_file ##Mentioning Input_file name here.
This might work for you (GNU sed):
sed -n '/.*WORK=\([^,]\+\).*/{s//\1/;/(.*)/!p}' file
Extract the string following WORK= and if that string does not contain (...) print it.
This will work if there is only zero or one occurrence of WORK= and that the exclusion depends only on the (...) occurring within that string and not other following fields.
For a global solution with the same stipulations for parens:
sed -n '/WORK=\([^,]\+\)/{s//\n\1\n/;s/[^\n]*\n//;/(.*).*\n/!P;D}' file
N.B. This prints each such string on a separate line an excludes empty strings.

Remove a specific part of string split by multiple-charater delimiter in bash

I'm trying to process my string and steps as below:
mytext='\\[[123 one (/)\n\\[[456 two (/)\n\\[[789 three (/)'
myvar=456
And I want to remove a part of string (and delimiter) that match myvar after split by delimiter \n, the result should be
\\[[123 one (/)\n\\[[789 three (/)
But I still not find out the solution.
If the delimiter is a single character like :, I can be done with sed command:
mytext2='\\[[123 one (/):\\[[456 two (/):\\[[789 three (/)'
myvar2=456
echo $mytext2 | sed -E "s/[^:]*${myvar}[^:]*(:|$)//g"
Result as expected: \\[[123 one (/):\\[[789 three (/)
How can be done if delimiter is a multiple-character in this case?
Thanks.
Even if you use sed -E, you still lack some support (?<=, ?!, ?=, ? etc). I suggest you use perl (Perl Compatible Regular Expressions).
mytext='\\[[123 one (/)\n\\[[456 two (/)\n\\[[789 three (/)';
myvar=456;
echo $mytext | perl -pe "s/(?<=\\\n).*${myvar}.*?(\\\n|$)//g";
Details:
(?<=\\\n): string starts after \n. \\ escape character \
.*${myvar}.*?(\\\n|$): get string which contains value of variable myvar and ends with \n or end of line.
Result.
\\[[123 one (/)\n\\[[789 three (/)
If you can find another delimiter to be used for example, :, you can first replace it with :,
echo $mytext | tr '\n' ':' | sed -E "s/[^:]*${myvar}[^:]*(:|$)//g"
The full script looks like this
mytext='\\[[123 one (/):\\[[456 two (/):\\[[789 three (/)'
myvar=456
v=$(echo $mytext | tr '\n' ':' | sed -E "s/[^:]*${myvar}[^:]*(:|$)//g")
echo $v
If the multiple character delimiter is something like abcd, you can try to use sed to replace it first instead of using tr.
I sugges awk in this case since you may specify the literal multichar delimiter pattern as the input/output field separator, iterate over the fields and discard all those not matching your value:
mytext='\\[[123 one (/)\n\\[[456 two (/)\n\\[[789 three (/)'
myvar=456
awk -v myvar=$myvar 'BEGIN{FS=OFS="\\\\n"} {s="";
for (i=1; i<=NF; i++) {
if ($i !~ myvar) {s = s (i==1 ? "" : OFS) $i;}
}
} END{print s}' <<< "$mytext"
# => \\[[123 one (/)\\n\\[[789 three (/)
See the online awk demo.
NOTES:
BEGIN{FS=OFS="\\\\n"} - sets the input/output field separator to \n
-v myvar=$myvar passes the myvar to awk
s="" - assigns s to an empty string
for (i=1; i<=NF; i++) {...} - iterates over all fields
if ($i !~ myvar) {...} - if the current field value matches myvar...
s = s (i==1 ? "" : OFS) $i;} - append either the current field value to s (if is the first field) or output separator and the current field value (if it is not the first)
END{print s} - prints s after the field checks.

Regular expression for not more than one occurance of consecutive characters

I'm looking for regular expression that will match only if 2 consecutive characters occur in string once.
for example:
1123456 - match
1122345 - not match
1121125 - not match
1234567 - not match
1112345 - not match
currently have this regex: ([0-9])\1{1,} but it matches 1122345 as well which is not what i need
This awk does it, if you have minimal awk (mawk) or GNU awk (gawk):
awk -F "" '
{
d=0
for(i=1;i<NF;i++){
if ($i==$(i+1)) d++
}
if (d==1) print
}' file
Setting the field to empty string ("") you can read each line character-wise! If character i equals character i+1, then increment d. If d==1, the string is printed.
From your sample:
$ cat file
1123456
1122345
1121125
1234567
1112345
It outputs:
1123456
Important remark:
GNU awk manual says the use of empty string as field separator is a "dark corner", meaning that it is not standard and some implementations may handle it differently. If you want to be sure that it will work with any awk, go for
awk '
{
d=0
n=split($0,ch,"")
for(i=1;i<n;i++){
if (ch[i]==ch[i+1]) d++
}
if (d==1) print
}' file
It passed the gawk --posix test and yields the same result.

how to use sed, awk, or gawk to print only what is matched?

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt
You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'
I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt
If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).
You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.
perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.
If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp
why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution
there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.
you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"
For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}
gawk '/.*abc([0-9]+)xyz.*/' file