Replace newline in quoted strings in huge files - regex

I have a few huge files with values seperated by a pipe (|) sign.
The strings our quoted but sometimes there is a newline in between the quoted string.
I need to read these files with external table from oracle but on the newlines he will give me errors. So I need to replace them with a space.
I do some other perl commands on these files for other errors, so I would like to have a solution in a one line perl command.
I 've found some other similar questions on stackoverflow, but they don't quite do the same and I can't find a solution for my problem with the solution mentioned there.
The statement I tried but that isn't working:
perl -pi -e 's/"(^|)*\n(^|)*"/ /g' test.txt
Sample text:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline
in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline
"
4457|.....
Should become:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
4457|.....

Sounds like you want a CSV parser like Text::CSV_XS (Install through your OS's package manager or favorite CPAN client):
$ perl -MText::CSV_XS -e '
my $csv = Text::CSV_XS->new({sep => "|", binary => 1});
while (my $row = $csv->getline(*ARGV)) {
$csv->say(*STDOUT, [ map { tr/\n/ /r } #$row ])
}' test.txt
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
This one-liner reads each record using | as the field separator instead of the normal comma, and for each field, replaces newlines with spaces, and then prints out the transformed record.

In your specific case, you can also consider a workaround using GNU sed or awk.
An awk command will look like
awk 'NR==1 {print;next;} /^[0-9]{4,}\|/{print "\n" $0;next;}1' ORS="" file > newfile
The ORS (output record separator) is set to an empty string, which means that \n is only added before lines starting with four or more digits followed with a | char (matched with a ^[0-9]{4,}\| POSIX ERE pattern).
A GNU sed command will look like
sed -i ':a;$!{N;/\n[0-9]\{4,\}|/!{s/\n/ /;ba}};P;D' file
This reads two consecutive lines into the pattern space, and once the second line doesn't start with four digits followed with a | char (see the [0-9]\{4\}| POSIX BRE regex pattern), the or more line break between the two is replaced with a space. The search and replace repeats until no match or the end of file.
With perl, if the file is huge but it can still fit into memory, you can use a short
perl -0777 -pi -e 's/\R++(?!\d{4,}\|)/ /g' <<< "$s"
With -0777, you slurp the file and the \R++(?!\d{4,}\|) pattern matches any one or more line breaks (\R++) not followed with four or more digits followed with a | char. The ++ possessive quantifier is required to make (?!...) negative lookahead to disallow backtracking into line break matching pattern.

With your shown samples, this could be simply done in awk program. Written and tested in GNU awk, should work in any awk. This should work fast even on huge files(better than slurping whole file into memory, having mentioned that OP may use it on huge files).
awk 'gsub(/"/,"&")%2!=0{if(val==""){val=$0} else{print val $0;val=""};next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
gsub(/"/,"&")%2!=0{ ##Checking condition if number of " are EVEN or not, because if they are NOT even then it means they are NOT closed properly.
if(val==""){ val=$0 } ##Checking condition if val is NULL then set val to current line.
else {print val $0;val=""} ##Else(if val NOT NULL) then print val current line and nullify val here.
next ##next will skip further statements from here.
}
1 ##In case number of " are EVEN in any line it will skip above condition(gusb one) and simply print the line.
' Input_file ##Mentioning Input_file name here.

Related

perl regex negative-lookbehind detect file lacking final linefeed

The following code uses tail to test whether the last line of a file fails to culminate in a newline (linefeed, LF).
> printf 'aaa\nbbb\n' | test -n "$(tail -c1)" && echo pathological last line
> printf 'aaa\nbbb' | test -n "$(tail -c1)" && echo pathological last line
pathological last line
>
One can test for the same condition by using perl, a positive lookbehind regex, and unless, as follows. This is based on the notion that, if a file ends with newline, the character immediately preceding end-of-file will be \n by definition.
(Recall that the -n0 flag causes perl to "slurp" the entire file as a single record. Thus, there is only one $, the end of the file.)
> printf 'aaa\nbbb\n' | perl -n0 -e 'print "pathological last line\n" unless m/(?<=\n)$/;'
> printf 'aaa\nbbb' | perl -n0 -e 'print "pathological last line\n" unless m/(?<=\n)$/;'
pathological last line
>
Is there a way to accomplish this using if rather than unless, and negative lookbehind? The following fails, in that the regex seems to always match:
> printf 'aaa\nbbb\n' | perl -n0 -e 'print "pathological last line\n" if m/(?<!\n)$/;'
pathological last line
> printf 'aaa\nbbb' | perl -n0 -e 'print "pathological last line\n" if m/(?<!\n)$/;'
pathological last line
>
Why does my regex always match, even when the end-of-file is preceded by newline? I am trying to test for an end-of-file that is not preceded by newline.
/(?<=\n)$/ is a weird and expensive way of doing /\n$/.
/\n$/ means /\n(?=\n?\z)/, so it's a weird and expensive way of doing /\n\z/.
A few approaches:
perl -n0777e'print "pathological last line\n" if !/\n\z/'
perl -n0777e'print "pathological last line\n" if /(?<!\n)\z/'
perl -n0777e'print "pathological last line\n" if substr($_, -1) ne "\n"'
perl -ne'$ll=$_; END { print "pathological last line\n" if $ll !~ /\n\z/ }'
The last solution avoids slurping the entire file.
Why does my regex always match, even when the end-of-file is preceded by newline?
Because you mistakenly think that $ only matches at the end of the string. Use \z for that.
Do you have a strong reason for using a regular expression for his job? Practicing regular expressions for example? If not, I think a simpler approach is to just use a while loop that tests for eof and remembers the latest character read. Something like this might do the job.
perl -le'while (!eof()) { $previous = getc(\*ARGV) }
if ($previous ne "\n") { print "pathological last line!" }'
PS: ikegami's comment about my solution being slow is well-taken. (Thanks for the helpful edit, too!) So I wondered if there's a way to read the file backwards. As it turns out, CPAN has a module for just that. After installing it, I came up with this:
perl -le 'use File::ReadBackwards;
my $bw = File::ReadBackwards->new(shift #ARGV);
print "pathological last line" if substr($bw->readline, -1) ne "\n"'
That should work efficiently, even very large files. And when I come back to read it a year later, I will more likely understand it than I would with the regular-expression approach.
The hidden context of my request was a perl script to "clean" a text file used in the TeX/LaTeX environment. This is why I wanted to slurp.
(I mistakenly thought that "laser focus" on a problem, recommended by stackoverflow, meant editing out the context.)
Thanks to the responses, here is an improved draft of the script:
#!/usr/bin/perl
use strict; use warnings; use 5.18.2;
# Loop slurp:
$/ = undef; # input record separator: entire file is a single record.
# a "trivial line" looks blank, consists exclusively of whitespace, but is not necessarily a pure newline=linefeed=LF.
while (<>) {
s/^\s*$/\n/mg; # convert any trivial line to a pure LF. Unlike \z, $ works with /m multiline.
s/[\n][\n]+/\n\n/g; # exactly 2 blank lines (newlines) separate paragraphs. Like cat -s
s/^[\n]+//; # first line is visible or "nontrivial."
s/[\n]+\z/\n/; # last line is visible or "nontrivial."
print STDOUT;
print "\n" unless m/\n\z/; # IF detect pathological last line, i.e., not ending in LF, THEN append LF.
}
And here is how it works, when named zz.pl. First a messy file, then how it looks after zz.pl gets through with it:
bash: printf ' \n \r \naaa\n \t \n \n \nbb\n\n\n\n \t'
aaa
bb
bash:
bash:
bash: printf ' \n \r \naaa\n \t \n \n \nbb\n\n\n\n \t' | zz.pl
aaa
bb
bash:

Using awk or sed to merge / print lines matching a pattern (oneliner?)

I have a file that contains the following text:
subject:asdfghj
subject:qwertym
subject:bigger1
subject:sage911
subject:mothers
object:cfvvmkme
object:rjo4j2f2
object:e4r234dd
object:uft5ed8f
object:rf33dfd1
I am hoping to achieve the following result using awk or sed (as a oneliner would be a bonus! [Perl oneliner would be acceptable as well]):
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1
I'd like to have each line that matches 'subject' and 'object' combined in the order that each one is listed, separated with a comma. May I see an example of this done with awk, sed, or perl? (Preferably as a oneliner if possible?)
I have tried some uses of awk to perform this, I am still learning I should add:
awk '{if ($0 ~ /subject/) pat1=$1; if ($0 ~ /object/) pat2=$2} {print $0,pat2}'
But does not do what I thought it would! So I know I have the syntax wrong. If I were to see an example that would greatly help so that I can learn.
not perl or awk but easier.
$ pr -2ts, file
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1
Explanation
-2 2 columns
t ignore print header (filename, date, page number, etc)
s, use comma as the column separator
I'd do it something like this in perl:
#!/usr/bin/perl
use strict;
use warnings;
my #subjects;
while ( <DATA> ) {
m/^subject:(\w+)/ and push #subjects, $1;
m/^object:(\w+)/ and print "subject:",shift #subjects,",object:", $1,"\n";
}
__DATA__
subject:asdfghj
subject:qwertym
subject:bigger1
subject:sage911
subject:mothers
object:cfvvmkme
object:rjo4j2f2
object:e4r234dd
object:uft5ed8f
object:rf33dfd1
Reduced down to one liner, this would be:
perl -ne '/^(subject:\w+)/ and push #s, $1; /^object/ and print shift #s,$_' file
grep, paste and process substitution
$ paste -d , <(grep 'subject' infile) <(grep 'object' infile)
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1
This treats the output of grep 'subject' infile and grep 'object' infile like files due to process substitution (<( )), then pastes the results together with paste, using a comma as the delimiter (indicated by -d ,).
sed
The idea is to read and store all subject lines in the hold space, then for each object line fetch the hold space, get the proper subject and put the remaining subject lines back into hold space.
First the unreadable oneliner:
$ sed -rn '/^subject/H;/^object/{G;s/\n+/,/;s/^(.*),([^\n]*)(\n|$)/\2,\1\n/;P;s/^[^\n]*\n//;h}' infile
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1
-r is for extended regex (no escaping of parentheses, + and |) and -n does not print by default.
Expanded, more readable and explained:
/^subject/H # Append subject lines to hold space
/^object/ { # For each object line
G # Append hold space to pattern space
s/\n+/,/ # Replace first group of newlines with a comma
# Swap object (before comma) and subject (after comma)
s/^(.*),([^\n]*)(\n|$)/\2,\1\n/
P # Print up to first newline
s/^[^\n]*\n// # Remove first line (can't use D because there is another command)
h # Copy pattern space to hold space
}
Remarks:
When the hold space is fetched for the first time, it starts with a newline (H adds one), so the newline-to-comma substitution replaces one or more newlines, hence the \n+: two newlines for the first time, one for the rest.
To anchor the end of the subject part in the swap, we use (\n|$): either a newline or the end of the pattern space – this is to get the swap also on the last line, where we don't have a newline at the end of the pattern space.
This works with GNU sed. For BSD sed as found in MacOS, there are some changes required:
The -r option has to be replaced by -E.
There has to be an extra semicolon before the closing brace: h;}
To insert a newline in the replacement string (swap command), we have to replace \n by either '$'\n'' or '"$(printf '\n')"'.
Since you specifically asked for a "oneliner" I assume brevity is far more important to you than clarity so:
$ awk -F: -v OFS=, 'NR>1&&$1!=p{f=1}{p=$1}f{print a[++c],$0;next}{a[NR]=$0}' file
subject:asdfghj,object:cfvvmkme
subject:qwertym,object:rjo4j2f2
subject:bigger1,object:e4r234dd
subject:sage911,object:uft5ed8f
subject:mothers,object:rf33dfd1

How to find/extract a pattern from a file?

Here are the contents of my text file named 'temp.txt'
---start of file ---
HEROKU_POSTGRESQL_AQUA_URL (DATABASE_URL) ----backup---> b687
Capturing... done
Storing... done
---end of file ----
I want to write a bash script in which I need to capture the string 'b687' in a variable. this is really a pattern (which is the letter 'b' followed by 'n' number of digits). I can do it the hard way by looping through the file and extracting the desired string (b687 in example above). Is there an easy way to do so? Perhaps by using awk or sed?
Try using grep
v=$(grep -oE '\bb[0-9]{3}\b' file)
This will seach for a word starting with b followed by '3' digits.
regex101 demo
Using sed
v=$(sed -nr 's/.*\b(b[0-9]{3})\b.*/\1/p' file)
varname=$(awk '/HEROKU_POSTGRESQL_AQUA_URL/{print $4}' filename)
what this does is reads the file when it matches the pattern HEROKU_POSTGRESQL_AQUA_URL print the 4th token in this case b687
your other option is to use sed
varname=$(sed -n 's/.* \(b[0-9][0-9]*\)/\1/p' filename)
In this case we are looking for the pattern you mentioned b####... and only print that pattern the -n tells sed not to print line that do not have that pattern. the rest of the sed command is a substitution .* is any string at the beginning. followed by a (...) which forms a group in which we put the regex that will match your b##### the second part says out of all that match only print the group 1 and the p at the end tells sed to print the result (since by default we told sed not to print with the -n)

Substitute words not in double quotes

$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
basic
I want unix sed command such that only basic that is not in quotes should be changed.[change basic to ring]
Expected output:
$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If we disallow escaping quotes, then any basic that is not within " is preceded by an even number of ". So this should do the trick:
sed -r 's/^([^"]*("[^"]*){2}*)basic/\1ring/' file
And as ДМИТРИЙ МАЛИКОВ mentioned, adding the --in-place option will immediately edit the file, instead of returning the new contents.
How does this work?
We anchor the regular expression to the beginning of each line with ". Then we allow an arbitrary number of non-" characters (with [^"]*). Then we start a new subpattern "[^"]* that consists of one " and arbitrarily many non-" characters. We repeat that an even number of times (with {2}*). And then we match basic. Because we matched all of that stuff in the line before basic we would replace that as well. That's why this part is wrapped in another pair of parentheses, thus capturing the line and writing it back in the replacement with \1 followed by ring.
One caveat: if you have multiple basic occurrences in one line, this will only replace the last one that is not enclosed in double quotes, because regex matches cannot overlap. A solution would be a lookbehind, but since this would be a variable-length lookbehind, which is only supported by the .NET regex engine. So if that is the case in your actual input, run the command multiple times until all occurrences are replaced.
$> sed -r 's/^([^\"]*)(basic)([^\"]*)$/\1ring\3/' file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If you wanna edit file in place use --in-place option.
This might work for you (GNU sed):
sed -r 's/^/\n/;ta;:a;s/\n$//;t;s/\n("[^"]*")/\1\n/;ta;s/\nbasic/ring\n/;ta;s/\n([^"]*)/\1\n/;ta' file
Not a sed solution, but it substitutes words not in quotes
Assuming that there is no escaped quotes in strings, i.e. "This is a trap \" hehe", awk might be able to solve this problem
awk -F\" 'BEGIN {OFS=FS}
{
for(i=1; i<=NF; i++){
if(i%2)
gsub(/basic/,"ring",$i)
}
print
}' inputFile
Basically the words that are not in quotes are in odd-numbered fields, and the word "basic" is replaced by "ring" in these fields.
This can be written as a one-liner, but for clarity's sake I've written it in multiple lines.
If basic is at the beginning of line:
sed -e 's/^basic/ring/' file0

how to use sed, awk, or gawk to print only what is matched?

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt
You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'
I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt
If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).
You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.
perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.
If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp
why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution
there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.
you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"
For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}
gawk '/.*abc([0-9]+)xyz.*/' file