Extract Filename before date Bash shellscript - regex

I am trying to extract a part of the filename - everything before the date and suffix. I am not sure the best way to do it in bashscript. Regex?
The names are part of the filename. I am trying to store it in a shellscript variable. The prefixes will not contain strange characters. The suffix will be the same. The files are stored in a directory - I will use loop to extract the portion of the filename for each file.
Expected input files:
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Expected Extract:
EXAMPLE_FILE
EXAMPLE_FILE_2
Attempt:
filename=$(basename "$file")
folder=sed '^s/_[^_]*$//)' $filename
echo 'Filename:' $filename
echo 'Foldername:' $folder

$ cat file.txt
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
$
$ cat file.txt | sed 's/_[0-9]*-[0-9]*-[0-9]*\.out$//'
EXAMPLE_FILE
EXAMPLE_FILE_2
$

No need for useless use of cat, expensive forks and pipes. The shell can cut strings just fine:
$ file=EXAMPLE_FILE_2_2017-10-12.out
$ echo ${file%%_????-??-??.out}
EXAMPLE_FILE_2
Read all about how to use the %%, %, ## and # operators in your friendly shell manual.

Bash itself has regex capability so you do not need to run a utility. Example:
for fn in *.out; do
[[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
done
With the example files, output is:
EXAMPLE_FILE_2017-09-12.out => EXAMPLE_FILE
EXAMPLE_FILE_2_2017-10-12.out => EXAMPLE_FILE_2
Using Bash itself will be faster, more efficient than spawning sed, awk, etc for each file name.
Of course in use, you would want to test for a successful match:
for fn in *.out; do
if [[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]; then
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
else
echo "$fn no match"
fi
done
As a side note, you can use Bash parameter expansion rather than a regex if you only need to trim the string after the last _ in the file name:
for fn in *.out; do
cap="${fn%_*}"
printf "%s => %s\n" "$fn" "$cap"
done
And then test $cap against $fn. If they are equal, the parameter expansion did not trim the file name after _ because it was not present.
The regex allows a test that a date-like string \d\d\d\d-\d\d-\d\d is after the _. Up to you which you need.

Code
See this code in use here
^\w+(?=_)
Results
Input
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Output
EXAMPLE_FILE
EXAMPLE_FILE_2
Explanation
^ Assert position at start of line
\w+ Match any word character (a-zA-Z0-9_) between 1 and unlimited times
(?=_) Positive lookahead ensuring what follows is an underscore _ character

Simply with sed:
sed 's/_[^_]*$//' file
The output:
EXAMPLE_FILE
EXAMPLE_FILE_2
----------
In case of iterating through the list of files with extension .out - bash solution:
for f in *.out; do echo "${f%_*}"; done

awk -F_ 'NF-=1' OFS=_ file
EXAMPLE_FILE
EXAMPLE_FILE_2

Could you please try awk solution too, which will take care of all the .out files, note this has ben written and tested in GNU awk.
awk --re-interval 'FNR==1{if(val){close(val)};split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");print array[1];val=FILENAME;nextfile}' *.out
Also my awk version is old so I am using --re-interval, if you have latest version of awk you may need not to use it then.
Explanation and Non-one liner fom of solution: Adding a non-one liner form of solution too here with explanation.
awk --re-interval '##Using --re-interval for supporting ERE in my OLD awk version, if OP has new version of awk it could be removed.
FNR==1{ ##Checking here condition that when very first line of any Input_file is being read then do following actions.
if(val){ ##Checking here if variable named val value is NOT NULL then do following.
close(val) ##close the Input_file named which is stored in variable val, so that we will NOT face problem of TOO MANY FILES OPENED, so it will be like one file read close it in background then.
};
split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");##Splitting FILENAME(which will have Input_file name in it) into array named array only, whose separator is a 4 digits-2 digits- then 2 digits, actually this will take care of YYYY-MM-DD format in Input_file(s) and it will be easier for us to get the file name part.
print array[1]; ##Printing array 1st element here.
val=FILENAME; ##Storing FILENAME variable value which will have current Input_file name in it to variable named val, so that we could close it in background.
nextfile ##nextfile as it name suggests it will skip all the lines in current line and jump onto the next file to save some cpu cycles of our system.
}
' *.out ##Mentioning all *.out Input_file(s) here.

Related

rename multiple files splitting filenames by '_' and retaining first and last fields

Say I have the following files:
a_b.txt a_b_c.txt a_b_c_d_e.txt a_b_c_d_e_f_g_h_i.txt
I want to rename them in such a way that I split their filenames by _ and I retain the first and last field, so I end up with:
a_b.txt a_c.txt a_e.txt a_i.txt
Thought it would be easy, but I'm a bit stuck...
I tried rename with the following regexp:
rename 's/^([^_]*).*([^_]*[.]txt)/$1_$2/' *.txt
But what I would really need to do is to actually split the filename, so I thought of awk, but I'm not so proficient with it... This is what I have so far (I know at some point I should specify FS="_" and grab the first and last field somehow...
find . -name "*.txt" | awk -v mvcmd='mv "%s" "%s"\n' '{old=$0; <<split by _ here somehow and retain first and last fields>>; printf mvcmd,old,$0}'
Any help? I don't have a preferred method, but it would be nice to use this to learn awk. Thanks!
Your rename attempt was close; you just need to make sure the final group is greedy.
rename 's/^([^_]*).*_([^_]*[.]txt)$/$1_$2/' *_*_*.txt
I added a _ before the last opening parenthesis (this is the crucial fix), and a $ anchor at the end, and also extended the wildcard so that you don't process any files which don't contain at least two underscores.
The equivalent in Awk might look something like
find . -name "*_*_*.txt" |
awk -F _ '{ system("mv " $0 " " $1 "_" $(NF)) }'
This is somewhat brittle because of the system call; you might need to rethink your approach if your file names could contain whitespace or other shell metacharacters. You could add quoting to partially fix that, but then the command will fail if the file name contains literal quotes. You could fix that, too, but then this will be a little too complex for my taste.
Here's a less brittle approach which should cope with completely arbitrary file names, even ones with newlines in them:
find . -name "*_*_*.txt" -exec sh -c 'for f; do
mv "$f" "${f%%_*}_${f##*_}"
done' _ {} +
find will supply a leading path before each file name, so we don't need mv -- here (there will never be a file name which starts with a dash).
The parameter expansion ${f##pattern} produces the value of the variable f with the longest available match on pattern trimmed off from the beginning; ${f%%pattern} does the same, but trims from the end of the string.
With your shown samples, please try following pure bash code(with great use parameter expansion capability of BASH). This will catch all files with name/format .txt in their name. Then it will NOT pick files like: a_b.txt it will only pick files which have more than 1 underscore in their name as per requirement.
for file in *_*_*.txt
do
firstPart="${file%%_*}"
secondPart="${file##*_}"
newName="${firstPart}_${secondPart}"
mv -- "$file" "$newName"
done
This answer works for your example, but #tripleee's "find" approach is more robust.
for f in a_*.txt; do mv "$f" "${f%%_*}_${f##*_}"; done
Details: https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html / https://www.gnu.org/software/bash/manual/html_node/Pattern-Matching.html
Here's an alternate regexp for the given samples:
$ rename -n 's/_.*_/_/' *.txt
rename(a_b_c_d_e_f_g_h_i.txt, a_i.txt)
rename(a_b_c_d_e.txt, a_e.txt)
rename(a_b_c.txt, a_c.txt)
A different rename regex
rename 's/(\S_)[a-z_]*(\S\.txt)/$1$2/'
Using the same regex with sed or using awk within a loop.
for a in a_*; do
name=$(echo $a | awk -F_ '{print $1, $NF}'); #Or
#name=$(echo $a | sed -E 's/(\S_)[a-z_]*(\S\.txt)/\1\2/g');
mv "$a" "$name";
done

How to use 'sed' to add dynamic prefix to each number in integer list?

How can I use sed to add a dynamic prefix to each number in an integer list?
For example:
I have a string "A-1,2,3,4,5", I want to transform it to string "A-1,A-2,A-3,A-4,A-5" - which means I want to add prefix of first integer i.e. "A-" to each number of the list.
If I have string like "B-1,20,300" then I want to transform it to string "B-1,B-20,B-300".
I am not able to use RegEx Capturing Groups because for global match they do not retain their value in subsequent matches.
When it comes to looping constructs in sed, I like to use newlines as markers for the places I have yet to process. This makes matching much simpler, and I know they're not in the input because my input is a text line.
For example:
$ echo A-1,2,3,4,5 | sed 's/,/\n/g;:a s/^\([^0-9]*\)\([^\n]*\)\n/\1\2,\1/; ta'
A-1,A-2,A-3,A-4,A-5
This works as follows:
s/,/\n/g # replace all commas with newlines (insert markers)
:a # label for looping
s/^\([^0-9]*\)\([^\n]*\)\n/\1\2,\1/ # replace the next marker with a comma followed
# by the prefix
ta # loop unless there's nothing more to do.
The approach is similar to #potong's, but I find the regex much more readable -- \([^0-9]*\) captures the prefix, \([^\n]*\) captures everything up to the next marker (i.e. everything that's already been processed), and then it's just a matter of reassembling it in the substitution.
Don't use sed, just use the other standard UNIX text manipulation tool, awk:
$ echo 'A-1,2,3,4,5' | awk '{p=substr($0,1,2); gsub(/,/,"&"p)}1'
A-1,A-2,A-3,A-4,A-5
$ echo 'B-1,20,300' | awk '{p=substr($0,1,2); gsub(/,/,"&"p)}1'
B-1,B-20,B-300
This might work for you (GNU sed):
sed -E ':a;s/^((([^-]+-)[^,]+,)+)([0-9])/\1\3\4/;ta' file
Uses pattern matching and a loop to replace a number following a comma by the first column prefix and that number.
Assuming this is for shell scripting, you can do so with 2 seds:
set string = "A1,2,3,4,5"
set prefix = `echo $string | sed 's/^\([A-Z]\).*/\1/'`
echo $string | sed 's/,\([0-9]\)/,'$prefix'-\1/g'
Output is
A1,A-2,A-3,A-4,A-5
With
set string = "B-1,20,300"
Output is
B-1,B-20,B-300
Could you please try following(if ok with awk).
awk '
BEGIN{
FS=OFS=","
}
{
for(i=1;i<=NF;i++){
if($i !~ /^A/&&$i !~ /\"A/){
$i="A-"$i
}
}
}
1' Input_file
if your data in 'd' file, tried on gnu sed:
sed -E 'h;s/^(\w-).+/\1/;x;G;:s s/,([0-9]+)(.*\n(.+))/,\3\1\2/;ts; s/\n.+//' d

Swap Strings within a line in Bash

I'm parsing a document with a bash script and output different parts of it. At one point i need find and reformat text in the form of:
(foo)[X]
[Y]
(bar)[Z]
to something like:
X->foo
Y
Z->bar
Now, I'm able to grep the parts I want with RegEx, but I'm having trouble swapping the two elements in one line and handling the fact that the text in parentheses is optional. Is this even possible with a combination of sed and grep?
Thank You for your time.
You can use sed:
sed -e 's/(\([^)]*\))\[\([^]]*\)]/\2->\1/' -e 's/\[\([^]]*\)]/\1/' file
This works for your given input example:
X->foo
Y
Z->bar
You might need to make the patterns more strict if you have more kinds of input to handle.
You can use awk:
awk -F '[][()]+' '{print (NF>3 ? $3 "->" $2 : $2)}' file
X->foo
Y
Z->bar
You can even do it in bash itself, although it's not pretty.
# Three capture groups:
# 1. The optional paranthesized text
# 2. The contents of the parentheses
# 3. The contents of the square brackets
regex="(\((.*)\))?\[(.*)\]"
while IFS= read -r str; do
[[ "$str" =~ $regex ]]
# If the 2nd array element is not empty, print -> followed by the
# non-empty value.
echo "${BASH_REMATCH[3]}${BASH_REMATCH[2]:+->${BASH_REMATCH[2]}}"
done < file.txt

Remove newlines (\n) but exclude lines with specific regex?

After a lot of searching, I've come across a few ways to remove newlines using sed or tr
sed ':a;N;$!ba;s/\n//g'
tr -d '\n'
However, I can't find a way to exclude the action from specific lines. I've learned that one can use the "!" in sed as a means to exclude an address from a subsequent action, but I can't figure out how to incorporate it into the sed command above. Here's an example of what I'm trying to resolve.
I have a file formatted as such:
>sequence_ID_1
atcgatcgggatc
aatgacttcattg
gagaccgaga
>sequence_ID_2
gatccatggacgt
ttaacgcgatgac
atactaggatcag
at
I want the file formatted in this fashion:
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat
I've been focusing on trying to exclude lines containing the ">" character, as this is the only constant regex that would exist on lines that have the ">" character (note: the sequence_ID_n is unique to each entry preceded by the ">" and, thus, cannot be relied upon for regex matching).
I've attempted this:
sed ':a;N;$!ba;/^>/!s/\n//g' file.txt > file2.txt
It runs without generating an error, but the output file is the same as the original.
Maybe I can't do this with sed? Maybe I'm approaching this problem incorrectly? Should I be trying to define a range of lines to operate on (i.e. only lines between lines beginning with ">")?
I'm brand new to basic text manipulation, so any suggestions are greatly, greatly appreciated!
This awk should work:
$ awk '/^>/{print (NR==1)?$0:"\n"$0;next}{printf "%s", $0}END{print ""}' file
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat
This might work for you (GNU sed):
sed ':a;N;/^>/M!s/\n//;ta;P;D' file
Remove newlines from lines that don't begin with a >.
Using GNU sed:
sed -r ':a;/^[^>]/{$!N;s/\n([^>])/\1/;ta}' inputfile
For your input, it'd produce:
>sequence_ID_1
atcgatcgggatcatgacttcattgagaccgaga
>sequence_ID_2
gatccatggacgttaacgcgatgactactaggatcagt
As #1_CR already said #jaypal's solution is a good way to do it. But I really could not resist to try it in pure Bash. See the comments for details:
The input data:
$ cat input.txt
>sequence_ID_1
atcgatcgggatc
aatgacttcattg
gagaccgaga
>sequence_ID_2
gatccatggacgt
ttaacgcgatgac
atactaggatcag
at
>sequence_ID_20
gattaca
The script:
$ cat script
#!/usr/bin/env bash
# Bash 4 - read the data line by line into an array
readarray -t data < "$1"
# Bash 3 - read the data line by line into an array
#while read line; do
# data+=("$line")
#done < "$1"
# A search pattern
pattern="^>sequence_ID_[0-9]"
# An array to insert the revised data
merged=()
# A counter
counter=0
# Iterate over each item in our data array
for item in "${data[#]}"; do
# If an item matches the pattern
if [[ "$item" =~ $pattern ]]; then
# Add the item straight into our new array
merged+=("$item")
# Raise the counter in order to write the next
# possible non-matching item to a new index
(( counter++ ))
# Continue the loop from the beginning - skip the
# rest of the code inside the loop for now since it
# is not relevant after we have found a match.
continue
fi
# If we have a match in our merged array then
# raise the counter one more time in order to
# get a new index position
[[ "${merged[$counter]}" =~ $pattern ]] && (( counter++ ))
# Add a non matching value to the already existing index
# currently having the highest index value based on the counter
merged[$counter]+="$item"
done
# Test: Echo each item of our merged array
printf "%s\n" "${merged[#]}"
The result:
$ ./script input.txt
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat
>sequence_ID_20
gattaca
Jaypal's solution is the way to go, here's a GNU awk variant
awk -v RS='>sequence[^\\n]+\\n'
'{gsub("\n", "");printf "%s%s%s", $0, NR==1?"":"\n", RT}' file
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat
Here is one way to do it with awk
awk '{printf (/^>/&&NR>1?RS:"")"%s"(/^>/?RS:""),$0}' file
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat

how to use sed, awk, or gawk to print only what is matched?

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt
You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'
I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt
If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).
You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.
perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.
If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp
why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution
there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.
you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"
For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}
gawk '/.*abc([0-9]+)xyz.*/' file