Objective
On Linux, I am trying to get an end-user friendly string representing available system memory.
Example:
Your computer has 4 GB of memory.
Success criteria
I consider these aspects end-user friendly (you may disagree):
1G is more readable than 1.0G (1 Vs 1.0)
1GB is more readable than 1G (GB Vs G)
1 GB is more readable than 1GB (space-separated unit of measure)
memory is more readable than RAM, DDR or DDR3 (no jargon)
Starting point
The free utility from procps-ng has an option intended for humans:
-h, --human
Show all output fields automatically scaled to shortest three digit unit
and display the units of print out. Following units are used.
B = bytes
K = kilos
M = megas
G = gigas
T = teras
If unit is missing, and you have petabyte of RAM or swap, the number is
in terabytes and columns might not be aligned with header.
so I decided to start there:
> free -h
total used free shared buffers cached
Mem: 3.8G 1.4G 2.4G 0B 159M 841M
-/+ buffers/cache: 472M 3.4G
Swap: 4.9G 0B 3.9G
3.8G sounds promising so all I have to do now is...
Required steps
Filter the output for the line containing the human-readable string (i.e. Mem:)
Pick out the memory total from the middle of the line (i.e. 3.8G)
Parse out the number and unit of measure (i.e. 3.8 and G)
Format and display a string more to my liking (e.g. G↝ GB, ...)
My attempt
free -h | \
awk '/^Mem:/{print $2}' | \
perl -ne '/(\d+(?:\.\d+)?)(B|K|M|G|T)/ && printf "%g %sB\n", $1, $2'
outputs:
3.8 GB
Desired solution
I'd prefer to just use gawk, but I don't know how
Use a better, even canonical if there is one, way to parse a "float" out of a string
I don't mind the fastidious matching of "just the recognised magnitude letters" (B|K|M|G|T), even if this would unnecessarily break the match with the introduction of new sizes
I use %g to output 4.0 as 4, which is something you may disagree with, depending on how you feel about these comments: https://unix.stackexchange.com/a/70553/10283.
My question, in summary
Could you do the above in awk only?
Could my perl be written more elegantly than that, keeping the strictness of it?
Remember:
I am a beginner robot. Here to learn. :]
What I learned from Andy Lester
Summarised here for my own benefit: to cement learning, if I can.
Use regex character classes, not regex alternation, to pick out one character from a set
perl has a -a option, which splits $_ from -e or -n into #F:
for example, this gawk:
echo foo bar baz | awk '{print $2}'
can be written like this in perl:
echo foo bar baz | perl -ane 'print "$F[1]\n";'
Unless there is something equivalent to gawk 's --field-separator, I think I still like gawk better, although of course to do everything in perl is both cleaner and more efficient. (is there an equivalent?)
EDIT: actually, this proves there is, and it's -F just like in gawk:
echo ooxoooxoooo | perl -Fx -ane 'print join "\n", #F'
outputs:
oo
ooo
oooo
perl has a -l option, which is just awesome: think of it as Python's str.rstrip (see the link if you are not a Python head) for the validity of $_ but it re-appends the \n to the output automatically for you
Thanks, Andy!
Yes, I'm sure you could do this awk-only, but I'm a Perl guy so here's how you'd do it Perl-only.
Instead of (B|K|M|G|T) use [BKMGT].
Use Perl's -l to automatically strip newlines from input and add them on output.
I don't see any reason to have Awk do some of the stripping and Perl doing the rest. You can do autosplitting of fields with Perl's -a.
I don't know what the output from free -h is exactly (My free doesn't have an -h option) so I'm guessing at this
free -h | \
perl -alne'/^Mem:/ && ($F[1]=~/(\d+(?:\.\d+)?)[BKMGT]/) && printf( "%g %sB", $1, $2)'
An awk (actually gawk) solution
free -h | awk 'FNR == 2 {if (match($2,"[BKMGT]$",a)) r=sprintf("%.0f %sB",substr($2,0,RSTART-1), a[0]); else r=$2 " B";print "Your computer has " r " of memory."}'
or broken down for readability
free -h | awk 'FNR == 2 {if (match($2,"[BKMGT]$",a)) r=sprintf("%.0f %sB",
substr($2,0,RSTART-1), a[0]); else r=$2 " B";
print "Your computer has " r " of memory."}'
Where
FNR is the nth line (if 2 does the {} commands)
$2 is the 2nd field
if (condition) command; else command;
match(string, regex, matches array). Regex says "must end with one of BKMGT"
r=sprintf set variable r to sprintf with %.0f for no decimals float
RSTART tells where the match occured, a[0] is the first match
Outputs with the exemple above
Your computer has 4 GB of memory.
Another lengthy Perl answer:
free -b |
perl -lane 'if(/Mem/){ #u=("B","KB","MB","GB"); $F[2]/=1024, shift #u while ($F[2]>1024); printf("%.2f %s", $F[2],$u[0])}'
Related
I need from this file extract line that starts with a number in the range 10-20 and I have tried use grep "[10-20]" tmp_file.txt, but from a file that has this format
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.aa
12.bbb
13.cccc
14.ddddd
15.eeeeee
16.fffffff
17.gggggggg
18.hhhhhhhhh
19.iiiiiiiiii
20.jjjjjjjjjjj
21.
it returned everything and marked every number that contains either 1, 0, 10, 2, 0, 20 or 21 :/
With an extended regular expression (-E):
grep -E '^(1[0-9]|20)\.' file
Output:
10.aa
12.bbb
13.cccc
14.ddddd
15.eeeeee
16.fffffff
17.gggggggg
18.hhhhhhhhh
19.iiiiiiiiii
20.jjjjjjjjjjj
See: The Stack Overflow Regular Expressions FAQ
An other one with awk
awk '/^10\./,/^20\./' tmp_file.txt
awk '/^10\./,/^13\./' tmp_file.txt
10.aa
12.bbb
13.cccc
Try
grep -w -e '^1[[:digit:]]' -e '^20' tmp_file.txt
-w forces matches of whole words. That prevents matching lines like 100.... It's not POSIX, but it's supported by every grep that most people will encounter these days. Use grep -e '^1[[:digit:]]\.' -e '^20\.' ... if you are concerned about portability.
The -e option can be used multiple times to specify multiple patterns.
[[:digit:]] may be more reliable than [0-9]. See In grep command, can I change [:digit:] to [0-9]?.
Assuming the file might not be sorted and using numeric comparison
awk -F. '$1 >= 10 && $1 <= 20' < file.txt
grep is not the tool for this, because grep finds text patterns, but does not understand numeric values. Making patterns that match the 11 values from 10-20 is like stirring a can of paint with a screwdriver. You can do it, but it's not the right tool for the job.
A much clearer way to do this is with Perl:
$ perl -n -e'print if /^(\d+)/ && $1 >= 10 && $1 <= 20' foo.txt
This says to print a line of the file if the beginning of the line ^ matches one or more digits \d+ and if the numeric value of what was matched $1 is between the values of 10 and 20.
I currently use this perl command to increment the last number in a string:
perl -pe 's/(\d+)(?!.*\d+)/$1+1/e' <<< "abc123_00456.txt"
It outputs abc123_457.txt, while I want abc123_00457.txt.
I also want something like 99 to increment to 100, though if that's too hard, 00 is also acceptable.
Some more examples of what I want:
09 -> 10
004 -> 005
I also want to be able to increment by any number (not just 1), so no ++.
I do not want to use shell's builtins to accomplish this.
Try this:
perl -pe 's/(\d+)(?=\D*\z)/my $n = $1; ++$n/e' <<< "abc123_00456.txt"
The ++ operator preserves the number of digits when incrementing a string.
Alternatively:
perl -pe 's/(\d+)(?=\D*\z)/sprintf "%0*d", length($1), $1 + 1/e' <<< "abc123_00456.txt"
This lets you increment by more than just 1 (or perform other arithmetic operations).
sprintf %d formats an integer in decimal format. 0 means to pad the result with zeroes; * means the maximum field width is taken from the next argument instead of the format string itself. (E.g. %05d means "format a number by padding it with zeroes until it is at least 5 characters wide".)
Here we simply take the length of the original string of digits (length($1)) and use it as our field width. The number to format is $1 + 1. If it is shorter than the original string, sprintf automatically adds zeroes.
See also perldoc -f sprintf.
You can use a formatted string with sprintf:
perl -pe 's/(\d+)(?!.*\d)/sprintf("%05d",$1+1)/e' <<< "abc123_00456.txt"
The 5 gives the width of your number, the 0 is the character used to pad the number.
For an unknow width, you can build dinamically the formatted string:
perl -pe 's/(\d+)(?!.*\d)/sprintf("%0".length($1)."d",$1+1)/e' <<< "abc123_00456.txt"
With GNU awk for the 3rd arg to match():
$ awk -v n=17 'match($0,/(.*[^0-9])([0-9]+)(.*)/,a){$0=a[1] sprintf("%0*d",length(a[2]),a[2]+n) a[3]} 1' <<< "abc123_00456.txt"
abc123_00473.txt
With any awk in any shell on every UNIX box:
$ awk -v n=17 'match($0,/[0-9]+\./){lgth=RLENGTH-1; tgt=substr($0,RSTART,lgth); $0=substr($0,1,RSTART-1) sprintf("%0*d",lgth,tgt+n) substr($0,RSTART+lgth)} 1' <<< "abc123_00456.txt"
abc123_00473.txt
This might work for you (GNU sed & Bash):
sed -E 's/^([^0-9]*([0-9]+[^0-9]+)*0*)([0-9]+)(.*)/echo "\1$((\3+1))\4"/e' file
Pretty new to AWK programming. I have a file1 with entries as:
15>000000513609200>000000513609200>B>I>0011>>238/PLMN/000100>File Ef141109.txt>0100-75607-16156-14 09-11-2014
15>000000513609200>000000513609200>B>I>0011>Danske Politi>238/PLMN/000200>>0100-75607-16156-14 09-11-2014
15>000050354428060>000050354428060>B>I>0011>Danske Politi>238/PLMN/000200>>4100-75607-01302-14 31-10-2014
I want to write a awk script, where if 2nd field subtracted from 3rd field is a 0, then it prints field 2. Else if the (difference > 0), then it prints all intermediate digits incremented by 1 starting from 2nd field ending at 3rd field. There will be no scenario where 3rd field is less than 2nd. So ignoring that condition.
I was doing something as:
awk 'NR > 2 { print p } { p = $0 }' file1 | awk -F">" '{if ($($3 - $2) == 0) print $2; else l = $($3 - $2); for(i=0;i<l;i++) print $2++; }'
(( Someone told me awk is close to C in terms of syntax ))
But from the output it looks to me that the String to numeric or numeric to string conversions are not taking place at right place at right time. Shouldn't it be taken care by AWK automatically ?
The OUTPUT that I get:
513609200
513609201
513609200
Which is not quiet as expected. One evident issue is its ignoring the preceding 0s.
Kindly help me modify the AWK script to get the desired result.
NOTE:
awk 'NR > 2 { print p } { p = $0 }' file1 is just to remove the 1st and last entry in my original file1. So the part that needs to be fixed is:
awk -F">" '{if ($($3 - $2) == 0) print $2; else l = $($3 - $2); for(i=0;i<l;i++) print $2++; }'
In awk, think of $ as an operator to retrieve the value of the named field number ($0 being a special case)
$1 is the value of field 1
$NF is the value of the field given in the NF variable
So, $($3 - $2) will try to get the value of the field number given by the expression ($3 - $2).
You need fewer $ signs
awk -F">" '{
if ($3 == $2)
print $2
else {
v=$2
while (v < $3)
print v++
}
}'
Normally, this will work, but your numbers are beyond awk integer bounds so you need another solution to handle them. I'm posting this to initiate other solutions and better illustrate your specifications.
$ awk -F'>' '{for(i=$2;i<=$3;i++) print i}' file
note that this will skip the rows that you say impossible to happen
A small scale example
$ cat file_0
x>1000>1000>etc
x>2000>2003>etc
x>3000>2999>etc
$ awk -F'>' '{for(i=$2;i<=$3;i++) print i}' file_0
1000
2000
2001
2002
2003
Apparently, newer versions of gawk has --bignum options for arbitrary precision integers, if you have a compatible version that may solve your problem but I don't have access to verify.
For anyone who does not have ready access to gawk with bigint support, it may be simpler to consider other options if some kind of "big integer" support is required. Since ruby has an awk-like mode of operation,
let's consider ruby here.
To get started, there are just four things to remember:
invoke ruby with the -n and -a options (-n for the awk-like loop; -a for automatic parsing of lines into fields ($F[i]));
awk's $n becomes $F[n-1];
explicit conversion of numeric strings to integers is required;
To specify the lines to be executed on the command line, use the '-e TEXT' option.
Thus a direct translation of:
awk -F'>' '{for(i=$2;i<=$3;i++) print i}' file
would be:
ruby -an -F'>' -e '($F[1].to_i .. $F[2].to_i).each {|i| puts i }' file
To guard against empty lines, the following script would be slightly better:
($F[1].to_i .. $F[2].to_i).each {|i| puts i } if $F.length > 2
This could be called as above, or if the script is in a file (say script.rb) using the incantation:
ruby -an -F'>' script.rb file
Given the OP input data, the output is:
513609200
513609200
50354428060
The left-padding can be accomplished in several ways -- see for example this SO page.
Problem:
I need to match an exact format for a mailing machine software program. It expects a certain format. I can count the number of new lines, carriage returns, tabs ...etc. using tools like
cat -vte
and
od -c
and
wc -l ( or wc -c )
However, I'd like to know the exact number of leading and trailing spaces between characters
and sections of text. Tabs as well.
Question:
How would you go about analyzing then matching a template exactly using common unix
tools + perl or python? One-liners preferred. Also, what's your advice for matching
a DOS encoded file? Would you translate it to NIX first, then analyze, or leave, as is?
UPDATE
Using this to see individual spaces [ assumes no '%' chars in file ]:
sed 's/ /%/g' filename.000
Plan to build a script that analyzes each line's tab and space content.
Using #shiplu's solution with a nod to the anti-cat crowd:
while read l;do echo $l;echo $((`echo $l | wc -c` - `echo $l | tr -d ' ' | wc -c`));done<filename.000
Still needs some tweaks for Windows but it's well on it's way.
SAMPLE TEXT
Key for reading:
newlines marked with \n
Carriage returns marked with \r
Unknown space/tab characters marked with [:space:] ( need counts on those )
\r\n
\n
[:space:]Institution Anon LLC\r\n
[:space:]123 Blankety St\r\n
[:space:]Greater Abyss, AK 99999\r\n
\n
\n
[:space:] 10/27/2011\r\n
[:space:]Requested materials are available for pickup:\r\n
[:space:]e__\r[:space:] D_ \r[:space:] _O\r\n
[:space:]Bathtime for BonZo[:space:] 45454545454545[:space:] 10/27/2011\r\n
[:space:]Bathtime for BonZo[:space:] 45454545454545[:space:] 10/27/2011\r\n
\n
\n
\n
\n
\n
\n
[:space:] Pantz McManliss\r\n
[:space:] Gibberish Ave\r\n
[:space:] Northern Mirkwood, ME 99999\r\n
( untold variable amounts of \n chars go here )
UPDATE 2
Using IFS with read gives similar results to the ruby posted by someone below.
while IFS='' read -r line
do
printf "%s\n" "$line" | sed 's/ /%/g' | grep -o '%' | wc -w
done < filename.000
perl -nlE'say 0+( () = /\s/g );'
Unlike the currently accepted answer, this doesn't split the input into fields, discarding the result. It also doesn't needlessly create an array just to count the number of values in a list.
Idioms used:
0+( ... ) imposes scalar context like scalar( ... ), but it's clearer because it tells the reader a number is expected.
List assignment in scalar context returns the number of elements returned by its RHS, so 0+( () = /.../g ) gives the number of times () = /.../g matched.
-l, when used with -n, will cause the input to be "chomped", so this removes line feeds from the count.
If you're just interested in spaces (U+0020) and tabs (U+0009), the following is faster and simpler:
perl -nE'say tr/ \t//;'
In both cases, you can pass the input via STDIN or via a file named by an argument.
Regular expressions in Perl or Python would be the way to go here.
Perl Regular Expressions
Python Regular Expressions
Regular Expressions Cheat Sheet
Yes, it may take an initial time investment to learn "perl, schmerl, zwerl" but once you've gained experience with an extremely powerful tool like Regular Expressions, it can save you an enormous amount of time down the road.
perl -nwE 'print; for my $s (/([\t ]+)/g) { say "Count: ", length $s }' input.txt
This will count individual groups of tab or space, instead of counting all the whitespace in the entire line. For example:
foo bar
Will print
foo bar
Count: 4
Count: 8
You may wish to skip single spaces (spaces between words). I.e. don't count the spaces in Bathtime for BonZo. If so, replace + with {2,} or whatever minimum you think is appropriate.
counting blanks:
sed 's/[^ ]//g' FILE | tr -d "\n" | wc -c
before, behind and between text. Do you want to count newlines, tabs, etc. in the same go and sum them up, or as separate step?
If you want to count the number of spaces in pm.txt, this command will do,
cat pm.txt | while read l;
do echo $((`echo $l | wc -c` - `echo $l | tr -d ' ' | wc -c`));
done;
If you want to count the number of spaces, \r, \n, \t use this,
cat pm.txt | while read l;
do echo $((`echo $l | wc -c` - `echo $l | tr -d ' \r\n\t' | wc -c`));
done;
read will strip any leading characters. If you dont want it, there is a nasty way. First split your file so that only 1 lines are there per file using
`split -l 1 -d pm.txt`.
After that there will be bunch of x* files. Now loop through it.
for x in x*; do echo $((`cat $x | wc -c` - `cat $x | tr -d ' \r\n\t' | wc -c`)); done;
Remove the those files by rm x*;
In case Ruby counts (it does count :)
ruby -lne 'puts scan(/\s/).size'
and now some Perl (slightly less intuitive IMHO):
perl -lne 'print scalar(#{[/(\s)/g]})'
If you ask me, I'd write a simple C program to do the counting and formatting all in one go. But that's just me. By the time I got finished fiddle-farting around with perl, schmerl, zwerl I'd have wasted half a day.
I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt
You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'
I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt
If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).
You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.
perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.
If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp
why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution
there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.
you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"
For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}
gawk '/.*abc([0-9]+)xyz.*/' file