shell regex not found (number, length and K letter) - regex

my code or regex not found, im tried with:
'^([0-9]{7,8})+([K|0-9]{1})'
'#[0-9]{7,8}[K 0-9]{1}'
'#[0-9]{7,8}[K 0-9]{1}'
"^([0-9]{7,8})+([K|0-9]{1})$"
*i need return (1234567 or 12345678) + K or number (not found \d)
ex: 123456789 12345678K 12345678
*with this regex '^[0-9]{7,8}[K|0-9]{1}' return:
ok 184587939
ok 17678977K
ok 186613074
ok 18661307Z (dont work the last digit)
invalido 18R613074
ok 1845879398888888 (not length found)
ok 18458793U
invalido 18661G074
invalido 18661G07T
ok 18458793
invalido 1845E793
#!/bin/bash
var='^([0-9]{7,8})+([K|0-9]{1})$'
for LINEA in `cat rut.txt ` #LINEA guarda el resultado del fichero rut.txt
do
rut=`echo $LINEA | cut -d ":" -f1` #Extrae rut
rut=$(echo $rut | tr 'a-z' 'A-Z')
while :
do
if [[ $rut =~ $var ]];then ()
#$rut >> rutok.txt
echo $rut | cat >> rutok.txt
echo $rut' ok'
break
else
#$rut >> rutinv.txt
echo $rut | cat >> rutinv.txt
echo $rut' inv'
break
fi
done
done
exit 0

If Iunderstand you correctly, I believe this is what you're after.
^([0-9]{8}|[0-9]{7}K)$

Your question is extremely unclear, but in addition, your code is incredibly overcomplicated. You seem to be looking for simply grep and grep -v. Looping over the lines in a file with for is definitely wrong (see http://mywiki.wooledge.org/DontReadLinesWithFor) but a while read -r loop is also usually an antipattern. In the general case, these are usually examples of an Awk script begging to be born.
cut -d : -f1 rut.txt |
tr A-Z a-z |
awk '/^([0-9]{7,8})([KZ0-9])$/ { print >>"rutok.txt"; next }
{ print >> "rutinv.txt" }'
The {1} repetition is just superfluous (otherwise we'd have to write h{1}t{1}t{1}p{1} etc to match a simple string!) and inside a character class, you just enumerate the character ranges; if you don't want to match a literal | character, don't put it in the character class.
([0-9]{7,8})+ means one or more repetitions of 7 or 8 numbers; so 7, 8, 14, 15, 16, 21, ... repetitions. According to the exposition, this is not what you want.
I'm guessing the regex above covers your case: seven or eight digits and a K or one more digit.
Moreover, returning to your shell script, you should not use uppercase for your own variables (these are reserved for system use); and you need to quote your strings unless you specifically want the shell to perform token splitting and wildcard expansion on the values (though that's probably likely to be harmless here). Finally, the while loop does not seem to serve any useful purpose, as you break out of it on the first iteration in every case.

You can try this:
^([0-9]{7,8})([k0-9])$
For
184587939
17678977K
186613074
and ^([0-9]{7,8})([a-z0-9])$ For
184587939
17678977K
186613074
17678977A
18661307B
17678977U
18661307L

Related

Shell script linux, validating integer

This code is for check if a character is a integer or not (i think). I'm trying to understand what this means, I mean... each part of that line, checking the GREP man pages, but it's really difficult for me. I found it on the internet. If anyone could explain me the part of the grep... what means each thing put there:
echo $character | grep -Eq '^(\+|-)?[0-9]+$'
Thanks people!!!
Analyse this regex:
'^(\+|-)?[0-9]+$'
^ - Line Start
(\+|-)? - Optional + or - sign at start
[0-9]+ - One or more digits
$ - Line End
Overall it matches strings like +123 or -98765 or just 9
Here -E is for extended regex support and -q is for quiet in grep command.
PS: btw you don't need grep for this check and can do this directly in pure bash:
re='^(\+|-)?[0-9]+$'
[[ "$character" =~ $re ]] && echo "its an integer"
I like this cheat sheet for regex:
http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
It is very useful, you could easily analyze the
'^(+|-)?[0-9]+$'
as
^: Line must begin with...
(): grouping
\: ESC character (because + means something ... see below)
+|-: plus OR minus signs
?: 0 or 1 repetation
[0-9]: range of numbers from 0-9
+: one or more repetation
$: end of line (no more characters allowed)
so it accepts like: -312353243 or +1243 or 5678
but do not accept: 3 456 or 6.789 or 56$ (as dollar sign).

Linux: counting spaces and other characters in file

Problem:
I need to match an exact format for a mailing machine software program. It expects a certain format. I can count the number of new lines, carriage returns, tabs ...etc. using tools like
cat -vte
and
od -c
and
wc -l ( or wc -c )
However, I'd like to know the exact number of leading and trailing spaces between characters
and sections of text. Tabs as well.
Question:
How would you go about analyzing then matching a template exactly using common unix
tools + perl or python? One-liners preferred. Also, what's your advice for matching
a DOS encoded file? Would you translate it to NIX first, then analyze, or leave, as is?
UPDATE
Using this to see individual spaces [ assumes no '%' chars in file ]:
sed 's/ /%/g' filename.000
Plan to build a script that analyzes each line's tab and space content.
Using #shiplu's solution with a nod to the anti-cat crowd:
while read l;do echo $l;echo $((`echo $l | wc -c` - `echo $l | tr -d ' ' | wc -c`));done<filename.000
Still needs some tweaks for Windows but it's well on it's way.
SAMPLE TEXT
Key for reading:
newlines marked with \n
Carriage returns marked with \r
Unknown space/tab characters marked with [:space:] ( need counts on those )
\r\n
\n
[:space:]Institution Anon LLC\r\n
[:space:]123 Blankety St\r\n
[:space:]Greater Abyss, AK 99999\r\n
\n
\n
[:space:] 10/27/2011\r\n
[:space:]Requested materials are available for pickup:\r\n
[:space:]e__\r[:space:] D_ \r[:space:] _O\r\n
[:space:]Bathtime for BonZo[:space:] 45454545454545[:space:] 10/27/2011\r\n
[:space:]Bathtime for BonZo[:space:] 45454545454545[:space:] 10/27/2011\r\n
\n
\n
\n
\n
\n
\n
[:space:] Pantz McManliss\r\n
[:space:] Gibberish Ave\r\n
[:space:] Northern Mirkwood, ME 99999\r\n
( untold variable amounts of \n chars go here )
UPDATE 2
Using IFS with read gives similar results to the ruby posted by someone below.
while IFS='' read -r line
do
printf "%s\n" "$line" | sed 's/ /%/g' | grep -o '%' | wc -w
done < filename.000
perl -nlE'say 0+( () = /\s/g );'
Unlike the currently accepted answer, this doesn't split the input into fields, discarding the result. It also doesn't needlessly create an array just to count the number of values in a list.
Idioms used:
0+( ... ) imposes scalar context like scalar( ... ), but it's clearer because it tells the reader a number is expected.
List assignment in scalar context returns the number of elements returned by its RHS, so 0+( () = /.../g ) gives the number of times () = /.../g matched.
-l, when used with -n, will cause the input to be "chomped", so this removes line feeds from the count.
If you're just interested in spaces (U+0020) and tabs (U+0009), the following is faster and simpler:
perl -nE'say tr/ \t//;'
In both cases, you can pass the input via STDIN or via a file named by an argument.
Regular expressions in Perl or Python would be the way to go here.
Perl Regular Expressions
Python Regular Expressions
Regular Expressions Cheat Sheet
Yes, it may take an initial time investment to learn "perl, schmerl, zwerl" but once you've gained experience with an extremely powerful tool like Regular Expressions, it can save you an enormous amount of time down the road.
perl -nwE 'print; for my $s (/([\t ]+)/g) { say "Count: ", length $s }' input.txt
This will count individual groups of tab or space, instead of counting all the whitespace in the entire line. For example:
foo bar
Will print
foo bar
Count: 4
Count: 8
You may wish to skip single spaces (spaces between words). I.e. don't count the spaces in Bathtime for BonZo. If so, replace + with {2,} or whatever minimum you think is appropriate.
counting blanks:
sed 's/[^ ]//g' FILE | tr -d "\n" | wc -c
before, behind and between text. Do you want to count newlines, tabs, etc. in the same go and sum them up, or as separate step?
If you want to count the number of spaces in pm.txt, this command will do,
cat pm.txt | while read l;
do echo $((`echo $l | wc -c` - `echo $l | tr -d ' ' | wc -c`));
done;
If you want to count the number of spaces, \r, \n, \t use this,
cat pm.txt | while read l;
do echo $((`echo $l | wc -c` - `echo $l | tr -d ' \r\n\t' | wc -c`));
done;
read will strip any leading characters. If you dont want it, there is a nasty way. First split your file so that only 1 lines are there per file using
`split -l 1 -d pm.txt`.
After that there will be bunch of x* files. Now loop through it.
for x in x*; do echo $((`cat $x | wc -c` - `cat $x | tr -d ' \r\n\t' | wc -c`)); done;
Remove the those files by rm x*;
In case Ruby counts (it does count :)
ruby -lne 'puts scan(/\s/).size'
and now some Perl (slightly less intuitive IMHO):
perl -lne 'print scalar(#{[/(\s)/g]})'
If you ask me, I'd write a simple C program to do the counting and formatting all in one go. But that's just me. By the time I got finished fiddle-farting around with perl, schmerl, zwerl I'd have wasted half a day.

In GNU Grep or another standard bash command, is it possible to get a resultset from regex?

Consider the following:
var="text more text and yet more text"
echo $var | egrep "yet more (text)"
It should be possible to get the result of the regex as the string: text
However, I don't see any way to do this in bash with grep or its siblings at the moment.
In perl, php or similar regex engines:
$output = preg_match('/yet more (text)/', 'text more text yet more text');
$output[1] == "text";
Edit: To elaborate why I can't just multiple-regex, in the end I will have a regex with multiple of these (Pictured below) so I need to be able to get all of them. This also eliminates the option of using lookahead/lookbehind (As they are all variable length)
egrep -i "([0-9]+) +$USER +([0-9]+).+?(/tmp/Flash[0-9a-z]+) "
Example input as requested, straight from lsof (Replace $USER with "j" for this input data):
npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXu8pvMg (deleted)
npviewer. 17875 j 17u REG 8,8 16037387 524273 /tmp/FlashXXIBH29F (deleted)
The end goal is to cp /proc/$var1/fd/$var2 ~/$var3 for every line, which ends up "Downloading" flash files (Flash used to store in /tmp but they drm'd it up)
So far I've got:
#!/bin/bash
regex="([0-9]+) +j +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+)"
echo "npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXYOvS8S (deleted)" |
sed -r -n -e " s%^.*?$regex.*?\$%\1 \2 \3%p " |
while read -a array
do
echo /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
It cuts off the first digits of the first value to return, and I'm not familiar enough with sed to see what's wrong.
End result for downloading flash 10.2+ videos (Including, perhaps, encrypted ones):
#!/bin/bash
lsof | grep "/tmp/Flash" | sed -r -n -e " s%^.+? ([0-9]+) +$USER +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+).*?\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Edit: look at my other answer for a simpler bash-only solution.
So, here the solution using sed to fetch the right groups and split them up. You later still have to use bash to read them. (And in this way it only works if the groups themselves do not contain any spaces - otherwise we had to use another divider character and patch read by setting $IFS to this value.)
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
sed -r -n -e " s%^.*$regex.*\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Note that I had to adapt your last regex group to allow uppercase letters, and added a space at the beginning to be sure to capture the whole block of numbers. Alternatively here a \b (word limit) would have worked, too.
Ah, I forget mentioning that you should pipe the text to this script, like this:
./grep-result.sh < grep-result-test.txt
(provided your files are named like this). Instead you can add a < grep-result-test after the sed call (before the |), or prepend the line with cat grep-result-test.txt |.
How does it work?
sed -r -n calls sed in extended-regexp-mode, and without printing anything automatically.
-e " s%^.*$regex.*\$%\1 \2 \3%p " gives the sed program, which consists of a single s command.
I'm using % instead of the normal / as parameter separator, since / appears inside the regex and I don't want to escape it.
The regex to search is prefixed by ^.* and suffixed by .*$ to grab the whole line (and avoid printing parts of the rest of the line).
Note that this .* grabs greedy, so we have to insert a space into our regexp to avoid it grabbing the start of the first digit group too.
The replacement text contains of the three parenthesed groups, separated by spaces.
the p flag at the end of the command says to print out the pattern space after replacement. Since we grabbed the whole line, the pattern space consists of only the replacement text.
So, the output of sed for your example input is this:
5 11 /tmp/FlashXXu8pvMg
5 17 /tmp/FlashXXIBH29F
This is much more friendly for reuse, obviously.
Now we pipe this output as input to the while loop.
read -a array reads a line from standard input (which is the output from sed, due to our pipe), splits it into words (at spaces, tabs and newlines), and puts the words into an array variable.
We could also have written read var1 var2 var3 instead (preferably using better variable names), then the first two words would be put to $var1 and $var2, with $var3 getting the rest.
If read succeeded reading a line (i.e. not end-of-file), the body of the loop is executed:
${array[0]} is expanded to the first element of the array and similarly.
When the input ends, the loop ends, too.
This isn't possible using grep or another tool called from a shell prompt/script because a child process can't modify the environment of its parent process. If you're using bash 3.0 or better, then you can use in-process regular expressions. The syntax is perl-ish (=~) and the match groups are available via $BASH_REMATCH[x], where x is the match group.
After creating my sed-solution, I also wanted to try the pure-bash approach suggested by Mark. It works quite fine, for me.
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
while read
do
if [[ $REPLY =~ $regex ]]
then
echo cp /proc/${BASH_REMATCH[1]}/fd/${BASH_REMATCH[2]} ~/${BASH_REMATCH[3]}
fi
done
(If you upvote this, you should think about also upvoting Marks answer, since it is essentially his idea.)
The same as before: pipe the text to be filtered to this script.
How does it work?
As said by Mark, the [[ ... ]] special conditional construct supports the binary operator =~, which interprets his right operand (after parameter expansion) as a extended regular expression (just as we want), and matches the left operand against this. (We have again added a space at front to avoid matching only the last digit.)
When the regex matches, the [[ ... ]] returns 0 (= true), and also puts the parts matched by the individual groups (and the whole expression) into the array variable BASH_REMATCH.
Thus, when the regex matches, we enter the then block, and execute the commands there.
Here again ${BASH_REMATCH[1]} is an array-access to an element of the array, which corresponds to the first matched group. ([0] would be the whole string.)
Another note: Both my scripts accept multi-line input and work on every line which matches. Non-matching lines are simply ignored. If you are inputting only one line, you don't need the loop, a simple if read ; then ... or even read && [[ $REPLY =~ $regex ]] && ... would be enough.
echo "$var" | pcregrep -o "(?<=yet more )text"
Well, for your simple example, you can do this:
var="text more text and yet more text"
echo $var | grep -e "yet more text" | grep -o "text"

Grep regular expression for digits in character string of variable length

I need some way to find words that contain any combination of characters and digits but exactly 4 digits only, and at least one character.
EXAMPLE:
a1a1a1a1 // Match
1234 // NO match (no characters)
a1a1a1a1a1 // NO match
ab2b2 // NO match
cd12 // NO match
z9989 // Match
1ab26a9 // Match
1ab1c1 // NO match
12345 // NO match
24 // NO match
a2b2c2d2 // Match
ab11cd22dd33 // NO match
to match a digit in grep you can use [0-9]. To match anything but a digit, you can use [^0-9]. Since that can be any number of , or no chars, you add a "*" (any number of the preceding). So what you'll want is logically
(anything not a digit or nothing)* (any single digit) (anything not a digit or nothing)* ....
until you have 4 "any single digit" groups. i.e. [^0-9]*[0-9]...
I find with grep long patterns, especially with long strings of special chars that need to be escaped, it's best to build up slowly so you're sure you understand whats going on. For example,
#this will highlight your matches, and make it easier to understand
alias grep='grep --color=auto'
echo 'a1b2' | grep '[0-9]'
will show you how it's matching. You can then extend the pattern once you understand each part.
I'm not sure about all the other input you might take (i.e. is ax12ax12ax12ax12 valid?), but this will work based on what you posted:
%> grep -P "^(?:\w\d){4}$" fileWithInput
With grep:
grep -iE '^([a-z]*[0-9]){4}[a-z]*$' | grep -vE '^[0-9]{4}$'
Do it in one pattern with Perl:
perl -ne 'print if /^(?!\d{4}$)([^\W\d_]*\d){4}[^\W\d_]*$/'
The funky [^\W\d_] character class is a cosmopolitan way to spell [A-Za-z]: it catches all letters rather than only the English ones.
If you don't mind using a little shell as well, you could do something like this:
echo "a1a1a1a1" |grep -o '[0-9]'|wc -l
which would display the number of digits found in the string. If you like, you could then test for a given number of matches:
max_match=4
[ "$(echo "a1da4a3aaa4a4" | grep -o '[0-9]'|wc -l)" -le $max_match ] || echo "too many digits."
Assuming you only need ASCII, and you can only access the (fairly primitive) regexp constructs of grep, the following should be pretty close:
grep ^[a-zA-Z]*[0-9][a-zA-Z]*[a-zA-Z]*[0-9][a-zA-Z]*[a-zA-Z]*[0-9][a-zA-Z]*[a-zA-Z]*[0-9][a-zA-Z]*$ | grep [a-zA-Z]
You might try
[^0-9]*[0-9][^0-9]*[0-9][^0-9]*[0-9][^0-9]*[0-9][^0-9]*
But this will match 1234. why doesn't that match your criteria?
The regex for that is:
([A-Za-z]\d){4}
[A-Za-z] - for character class
\d - for number
you wrapp them in () to group them indicating the format character follow by number
{4} - indicating that it must be 4 repetitions
you can use normal shell script, no need complicated regex.
var=a1a1a1a1
alldigits=${var//[^0-9]/}
allletters=${var//[0-9]/}
case "${#alldigits}" in
4)
if [ "${#allletters}" -gt 0 ];then
echo "ok: 4 digits and letters: $var"
else
echo "Invalid: all numbers and exactly 4: $var"
fi
;;
*) echo "Invalid: $var";;
esac
thanks for your answers
finaly i wrote some script and it work perfect:
. /P ab2b2 cd12 z9989 1ab26a9 1ab1c1 1234 24 a2b2c2d2
#!/bin/bash
echo "$#" |tr -s " " "\n"s >> sorting
cat sorting | while read tostr
do
l=$(echo $tostr|tr -d "\n"|wc -c)
temp=$(echo $tostr|tr -d a-z|tr -d "\n" | wc -c)
if [ $temp -eq 4 ]; then
if [ $l -gt 4 ]; then
printf "%s " "$tostr"
fi
fi
done
echo

how to use sed, awk, or gawk to print only what is matched?

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt
You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'
I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt
If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).
You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.
perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.
If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp
why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution
there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.
you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"
For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}
gawk '/.*abc([0-9]+)xyz.*/' file