Using regex in Bash with mapfile - regex

Edit 2:
Minimal input file: input/input.txt
#-----------
snapshot=83
#-----------
time=30142088
mem_heap_B=20224
mem_heap_extra_B=8
mem_stacks_B=240480
heap_tree=empty
#-----------
snapshot=84
#-----------
time=30408368
mem_heap_B=20224
mem_heap_extra_B=8
mem_stacks_B=240552
heap_tree=empty
#-----------
snapshot=85
#-----------
time=30674648
mem_heap_B=20224
mem_heap_extra_B=8
mem_stacks_B=240464
heap_tree=empty
#-----------
snapshot=86
#-----------
Actual output:
input.txt/*
time, heap, stack
input/input.txt
time, heap, stack
30674648, 20224, 240464
input/input.txt
time, heap, stack
input/input.txt
time, heap, stack
input/input.txt
time, heap, stack
30674648, 20224, 240464
Expected output:
input.txt
time, heap, stack,
30142088, 20224, 240480
30408368, 20224, 240552
30674648, 20224, 240464
Edit: Originally, the problem may have been due to Bash's regex's lack of multiline capability. However, after stripping newlines from the text, the problem remains, with the exception that the output now has between one to five lines instead of zero.
I'm trying to write a Bash script to parse a text file into a desirable CSV file with the needed information.
As part of the script, I iterate through n files. Each of the files contains m matches for a given regex, and each match contains three capture groups.
I want to format the three capture groups into a CSV row, then concatenate all the rows of all the matches of all the files and write them to a *.csv file.
I'm quite comfortable using Regex in high level languages such as Kotlin or C#, however I have no experience with Regex in Bash. I used this answer as a starting point, however it doesn't seem to be working for me (mapfile -t matches < <( format_row "$text" "$regex" ) doesn't do anything.
Here's the full code with the relevant portion noted:
#!/bin/bash
# RELEVANT CODE BELOW
regex="time=([0-9]+)\nmem_heap_B=([0-9]+)\n.*\nmem_stacks_B=([0-9]+)"
format_row() {
local s=$1 regex=$2
while [[ $s =~ $regex ]]
do
time="${BASH_REMATCH[1]}"
heap="${BASH_REMATCH[2]}"
stack="${BASH_REMATCH[3]}"
echo "${time}, ${heap}, ${stack}"
echo ""
s=${s#*"${BASH_REMATCH[3]}"}
done
}
for file in $1/*
do
echo "Parsing ${file}..."
echo $file >> $2
echo "time, heap, stack" >> $2
text=$(<${file})
mapfile -t matches < <( format_row "$text" "$regex" )
printf "%s\n" "${matches[#]}" >> $2
echo "" >> $2
done
echo ""
echo "Done"
Thanks!

There are two problems here:
Although bash's =~ operator can match newlines, it does not understand the \n escape sequence. You have to use actual newlines in your regex. This can also be achieved by C-style strings $'\n'.
The regex quantifier * is greedy. When matching ...
[[ "a=1,b=1 a=2,b=2 a=3,b=3" =~ a=(.).*b=(.) ]]
... you end up with BATCH_REMATCH=(1 3) instead of (1 1).
In other regex dialects like PCRE you could use the non-greedy quantifier *?. However, in bash we have to use a workaround and have to replace .* with something that cannot match more than wanted, for instance
[[ "a=1,b=1 a=2,b=2 a=3,b=3" =~ a=(.)[^=]*b=(.) ]]
In your case we have to make sure that the next mem_stacks is not matched
As you didn't post any example input and expected output, I can only guess. However, I assume the following regex could work for you:
regex=$'time=([0-9]+)
mem_heap_B=([0-9]+)
([^\n]*\n){TODO set number of lines allowed here}
mem_stacks_B=([0-9]+)'
Note that now you have to use BASH_REMATCH[4] instead of [3].
At the marked location you have to insert the numbers of lines allowed between mem_heap and mem_stacks. The number can be constant (e.g. {5}) or a range (e.g. {1,10}). In case of ranges you have to make sure that the maximum bound is not so high that you could accidentally skip the next mem_stacks and match another mem_stacks instead. Thus, in case of ranges it might be more appropriate to use two matches. Something like
regex1='time=([0-9]+)
mem_heap_B=([0-9]+)'
regex2='mem_stacks_B=([0-9]+)'
while
[[ "$s" =~ $regex1 ]] &&
time="${BASH_REMATCH[1]}" &&
heap="${BASH_REMATCH[2]}" &&
[[ "$s" =~ $regex2 ]] &&
stack="${BASH_REMATCH[1]}"
do
echo "$time, $heap, $stack"
s="${s#*$stack}"
done >> "$2"
By the way:
https://www.shellcheck.net/ helps you to make your script more robust.
First and foremost: quote your variables.
You can use do cmd1; cmd2 done >> file instead of do cmd1 >> file; cmd2 >> file; done.
mapfile -t matches < <(format_row "$text" "$regex")
printf "%s\n" "${matches[#]}" >> "$2"
could be written as just
format_row "$text" "$regex" >> "$2"

Related

Bash regex overwrite line if multiple match

I have a bash script where I have 3 regular expressions. I would like to, through conditional if, to find the match of the first pattern in the file.
If there is a match, then look for a match in the second pattern but only with the lines that have matched the first pattern.
Finally, to check the third pattern only with the lines that have matched the second pattern (which are also the ones that had already matched the first pattern).
I have the following code but I don't know how to tell that if there is a match to overwrite the "line" value to decrease the number of total lines to only the ones matching.
#!/bin/bash
pattern1= egrep '^([^,]*,){31}[1-9][0-9].*'
pattern2= egrep '^([^,]*,){16}[0-1].[3-9].*'
pattern3= egrep '^([^,]*,){32}[2-9][0-9].*'
while read line
do
if [[$line == $pattern1]];then
newline == $pattern1
if [[$newline == $pattern2 ]];then
newline2 == $pattern2
if [[$newline2 == $pattern3 ]]; then
echo $pattern3
fi
done < mj1.csv #this is the input file
I will call this script like ./b1.sh <filename>.
Some input data:
EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc
1985,1,1,10/26/1984,21,252,21.6899384,CHI,1,WSB,1,16,1,40,5,16,0.313,0,0,,6,7,0.857,1,5,6,7,2,4,5,2,16,12.5
1985,2,2,10/27/1984,21,253,21.69267625,CHI,0,MIL,0,-2,1,34,8,13,0.615,0,0,,5,5,1,3,2,5,5,2,1,3,4,21,19.4
1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9
1985,4,4,10/30/1984,21,256,21.7008898,CHI,0,KCK,1,5,1,36,8,21,0.381,0,0,,9,9,1,2,2,4,5,3,1,6,5,25,14.7
1985,5,5,11/1/1984,21,258,21.7063655,CHI,0,DEN,0,-16,1,33,7,15,0.467,0,0,,3,4,0.75,3,2,5,5,1,1,2,4,17,13.2
1985,6,6,11/7/1984,21,264,21.72279261,CHI,0,DET,1,4,1,27,9,19,0.474,0,0,,7,9,0.778,1,3,4,3,3,1,5,5,25,14.9
1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.75,4,4,8,5,3,2,5,2,33,29.3
1985,8,8,11/10/1984,21,267,21.73100616,CHI,0,IND,1,2,1,42,9,22,0.409,0,0,,9,12,0.75,2,7,9,4,2,5,3,4,27,21.2
1985,9,9,11/13/1984,21,270,21.73921971,CHI,1,SAS,1,3,1,43,18,27,0.667,1,1,1,8,11,0.727,2,8,10,4,3,2,4,4,45,37.5
1985,10,10,11/15/1984,21,272,21.74469541,CHI,1,BOS,0,-20,1,33,12,24,0.5,0,1,0,3,3,1,0,2,2,2,2,1,1,4,27,17.1
1985,11,11,11/17/1984,21,274,21.75017112,CHI,1,PHI,0,-9,1,44,4,17,0.235,0,0,,8,8,1,0,5,5,7,5,2,4,5,16,12.5
1985,12,12,11/19/1984,21,276,21.75564682,CHI,1,IND,0,-17,1,39,11,26,0.423,0,3,0,12,16,0.75,2,3,5,2,2,1,3,3,34,20.8
1985,13,13,11/21/1984,21,278,21.76112252,CHI,0,MIL,0,-10,1,42,11,22,0.5,0,0,,13,14,0.929,4,9,13,2,2,2,6,3,35,26.7
1985,14,14,11/23/1984,21,280,21.76659822,CHI,0,SEA,1,19,1,30,9,13,0.692,0,0,,5,6,0.833,0,4,4,3,4,1,4,4,23,19.5
1985,15,15,11/24/1984,21,281,21.76933607,CHI,0,POR,0,-10,1,41,10,24,0.417,0,1,0,10,10,1,3,3,6,8,3,1,4,4,30,23.9
1985,16,16,11/27/1984,21,284,21.77754962,CHI,0,GSW,0,-6,1,24,6,10,0.6,0,0,,1,1,1,0,2,2,3,3,2,4,1,13,11.1
1985,17,17,11/29/1984,21,286,21.78302533,CHI,0,PHO,0,-5,1,30,9,17,0.529,1,1,1,3,4,0.75,1,2,3,2,2,0,2,5,22,14
1985,18,18,11/30/1984,21,287,21.78576318,CHI,0,LAC,1,4,1,37,9,15,0.6,0,0,,2,4,0.5,2,3,5,5,3,0,4,4,20,15.5
1985,19,19,12/2/1984,21,289,21.79123888,CHI,0,LAL,1,1,1,42,7,13,0.538,0,0,,6,8,0.75,2,0,2,3,1,1,4,3,20,12.9
1985,20,20,12/4/1984,21,291,21.79671458,CHI,1,NJN,1,15,1,35,7,13,0.538,0,0,,6,6,1,1,2,3,6,1,0,3,3,20,16
1985,21,21,12/7/1984,21,294,21.80492813,CHI,1,NYK,1,2,1,43,8,16,0.5,0,1,0,5,7,0.714,1,1,2,3,2,0,6,5,21,9.3
1985,22,22,12/8/1984,21,295,21.80766598,CHI,1,DAL,1,2,1,35,10,23,0.435,0,0,,0,0,,4,3,7,2,0,2,2,3,20,11.2
1985,23,23,12/11/1984,21,298,21.81587953,CHI,1,DET,0,-7,1,37,13,28,0.464,0,1,0,1,3,0.333,1,7,8,6,2,0,3,4,27,16.2
1985,24,24,12/12/1984,21,299,21.81861739,CHI,0,DET,0,-7,1,30,6,17,0.353,0,2,0,9,10,0.9,0,1,1,2,2,1,1,5,21,12.5
1985,25,25,12/14/1984,21,301,21.82409309,CHI,0,NJN,0,-2,1,44,12,25,0.48,0,0,,10,10,1,2,6,8,8,1,0,0,4,34,29.5
1985,26,26,12/15/1984,21,302,21.82683094,CHI,1,PHI,0,-12,1,27,7,16,0.438,0,0,,0,0,,1,1,2,2,1,0,1,2,14,7.2
1985,27,27,12/18/1984,21,305,21.83504449,CHI,1,HOU,0,-8,1,45,8,20,0.4,0,1,0,2,4,0.5,1,2,3,8,3,0,1,2,18,14.5
1985,28,28,12/20/1984,21,307,21.84052019,CHI,0,ATL,1,3,1,41,12,22,0.545,0,0,,10,16,0.625,4,4,8,7,5,1,7,5,34,26.6
To make things easier, pattern1 matches all rows where column PTS is higher than 10, pattern 2 matches the rows where column FG_PCT is higher than 0.3, and pattern 3 matches all rows where column GmSc is higher than 19.
While an awk solution is going to be a bit faster ... we'll focus on a bash solution per OP's request.
First issue is regex matching uses the =~ operator and not the == operator.
Second issue is that to keep a row if only all 3 regexes match means we want to and (&&) the results of all 3 regex matches.
Third issue addresses some basic syntax issues with OP's current code (eg, space after [[ and before ]]; improper assignments of regex patterns to the pattern* variables).
One bash idea:
pattern1='^([^,]*,){31}[1-9][0-9].*'
pattern2='^([^,]*,){16}[0-1].[3-9].*'
pattern3='^([^,]*,){32}[2-9][0-9].*'
head -1 mj1.csv > mj1.new.csv
while read -r line
do
if [[ "${line}" =~ $pattern1 && "${line}" =~ $pattern2 && "${line}" =~ $pattern3 ]]
then
# do whatever with $line, eg:
echo "${line}"
fi
done < mj1.csv >> mj1.new.csv
This generates:
$ cat mj1.new.csv
EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc
1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9
1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.75,4,4,8,5,3,2,5,2,33,29.3
1985,8,8,11/10/1984,21,267,21.73100616,CHI,0,IND,1,2,1,42,9,22,0.409,0,0,,9,12,0.75,2,7,9,4,2,5,3,4,27,21.2
1985,9,9,11/13/1984,21,270,21.73921971,CHI,1,SAS,1,3,1,43,18,27,0.667,1,1,1,8,11,0.727,2,8,10,4,3,2,4,4,45,37.5
1985,12,12,11/19/1984,21,276,21.75564682,CHI,1,IND,0,-17,1,39,11,26,0.423,0,3,0,12,16,0.75,2,3,5,2,2,1,3,3,34,20.8
1985,13,13,11/21/1984,21,278,21.76112252,CHI,0,MIL,0,-10,1,42,11,22,0.5,0,0,,13,14,0.929,4,9,13,2,2,2,6,3,35,26.7
1985,15,15,11/24/1984,21,281,21.76933607,CHI,0,POR,0,-10,1,41,10,24,0.417,0,1,0,10,10,1,3,3,6,8,3,1,4,4,30,23.9
1985,25,25,12/14/1984,21,301,21.82409309,CHI,0,NJN,0,-2,1,44,12,25,0.48,0,0,,10,10,1,2,6,8,8,1,0,0,4,34,29.5
1985,28,28,12/20/1984,21,307,21.84052019,CHI,0,ATL,1,3,1,41,12,22,0.545,0,0,,10,16,0.625,4,4,8,7,5,1,7,5,34,26.6
NOTE: OP hasn't (yet) provided the expected output so at this point I have to assume OP's regexes are correct

Swap Strings within a line in Bash

I'm parsing a document with a bash script and output different parts of it. At one point i need find and reformat text in the form of:
(foo)[X]
[Y]
(bar)[Z]
to something like:
X->foo
Y
Z->bar
Now, I'm able to grep the parts I want with RegEx, but I'm having trouble swapping the two elements in one line and handling the fact that the text in parentheses is optional. Is this even possible with a combination of sed and grep?
Thank You for your time.
You can use sed:
sed -e 's/(\([^)]*\))\[\([^]]*\)]/\2->\1/' -e 's/\[\([^]]*\)]/\1/' file
This works for your given input example:
X->foo
Y
Z->bar
You might need to make the patterns more strict if you have more kinds of input to handle.
You can use awk:
awk -F '[][()]+' '{print (NF>3 ? $3 "->" $2 : $2)}' file
X->foo
Y
Z->bar
You can even do it in bash itself, although it's not pretty.
# Three capture groups:
# 1. The optional paranthesized text
# 2. The contents of the parentheses
# 3. The contents of the square brackets
regex="(\((.*)\))?\[(.*)\]"
while IFS= read -r str; do
[[ "$str" =~ $regex ]]
# If the 2nd array element is not empty, print -> followed by the
# non-empty value.
echo "${BASH_REMATCH[3]}${BASH_REMATCH[2]:+->${BASH_REMATCH[2]}}"
done < file.txt

Capturing multiline string with regex and replacing it with itself, indented on every line

Background
I have a large batch of markdown files that have code blocks marked off with three backticks and a language name (github style). Like so:
```ruby
def method_missing
puts "Where's the method?"
end
```
I'd like to change the way these are marked off, so that instead of using three backticks, code blocks are set of with indentation (stack overflow style), as follows:
def method_missing
puts "Where's the method?"
end
Problem
I'm doing a find-and-replace across multiple files in Sublime Text with this expression
(?s)```ruby(.*?)```
This effectively captures what I'd like, but I'm having trouble finding a good way to replace the capture group $1 with an indented version of itself. At best, I can insert a soft tab before the entire capture group.
Any suggestions? Thanks in advance.
Alternatively: Is there a quick way to do this with a bash script using grep?
You can't do this with grep, but you could do it using text manipulation utilities like sed. For example, saying:
sed -n '/```ruby/,/```/{/```ruby/b;/```/b;s/^/ /p }' filename
would produce:
def method_missing
puts "Where's the method?"
end
for your sample input.
It captures lines between ```ruby and three backticks; adds 4 spaces in front of those lines; and prints those.
If you want a TAB character instead of those whitespaces, substitute s/^/ /p with s/^/\t/p in the expression above.
I'm sure you could get a one line to work, but might just be simpler to do it with a shell script
#!/bin/bash
while IFS= read -r line; do
if [[ $line =~ ^'```ruby' ]]; then
indent=true
elif [[ $line =~ ^'```' ]]; then
indent=
else
[[ -n $indent ]] && echo -e "\t$line" || echo "$line"
fi
done < file

In GNU Grep or another standard bash command, is it possible to get a resultset from regex?

Consider the following:
var="text more text and yet more text"
echo $var | egrep "yet more (text)"
It should be possible to get the result of the regex as the string: text
However, I don't see any way to do this in bash with grep or its siblings at the moment.
In perl, php or similar regex engines:
$output = preg_match('/yet more (text)/', 'text more text yet more text');
$output[1] == "text";
Edit: To elaborate why I can't just multiple-regex, in the end I will have a regex with multiple of these (Pictured below) so I need to be able to get all of them. This also eliminates the option of using lookahead/lookbehind (As they are all variable length)
egrep -i "([0-9]+) +$USER +([0-9]+).+?(/tmp/Flash[0-9a-z]+) "
Example input as requested, straight from lsof (Replace $USER with "j" for this input data):
npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXu8pvMg (deleted)
npviewer. 17875 j 17u REG 8,8 16037387 524273 /tmp/FlashXXIBH29F (deleted)
The end goal is to cp /proc/$var1/fd/$var2 ~/$var3 for every line, which ends up "Downloading" flash files (Flash used to store in /tmp but they drm'd it up)
So far I've got:
#!/bin/bash
regex="([0-9]+) +j +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+)"
echo "npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXYOvS8S (deleted)" |
sed -r -n -e " s%^.*?$regex.*?\$%\1 \2 \3%p " |
while read -a array
do
echo /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
It cuts off the first digits of the first value to return, and I'm not familiar enough with sed to see what's wrong.
End result for downloading flash 10.2+ videos (Including, perhaps, encrypted ones):
#!/bin/bash
lsof | grep "/tmp/Flash" | sed -r -n -e " s%^.+? ([0-9]+) +$USER +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+).*?\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Edit: look at my other answer for a simpler bash-only solution.
So, here the solution using sed to fetch the right groups and split them up. You later still have to use bash to read them. (And in this way it only works if the groups themselves do not contain any spaces - otherwise we had to use another divider character and patch read by setting $IFS to this value.)
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
sed -r -n -e " s%^.*$regex.*\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Note that I had to adapt your last regex group to allow uppercase letters, and added a space at the beginning to be sure to capture the whole block of numbers. Alternatively here a \b (word limit) would have worked, too.
Ah, I forget mentioning that you should pipe the text to this script, like this:
./grep-result.sh < grep-result-test.txt
(provided your files are named like this). Instead you can add a < grep-result-test after the sed call (before the |), or prepend the line with cat grep-result-test.txt |.
How does it work?
sed -r -n calls sed in extended-regexp-mode, and without printing anything automatically.
-e " s%^.*$regex.*\$%\1 \2 \3%p " gives the sed program, which consists of a single s command.
I'm using % instead of the normal / as parameter separator, since / appears inside the regex and I don't want to escape it.
The regex to search is prefixed by ^.* and suffixed by .*$ to grab the whole line (and avoid printing parts of the rest of the line).
Note that this .* grabs greedy, so we have to insert a space into our regexp to avoid it grabbing the start of the first digit group too.
The replacement text contains of the three parenthesed groups, separated by spaces.
the p flag at the end of the command says to print out the pattern space after replacement. Since we grabbed the whole line, the pattern space consists of only the replacement text.
So, the output of sed for your example input is this:
5 11 /tmp/FlashXXu8pvMg
5 17 /tmp/FlashXXIBH29F
This is much more friendly for reuse, obviously.
Now we pipe this output as input to the while loop.
read -a array reads a line from standard input (which is the output from sed, due to our pipe), splits it into words (at spaces, tabs and newlines), and puts the words into an array variable.
We could also have written read var1 var2 var3 instead (preferably using better variable names), then the first two words would be put to $var1 and $var2, with $var3 getting the rest.
If read succeeded reading a line (i.e. not end-of-file), the body of the loop is executed:
${array[0]} is expanded to the first element of the array and similarly.
When the input ends, the loop ends, too.
This isn't possible using grep or another tool called from a shell prompt/script because a child process can't modify the environment of its parent process. If you're using bash 3.0 or better, then you can use in-process regular expressions. The syntax is perl-ish (=~) and the match groups are available via $BASH_REMATCH[x], where x is the match group.
After creating my sed-solution, I also wanted to try the pure-bash approach suggested by Mark. It works quite fine, for me.
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
while read
do
if [[ $REPLY =~ $regex ]]
then
echo cp /proc/${BASH_REMATCH[1]}/fd/${BASH_REMATCH[2]} ~/${BASH_REMATCH[3]}
fi
done
(If you upvote this, you should think about also upvoting Marks answer, since it is essentially his idea.)
The same as before: pipe the text to be filtered to this script.
How does it work?
As said by Mark, the [[ ... ]] special conditional construct supports the binary operator =~, which interprets his right operand (after parameter expansion) as a extended regular expression (just as we want), and matches the left operand against this. (We have again added a space at front to avoid matching only the last digit.)
When the regex matches, the [[ ... ]] returns 0 (= true), and also puts the parts matched by the individual groups (and the whole expression) into the array variable BASH_REMATCH.
Thus, when the regex matches, we enter the then block, and execute the commands there.
Here again ${BASH_REMATCH[1]} is an array-access to an element of the array, which corresponds to the first matched group. ([0] would be the whole string.)
Another note: Both my scripts accept multi-line input and work on every line which matches. Non-matching lines are simply ignored. If you are inputting only one line, you don't need the loop, a simple if read ; then ... or even read && [[ $REPLY =~ $regex ]] && ... would be enough.
echo "$var" | pcregrep -o "(?<=yet more )text"
Well, for your simple example, you can do this:
var="text more text and yet more text"
echo $var | grep -e "yet more text" | grep -o "text"

Getting the index of the substring on solaris

How can I find the index of a substring which matches a regular expression on solaris10?
Assuming that what you want is to find the location of the first match of a wildcard in a string using bash, the following bash function returns just that, or empty if the wildcard doesn't match:
function match_index()
{
local pattern=$1
local string=$2
local result=${string/${pattern}*/}
[ ${#result} = ${#string} ] || echo ${#result}
}
For example:
$ echo $(match_index "a[0-9][0-9]" "This is a a123 test")
10
If you want to allow full-blown regular expressions instead of just wildcards, replace the "local result=" line with
local result=$(echo "$string" | sed 's/'"$pattern"'.*$//')
but then you're exposed to the usual shell quoting issues.
The goto options for me are bash, awk and perl. I'm not sure what you're trying to do, but any of the three would likely work well. For example:
f=somestring
string=$(expr match "$f" '.*\(expression\).*')
echo $string
You tagged the question as bash, so I'm going to assume you're asking how to do this in a bash script. Unfortunately, the built-in regular expression matching doesn't save string indices. However, if you're asking this in order to extract the match substring, you're in luck:
if [[ "$var" =~ "$regex" ]]; then
n=${#BASH_REMATCH[*]}
while [[ $i -lt $n ]]
do
echo "capture[$i]: ${BASH_REMATCH[$i]}"
let i++
done
fi
This snippet will output in turn all of the submatches. The first one (index 0) will be the entire match.
You might like your awk options better, though. There's a function match which gives you the index you want. Documentation can be found here. It'll also store the length of the match in RLENGTH, if you need that. To implement this in a bash script, you could do something like:
match_index=$(echo "$var_to_search" | \
awk '{
where = match($0, '"$regex_to_find"')
if (where)
print where
else
print -1
}')
There are a lot of ways to deal with passing the variables in to awk. This combination of piping output and directly embedding one into the awk one-liner is fairly common. You can also give awk variable values with the -v option (see man awk).
Obviously you can modify this to get the length, the match string, whatever it is you need. You can capture multiple things into an array variable if necessary:
match_data=($( ... awk '{ ... print where,RLENGTH,match_string ... }'))
If you use bash 4.x you can source the oobash. A string lib written in bash with oo-style:
http://sourceforge.net/projects/oobash/
String is the constructor function:
String a abcda
a.indexOf a
0
a.lastIndexOf a
4
a.indexOf da
3
There are many "methods" more to work with strings in your scripts:
-base64Decode -base64Encode -capitalize -center
-charAt -concat -contains -count
-endsWith -equals -equalsIgnoreCase -reverse
-hashCode -indexOf -isAlnum -isAlpha
-isAscii -isDigit -isEmpty -isHexDigit
-isLowerCase -isSpace -isPrintable -isUpperCase
-isVisible -lastIndexOf -length -matches
-replaceAll -replaceFirst -startsWith -substring
-swapCase -toLowerCase -toString -toUpperCase
-trim -zfill