capture repeating regex pattern as one group, sed in bash script - regex

I wrote a working expression that extracts two pieces of data from valid lines of text. The first capture group is the numerical section including periods. The second is the remaining characters of the line as long as the line is valid. A line is invalid if the numerical section ends with a period or the line ends with a number.
1.1 the quick 1-1 (no match due to ending hypen and number)
11.2 brown fox jumped (should return '11.2' and 'brown fox jumped')
1.41.1 over the lazy (should return '1.41.1' and 'over the lazy')
2.1. dog (no match due to numerical section trailing period)
The expression ^((?:[0-9]+\.)+[0-9]+) (.*)[^0-9]$ works when tested on various regex testing sites.
My issue is... that I have failed to adapt this expression to work with sed from a bash script that loops through lines of text ($L).
IFS=$'\t' read -r NUM STR < <(sed 's#^\(\(?:[0-9]\+\.\)\+[0-9]\+\) \(.*)[^0-9]$#\1\t\2#p;d' <<< $L )
What does work is below where I replaced the capturing of repeating groups with repeating digits and periods. I would prefer not to do this because it could match lines starting with periods and multiple periods in a row. Also it loses the last char of the captured string but I expect I can figure that part out.
FS=$'\t' read -r NUM STR < <(sed 's#^\([0-9\.]\+[0-9]\+\) \(.*[^0-9]\)$#\1\t\2#p;d' <<< $L )
Please help me understand what I'm doing wrong. Thank you.

An ERE for that would be:
^([0-9]+(\.[0-9]+)*) (.*[^0-9])$
with \1 and \3 being the capture groups of interest
But I'm not sure that using sed + read is the best approach for capturing the data in variables; you could just use bash builtins instead:
while IFS=' ' read -r num str
[[ $num =~ ^([0-9]+(\.[0-9]+)*)$ && $str =~ [^0-9]$ ]] || continue
declare -p num str
done < input.txt
There's a side-effect with this solution though: The read will strip the leading, trailing and the first middle space++ chars of the line.
If you need those spaces then you can match the whole line instead:
regex='^([0-9]+(\.[0-9]+)*) (.*[^0-9])$'
while IFS='' read -r line
[[ $line =~ $regex ]] || continue
declare -p num str
done < input.txt


Bash regex overwrite line if multiple match

I have a bash script where I have 3 regular expressions. I would like to, through conditional if, to find the match of the first pattern in the file.
If there is a match, then look for a match in the second pattern but only with the lines that have matched the first pattern.
Finally, to check the third pattern only with the lines that have matched the second pattern (which are also the ones that had already matched the first pattern).
I have the following code but I don't know how to tell that if there is a match to overwrite the "line" value to decrease the number of total lines to only the ones matching.
pattern1= egrep '^([^,]*,){31}[1-9][0-9].*'
pattern2= egrep '^([^,]*,){16}[0-1].[3-9].*'
pattern3= egrep '^([^,]*,){32}[2-9][0-9].*'
while read line
if [[$line == $pattern1]];then
newline == $pattern1
if [[$newline == $pattern2 ]];then
newline2 == $pattern2
if [[$newline2 == $pattern3 ]]; then
echo $pattern3
done < mj1.csv #this is the input file
I will call this script like ./ <filename>.
Some input data:
To make things easier, pattern1 matches all rows where column PTS is higher than 10, pattern 2 matches the rows where column FG_PCT is higher than 0.3, and pattern 3 matches all rows where column GmSc is higher than 19.
While an awk solution is going to be a bit faster ... we'll focus on a bash solution per OP's request.
First issue is regex matching uses the =~ operator and not the == operator.
Second issue is that to keep a row if only all 3 regexes match means we want to and (&&) the results of all 3 regex matches.
Third issue addresses some basic syntax issues with OP's current code (eg, space after [[ and before ]]; improper assignments of regex patterns to the pattern* variables).
One bash idea:
head -1 mj1.csv >
while read -r line
if [[ "${line}" =~ $pattern1 && "${line}" =~ $pattern2 && "${line}" =~ $pattern3 ]]
# do whatever with $line, eg:
echo "${line}"
done < mj1.csv >>
This generates:
$ cat
NOTE: OP hasn't (yet) provided the expected output so at this point I have to assume OP's regexes are correct

Using regex in Bash with mapfile

Edit 2:
Minimal input file: input/input.txt
Actual output:
time, heap, stack
time, heap, stack
30674648, 20224, 240464
time, heap, stack
time, heap, stack
time, heap, stack
30674648, 20224, 240464
Expected output:
time, heap, stack,
30142088, 20224, 240480
30408368, 20224, 240552
30674648, 20224, 240464
Edit: Originally, the problem may have been due to Bash's regex's lack of multiline capability. However, after stripping newlines from the text, the problem remains, with the exception that the output now has between one to five lines instead of zero.
I'm trying to write a Bash script to parse a text file into a desirable CSV file with the needed information.
As part of the script, I iterate through n files. Each of the files contains m matches for a given regex, and each match contains three capture groups.
I want to format the three capture groups into a CSV row, then concatenate all the rows of all the matches of all the files and write them to a *.csv file.
I'm quite comfortable using Regex in high level languages such as Kotlin or C#, however I have no experience with Regex in Bash. I used this answer as a starting point, however it doesn't seem to be working for me (mapfile -t matches < <( format_row "$text" "$regex" ) doesn't do anything.
Here's the full code with the relevant portion noted:
format_row() {
local s=$1 regex=$2
while [[ $s =~ $regex ]]
echo "${time}, ${heap}, ${stack}"
echo ""
for file in $1/*
echo "Parsing ${file}..."
echo $file >> $2
echo "time, heap, stack" >> $2
mapfile -t matches < <( format_row "$text" "$regex" )
printf "%s\n" "${matches[#]}" >> $2
echo "" >> $2
echo ""
echo "Done"
There are two problems here:
Although bash's =~ operator can match newlines, it does not understand the \n escape sequence. You have to use actual newlines in your regex. This can also be achieved by C-style strings $'\n'.
The regex quantifier * is greedy. When matching ...
[[ "a=1,b=1 a=2,b=2 a=3,b=3" =~ a=(.).*b=(.) ]]
... you end up with BATCH_REMATCH=(1 3) instead of (1 1).
In other regex dialects like PCRE you could use the non-greedy quantifier *?. However, in bash we have to use a workaround and have to replace .* with something that cannot match more than wanted, for instance
[[ "a=1,b=1 a=2,b=2 a=3,b=3" =~ a=(.)[^=]*b=(.) ]]
In your case we have to make sure that the next mem_stacks is not matched
As you didn't post any example input and expected output, I can only guess. However, I assume the following regex could work for you:
([^\n]*\n){TODO set number of lines allowed here}
Note that now you have to use BASH_REMATCH[4] instead of [3].
At the marked location you have to insert the numbers of lines allowed between mem_heap and mem_stacks. The number can be constant (e.g. {5}) or a range (e.g. {1,10}). In case of ranges you have to make sure that the maximum bound is not so high that you could accidentally skip the next mem_stacks and match another mem_stacks instead. Thus, in case of ranges it might be more appropriate to use two matches. Something like
[[ "$s" =~ $regex1 ]] &&
time="${BASH_REMATCH[1]}" &&
heap="${BASH_REMATCH[2]}" &&
[[ "$s" =~ $regex2 ]] &&
echo "$time, $heap, $stack"
done >> "$2"
By the way: helps you to make your script more robust.
First and foremost: quote your variables.
You can use do cmd1; cmd2 done >> file instead of do cmd1 >> file; cmd2 >> file; done.
mapfile -t matches < <(format_row "$text" "$regex")
printf "%s\n" "${matches[#]}" >> "$2"
could be written as just
format_row "$text" "$regex" >> "$2"

perl regex negative-lookbehind detect file lacking final linefeed

The following code uses tail to test whether the last line of a file fails to culminate in a newline (linefeed, LF).
> printf 'aaa\nbbb\n' | test -n "$(tail -c1)" && echo pathological last line
> printf 'aaa\nbbb' | test -n "$(tail -c1)" && echo pathological last line
pathological last line
One can test for the same condition by using perl, a positive lookbehind regex, and unless, as follows. This is based on the notion that, if a file ends with newline, the character immediately preceding end-of-file will be \n by definition.
(Recall that the -n0 flag causes perl to "slurp" the entire file as a single record. Thus, there is only one $, the end of the file.)
> printf 'aaa\nbbb\n' | perl -n0 -e 'print "pathological last line\n" unless m/(?<=\n)$/;'
> printf 'aaa\nbbb' | perl -n0 -e 'print "pathological last line\n" unless m/(?<=\n)$/;'
pathological last line
Is there a way to accomplish this using if rather than unless, and negative lookbehind? The following fails, in that the regex seems to always match:
> printf 'aaa\nbbb\n' | perl -n0 -e 'print "pathological last line\n" if m/(?<!\n)$/;'
pathological last line
> printf 'aaa\nbbb' | perl -n0 -e 'print "pathological last line\n" if m/(?<!\n)$/;'
pathological last line
Why does my regex always match, even when the end-of-file is preceded by newline? I am trying to test for an end-of-file that is not preceded by newline.
/(?<=\n)$/ is a weird and expensive way of doing /\n$/.
/\n$/ means /\n(?=\n?\z)/, so it's a weird and expensive way of doing /\n\z/.
A few approaches:
perl -n0777e'print "pathological last line\n" if !/\n\z/'
perl -n0777e'print "pathological last line\n" if /(?<!\n)\z/'
perl -n0777e'print "pathological last line\n" if substr($_, -1) ne "\n"'
perl -ne'$ll=$_; END { print "pathological last line\n" if $ll !~ /\n\z/ }'
The last solution avoids slurping the entire file.
Why does my regex always match, even when the end-of-file is preceded by newline?
Because you mistakenly think that $ only matches at the end of the string. Use \z for that.
Do you have a strong reason for using a regular expression for his job? Practicing regular expressions for example? If not, I think a simpler approach is to just use a while loop that tests for eof and remembers the latest character read. Something like this might do the job.
perl -le'while (!eof()) { $previous = getc(\*ARGV) }
if ($previous ne "\n") { print "pathological last line!" }'
PS: ikegami's comment about my solution being slow is well-taken. (Thanks for the helpful edit, too!) So I wondered if there's a way to read the file backwards. As it turns out, CPAN has a module for just that. After installing it, I came up with this:
perl -le 'use File::ReadBackwards;
my $bw = File::ReadBackwards->new(shift #ARGV);
print "pathological last line" if substr($bw->readline, -1) ne "\n"'
That should work efficiently, even very large files. And when I come back to read it a year later, I will more likely understand it than I would with the regular-expression approach.
The hidden context of my request was a perl script to "clean" a text file used in the TeX/LaTeX environment. This is why I wanted to slurp.
(I mistakenly thought that "laser focus" on a problem, recommended by stackoverflow, meant editing out the context.)
Thanks to the responses, here is an improved draft of the script:
use strict; use warnings; use 5.18.2;
# Loop slurp:
$/ = undef; # input record separator: entire file is a single record.
# a "trivial line" looks blank, consists exclusively of whitespace, but is not necessarily a pure newline=linefeed=LF.
while (<>) {
s/^\s*$/\n/mg; # convert any trivial line to a pure LF. Unlike \z, $ works with /m multiline.
s/[\n][\n]+/\n\n/g; # exactly 2 blank lines (newlines) separate paragraphs. Like cat -s
s/^[\n]+//; # first line is visible or "nontrivial."
s/[\n]+\z/\n/; # last line is visible or "nontrivial."
print STDOUT;
print "\n" unless m/\n\z/; # IF detect pathological last line, i.e., not ending in LF, THEN append LF.
And here is how it works, when named First a messy file, then how it looks after gets through with it:
bash: printf ' \n \r \naaa\n \t \n \n \nbb\n\n\n\n \t'
bash: printf ' \n \r \naaa\n \t \n \n \nbb\n\n\n\n \t' |

How to match string (with regular expression) that begins with a string

In a bash script I have to match strings that begin with exactly 3 times with the string lo; so lololoba is good, loloba is bad, lololololoba is good, balololo is bad.
I tried with this pattern: "^$str1/{$n,}" but it doesn't work, how can I do it?
According to OPs comment, lololololoba is bad now.
This should work:
[[ $s =~ $pat ]] && echo good || echo bad
EDIT (As per OPs comment):
If you want to match exactly 3 times (i.e lolololoba and such should be unmatched):
change the pat="^(lo){3}" to:
You can use following regex :
Instead of lo you can put your variable.
See demo
You can use this awk to match exactly 3 occurrences of lo at the beginning:
# input file
cat file
# awk command to print only valid lines
awk -F '^(lo){3}' 'NF == 2 && !($2 ~ /^lo/)' file
As per your comment:
... more than 3 is bad so "lolololoba" is not good!
You'll find that #Jahid's answer doesn't fit (as his gives you "good" to that test string.
To use his answer with the correct regex:
[[ $s =~ $pat ]] && echo good || echo bad
This verifies that there are three "lo"s at the beginning, and not another one immediately following the three.
Note that if you're using bash you'll have to escape that ! in the first line (which is what my regex above does)

In GNU Grep or another standard bash command, is it possible to get a resultset from regex?

Consider the following:
var="text more text and yet more text"
echo $var | egrep "yet more (text)"
It should be possible to get the result of the regex as the string: text
However, I don't see any way to do this in bash with grep or its siblings at the moment.
In perl, php or similar regex engines:
$output = preg_match('/yet more (text)/', 'text more text yet more text');
$output[1] == "text";
Edit: To elaborate why I can't just multiple-regex, in the end I will have a regex with multiple of these (Pictured below) so I need to be able to get all of them. This also eliminates the option of using lookahead/lookbehind (As they are all variable length)
egrep -i "([0-9]+) +$USER +([0-9]+).+?(/tmp/Flash[0-9a-z]+) "
Example input as requested, straight from lsof (Replace $USER with "j" for this input data):
npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXu8pvMg (deleted)
npviewer. 17875 j 17u REG 8,8 16037387 524273 /tmp/FlashXXIBH29F (deleted)
The end goal is to cp /proc/$var1/fd/$var2 ~/$var3 for every line, which ends up "Downloading" flash files (Flash used to store in /tmp but they drm'd it up)
So far I've got:
regex="([0-9]+) +j +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+)"
echo "npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXYOvS8S (deleted)" |
sed -r -n -e " s%^.*?$regex.*?\$%\1 \2 \3%p " |
while read -a array
echo /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
It cuts off the first digits of the first value to return, and I'm not familiar enough with sed to see what's wrong.
End result for downloading flash 10.2+ videos (Including, perhaps, encrypted ones):
lsof | grep "/tmp/Flash" | sed -r -n -e " s%^.+? ([0-9]+) +$USER +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+).*?\$%\1 \2 \3%p " |
while read -a array
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
Edit: look at my other answer for a simpler bash-only solution.
So, here the solution using sed to fetch the right groups and split them up. You later still have to use bash to read them. (And in this way it only works if the groups themselves do not contain any spaces - otherwise we had to use another divider character and patch read by setting $IFS to this value.)
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
sed -r -n -e " s%^.*$regex.*\$%\1 \2 \3%p " |
while read -a array
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
Note that I had to adapt your last regex group to allow uppercase letters, and added a space at the beginning to be sure to capture the whole block of numbers. Alternatively here a \b (word limit) would have worked, too.
Ah, I forget mentioning that you should pipe the text to this script, like this:
./ < grep-result-test.txt
(provided your files are named like this). Instead you can add a < grep-result-test after the sed call (before the |), or prepend the line with cat grep-result-test.txt |.
How does it work?
sed -r -n calls sed in extended-regexp-mode, and without printing anything automatically.
-e " s%^.*$regex.*\$%\1 \2 \3%p " gives the sed program, which consists of a single s command.
I'm using % instead of the normal / as parameter separator, since / appears inside the regex and I don't want to escape it.
The regex to search is prefixed by ^.* and suffixed by .*$ to grab the whole line (and avoid printing parts of the rest of the line).
Note that this .* grabs greedy, so we have to insert a space into our regexp to avoid it grabbing the start of the first digit group too.
The replacement text contains of the three parenthesed groups, separated by spaces.
the p flag at the end of the command says to print out the pattern space after replacement. Since we grabbed the whole line, the pattern space consists of only the replacement text.
So, the output of sed for your example input is this:
5 11 /tmp/FlashXXu8pvMg
5 17 /tmp/FlashXXIBH29F
This is much more friendly for reuse, obviously.
Now we pipe this output as input to the while loop.
read -a array reads a line from standard input (which is the output from sed, due to our pipe), splits it into words (at spaces, tabs and newlines), and puts the words into an array variable.
We could also have written read var1 var2 var3 instead (preferably using better variable names), then the first two words would be put to $var1 and $var2, with $var3 getting the rest.
If read succeeded reading a line (i.e. not end-of-file), the body of the loop is executed:
${array[0]} is expanded to the first element of the array and similarly.
When the input ends, the loop ends, too.
This isn't possible using grep or another tool called from a shell prompt/script because a child process can't modify the environment of its parent process. If you're using bash 3.0 or better, then you can use in-process regular expressions. The syntax is perl-ish (=~) and the match groups are available via $BASH_REMATCH[x], where x is the match group.
After creating my sed-solution, I also wanted to try the pure-bash approach suggested by Mark. It works quite fine, for me.
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
while read
if [[ $REPLY =~ $regex ]]
echo cp /proc/${BASH_REMATCH[1]}/fd/${BASH_REMATCH[2]} ~/${BASH_REMATCH[3]}
(If you upvote this, you should think about also upvoting Marks answer, since it is essentially his idea.)
The same as before: pipe the text to be filtered to this script.
How does it work?
As said by Mark, the [[ ... ]] special conditional construct supports the binary operator =~, which interprets his right operand (after parameter expansion) as a extended regular expression (just as we want), and matches the left operand against this. (We have again added a space at front to avoid matching only the last digit.)
When the regex matches, the [[ ... ]] returns 0 (= true), and also puts the parts matched by the individual groups (and the whole expression) into the array variable BASH_REMATCH.
Thus, when the regex matches, we enter the then block, and execute the commands there.
Here again ${BASH_REMATCH[1]} is an array-access to an element of the array, which corresponds to the first matched group. ([0] would be the whole string.)
Another note: Both my scripts accept multi-line input and work on every line which matches. Non-matching lines are simply ignored. If you are inputting only one line, you don't need the loop, a simple if read ; then ... or even read && [[ $REPLY =~ $regex ]] && ... would be enough.
echo "$var" | pcregrep -o "(?<=yet more )text"
Well, for your simple example, you can do this:
var="text more text and yet more text"
echo $var | grep -e "yet more text" | grep -o "text"