How to match string (with regular expression) that begins with a string - regex

In a bash script I have to match strings that begin with exactly 3 times with the string lo; so lololoba is good, loloba is bad, lololololoba is good, balololo is bad.
I tried with this pattern: "^$str1/{$n,}" but it doesn't work, how can I do it?
EDIT:
According to OPs comment, lololololoba is bad now.

This should work:
pat="^(lo){3}"
s="lolololoba"
[[ $s =~ $pat ]] && echo good || echo bad
EDIT (As per OPs comment):
If you want to match exactly 3 times (i.e lolololoba and such should be unmatched):
change the pat="^(lo){3}" to:
pat="^(lo){3}(l[^o]|[^l].)"

You can use following regex :
^(lo){3}.*$
Instead of lo you can put your variable.
See demo https://regex101.com/r/sI8zQ6/1

You can use this awk to match exactly 3 occurrences of lo at the beginning:
# input file
cat file
lololoba
balololo
loloba
lololololoba
lololo
# awk command to print only valid lines
awk -F '^(lo){3}' 'NF == 2 && !($2 ~ /^lo/)' file
lololoba
lololo

As per your comment:
... more than 3 is bad so "lolololoba" is not good!
You'll find that #Jahid's answer doesn't fit (as his gives you "good" to that test string.
To use his answer with the correct regex:
pat="^(lo){3}(?\!lo)"
s="lolololoba"
[[ $s =~ $pat ]] && echo good || echo bad
This verifies that there are three "lo"s at the beginning, and not another one immediately following the three.
Note that if you're using bash you'll have to escape that ! in the first line (which is what my regex above does)

Related

capture repeating regex pattern as one group, sed in bash script

I wrote a working expression that extracts two pieces of data from valid lines of text. The first capture group is the numerical section including periods. The second is the remaining characters of the line as long as the line is valid. A line is invalid if the numerical section ends with a period or the line ends with a number.
1.1 the quick 1-1 (no match due to ending hypen and number)
11.2 brown fox jumped (should return '11.2' and 'brown fox jumped')
1.41.1 over the lazy (should return '1.41.1' and 'over the lazy')
2.1. dog (no match due to numerical section trailing period)
The expression ^((?:[0-9]+\.)+[0-9]+) (.*)[^0-9]$ works when tested on various regex testing sites.
My issue is... that I have failed to adapt this expression to work with sed from a bash script that loops through lines of text ($L).
IFS=$'\t' read -r NUM STR < <(sed 's#^\(\(?:[0-9]\+\.\)\+[0-9]\+\) \(.*)[^0-9]$#\1\t\2#p;d' <<< $L )
What does work is below where I replaced the capturing of repeating groups with repeating digits and periods. I would prefer not to do this because it could match lines starting with periods and multiple periods in a row. Also it loses the last char of the captured string but I expect I can figure that part out.
FS=$'\t' read -r NUM STR < <(sed 's#^\([0-9\.]\+[0-9]\+\) \(.*[^0-9]\)$#\1\t\2#p;d' <<< $L )
Please help me understand what I'm doing wrong. Thank you.
An ERE for that would be:
^([0-9]+(\.[0-9]+)*) (.*[^0-9])$
with \1 and \3 being the capture groups of interest
But I'm not sure that using sed + read is the best approach for capturing the data in variables; you could just use bash builtins instead:
#!/bin/bash
while IFS=' ' read -r num str
do
[[ $num =~ ^([0-9]+(\.[0-9]+)*)$ && $str =~ [^0-9]$ ]] || continue
declare -p num str
done < input.txt
There's a side-effect with this solution though: The read will strip the leading, trailing and the first middle space++ chars of the line.
If you need those spaces then you can match the whole line instead:
#!/bin/bash
regex='^([0-9]+(\.[0-9]+)*) (.*[^0-9])$'
while IFS='' read -r line
do
[[ $line =~ $regex ]] || continue
num=${BASH_REMATCH[1]}
str=${BASH_REMATCH[3]}
declare -p num str
done < input.txt

Bash regex overwrite line if multiple match

I have a bash script where I have 3 regular expressions. I would like to, through conditional if, to find the match of the first pattern in the file.
If there is a match, then look for a match in the second pattern but only with the lines that have matched the first pattern.
Finally, to check the third pattern only with the lines that have matched the second pattern (which are also the ones that had already matched the first pattern).
I have the following code but I don't know how to tell that if there is a match to overwrite the "line" value to decrease the number of total lines to only the ones matching.
#!/bin/bash
pattern1= egrep '^([^,]*,){31}[1-9][0-9].*'
pattern2= egrep '^([^,]*,){16}[0-1].[3-9].*'
pattern3= egrep '^([^,]*,){32}[2-9][0-9].*'
while read line
do
if [[$line == $pattern1]];then
newline == $pattern1
if [[$newline == $pattern2 ]];then
newline2 == $pattern2
if [[$newline2 == $pattern3 ]]; then
echo $pattern3
fi
done < mj1.csv #this is the input file
I will call this script like ./b1.sh <filename>.
Some input data:
EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc
1985,1,1,10/26/1984,21,252,21.6899384,CHI,1,WSB,1,16,1,40,5,16,0.313,0,0,,6,7,0.857,1,5,6,7,2,4,5,2,16,12.5
1985,2,2,10/27/1984,21,253,21.69267625,CHI,0,MIL,0,-2,1,34,8,13,0.615,0,0,,5,5,1,3,2,5,5,2,1,3,4,21,19.4
1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9
1985,4,4,10/30/1984,21,256,21.7008898,CHI,0,KCK,1,5,1,36,8,21,0.381,0,0,,9,9,1,2,2,4,5,3,1,6,5,25,14.7
1985,5,5,11/1/1984,21,258,21.7063655,CHI,0,DEN,0,-16,1,33,7,15,0.467,0,0,,3,4,0.75,3,2,5,5,1,1,2,4,17,13.2
1985,6,6,11/7/1984,21,264,21.72279261,CHI,0,DET,1,4,1,27,9,19,0.474,0,0,,7,9,0.778,1,3,4,3,3,1,5,5,25,14.9
1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.75,4,4,8,5,3,2,5,2,33,29.3
1985,8,8,11/10/1984,21,267,21.73100616,CHI,0,IND,1,2,1,42,9,22,0.409,0,0,,9,12,0.75,2,7,9,4,2,5,3,4,27,21.2
1985,9,9,11/13/1984,21,270,21.73921971,CHI,1,SAS,1,3,1,43,18,27,0.667,1,1,1,8,11,0.727,2,8,10,4,3,2,4,4,45,37.5
1985,10,10,11/15/1984,21,272,21.74469541,CHI,1,BOS,0,-20,1,33,12,24,0.5,0,1,0,3,3,1,0,2,2,2,2,1,1,4,27,17.1
1985,11,11,11/17/1984,21,274,21.75017112,CHI,1,PHI,0,-9,1,44,4,17,0.235,0,0,,8,8,1,0,5,5,7,5,2,4,5,16,12.5
1985,12,12,11/19/1984,21,276,21.75564682,CHI,1,IND,0,-17,1,39,11,26,0.423,0,3,0,12,16,0.75,2,3,5,2,2,1,3,3,34,20.8
1985,13,13,11/21/1984,21,278,21.76112252,CHI,0,MIL,0,-10,1,42,11,22,0.5,0,0,,13,14,0.929,4,9,13,2,2,2,6,3,35,26.7
1985,14,14,11/23/1984,21,280,21.76659822,CHI,0,SEA,1,19,1,30,9,13,0.692,0,0,,5,6,0.833,0,4,4,3,4,1,4,4,23,19.5
1985,15,15,11/24/1984,21,281,21.76933607,CHI,0,POR,0,-10,1,41,10,24,0.417,0,1,0,10,10,1,3,3,6,8,3,1,4,4,30,23.9
1985,16,16,11/27/1984,21,284,21.77754962,CHI,0,GSW,0,-6,1,24,6,10,0.6,0,0,,1,1,1,0,2,2,3,3,2,4,1,13,11.1
1985,17,17,11/29/1984,21,286,21.78302533,CHI,0,PHO,0,-5,1,30,9,17,0.529,1,1,1,3,4,0.75,1,2,3,2,2,0,2,5,22,14
1985,18,18,11/30/1984,21,287,21.78576318,CHI,0,LAC,1,4,1,37,9,15,0.6,0,0,,2,4,0.5,2,3,5,5,3,0,4,4,20,15.5
1985,19,19,12/2/1984,21,289,21.79123888,CHI,0,LAL,1,1,1,42,7,13,0.538,0,0,,6,8,0.75,2,0,2,3,1,1,4,3,20,12.9
1985,20,20,12/4/1984,21,291,21.79671458,CHI,1,NJN,1,15,1,35,7,13,0.538,0,0,,6,6,1,1,2,3,6,1,0,3,3,20,16
1985,21,21,12/7/1984,21,294,21.80492813,CHI,1,NYK,1,2,1,43,8,16,0.5,0,1,0,5,7,0.714,1,1,2,3,2,0,6,5,21,9.3
1985,22,22,12/8/1984,21,295,21.80766598,CHI,1,DAL,1,2,1,35,10,23,0.435,0,0,,0,0,,4,3,7,2,0,2,2,3,20,11.2
1985,23,23,12/11/1984,21,298,21.81587953,CHI,1,DET,0,-7,1,37,13,28,0.464,0,1,0,1,3,0.333,1,7,8,6,2,0,3,4,27,16.2
1985,24,24,12/12/1984,21,299,21.81861739,CHI,0,DET,0,-7,1,30,6,17,0.353,0,2,0,9,10,0.9,0,1,1,2,2,1,1,5,21,12.5
1985,25,25,12/14/1984,21,301,21.82409309,CHI,0,NJN,0,-2,1,44,12,25,0.48,0,0,,10,10,1,2,6,8,8,1,0,0,4,34,29.5
1985,26,26,12/15/1984,21,302,21.82683094,CHI,1,PHI,0,-12,1,27,7,16,0.438,0,0,,0,0,,1,1,2,2,1,0,1,2,14,7.2
1985,27,27,12/18/1984,21,305,21.83504449,CHI,1,HOU,0,-8,1,45,8,20,0.4,0,1,0,2,4,0.5,1,2,3,8,3,0,1,2,18,14.5
1985,28,28,12/20/1984,21,307,21.84052019,CHI,0,ATL,1,3,1,41,12,22,0.545,0,0,,10,16,0.625,4,4,8,7,5,1,7,5,34,26.6
To make things easier, pattern1 matches all rows where column PTS is higher than 10, pattern 2 matches the rows where column FG_PCT is higher than 0.3, and pattern 3 matches all rows where column GmSc is higher than 19.
While an awk solution is going to be a bit faster ... we'll focus on a bash solution per OP's request.
First issue is regex matching uses the =~ operator and not the == operator.
Second issue is that to keep a row if only all 3 regexes match means we want to and (&&) the results of all 3 regex matches.
Third issue addresses some basic syntax issues with OP's current code (eg, space after [[ and before ]]; improper assignments of regex patterns to the pattern* variables).
One bash idea:
pattern1='^([^,]*,){31}[1-9][0-9].*'
pattern2='^([^,]*,){16}[0-1].[3-9].*'
pattern3='^([^,]*,){32}[2-9][0-9].*'
head -1 mj1.csv > mj1.new.csv
while read -r line
do
if [[ "${line}" =~ $pattern1 && "${line}" =~ $pattern2 && "${line}" =~ $pattern3 ]]
then
# do whatever with $line, eg:
echo "${line}"
fi
done < mj1.csv >> mj1.new.csv
This generates:
$ cat mj1.new.csv
EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc
1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9
1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.75,4,4,8,5,3,2,5,2,33,29.3
1985,8,8,11/10/1984,21,267,21.73100616,CHI,0,IND,1,2,1,42,9,22,0.409,0,0,,9,12,0.75,2,7,9,4,2,5,3,4,27,21.2
1985,9,9,11/13/1984,21,270,21.73921971,CHI,1,SAS,1,3,1,43,18,27,0.667,1,1,1,8,11,0.727,2,8,10,4,3,2,4,4,45,37.5
1985,12,12,11/19/1984,21,276,21.75564682,CHI,1,IND,0,-17,1,39,11,26,0.423,0,3,0,12,16,0.75,2,3,5,2,2,1,3,3,34,20.8
1985,13,13,11/21/1984,21,278,21.76112252,CHI,0,MIL,0,-10,1,42,11,22,0.5,0,0,,13,14,0.929,4,9,13,2,2,2,6,3,35,26.7
1985,15,15,11/24/1984,21,281,21.76933607,CHI,0,POR,0,-10,1,41,10,24,0.417,0,1,0,10,10,1,3,3,6,8,3,1,4,4,30,23.9
1985,25,25,12/14/1984,21,301,21.82409309,CHI,0,NJN,0,-2,1,44,12,25,0.48,0,0,,10,10,1,2,6,8,8,1,0,0,4,34,29.5
1985,28,28,12/20/1984,21,307,21.84052019,CHI,0,ATL,1,3,1,41,12,22,0.545,0,0,,10,16,0.625,4,4,8,7,5,1,7,5,34,26.6
NOTE: OP hasn't (yet) provided the expected output so at this point I have to assume OP's regexes are correct

Using regex in Bash with mapfile

Edit 2:
Minimal input file: input/input.txt
#-----------
snapshot=83
#-----------
time=30142088
mem_heap_B=20224
mem_heap_extra_B=8
mem_stacks_B=240480
heap_tree=empty
#-----------
snapshot=84
#-----------
time=30408368
mem_heap_B=20224
mem_heap_extra_B=8
mem_stacks_B=240552
heap_tree=empty
#-----------
snapshot=85
#-----------
time=30674648
mem_heap_B=20224
mem_heap_extra_B=8
mem_stacks_B=240464
heap_tree=empty
#-----------
snapshot=86
#-----------
Actual output:
input.txt/*
time, heap, stack
input/input.txt
time, heap, stack
30674648, 20224, 240464
input/input.txt
time, heap, stack
input/input.txt
time, heap, stack
input/input.txt
time, heap, stack
30674648, 20224, 240464
Expected output:
input.txt
time, heap, stack,
30142088, 20224, 240480
30408368, 20224, 240552
30674648, 20224, 240464
Edit: Originally, the problem may have been due to Bash's regex's lack of multiline capability. However, after stripping newlines from the text, the problem remains, with the exception that the output now has between one to five lines instead of zero.
I'm trying to write a Bash script to parse a text file into a desirable CSV file with the needed information.
As part of the script, I iterate through n files. Each of the files contains m matches for a given regex, and each match contains three capture groups.
I want to format the three capture groups into a CSV row, then concatenate all the rows of all the matches of all the files and write them to a *.csv file.
I'm quite comfortable using Regex in high level languages such as Kotlin or C#, however I have no experience with Regex in Bash. I used this answer as a starting point, however it doesn't seem to be working for me (mapfile -t matches < <( format_row "$text" "$regex" ) doesn't do anything.
Here's the full code with the relevant portion noted:
#!/bin/bash
# RELEVANT CODE BELOW
regex="time=([0-9]+)\nmem_heap_B=([0-9]+)\n.*\nmem_stacks_B=([0-9]+)"
format_row() {
local s=$1 regex=$2
while [[ $s =~ $regex ]]
do
time="${BASH_REMATCH[1]}"
heap="${BASH_REMATCH[2]}"
stack="${BASH_REMATCH[3]}"
echo "${time}, ${heap}, ${stack}"
echo ""
s=${s#*"${BASH_REMATCH[3]}"}
done
}
for file in $1/*
do
echo "Parsing ${file}..."
echo $file >> $2
echo "time, heap, stack" >> $2
text=$(<${file})
mapfile -t matches < <( format_row "$text" "$regex" )
printf "%s\n" "${matches[#]}" >> $2
echo "" >> $2
done
echo ""
echo "Done"
Thanks!
There are two problems here:
Although bash's =~ operator can match newlines, it does not understand the \n escape sequence. You have to use actual newlines in your regex. This can also be achieved by C-style strings $'\n'.
The regex quantifier * is greedy. When matching ...
[[ "a=1,b=1 a=2,b=2 a=3,b=3" =~ a=(.).*b=(.) ]]
... you end up with BATCH_REMATCH=(1 3) instead of (1 1).
In other regex dialects like PCRE you could use the non-greedy quantifier *?. However, in bash we have to use a workaround and have to replace .* with something that cannot match more than wanted, for instance
[[ "a=1,b=1 a=2,b=2 a=3,b=3" =~ a=(.)[^=]*b=(.) ]]
In your case we have to make sure that the next mem_stacks is not matched
As you didn't post any example input and expected output, I can only guess. However, I assume the following regex could work for you:
regex=$'time=([0-9]+)
mem_heap_B=([0-9]+)
([^\n]*\n){TODO set number of lines allowed here}
mem_stacks_B=([0-9]+)'
Note that now you have to use BASH_REMATCH[4] instead of [3].
At the marked location you have to insert the numbers of lines allowed between mem_heap and mem_stacks. The number can be constant (e.g. {5}) or a range (e.g. {1,10}). In case of ranges you have to make sure that the maximum bound is not so high that you could accidentally skip the next mem_stacks and match another mem_stacks instead. Thus, in case of ranges it might be more appropriate to use two matches. Something like
regex1='time=([0-9]+)
mem_heap_B=([0-9]+)'
regex2='mem_stacks_B=([0-9]+)'
while
[[ "$s" =~ $regex1 ]] &&
time="${BASH_REMATCH[1]}" &&
heap="${BASH_REMATCH[2]}" &&
[[ "$s" =~ $regex2 ]] &&
stack="${BASH_REMATCH[1]}"
do
echo "$time, $heap, $stack"
s="${s#*$stack}"
done >> "$2"
By the way:
https://www.shellcheck.net/ helps you to make your script more robust.
First and foremost: quote your variables.
You can use do cmd1; cmd2 done >> file instead of do cmd1 >> file; cmd2 >> file; done.
mapfile -t matches < <(format_row "$text" "$regex")
printf "%s\n" "${matches[#]}" >> "$2"
could be written as just
format_row "$text" "$regex" >> "$2"

Regular expression Bash issue

I have to match a string composed of only lowercase characters repeated 2 times , for example ballball or printprint. For example the word ball is not accepted because is not repeated 2 time.
For this reason I have this code:
read input
expr='^(([a-z]*){2})$'
if [[ $input =~ $expr ]]; then
echo "OK string"
exit 0
fi
exit 10
but it doesn't work , for example if I insert ball the script prints "OK string".
What do I wrong?
Not all Bash versions support backreferences in regexes natively. If yours doesn't, you can use an external tool such as grep:
read input
re='^\([a-z]\+\)\1$'
if grep -q "$re" <<< "$input"; then
echo "OK string"
exit 0
fi
exit 1
grep -q is silent and has a successful exit status if there was a match. Notice how (, + and ) have to be escaped for grep. (grep -E would understand () without escaping.)
Also, I've replaced your * with + so we don't match the empty string.
Alternatively: your requirement means that a matching string has two identical halves, so we can check for just that, without any regexes:
read input
half=$(( ${#input} / 2 ))
if (( half > 0 )) && [[ ${input:0:$half} = ${input:$half} ]]; then
echo "OK string"
fi
This uses Substring Expansion; the first check is to make sure that the empty string doesn't match.
Your requirement is to match strings made of two repeated words. This is easy to do by just checking if the first half of your string is equal to the remaining part. No need to use regexps...
$ var="byebye" && len=$((${#var}/2))
$ test ${var:0:$len} = ${var:$len} && { echo ok ; } || echo no
ok
$ var="abcdef" && len=$((${#var}/2))
$ test ${var:0:$len} = ${var:$len} && { echo ok ; } || echo no
no
The regex [a-z]* will match any alphanumeric or empty string.
([a-z]*){2} will match any two of those.
Ergo, ^(([a-z]*){2})$ will match any string containing zero or more alphanumeric characters.
Using the suggestion from #hwnd (replacing {2} with \1) will enforce a match on two identical strings.
N.B: You will need a fairly recent version of bash. Tested in bash 4.3.11.

Getting the index of the substring on solaris

How can I find the index of a substring which matches a regular expression on solaris10?
Assuming that what you want is to find the location of the first match of a wildcard in a string using bash, the following bash function returns just that, or empty if the wildcard doesn't match:
function match_index()
{
local pattern=$1
local string=$2
local result=${string/${pattern}*/}
[ ${#result} = ${#string} ] || echo ${#result}
}
For example:
$ echo $(match_index "a[0-9][0-9]" "This is a a123 test")
10
If you want to allow full-blown regular expressions instead of just wildcards, replace the "local result=" line with
local result=$(echo "$string" | sed 's/'"$pattern"'.*$//')
but then you're exposed to the usual shell quoting issues.
The goto options for me are bash, awk and perl. I'm not sure what you're trying to do, but any of the three would likely work well. For example:
f=somestring
string=$(expr match "$f" '.*\(expression\).*')
echo $string
You tagged the question as bash, so I'm going to assume you're asking how to do this in a bash script. Unfortunately, the built-in regular expression matching doesn't save string indices. However, if you're asking this in order to extract the match substring, you're in luck:
if [[ "$var" =~ "$regex" ]]; then
n=${#BASH_REMATCH[*]}
while [[ $i -lt $n ]]
do
echo "capture[$i]: ${BASH_REMATCH[$i]}"
let i++
done
fi
This snippet will output in turn all of the submatches. The first one (index 0) will be the entire match.
You might like your awk options better, though. There's a function match which gives you the index you want. Documentation can be found here. It'll also store the length of the match in RLENGTH, if you need that. To implement this in a bash script, you could do something like:
match_index=$(echo "$var_to_search" | \
awk '{
where = match($0, '"$regex_to_find"')
if (where)
print where
else
print -1
}')
There are a lot of ways to deal with passing the variables in to awk. This combination of piping output and directly embedding one into the awk one-liner is fairly common. You can also give awk variable values with the -v option (see man awk).
Obviously you can modify this to get the length, the match string, whatever it is you need. You can capture multiple things into an array variable if necessary:
match_data=($( ... awk '{ ... print where,RLENGTH,match_string ... }'))
If you use bash 4.x you can source the oobash. A string lib written in bash with oo-style:
http://sourceforge.net/projects/oobash/
String is the constructor function:
String a abcda
a.indexOf a
0
a.lastIndexOf a
4
a.indexOf da
3
There are many "methods" more to work with strings in your scripts:
-base64Decode -base64Encode -capitalize -center
-charAt -concat -contains -count
-endsWith -equals -equalsIgnoreCase -reverse
-hashCode -indexOf -isAlnum -isAlpha
-isAscii -isDigit -isEmpty -isHexDigit
-isLowerCase -isSpace -isPrintable -isUpperCase
-isVisible -lastIndexOf -length -matches
-replaceAll -replaceFirst -startsWith -substring
-swapCase -toLowerCase -toString -toUpperCase
-trim -zfill