Check if string contains embedded string in order - regex

I want to check if some string is embedded within another string. For example pineapple and apple match as well as aepprestlse and apple.
This is a simple task if I know the word I want to test against for example:
if [[ $e == *"a"*"p"*"p"*"l"*"e"* ]]
then
echo "match"
fi
However I do not know the length or contents of what will replace my "apple" variable when I run the script. How can I perform this check with variable sizes/contents?

Here is how you can generate a glob pattern to match:
data='bcdaeppr?estlse'
search='app?le'
# generate a regex using sed i.e. *\a*\p*\p*\?\l*\e*
patt="*$(sed 's/./\\&*/g' <<< "$search")"
# now match it
[[ $data == $patt ]] && echo "matched" || echo "nope"
matched
# not matching example
data='bcdaepprestlse'
[[ $data == $patt ]] && echo "matched" || echo "nope"
nope

awk to the rescue!
$ awk -v s='pineapple' -v r='apple' '
BEGIN{for(i=1;i<=length(r);i++)
{s=substr(s,k);
k=index(s,substr(r,i,1));
if(k==0) exit 1}
exit 0}'; echo $?

Related

Parsing string with two captures in bash

I'm trying to parse a string with regex. A valid string is of the following format:
https://github.com/xyz/abc/a_123/project_14.git
The valid string should contain github.com and xyz or zyx. If the string is valid I want to capture abc/a_123 into $A and project_14 into $B.
What I did:
if [[ "$x" == *"github.com"* ]]; then
if [[ "$x" == *"xyz"* ]]; then
# (1)
elif [[ "$x" == *"zyx"* ]]; then
# (2)
else
return 1 # Invalid
fi
return 0 # Valid
fi
return 1 # Invalid
In both (1) and (2) I want to set $A and $B with the values (same behavior on different cases).
Also, I think that this solution is not good because it will enter the if-else in the case of https://github.com/bla/abc/a_123/xyz.git so I guess we need to change it to be "github.com/xyz". Also, how can I get rid of .git (if exists)?
Another example:
https://github.com/zyx/asdasdas/lalal/asdas/nu.git
# $A = asdasdas/lalal/asdas
# $B = nu
What is the proper way to achieve this goal?
Here is a way using regex:
url='https://github.com/xyz/abc/a_123/project_14.git'
if [[ $url =~ http[s]?:[/]{2}(github.com)[/]([[:alpha:]]+)(/.*)$ ]]
then
$A=${BASH_REMATCH[2]}
$B=${BASH_REMATCH[3]%.git}
fi
And here is a small proof of concept:
url='https://github.com/xyz/abc/a_123/project_14.git'
if [[ $url =~ http[s]?:[/]{2}(github.com)[/]([[:alpha:]]+)(/.*)$ ]]
then
echo ${BASH_REMATCH[2]} ${BASH_REMATCH[3]%.git}
fi
Resulting in:
xyz /abc/a_123/project_14
I think this does what you want :
#!/bin/bash
repo="https://github.com/xyz/abc/a_123/project_14.git"
[[ ! "$repo" =~ https:\/\/github.com\/[a-z]+\/[a-z]+\/[a-z]_[0-9]+\/.*.git ]] && exit
A=$( echo "$repo" | sed -E "s/(https:\/\/github.com\/[a-z]+)(\/[a-z]+\/[a-z]_[0-9]+\/)(.*.git)/\2/g" )
B=$( echo "$repo" | sed -E "s/(https:\/\/github.com\/[a-z]+)(\/[a-z]+\/[a-z]_[0-9]+\/)(.*.git)/\3/g" )
echo "$A"
echo "${B%%.git}"
Let me know if it helps
Would you please try the following:
strchk() {
local x=$1
if [[ $x =~ github.com/(xyz|zyx)/(.+)/(.+) ]]; then
A="${BASH_REMATCH[2]}"
B="${BASH_REMATCH[3]%.*}"
return 0
else
return 1
fi
}
Results:
strchk "https://github.com/xyz/abc/a_123/project_14.git" && echo "A=$A, B=$B"
=> A=abc/a_123, B=project_14
strchk "https://github.com/bla/abc/a_123/xyz.git" && echo "A=$A, B=$B"
=> <empty>
strchk "https://github.com/zyx/asdasdas/lalal/asdas/nu.git" && echo "A=$A, B=$B"
=> A=asdasdas/lalal/asdas, B=nu
Explanations:
The pattern github.com/(xyz|zyx)/ matches a string which contains
github.com/ followed by xyz/ or zyx/.
The next pattern (.+)/ matches a substring after xyz/ or zyx/ as long
as it reaches the rightmost slash then stores the captured substring within the parens into
a bash variable ${BASH_REMATCH[2]}.
The last pattern (.+) captures the remaining substring into
${BASH_REMATCH[3]}.
The parameter expansion ${BASH_REMATCH[3]%.*} removes the extension
after the dot if exists.
Hope this helps.

Match two consecutive lines using Regex and Bash features only

What Regular Expression(s) can you use to match two consecutive lines?
The aim is not to use any packages like awk or sed but only use pure RegExp inside a shell script.
Example, I would like to ensure the word "hello" is immediately followed by "world" in the next line.
Acceptance criteria:
"hello" is not to have any spaces before it
"world" must have at least 1 or more space before it.
#/bin/bash
file=./myfile.txt
regex='^hello'
[[ `cat $file` =~ $regexp ]] && echo "yes" || echo "no"
myfile.txt
abc is def
hello
world
cde is efg
Here is pure bash way:
file='./myfile.txt'
[[ $(<$file) =~ hello$'\n'[[:blank:]]*world ]] && echo "yes" || echo "no"
yes
Here $'\n' matches a new line and [[:blank:]]* matches 0+ tabs or spaces.
If you want to be more precise then use:
[[ $(<file) =~ (^|$'\n')hello$'\n'[[:blank:]]*world($'\n'|$) ]] && echo "yes" || echo "no"
However grep or awk are much better tools for this job.

Using regular expressions in a ksh Script

I have a file (file.txt) that contains some text like:
000000000+000+0+00
000000001+000+0+00
000000002+000+0+00
and I am trying to check each line to make sure that it follows the format:
character*9, "+", character*3, "+", etc
so far I have:
#!/bin/ksh
file=file.txt
line_number=1
for line in $(cat $file)
do
if [[ "$line" != "[[.]]{9}+[[.]]{3}+[[.]]{1}+[[.]]{2} ]" ]]
then
echo "Invalid number ($line) check line $line_number"
exit 1
fi
let "line_number++"
done
however this does not evaluate correctly, no matter what I put in the lines the program terminates.
When you want line numbers of the mismatches, you can use grep -vn. Be careful with writing a correct regular expression, and you will have
grep -Evn "^.{9}[+].{3}[+].[+].{2}$" file.txt
This is not in the layout that you want, so change the layout with sed:
grep -Evn "^.{9}[+].{3}[+].[+].{2}$" file.txt |
sed -r 's/([^:]*):(.*)/Invalid number (\2) check line number \1./'
EDIT:
I changed .{1} into ..
The sed is also over the top. When you need spme explanation, you can start with echo "Linenr:Invalid line"
I'm having funny results putting the regex in the condition directly:
$ line='000000000+000+0+00'
$ [[ $line =~ ^.{9}\+.{3}\+.\+..$ ]] && echo ok
ksh: syntax error: `~(E)^.{9}\+.{3}\+.\+..$ ]] && echo ok
' unexpected
But if I save the regex in a variable:
$ re="^.{9}\+.{3}\+.\+..$"
$ [[ $line =~ $re ]] && echo ok
ok
So you can do
#!/bin/ksh
file=file.txt
line_number=1
re="^.{9}\+.{3}\+.\+..$"
while IFS= read -r line; do
if [[ ! $line =~ $re ]]; then
echo "Invalid number ($line) check line $line_number"
exit 1
fi
let "line_number++"
done < "$file"
You can also use a plain glob pattern:
if [[ $line != ?????????+???+?+?? ]]; then echo error; fi
ksh glob patterns have some regex-like syntax. If there's an optional space in there, you can handle that with the ?(sub-pattern) syntax
pattern="?????????+???+?( )?+??"
line1="000000000+000+0+00"
line2="000000000+000+ 0+00"
[[ $line1 == $pattern ]] && echo match || echo no match # => match
[[ $line2 == $pattern ]] && echo match || echo no match # => match
Read the "File Name Generation" section of the ksh man page.
Your regex looks bad - using sites like https://regex101.com/ is very helpful. From your description, I suspect it should look more like one of these;
^.{9}\+.{3}\+.{1}\+.{2}$
^[^\+]{9}\+[^\+]{3}\+[^\+]{1}\+[^\+]{2}$
^[0-9]{9}\+[0-9]{3}\+[0-9]{1}\+[0-9]{2}$
From the ksh manpage section on [[ - you would probably want to be using =~.
string =~ ere
True if string matches the pattern ~(E)ere where ere is an extended regular expression.
Note: As far as I know, ksh regex doesn't follow the normal syntax
You may have better luck with using grep:
# X="000000000+000+0+00"
# grep -qE "^[^\+]{9}\+[^\+]{3}\+[^\+]{1}\+[^\+]{2}$" <<<"${X}" && echo true
true
Or:
if grep -qE "^[^\+]{9}\+[^\+]{3}\+[^\+]{1}\+[^\+]{2}$" <<<"${line}"
then
exit 1
fi
You may also prefer to use a construct like below for handling files:
while read line; do
echo "${line}";
done < "${file}"

How to match this string in bash?

I'm reading a file in bash, line by line. I need to print lines that have the following format:
don't care <<< at least one character >>> don't care.
These are all the way which I have tried and none of them work:
if [[ $line =~ .*<<<.+>>>.* ]]; then
echo "$line"
fi
This has incorrect syntax
These two have correct syntax don't work
if [[ $line =~ '.*<<<.+>>>.*' ]]; then
echo "$line"
fi
And this:
if [[ $line == '*<<<*>>>*' ]]; then
echo "$line"
fi
So how to I tell bash to only print lines with that format? PD: I have tested and printing all lines works just fine.
Don't need regular expression. filename patterns will work just fine:
if [[ $line == *"<<<"?*">>>"* ]]; then ...
* - match zero or more characters
? - match exactly one character
"<<<" and ">>>" - literal strings: The angle brackets need to be quoted so bash does not interpret them as a here-string redirection.
$ line=foobar
$ [[ $line == *"<<<"?*">>>"* ]] && echo y || echo n
n
$ line='foo<<<>>>bar'
$ [[ $line == *"<<<"?*">>>"* ]] && echo y || echo n
n
$ line='foo<<<x>>>bar'
$ [[ $line == *"<<<"?*">>>"* ]] && echo y || echo n
y
$ line='foo<<<xyz>>>bar'
$ [[ $line == *"<<<"?*">>>"* ]] && echo y || echo n
y
For maximum compatibility, it's always a good idea to define your regex pattern as a separate variable in single quotes, then use it unquoted. This works for me:
re='<<<.+>>>'
if [[ $line =~ $re ]]; then
echo "$line"
fi
I got rid of the redundant leading/trailing .*, by the way.
Of course, I'm assuming that you have a valid reason to process the file in native bash (if not, just use grep -E '<<<.+>>>' file)
<, <<, <<<, >, and >> are special in the shell and need quoting:
[[ $line =~ '<<<'.+'>>>' ]]
. and + shouldn't be quoted, though, to keep their special meaning.
You don't need the leading and trailing .* in =~ matching, but you need them (or their equivalents) in patterns:
[[ $line == *'<<<'?*'>>>'* ]]
It's faster to use grep to extract lines:
grep -E '<<<.+>>>' input-file
I don't even understand why you are reading the file line per line. I have just launched following command in the bash prompt and it's working fine:
grep "<<<<.+>>>>" test.txt
where test.txt contains following data:
<<<<>>>>
<<<<a>>>>
<<<<aa>>>>
The result of the command was:
<<<<a>>>>
<<<<aa>>>>

match leading dots in bash if using regex

Say I want to match the leading dot in a string ".a"
So I type
[[ ".a" =~ ^\. ]] && echo "ha"
ha
[[ "a" =~ ^\. ]] && echo "ha"
ha
Why am I getting the same result here?
You need to escape the dot it has meaning beyond just a period - it is a metacharacter in regex.
[[ "a" =~ ^\. ]] && echo "ha"
Make the change in the other example as well.
Check your bash version - you need 4.0 or higher I believe.
There's some compatibility issues with =~ between Bash versions after 3.0. The safest way to use =~ in Bash is to put the RE pattern in a var:
$ pat='^\.foo'
$ [[ .foo =~ $pat ]] && echo yes || echo no
yes
$ [[ foo =~ $pat ]] && echo yes || echo no
no
$
For more details, see E14 on the Bash FAQ page.
Probably it's because bash tries to treat "." as a \ character, like \n \r etc.
In order to tell \ & . as 2 separate characters, try
[[ "a" =~ ^\\. ]] && echo ha