Bash: need to find text within matching braces (parantheses) in text - regex

I have some text that looks like this:
(something1)something2
However something1 and something2 might also have some parentheses inside them such as
(some(thing)1)something(2)
I want to extract something1 (including internal parentheses if there are any) to a variable. Since I can count on the text always starting with an opening parentheses, I'm hoping that I can do something where I match the first parenthesis to the correct closing parentheses, and extract the middle.
Everything I have tried so far has the potential to match the wrong ending parentheses.

If you have perl, the:
perl -MText::Balanced -nlE 'say [Text::Balanced::extract_bracketed( $_, "()" )]->[0]' <<EOF
(something1)something2
(some(thing)1)something(2)
(some(t()()hing)()1)()something(2)
EOF
will prints
(something1)
(some(thing)1)
(some(t()()hing)()1)

Since this is apparently something that is impossible with regular expressions, I have resorted to pickup the the characters 1 by 1:
first=""
count=0
while test -n "$string"
do
char=${string:0:1} # Get the first character
if [[ "$char" == ")" ]]
then
count=$(( $count - 1 ))
fi
if [[ $count > 0 ]]
then
first="$first$char"
fi
if [[ "$char" == "(" ]]
then
count=$(( $count + 1 ))
fi
string=${string:1} # Trim the first character
if [[ $count == 0 ]]
then
second="$string"
string=""
fi
done

You can do it with perl:
echo "(some(thing)1)something(2)" | perl -ne '$_ =~ /(\((?:\(.*\)|[^(])*\))|\w+/s; print $1;'

awk can do it:
#!/bin/awk -f
{
for (i=1; i<=length; ++i) {
if (numLeft == 0 && substr($0, i, 1) == "(") {
leftPos = i
numLeft = 1
} else if (substr($0, i, 1) == "(") {
++numLeft
} else if (substr($0, i, 1) == ")") {
++numRight
}
if (numLeft && numLeft == numRight) {
print substr($0, leftPos, i-leftPos+1)
next
}
}
}
Input:
(something1)something2
(some(thing)1)something(2)
Output:
(something1)
(some(thing)1)

Related

detect string case and apply to another one

How can I detect the case (lowercase, UPPERCASE, CamelCase [, maybe WhATevERcAse]) of a string to apply to another one?
I would like to do it as a oneline with sed or whatever.
This is used for a spell checker which proposes corrections.
Let's say I get something like string_to_fix:correction:
BEHAVIOUR:behavior => get BEHAVIOUR:BEHAVIOR
Behaviour:behavior => get Behaviour:Behavior
behaviour:behavior => remains behaviour:behavior
Extra case to be handled:
MySpecalCase:myspecialcase => MySpecalCase:MySpecialCase (so character would be the point of reference and not the position in the word)
With awk you can use the posix character classes to detect case:
$ cat case.awk
/^[[:lower:]]+$/ { print "lower"; next }
/^[[:upper:]]+$/ { print "upper"; next }
/^[[:upper:]][[:lower:]]+$/ { print "capitalized"; next }
/^[[:alpha:]]+$/ { print "mixed case"; next }
{ print "non alphabetic" }
Jims-MacBook-Air so $ echo chihuahua | awk -f case.awk
lower
Jims-MacBook-Air so $ echo WOLFHOUND | awk -f case.awk
upper
Jims-MacBook-Air so $ echo London | awk -f case.awk
capitalized
Jims-MacBook-Air so $ echo LaTeX | awk -f case.awk
mixed case
Jims-MacBook-Air so $ echo "Jaws 2" | awk -f case.awk
non alphabetic
Here's an example taking two strings and applying the case of the first to the second:
BEGIN { OFS = FS = ":" }
$1 ~ /^[[:lower:]]+$/ { print $1, tolower($2); next }
$1 ~ /^[[:upper:]]+$/ { print $1, toupper($2); next }
$1 ~ /^[[:upper:]][[:lower:]]+$/ { print $1, toupper(substr($2,1,1)) tolower(substr($2,2)); next }
$1 ~ /^[[:alpha:]]+$/ { print $1, $2; next }
{ print $1, $2 }
$ echo BEHAVIOUR:behavior | awk -f case.awk
BEHAVIOUR:BEHAVIOR
$ echo Behaviour:behavior | awk -f case.awk
Behaviour:Behavior
$ echo behaviour:behavior | awk -f case.awk
behaviour:behavior
With GNU sed:
sed -r 's/([A-Z]+):(.*)/\1:\U\2/;s/([A-Z][a-z]+):([a-z])/\1:\U\2\L/' file
Explanations:
s/([A-Z]+):(.*)/\1:\U\2/: search for uppercase letters up to : and using backreference and uppercase modifier \U, change letters after : to uppercase
s/([A-Z][a-z]+):([a-z])/\1:\U\2\L/ : search for words starting with uppercase letter and if found, replace first letter after : to uppercase
awk -F ':' '
{
# read Pattern to reproduce
Pat = $1
printf("%s:", Pat)
# generic
if ( $1 ~ /^[:upper:]*$/) { print toupper( $2); next}
if ( $1 ~ /^[:lower:]*$/) { print tolower( $2); next}
# Specific
gsub( /[^[:upper:][:lower:]]/, "~:", Pat)
gsub( /[[:upper:]]/, "U:", Pat)
gsub( /[[:lower:]]/, "l:", Pat)
LengPat = split( Pat, aDir, /:/)
# print with the correponsing pattern
LenSec = length( $2)
for( i = 1; i <= LenSec; i++ ) {
ThisChar = substr( $2, i, 1)
Dir = aDir[ (( i - 1) % LengPat + 1)]
if ( Dir == "U" ) printf( "%s", toupper( ThisChar))
else if ( Dir == "l" ) printf( "%s", tolower( ThisChar))
else printf( "%s", ThisChar)
}
printf( "\n")
}' YourFile
take all case (and taking same concept as #Jas for quick upper or lower pattern)
works for this strucure only (spearator by :)
second part (text) could be longer than part1, pattern is used cyclingly
This might work for you (GNU sed):
sed -r '/^([^:]*):\1$/Is//\1:\1/' file
This uses the I flag to do a caseless match and then replaces both instances of the match with the first.

How should I use exact keyword matching as a condition in the case statement?

I was trying to write myself some handy scripts in order to legitimately slacking off work more efficiently, and this question suddenly popped up:
Given a very long string $LONGEST_EVER_STRING and several keywords strings like $A='foo bar' , $B='omg bbq' and $C='stack overflow'
How should I use exact keyword matching as a condition in the case statement?
for word in $LONGEST_EVER_STRING; do
case $word in
any exact match in $A) do something ;;
any exact match in $B) do something ;;
any exact match in $C) do something ;;
*) do something else;;
esac
done
I know I can write in this way but it looks really ugly:
for word in $LONGEST_EVER_STRING; do
if [[ -n $(echo $A | fgrep -w $word) ]]; then
do something;
elif [[ -n $(echo $B | fgrep -w $word) ]]; then
do something;
elif [[ -n $(echo $C | fgrep -w $word) ]]; then
do something;
else
do something else;
fi
done
Does anyone have an elegant solution? Many thanks!
You could use a function to do a little transform in your A, B, C variables and then:
shopt -s extglob
Ax="+(foo|bar)"
Bx="+(omg|bbq)"
Cx="+(stack|overflow)"
for word in $LONGEST_EVER_STRING; do
case $word in
$Ax) do something ;;
$Bx) do something ;;
$Cx) do something ;;
*) do something else;;
esac
done
I would just define a function for this. It'll be slower than grep for large wordlists, but faster than starting up grep many times.
##
# Success if the first arg is one of the later args.
has() {
[[ $1 = $2 ]] || {
[[ $3 ]] && has "$1" "${#:3}"
}
}
$ has a b c && echo t || echo f
f
$ has a b c a d e f && echo t || echo f
t
A variation on /etc/bashrc's "pathmunge"
for word in $LONGEST_EVER_STRING; do
found_word=false
for list in " $A " " $B " " $C "; do
if [[ $list == *" $word "* ]]; then
found_word=true
stuff with $list and $word
break
fi
done
$found_word || stuff when not found
done

Bash Regex for empty string returns true

if [[ " " =~ ^[0-9]*$ ]]; then echo "si"; else echo "no"; fi; //Echoes No
if [[ "" =~ ^[0-9]*$ ]]; then echo "si"; else echo "no"; fi; //Echoes Yes
Is this a bug or am I missing something?
This is as expected. You specified 0 or more times (*) a digit ([0-9]). An empty string is 0 times that.
Use a + (which means "1 or more times") instead of a *:
if [[ " " =~ ^[0-9]+$ ]]; then echo "si"; else echo "no"; fi; // Should echo No
if [[ "" =~ ^[0-9]+$ ]]; then echo "si"; else echo "no"; fi; // Should echo No
The first one is a space, which does not match the [0-9]* regex.
The second is empty, which is [0-9]* because * also implies 0 ocurrencies. If you make it match at least one ocurrency with +, then it is false:
$ if [[ " " =~ ^[0-9]+$ ]]; then echo "si"; else echo "no"; fi;
no
* in a regex means "0 or more", so with nothing in the target string, the regex trivially matches.
[0-9]* matches zero or more digits, so yes, it matches the empty string. If you don't want to match the empty string, use [0-9]+, which matches one or more digits.

How can I check the last character in a string in bash?

I need to ensure that the last character in a string is a /
x="test.com/"
if [[ $x =~ //$/ ]] ; then
x=$x"extention"
else
x=$x"/extention"
fi
at the moment, false always fires.
Like this, for example:
$ x="test.com/"
$ [[ "$x" == */ ]] && echo "yes"
yes
$ x="test.com"
$ [[ "$x" == */ ]] && echo "yes"
$
$ x="test.c/om"
$ [[ "$x" == */ ]] && echo "yes"
$
$ x="test.c/om/"
$ [[ "$x" == */ ]] && echo "yes"
yes
$ x="test.c//om/"
$ [[ "$x" == */ ]] && echo "yes"
yes
You can index strings in Bash using ${var:index} and ${#var} to get the length of the string. Negative indices means the moving from the end to the start of the string so that -1 is index of the last character:
if [[ "${x:${#x}-1}" == "/" ]]; then
# last character of x is /
fi
Your condition was slightly incorrect. When using =~, the rhs is considered a pattern, so you'd say pattern and not /pattern/.
You'd have got expected results if you said
if [[ $x =~ /$ ]] ; then
instead of
if [[ $x =~ //$/ ]] ; then
You can do this generically using bash substrings $(string:offset:length} - length is optional
#x is the length of x
Therefore
$n = 1 # 1 character
last_char = ${x:${#x} - $n}
For future references,
$ man bash
has all the magic
${parameter:offset:length}
Substring Expansion. Expands to up to length characters of parameter
starting at the character specified by offset. If length is
omitted, expands to the substring of parameter starting at the
character specified by offset. length and offset are arithmetic
expressions ...

regex doubt in gawk

my csv data file is like this
title,name,gender
MRS.,MADHU,Female
MRS.,RAJ KUMAR,male
MR.,N,Male
MRS.,SHASHI,Female
MRS.,ALKA,Female
now as you can see i wanna avoid all data like line 2 and 3 (i.e no white space or data length >= 3 )
MRS.,RAJ KUMAR,male
MR.,N,Male
and place it in a file called rejected_list.csv, rest all go in a file called clean_list.csv
hence here is my gawk script for it
gawk -F ',' '{
if( $2 ~ /\S/ &&
$1 ~ /MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF./ &&
$3 ~ /M|F|Male|Female/)
print $1","$2","$3 > "clean_list.csv";
else
print $1","$2","$3 > "rejected_list.csv" } ' \
< DATA_file.csv
My problem is this script is not recognising '\S' character set( all alphabets except space).. it is selecting all words starting with S or has a S and rejecting the rest
a simple regex like /([A-Z])/ in place of /s works perfectly but as i place a limit of {3,} the script fails..
gawk -F ',' '{
if( $2 ~ /([A-Z]){3,}/ &&
$1 ~ /MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF./ &&
$3 ~ /M|F|Male|Female/)
print $1","$2","$3 > "clean_list.csv";
else
print $1","$2","$3 > "rejected_list.csv" } ' \
< DATA_file.csv
i have tried all sorts of combination of the regex with '*','+' etc but i cant get what i want...
can anyone tell me what is the problem?
Use [:graph:] instead of \S for all printable and visible characters. GAWK does not recognize \S as [:graph:] so it will not work.
Additionally, the {3,} interval expression only works in posix or re-interval modes.
I added a rejection condition: not exactly 3 fields
gawk -F, '
BEGIN {
titles = "MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF."
genders = "M|F|Male|Female"
}
$1 !~ titles || $2 ~ /[[:space:]]/ || length($2) < 3 || $3 !~ genders || NF != 3 {
print > "rejected_list.csv"
next
}
{ print > "clean_list.csv" }
' < DATA_file.csv