regex match square brackets once - regex

There is a text file with the following info:
[[parent]]
[son]
[daughter]
How to get only [son] and [daughter]?
$0 ~ /\[([a-z])*\]/ ???

Your regex is almost right. Just put the * inside the round brackets (in order to have the whole text inside the only group) and remember to use the ^ and $ delimiters (to avoid matching [[parent]]):
^\[([a-z]*)\]$

Match any square bracket at beginning of line where the next character is an alphabetic.
awk '/^\[[a-z]/' file
You might want to add uppercase and/or numbers to the character class, depending on what your real requiements are. (Your examples show only lowercase, so I have assumed that's a valid generalization.)

You can use this awk command:
awk -F '[][]+' 'NF && !/\[\[/{print $2}' file
son
daughter
awk command breakup:
-F '[][]+' # set input field separator as 1 or more of [ or ]
NF # only if at least one field is found
!/\[\[/ # when input doesn't start with [[

Related

extract substring with SED

I have the next strings:
for example:
input1 = abc-def-ghi-jkl
input2 = mno-pqr-stu-vwy
I want extract the first word between "-"
for the fisrt string I want to get: def
if the input is the second string, I want to get: pqr
I want to use the command SED, Could you help me please?
Use
sed 's,^[^-]*-\([^-]*\).*,\1,' file
The string after the first - will be captured up to the second - and the rest will be matched, then the matched line will be replaced with the group text.
With bash:
var='input1 = abc-def-ghi-jkl'
var=${var#*-} # remove shortest prefix `*-`, this removes `input1 = abc-`
echo "${var%%-*}" # remove longest suffix `-*`, this removes `-ghi-jkl`
Or with awk:
awk -F'-' '{print $2}' <<<'input1 = abc-def-ghi-jkl'
Use - as input field separator and print the second field.
Or with cut:
cut -d'-' -f2 <<<'input1 = abc-def-ghi-jkl'
When you want to use sed, you can choose between solutions like
# Double processing
echo "$input1" | sed 's/[^-]*-//;s/-.*//'
# Normal approach
echo "$input1" | sed -r 's/^[^-]*-([^-]*)|-.*)/\1/g'
# Funny alternative
echo "$input1" | sed -r 's/(^[^-]*-|-.*)//g'
The obvious "external" tool would be cut. You can also look at a Bash builtin solution like
[[ ${input1} =~ ([^-]*)-([^-]*) ]] && printf %s "${BASH_REMATCH[2]}"
grep solution (in my opinion this is the most natural approach, as you are only trying to find matches to a regular expression - you are not looking to edit anything, so there should be no need for the more advanced command sed)
grep -oP '^[^-]*-\K[^-]*(?=-)' << EOF
> abc-qrs-bobo-the-clown
> 123-45-6789
> blah-blah-blah
> no dashes here
> mahi-mahi
> EOF
Output
qrs
45
blah
Explanation
Look at the inputs first, included here for completeness as a heredoc (more likely you would name your file as the last argument to grep.) The solution requires at least two dashes to be present in the string; in particular, for mahi-mahi it will find no match. If you want to find the second mahi as a match, you can remove the lookahead assertion at the end of the regular expression (see below).
The regular expression does this. First note the command options: -o to return only the matched substring, not the entire line; and -P to use Perl extensions. Then, the regular expression: start from the beginning of the line (^); look for zero or more non-dash characters followed by dash, and then (\K) discard this part of the required match from the substrings found to match the pattern. Then look for zero or more non-dash characters again - this will be returned by the command. Finally, require a dash following this pattern, but do not include it in the match. This is done with a lookahead (marked by (?= ... )).

Gawk - Regexp - unable to get results

I have a two column file named names.csv. Field 1 has names with alphabet characters in them. I am trying to find out names where a character repeats e.g. Viijay (and not Vijay)
The command below works and returns all the rows in Field 1
gawk "$1 ~ /[a-z]/ {print $0}" names.csv
To meet the requirement stated above (viz. repeating characters), I have actually used the command below, which does not return any rows
gawk "$1 ~ /[a-z]{1,}/ {print $0}" names.csv
What is the correction needed to get what I am looking for?
To further elaborate, if the values in Column 1/Field 1 are Vijay, Viijay and Vijayini, i want only Viijay to be returned. That is, only values where a character ("i" in the example here) is repeated (not "recurring" as in Vijayini where the character "i" is recurring in the string but not clustered together.)
Requested sample data is:
Vijay 1
Viijay 2
Vijayini 3
and the expected output:
Viijay 2
As awk regex doesn't support backreferences in matching, you need to find the duplicated characters some other way. This one duplicates every character in $1 and adds them to a variable which is then matched against the original string in, ie. Viijay -> re="(VV|ii|ii|jj|aa|yy)"; if($1~re)... (notice, that it does not test if the entry is already in re, you might want to consider adding some checking, more checking considerations in the comments):
$ awk '
{ # you should test for empty $1
re="(" # reset re
for(i=1;i<=length($1);i++) # for each char in $1
re=re (i==1?"":"|") (b=substr($1,i,1)) b # generate dublicated re entry
re=re ")" # terminating )
if($1~re) # match
print # and print if needed
}' file
Output:
Viijay 2
Ironically or exemplarily it fails on Busybox awk—in which the backreferences can be used Ɑ:
$ busybox awk '$1~"(.)\\1" {print $0}' file
Viijay,2
Since awk doesn't support backreferences in a regexp you're better off using grep or sed for this:
$ grep '^[^[:space:]]*\([a-z]\)\1' file
Viijay 2
$ sed -n '/^[^[:space:]]*\([a-z]\)\1/p' file
Viijay 2
That might be GNU-only, google to check.
With awk you'd have to do something like the following to first create a regexp that matches 2 repetitions of any character in your specific character set of a-z:
$ awk '{re=$1; gsub(/[^a-z]/,"",re); gsub(/./,"&{2}|",re); sub(/\|$/,"",re)} $1 ~ re' file
Viijay 2
FYI to create a regexp from $1 that would match 2 repetitions of any character it contains, not just a-z, would be:
re=$1; gsub(/[^\\^]/,"[&]{2}|",re); gsub(/[\\^]/,"\\\\&{2}|",re); sub(/\|$/,"",re);
You have to handle ^ differently from other characters as that's the only character that has a different meaning than literal when it's the first character in a bracket expression (i.e. negation) so you have to escape it with a backslash rather than putting it inside a bracket expression to make it literal. You have to handle \ different because [\] means the same as [] which is an unterminated bracket expression because [ is the start but ] is just the first character inside the bracket expression, it's not the ] needed to terminate it.

Regex: find elements regardless of order

If I have the string:
geo:FR, host:www.example.com
(In reality the string is more complicated and has more fields.)
And I want to extract the "geo" value and the "host" value, I am facing a problem when the order of the keys change, as in the following:
host:www.example.com, geo:FR
I tried this line:
sed 's/.\*geo:\([^ ]*\).\*host:\([^ ]*\).*/\1,\2/'
But it only works on the first string.
Is there a way to do it in a single regex, and if not, what's the best approach?
I suggest extracting each text you need with a separate sed command:
s="geo:FR, host:www.example.com"
host="$(sed -n 's/.*host:\([^[:space:],]*\).*/\1/p' <<< "$s")"
geo="$(sed -n 's/.*geo:\([^[:space:],]*\).*/\1/p' <<< "$s")"
See the online demo, echo "$host and $geo" prints
www.example.com and FR
for both inputs.
Details
-n suppresses line output and p prints the matches
.* - matches any 0+ chars up the last...
host: - host: substring and then
\([^[:space:],]*\) - captures into Group 1 any 0 or more chars other than whitespace and a comma
.* - the rest of the line.
The result is just the contents of Group 1 (see \1 in the replacement pattern).
Whenever you have tag/name to value pairs in your input I find it best (clearest, simplest, most robust,, easiest to enhance, etc.) to first create an array that contains that mapping (f[] below) and then you can simply access the values by their tags:
$ cat file
geo:FR, host:www.example.com
host:www.example.com, geo:FR
foo:bar, host:www.example.com, stuff:nonsense, badgeo:uhoh, geo:FR, nastygeo:wahwahwah
$ cat tst.awk
BEGIN { FS=":|, *"; OFS="," }
{
for (i=1; i<=NF; i+=2) {
f[$i] = $(i+1)
}
print f["geo"], f["host"]
}
$ awk -f tst.awk file
FR,www.example.com
FR,www.example.com
FR,www.example.com
The above will work using any awk in any shell on every UNIX box.
Here I've used GNU Awk to convert your delimited key:value pairs to valid shell assignment. With Bash, you can load these assignments into your current shell using <(process substitution):
# source the file descriptor generated by proc sub
. < <(
# use comma-space as field separator, literal apostrophe as variable q
awk -F', ' -vq=\' '
# change every foo:bar in line to foo='bar' on its own line
{for(f=1;f<=NF;f++) print gensub(/:(.*)/, "=" q "\\1" q, 1, $f)}
# use here-string to load text; remove everything but first quote to use standard input
' <<< 'host:www.example.com, geo:FR'
)

How to find/extract a pattern from a file?

Here are the contents of my text file named 'temp.txt'
---start of file ---
HEROKU_POSTGRESQL_AQUA_URL (DATABASE_URL) ----backup---> b687
Capturing... done
Storing... done
---end of file ----
I want to write a bash script in which I need to capture the string 'b687' in a variable. this is really a pattern (which is the letter 'b' followed by 'n' number of digits). I can do it the hard way by looping through the file and extracting the desired string (b687 in example above). Is there an easy way to do so? Perhaps by using awk or sed?
Try using grep
v=$(grep -oE '\bb[0-9]{3}\b' file)
This will seach for a word starting with b followed by '3' digits.
regex101 demo
Using sed
v=$(sed -nr 's/.*\b(b[0-9]{3})\b.*/\1/p' file)
varname=$(awk '/HEROKU_POSTGRESQL_AQUA_URL/{print $4}' filename)
what this does is reads the file when it matches the pattern HEROKU_POSTGRESQL_AQUA_URL print the 4th token in this case b687
your other option is to use sed
varname=$(sed -n 's/.* \(b[0-9][0-9]*\)/\1/p' filename)
In this case we are looking for the pattern you mentioned b####... and only print that pattern the -n tells sed not to print line that do not have that pattern. the rest of the sed command is a substitution .* is any string at the beginning. followed by a (...) which forms a group in which we put the regex that will match your b##### the second part says out of all that match only print the group 1 and the p at the end tells sed to print the result (since by default we told sed not to print with the -n)

Grep Regex: List all lines except

I'm trying to automagically remove all lines from a text file that contains a letter "T" that is not immediately followed by a "H". I've been using grep and sending the output to another file, but I can't come up with the magic regex that will help me do this.
I don't mind using awk, sed, or some other linux tool if grep isn't the right tool to be using.
That should do it:
grep -v 'T[^H]'
-v : print lines not matching
[^H]: matches any character but H
You can do:
grep -v 'T[^H]' input
-v is the inverse match option of grep it does not list the lines that match the pattern.
The regex used is T[^H] which matches any lines that as a T followed by any character other than a H.
Read lines from file exclude EMPTY Lines and Lines starting with #
grep -v '^$\|^#' folderlist.txt
folderlist.txt
# This is list of folders
folder1/test
folder2
# This is comment
folder3
folder4/backup
folder5/backup
Results will be:
folder1/test
folder2
folder3
folder4/backup
folder5/backup
Adding 2 awk solutions to the mix here.
1st solution(simpler solution): With simple awk and any version of awk.
awk '!/T/ || /TH/' Input_file
Checking 2 conditions:
If a line doesn't contain T OR
If a line contains TH then:
If any of above condition is TRUE then print that line simply.
2nd solution(GNU awk specific): Using GNU awk using match function where mentioning regex (T)(.|$) and using match function's array creation capability.
awk '
!/T/{
print
next
}
match($0,/(T)(.|$)/,arr) && arr[1]=="T" && arr[2]=="H"
' Input_file
Explanation: firstly checking if a line doesn't have T then print that simply. Then using match function of awk to match T followed by any character OR end of the line. Since these are getting stored into 2 capturing groups so checking if array arr's 1st element is T and 2nd element is H then print that line.