how to exactily repeat the n matched pattern in result string - regex

How to exactly repeat the n matched pattern in result string?
Example if I have the folowing text:
++ '[' -f /etc/bashrc ']'
++ . /etc/bashrc
+++ '[' '[\u#\h \W]\$ ' ']'
+++ '[' -z 'printf "\033]0;%s#%s:%s\007" "${USER}" "${HOSTNAME%%.*}" "${PWD/#$HOME/~}"' ']'
+++ shopt -s checkwinsize
+++ '[' '[\u#\h \W]\$ ' = '\s-\v\$ ' ']'
+++ shopt -q login_shell
+++ '[' 506 -gt 199 ']'
++++ id -gn
Now I want to substitute every '+' for 3 spaces, but it can only happen at the begining of the pattern. I would use :<range>s/^<pattern> :%s/+/ /g, but if it there were a '+' in the rest of the text I would simply mess it up.
The question:
How to match every + at begining and repeat the same count of found + in the result string?
expected:
^ ++$ -> ^ $
^ +++$ -> ^ $
^ +$ -> ^ $
Thanks

Try this:
:%s/^+*/\=repeat(' ',strlen(submatch(0)))/
submatch(0) contains all the matched + at the start of the line, strlen counts them. So for every plus sign at the start of the line three spaces are inserted using repeat.
For more information:
:help sub-replace-expression
:help repeat()
:help submatch()
:help strlen()

An elegant substitution command for this case is the following:
:%s/\%(^+*\)\#<=+/ /g

I think you'll have to run an expression several times, if that is acceptable...
You'll want to run something like this (minus the single quotes, which are used to show whitespace):
'^(\s*)+'
replacing with something like (again minus the single quotes)
'$1 '
Not every problem that can be solved with regular expressions can be solved using only a single regular expression - I'm pretty sure this is one of those cases
This expression/replacement pair will need to be run once for each plus sign at the beginning of the line with the most plus signs (in your example above, that would be four times) N.B.: as written, this will mess up any lines that are supposed to begin with whitespace and plus signs , so I hope that doesn't happen anywhere...

Related

Bash parameter substitution mess (pdfgrep, regex, newline and more)

I need to match a pattern across multiple lines with pdfgrep
pdfgrep -in -C line 'CHAPTER 1'[$'\n'][$' ']*'THIS IS THE TITLE' ~/temp.pdf
works ok and outputs
12: CHAPTER 1
THIS IS THE TITLE
Now
$ pattern="CHAPTER 1 - THIS IS THE TITLE"
$ echo "'${pattern:0:9}'[$'\n'][$' ']*'${pattern:12:${#pattern}}'"
'CHAPTER 1'[$'\n'][$' ']*'THIS IS THE TITLE'
$ pdfgrep -in -C line "'${pattern:0:9}'[$'\n'][$' ']*'${pattern:12:${#pattern}}'" ~/temp.pdf
doesn't work anymore, gives me nothing. I guess there is something going on with the parameter substitution, but I can't figure out what's happening. Anyone can help?
Background infos:
From "man pdfgrep"
pdfgrep works much like grep, with one distinction: It operates on pages and not on lines.
"." matches any character, line breaks INCLUDED.
You are using extra ' characters:
"'${pattern:0:9}'[$'\n'][$' ']*'${pattern:12:${#pattern}}'"
^ ^ ^ ^
Also, you are using $'\n' and $' ' inside double quotes, and this prevents their expansion.
The correct expression is:
"${pattern:0:9}"[$'\n'][$' ']*"${pattern:12:${#pattern}}"
In fact:
$ echo 'CHAPTER 1'[$'\n'][$' ']*'THIS IS THE TITLE'
CHAPTER 1[
][ ]*THIS IS THE TITLE
$ pattern="CHAPTER 1 - THIS IS THE TITLE"
$ echo "${pattern:0:9}"[$'\n'][$' ']*"${pattern:12:${#pattern}}"
CHAPTER 1[
][ ]*THIS IS THE TITLE
Note that the output of echo when given the two expressions is the equivalent (if you did things right, echo should not return a Bash expression, it should return the final string).
It's not required, but as a best practice you should quote the *, [ and ] characters (thanks chepner for noticing). Also, $' ' is pretty useless here:
"${pattern:0:9}["$'\n'"][ ]*${pattern:12:${#pattern}}"
^ ^ ^
This will prevent glob expansion (which is unlikely to happen in your case, but still something to care about).
$'\n' doesnt interpolates to the line feed when the string is double-quoted:
prompt $ echo "$'\n'"
$'\n'
prompt $ echo $'\n'
Don't use double-quotes around the string:
prompt $ a='abcd'$'\n''efgc'
prompt $ echo "$a"
abcd
efgc
P.S. Your regular expression looks very strange. Why do you use square brackets around the \n and \s?

How to match until the last occurrence of a character in bash shell

I am using curl and cut on a output like below.
var=$(curl https://avc.com/actuator/info | tr '"' '\n' | grep - | head -n1 | cut -d'-' -f -1, -3)
Varible var gets have two kinds of values (one at a time).
HIX_MAIN-7ae526629f6939f717165c526dad3b7f0819d85b
HIX-R1-1-3b5126629f67892110165c524gbc5d5g1808c9b5
I am actually trying to get everything until the last '-'. i.e HIX-MAIN or HIX-R1-1.
The command shown works fine to get HIX-R1-1.
But I figured this is the wrong way to do when I have something something like only 1 - in the variable; it is getting me the entire variable value (e.g. HIX_MAIN-7ae526629f6939f717165c526dad3b7f0819d85b).
How do I go about getting everything up to the last '-' into the variable var?
This removes everything from the last - to the end:
sed 's/\(.*\)-.*/\1/'
As examples:
$ echo HIX_MAIN-7ae52 | sed 's/\(.*\)-.*/\1/'
HIX_MAIN
$ echo HIX-R1-1-3b5126629f67 | sed 's/\(.*\)-.*/\1/'
HIX-R1-1
How it works
The sed substitute command has the form s/old/new/ where old is a regular expression. In this case, the regex is \(.*\)-.*. This works because \(.*\)- is greedy: it will match everything up to the last -. Because of the escaped parens,\(...\), everything before the last - will be saved in group 1 which we can refer to as \1. The final .* matches everything after the last -. Thus, as long as the line contains a -, this regex matches the whole line and the substitute command replaces the whole line with \1.
You can use bash string manipulation:
$ foo=a-b-c-def-ghi
$ echo "${foo%-*}"
a-b-c-def
The operators, # and % are on either side of $ on a QWERTY keyboard, which helps to remember how they modify the variable:
#pattern trims off the shortest prefix matching "pattern".
##pattern trims off the longest prefix matching "pattern".
%pattern trims off the shortest suffix matching "pattern".
%%pattern trims off the longest suffix matching "pattern".
where pattern matches the bash pattern matching rules, including ? (one character) and * (zero or more characters).
Here, we're trimming off the shortest suffix matching the pattern -*, so ${foo%-*} will get you what you want.
Of course, there are many ways to do this using awk or sed, possibly reusing the sed command you're already running. Variable manipulation, however, can be done natively in bash without launching another process.
You can reverse the string with rev, cut from the second field and then rev again:
rev <<< "$VARIABLE" | cut -d"-" -f2- | rev
For HIX-R1-1----3b5126629f67892110165c524gbc5d5g1808c9b5, prints:
HIX-R1-1---
I think you should be using sed, at least after the tr:
var=$(curl https://avc.com/actuator/info | tr '"' '\n' | sed -n '/-/{s/-[^-]*$//;p;q}')
The -n means "don't print by default". The /-/ looks for a line containing a dash; it then executes s/-[^-]*$// to delete the last dash and everything after it, followed by p to print and q to quit (so it only prints the first such line).
I'm assuming that the output from curl intrinsically contains multiple lines, some of them with unwanted double quotes in them, and that you need to match only the first line that contains a dash at all (which might very well not be the first line). Once you've whittled the input down to the sole interesting line, you could use pure shell techniques to get the result that's desired, but getting the sole interesting line is not as trivial as some of the answers seem to be assuming.

How can I match square bracket in regex with grep?

I am trying to match both [ and ] with grep, but only succeeded to match [. No matter how I try, I can't seem to get it right to match ].
Here's a code sample:
echo "fdsl[]" | grep -o "[ a-z]\+" #this prints fdsl
echo "fdsl[]" | grep -o "[ \[a-z]\+" #this prints fdsl[
echo "fdsl[]" | grep -o "[ \]a-z]\+" #this prints nothing
echo "fdsl[]" | grep -o "[ \[\]a-z]\+" #this prints nothing
Edit: My original regex, on which I need to do this, is this one:
echo "fdsl[]" | grep -o "[ \[\]\t\na-zA-Z\/:\.0-9_~\"'+,;*\=()$\!##&?-]\+"
#this prints nothing
N.B: I have tried all the answers from this post but that didn't work on this particular case. And I need to use those brackets inside [].
According to BRE/ERE Bracketed Expression section of POSIX regex specification:
[...] The right-bracket ( ']' ) shall lose its special meaning and represent itself in a bracket expression if it occurs first in the list (after an initial circumflex ( '^' ), if any). Otherwise, it shall terminate the bracket expression, unless it appears in a collating symbol (such as "[.].]" ) or is the ending right-bracket for a collating symbol, equivalence class, or character class. The special characters '.', '*', '[', and '\' (period, asterisk, left-bracket, and backslash, respectively) shall lose their special meaning within a bracket expression.
and
[...] If a bracket expression specifies both '-' and ']', the ']' shall be placed first (after the '^', if any) and the '-' last within the bracket expression.
Therefore, your regex should be:
echo "fdsl[]" | grep -Eo "[][ a-z]+"
Note the E flag, which specifies to use ERE, which supports + quantifier. + quantifier is not supported in BRE (the default mode).
The solution in Mike Holt's answer "[][a-z ]\+" with escaped + works because it's run on GNU grep, which extends the grammar to support \+ to mean repeat once or more. It's actually undefined behavior according to POSIX standard (which means that the implementation can give meaningful behavior and document it, or throw a syntax error, or whatever).
If you are fine with the assumption that your code can only be run on GNU environment, then it's totally fine to use Mike Holt's answer. Using sed as example, you are stuck with BRE when you use POSIX sed (no flag to switch over to ERE), and it's cumbersome to write even simple regular expression with POSIX BRE, where the only defined quantifier is *.
Original regex
Note that grep consumes the input file line by line, then checks whether the line matches the regex. Therefore, even if you use P flag with your original regex, \n is always redundant, as the regex can't match across lines.
While it is possible to match horizontal tab without P flag, I think it is more natural to use P flag for this task.
Given this input:
$ echo -e "fds\tl[]kSAJD<>?,./:\";'{}|[]\\!##$%^&*()_+-=~\`89"
fds l[]kSAJD<>?,./:";'{}|[]\!##$%^&*()_+-=~`89
The original regex in the question works with little modification (unescape + at the end):
$ echo -e "fds\tl[]kSAJD<>?,./:\";'{}|[]\\!##$%^&*()_+-=~\`89" | grep -Po "[ \[\]\t\na-zA-Z\/:\.0-9_~\"'+,;*\=()$\!##&?-]+"
fds l[]kSAJD
?,./:";'
[]
!##$
&*()_+-=~
89
Though we can remove \n (since it is redundant, as explained above), and a few other unnecessary escapes:
$ echo -e "fds\tl[]kSAJD<>?,./:\";'{}|[]\\!##$%^&*()_+-=~\`89" | grep -Po "[ \[\]\ta-zA-Z/:.0-9_~\"'+,;*=()$\!##&?-]+"
fds l[]kSAJD
?,./:";'
[]
!##$
&*()_+-=~
89
One issue is that [ is a special character in expression and it cannot get escaped with \ (at least not in my flavors of grep). Solution is to define it like [[].
According to regular-expressions.info:
In most regex flavors, the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (\), the caret (^), and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash.
... and ...
The closing bracket (]), the caret (^) and the hyphen (-) can be included by escaping them with a backslash, or by placing them in a position where they do not take on their special meaning.
So, assuming that the particular flavor of regular expressions syntax supported by grep conforms to this, then I would have expected that "[ a-z[\]]\+" should have worked.
However, my version of grep (GNU grep 2.14) only matches the "[]" at the end of "fdsl[]" with this regex.
However, I tried using the other technique mentioned in that quote (putting the ] in a position within the character class where it cannot take on its normal meaning, and it seems to have worked:
$ echo "fdsl[]" | grep -o "[][a-z ]\+"
fdsl[]

Shell script linux, validating integer

This code is for check if a character is a integer or not (i think). I'm trying to understand what this means, I mean... each part of that line, checking the GREP man pages, but it's really difficult for me. I found it on the internet. If anyone could explain me the part of the grep... what means each thing put there:
echo $character | grep -Eq '^(\+|-)?[0-9]+$'
Thanks people!!!
Analyse this regex:
'^(\+|-)?[0-9]+$'
^ - Line Start
(\+|-)? - Optional + or - sign at start
[0-9]+ - One or more digits
$ - Line End
Overall it matches strings like +123 or -98765 or just 9
Here -E is for extended regex support and -q is for quiet in grep command.
PS: btw you don't need grep for this check and can do this directly in pure bash:
re='^(\+|-)?[0-9]+$'
[[ "$character" =~ $re ]] && echo "its an integer"
I like this cheat sheet for regex:
http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
It is very useful, you could easily analyze the
'^(+|-)?[0-9]+$'
as
^: Line must begin with...
(): grouping
\: ESC character (because + means something ... see below)
+|-: plus OR minus signs
?: 0 or 1 repetation
[0-9]: range of numbers from 0-9
+: one or more repetation
$: end of line (no more characters allowed)
so it accepts like: -312353243 or +1243 or 5678
but do not accept: 3 456 or 6.789 or 56$ (as dollar sign).

RegEx, colon separated list

I am trying to match a list of colon separated emails. For the sake of keeping things simple, I am going to leave the email expression out of the mix and match it with any number of characters with no spaces in between them.
The following will be matched...
somevalues ;somevalues; somevalues;
or
somevalues; somevalues ;somevalues
The ending ; shouldn't be necessary.
The following would not be matched.
somevalues ; some values somevalues;
or
some values; somevalues some values
I have gotten this so far, but it doesn't work. Since I allow spaces between the colons, the expression doesn't know if the space is in the word, or between the colon.
([a-zA-Z]*\s*\;?\s*)*
The following is matched (which shouldn't e)
somevalue ; somevalues some values;
How do I make the expression only allow spaces if there is a ; to the left or right of it?
Why not just split on semi colon and then regex out the email addresses?
This following PCRE Expression should work.
\w+\s*(?:(?:;(?:\s*\w+\s*)?)+)?
However if putting the email address validation regular expression on this will require
replacing \w+ with (?:<your email validation regex>)
Probabbly This is exactly what you want, tested on http://regexr.com?2rnce
EDIT: However depending on the language you might? need to escape ; as \;
The problem comes from the ? in \;?
[a-zA-Z]*(\s*;\s*[a-zA-Z]*)*
should work.
Try
([a-zA-Z]+\s*;\s*)*([a-zA-Z]+\s*\)?
Note that I changed * to + on the e-mail pattern since I assume you don't want strings like ; to match.
to solve this with regex, you must prepend + append the delimiter to your input lines, otherwise you cannot easily detect the first and last item
#!/bin/bash
input=a:aa:aaa:aaaa
needle=aa
if [[ ":$input:" =~ ":$needle:" ]]
then
echo found
else
echo not found
fi
# -> found
.. this takes 45 nanoseconds
bash globbing is faster with 35 nanoseconds
input=a:aa:aaa:aaaa
needle=aa
if [[ ":$input:" == *":$needle:"* ]]
then
echo found
else
echo not found
fi
# -> found
stupid solution: split by delimiter and match whole lines. this one is really slow, with 5100 nanoseconds
echo a:aa:aaa:aaaa | tr ':' $'\n' | grep "^aa$"
# -> aa