Looking for single occurrence between '{' and ':' in a large text - regex

I'm new to the Regex world, so please be kind on the tantrums :-)
I would like to print only the first occurrence of a string between { and :.
Example in the following string:
({TRIGGER.VALUE}=0 and {Zabbix windows:zabbix[process,discoverer,avg,busy].avg(10m)}>75)
or
({TRIGGER.VALUE}=1 and {Zabbix windows:zabbix[process,discoverer,avg,busy].avg(10m)}>65)
I want it to output only Zabbix windows
how is that possible?
I tried {([a-zA-Z0-9 ]*): it is printing : and doing it twice.
Thanks for reading!
Srini

You may use a PCRE regex with -o option (extracting the matches rather than returning the whole lines) to grab the text you need and use head -1 to only have the first match:
s='({TRIGGER.VALUE}=0 and {Zabbix windows:zabbix[process,discoverer,avg,busy].avg(10m)}>75) or ({TRIGGER.VALUE}=1 and {Zabbix windows:zabbix[process,discoverer,avg,busy].avg(10m)}>65)'
echo $s | grep -oP '(?<={)[\w\s]+(?=:)' | head -1
See an online demo
Pattern details:
(?<={) - there must be a { immediately to the left of the current location
[\w\s]+ - 1+ word and/or whitespace chars
(?=:) - there must be a : immediately to the right of the current location.

Related

How can I delete the rest of the line after the second pipe character "|" for every line with python?

I am using notepad++ and I want to get rid of everything after one second (including the second pipe character) for every line in my txt file.
Basically, the txt file has the following format:
3.1_1.wav|I like apples.|I like apples|I like bananas
3.1_2.wav|Isn't today a lovely day?|Right now it is 1 in the afternoon.|....
The result should be:
3.1_1.wav|I like apples.
3.1_2.wav|Isn't today a lovely day?
I have tried using \|.* but then everything after the first pipe character is matched.
In Notepad++ do this:
Find what: ^([^\|]*\|[^\|]*).*
Replace with: $1
check "Regular expression", and "Replace All"
Explanation:
^ - anchor at start of line
( - start group, can be referenced as $1
[^\|]* - scan over any character other than |
\| - scan over |
[^\|]* - scan over any character other than |
) - end group
.* - scan over everything until end of line
in replace reference the captured group with $1
I'm not sure if this is the best way to do it, but try this:
[^wav]\|.*

Regex match until third occurrence of a char is found, counting occurrence of said char starting from the end of string

Let's dive in : Input :
p9_rec_tonly_.cr_called.seg
p9_tonly_.cr_called.seg
p10_nor_nor_.cr_called.seg
p10_rec_tn_.cr_called.seg
p10_tn_.cr_called.seg
p26_rec_nor_nor_.cr_called.seg
p26_rec_tn_.cr_called.seg
p26_tn_.cr_called.seg
Desired output :
p9_rec
p9
p10_nor
p10_rec
p10
p26_rec_nor
p26_rec
p26
Starting from the beginning of my string, I need to match until the third occurrence of " _ " (underscore) is found, but I need to count " _ " (underscore) occurrence starting from end of string.
Any tips is appreciated,
Best regards
I believe this regex should do the trick!
^.*?(?=_[^_]*_[^_]*_[^_]*$)
Online Demo
Explanation:
^ the start of the line
.*? matches as many characters as possible
(?=...) asserts that its contents follow our match
_[^_]*_[^_]*_[^_]* Looks for exactly three underscores after our match.
$ the end of the line
You should think beyond regex to solve this problem. For example, if you are using Python just use rsplit with a limit of 3 and get the first resulting string:
>>> data = [
'p9_rec_tonly_.cr_called.seg',
'p9_tonly_.cr_called.seg',
'p10_nor_nor_.cr_called.seg',
'p10_rec_tn_.cr_called.seg',
'p10_tn_.cr_called.seg',
'p26_rec_nor_nor_.cr_called.seg',
'p26_rec_tn_.cr_called.seg',
'p26_tn_.cr_called.seg',
]
>>> for d in data:
print(d.rsplit('_', 3)[0])
p9_rec
p9
p10_nor
p10_rec
p10
p26_rec_nor
p26_rec
p26
bash you say? Well it's not a regular expression but you can do pattern substitutions (or stripping with bash):
while read var ; do echo ${var%_*_*_*} ; done <<EOT
p9_rec_tonly_.cr_called.seg
p9_tonly_.cr_called.seg
p10_nor_nor_.cr_called.seg
p10_rec_tn_.cr_called.seg
p10_tn_.cr_called.seg
p26_rec_nor_nor_.cr_called.seg
p26_rec_tn_.cr_called.seg
p26_tn_.cr_called.seg
EOT
${var%_*_*_*} expands variable var stripping shorted suffix match for _*_*_*.
Otherwise to perform regex operations in shell, you could normally ask a utility like sed for help and feed your lines through for instance this:
sed -e 's#_[^_]*_[^_]*_[^_]*$##'
or for short:
sed -e 's#\(_[^_]*\)\{3\}$##'
Find three groups of _ and zero or more characters of not _ at the end of line $ replacing them with nothing ('').

Confusion in regex pattern for search

Learning regex in bash, i am trying to fetch all lines which ends with .com
Initially i did :
cat patternNpara.txt | egrep "^[[:alnum:]]+(.com)$"
why : +matches one or more occurrences, so placing it after alnum should fetch the occurrence of any digit,word or signs but apparently, this logic is failing....
Then i did this : (purely hit-and-try, not applying any logic really...) and it worked
cat patternNpara.txt | egrep "^[[:alnum:]].+(.com)$"
whats confusing me : . matches only single occurrence, then, how am i getting the output...i mean how is it really matching the pattern???
Question : whats the difference between [[:alnum:]]+ and [[:alnum:]].+ (this one has . in it) in the above matching pattern and how its working???
PS : i am looking for a possible explanation...not, try it this way thing... :)
Some test lines for the file patternNpara.txt which are fetched as output!
valid email = abc#abc.com
invalid email = ab#abccom
another invalid = abc#.com
1 : abc,s,11#gmail.com
2: abc.s.11#gmail.com
Looking at your screenshot it seems you're trying to match email address that has # character also which is not included in your regex. You can use this regex:
egrep "[#[:alnum:]]+(\.com)" patternNpara.txt
DIfference between 2 regex:
[[:alnum:]] matches only [a-zA-Z0-9]. If you have # or , then you need to include them in character class as well.
Your 2nd case is including .+ pattern which means 1 or more matches of ANY CHARACTER
If you want to match any lines that end with '.com', you should use
egrep ".*\.com$" file.txt
To match all the following lines
valid email = abc#abc.com
invalid email = ab#abccom
another invalid = abc#.com
1 : abc,s,11#gmail.com
2: abc.s.11#gmail.com
^[[:alnum:]].+(.com)$ will work, but ^[[:alnum:]]+(.com)$ will not. Here is the reasons:
^[[:alnum:]].+(.com)$ means to match strings that start with a a-zA-Z or 0-9, flows two or more any characters, and end with a 'com' (not '.com').
^[[:alnum:]]+(.com)$ means to match strings that start with one or more a-zA-Z or 0-9, flows one character that could be anything, and end with a 'com' (not '.com').
Try this (with "positive-lookahead") :
.+(?=\.com)
Demo :
http://regexr.com?38bo0

How can I use regex to ignore strings if they contain a certain string

I am trying to use regex to scan through some log files. In particular, I am looking to pick out lines that meet this format:
IP address or random number "banned.", so for example, "111.111.111.111 banned." or "0320932 banned.", etc.
There should only be 2 groups of characters (the number/IP address and "banned." There may be more than one space in between the words or before them), the string should also not contain "client", "[private]", or "request". For the most part I am just confused about how to go about detecting the groups of characters and avoiding strings that contain those words.
Thanks for any help that you may have to offer
egrep -v '^ *[0-9]+((\.[0-9]+){3})? +banned\.$'
Allows optional leading spaces at the beginning of the line.
Must be followed by an all-digit sequence OR an IP-like address.
Must be followed by at least one space.
Line must end in 'banned.'
Finally, the -v option ensures that only lines NOT matching the regex are returned.
With these constraints you needn't worry about ruling out additional words such as 'client'.
I'm assuming in the following input data lines 1 and 3 should be dropped:
111.111.111.111 banned.
2.2.2.2 wibble
0320932 banned
1434324 wobble
You can drop them with this grep expression:
$ grep -E -v "[0-9.]+ +banned" logfile.log
2.2.2.2 wibble
1434324 wobble
$
This regular expression matches 1 or more numbers and periods followed by 1 or more spaces followed by the word "banned". Passing -v to grep will cause it to display all lines that do not match the regular expression. Add -i to the grep command to make it case-insensitive.
You want a negating match, which looks like:
/^((?!([\d.\s]+banned\.)).)*$/
See it in action: http://regex101.com/r/bY7pK4
Note your example shows a period after banned. If you don't want it, remove \. from the expression.
Try this RegExp
String regex = "\\d+.\\d+.\\d+.\\d+ banned.";
Here you can filter your both kind of string.
Example:
public static void main(String[] args) {
System.out.println("start");
String src = "657 hi tis is 111.111.111.111 banned. 57 happy i9";
//String src = "87 working is 0320932 banned. Its ending str 08";
String regex = "\\d+.\\d+.\\d+.\\d+ banned.";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(src);
while(matcher.find()){
System.out.println(matcher.start() + " : " + matcher.group());
}
}
Let me know if it is not working for you.
trying to match IP address or random number "banned."
This egrep should work for you:
egrep '(([0-9]{1,3}\.){3}[0-9]{1,3}|[0-9]+) +banned' logfile
The following will work:
\s*\d\d\d\.\d\d\d\.\d\d\d\.\d\d\d\s*banned\s*

Replace patterns that are inside delimiters using a regular expression call

I need to clip out all the occurances of the pattern '--' that are inside single quotes in long string (leaving intact the ones that are outside single quotes).
Is there a RegEx way of doing this?
(using it with an iterator from the language is OK).
For example, starting with
"xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
I should end up with:
"xxxx rt / $ 'dfdffggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g 'ggh' vcbcvb"
So I am looking for a regex that could be run from the following languages as shown:
+-------------+------------------------------------------+
| Language | RegEx |
+-------------+------------------------------------------+
| JavaScript | input.replace(/someregex/g, "") |
| PHP | preg_replace('/someregex/', "", input) |
| Python | re.sub(r'someregex', "", input) |
| Ruby | input.gsub(/someregex/, "") |
+-------------+------------------------------------------+
I found another way to do this from an answer by Greg Hewgill at Qn138522
It is based on using this regex (adapted to contain the pattern I was looking for):
--(?=[^\']*'([^']|'[^']*')*$)
Greg explains:
"What this does is use the non-capturing match (?=...) to check that the character x is within a quoted string. It looks for some nonquote characters up to the next quote, then looks for a sequence of either single characters or quoted groups of characters, until the end of the string. This relies on your assumption that the quotes are always balanced. This is also not very efficient."
The usage examples would be :
JavaScript: input.replace(/--(?=[^']*'([^']|'[^']*')*$)/g, "")
PHP: preg_replace('/--(?=[^\']*'([^']|'[^']*')*$)/', "", input)
Python: re.sub(r'--(?=[^\']*'([^']|'[^']*')*$)', "", input)
Ruby: input.gsub(/--(?=[^\']*'([^']|'[^']*')*$)/, "")
I have tested this for Ruby and it provides the desired result.
This cannot be done with regular expressions, because you need to maintain state on whether you're inside single quotes or outside, and regex is inherently stateless. (Also, as far as I understand, single quotes can be escaped without terminating the "inside" region).
Your best bet is to iterate through the string character by character, keeping a boolean flag on whether or not you're inside a quoted region - and remove the --'s that way.
If bending the rules a little is allowed, this could work:
import re
p = re.compile(r"((?:^[^']*')?[^']*?(?:'[^']*'[^']*?)*?)(-{2,})")
txt = "xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
print re.sub(p, r'\1-', txt)
Output:
xxxx rt / $ 'dfdf-fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '-ggh-' vcbcvb
The regex:
( # Group 1
(?:^[^']*')? # Start of string, up till the first single quote
[^']*? # Inside the single quotes, as few characters as possible
(?:
'[^']*' # No double dashes inside theses single quotes, jump to the next.
[^']*?
)*? # as few as possible
)
(-{2,}) # The dashes themselves (Group 2)
If there where different delimiters for start and end, you could use something like this:
-{2,}(?=[^'`]*`)
Edit: I realized that if the string does not contain any quotes, it will match all double dashes in the string. One way of fixing it would be to change
(?:^[^']*')?
in the beginning to
(?:^[^']*'|(?!^))
Updated regex:
((?:^[^']*'|(?!^))[^']*?(?:'[^']*'[^']*?)*?)(-{2,})
Hm. There might be a way in Python if there are no quoted apostrophes, given that there is the (?(id/name)yes-pattern|no-pattern) construct in regular expressions, but it goes way over my head currently.
Does this help?
def remove_double_dashes_in_apostrophes(text):
return "'".join(
part.replace("--", "") if (ix&1) else part
for ix, part in enumerate(text.split("'")))
Seems to work for me. What it does, is split the input text to parts on apostrophes, and replace the "--" only when the part is odd-numbered (i.e. there has been an odd number of apostrophes before the part). Note about "odd numbered": part numbering starts from zero!
You can use the following sed script, I believe:
:again
s/'\(.*\)--\(.*\)'/'\1\2'/g
t again
Store that in a file (rmdashdash.sed) and do whatever exec magic in your scripting language allows you to do the following shell equivalent:
sed -f rmdotdot.sed < file containing your input data
What the script does is:
:again <-- just a label
s/'\(.*\)--\(.*\)'/'\1\2'/g
substitute, for the pattern ' followed by anything followed by -- followed by anything followed by ', just the two anythings within quotes.
t again <-- feed the resulting string back into sed again.
Note that this script will convert '----' into '', since it is a sequence of two --'s within quotes. However, '---' will be converted into '-'.
Ain't no school like old school.