Conditional matching in Regex [duplicate] - regex

I'm setting up some goals in Google Analytics and could use a little regex help.
Lets say I have 4 URLs
http://www.anydotcom.com/test/search.cfm?metric=blah&selector=size&value=1
http://www.anydotcom.com/test/search.cfm?metric=blah2&selector=style&value=1
http://www.anydotcom.com/test/search.cfm?metric=blah3&selector=size&value=1
http://www.anydotcom.com/test/details.cfm?metric=blah&selector=size&value=1
I want to create an expression that will identify any URL that contains the string selector=size but does NOT contain details.cfm
I know that to find a string that does NOT contain another string I can use this expression:
(^((?!details.cfm).)*$)
But, I'm not sure how to add in the selector=size portion.
Any help would be greatly appreciated!

This should do it:
^(?!.*details\.cfm).*selector=size.*$
^.*selector=size.*$ should be clear enough. The first bit, (?!.*details.cfm) is a negative look-ahead: before matching the string it checks the string does not contain "details.cfm" (with any number of characters before it).

^(?=.*selector=size)(?:(?!details\.cfm).)+$
If your regex engine supported posessive quantifiers (though I suspect Google Analytics does not), then I guess this will perform better for large input sets:
^[^?]*+(?<!details\.cfm).*?selector=size.*$

regex could be (perl syntax):
`/^[(^(?!.*details\.cfm).*selector=size.*)|(selector=size.*^(?!.*details\.cfm).*)]$/`

There is a problem with the regex in the accepted answer. It also matches abcselector=size, selector=sizeabc etc.
A correct regex can be ^(?!.*\bdetails\.cfm\b).*\bselector=size\b.*$
Explanation of the regex at regex101:

I was looking for a way to avoid --line-buffered on a tail in a similar situation as the OP and Kobi's solution works great for me. In my case excluding lines with either "bot" or "spider" while including ' / ' (for my root document).
My original command:
tail -f mylogfile | grep --line-buffered -v 'bot\|spider' | grep ' / '
Now becomes (with -P perl switch):
tail -f mylogfile | grep -P '^(?!.*(bot|spider)).*\s\/\s.*$'

Related

Select a single character in an alphanumeric string in bash

I have an issue with string manipulation in bash. I have a list of names, each name being composed of two parts, chars and numbers: for example
abcdef01234
I want to cut the last character before the numeric part starts, in this case
f
I think there is a regular expression to help me with this but just can't figure it out. AWK/sed solutions are accepted too. Hope someone can help.
Thank you.
In bash it can be done with parameter expansion with substring removal and string indexes, e.g.,
a=abcdef01234 # your string
tmp=${a%%[0-9]*} # remove all numbers from right
echo ${tmp:(-1)} # output last of remaining chars
Output: f
You can use a regexp like [a-zA-Z]+([a-zA-Z])[0-9]+. If you know how to use sed is pretty easy.
Check https://regex101.com/r/XCkKM5/1
The match will be the letter you want.
^\w+([a-zA-Z])\d+$
As a sed command (on OSX) this will be :
echo "abcdef12345" | sed -E "s#^[a-zA-Z]+([a-zA-Z])[0-9]+\$#\1#"
try following too once.
echo "abcdef01234" | awk '{match($0,/[a-zA-Z]+/);print substr($0,RLENGTH,1)}'
I have a list of names I assume is a file, file. Using grep's PCRE and (positive) lookahead:
$ grep -oP "[a-z](?=[^a-z])" file
f
It prints out the first (lowercase) letter followed by a non-(lowercase)-letter.

How can I use sed to regex string and number in bash script

I want to separate string and number in a file to get a specific number in bash script, such as:
Branches executed:75.38% of 1190
I want to only get number
75.38
. I have try like the code below
$new_value=value | sed -r 's/.*_([0-9]*)\..*/\1/g'
but it was incorrect and it was failed.
How should it works? Thank you before for your help.
You can use the following regex to extract the first number in a line:
^[^0-9]*\([0-9.]*\).*$
Usage:
% echo 'Branches executed:75.38% of 1190' | sed 's/^[^0-9]*\([0-9.]*\).*$/\1/'
75.38
Give this a try:
value=$(sed "s/^Branches executed:\([0-9][.0-9]*[0-9]*\)%.*$/\1/" afile)
It is assumed that the line appears only once in afile.
The value is stored in the value variable.
There are several things here that we could improve. One is that you need to escape the parentheses in sed: \(...\)
Another one is that it would be good to have a full specification of the input strings as well as a good script that can help us to play with this.
Anyway, this is my first attempt:
Update: I added a little more bash around this regex so it'll be more easy to play with it:
value='Branches executed:75.38% of 1190'
new_value=`echo $value | sed -e 's/[^0-9]*\([0-9]*\.[0-9]*\).*/\1/g'`
echo $new_value
Update 2: as john pointed out, it will match only numbers that contain a decimal dot. We can fix it with an optional group: \(\.[0-9]\+\)?.
An explanation for the optional group:
\(...\) is a group.
\(...\)? Is a group that appears zero or one times (mind the question mark).
\.[0-9]\+ is the pattern for a dot and one or more digits.
Putting all together:
value='Branches executed:75.38% of 1190'
new_value=`echo $value | sed -e 's/[^0-9]*\([0-9]\+\(\.[0-9]\+\)\?\).*/\1/g'`
echo $new_value

String simplification for regex (BASH)

I'm looking for a way to simplify multiple strings for the purpose of regular expression searching, Here's an example:
I have a list of several thousand strings, similar to the ones below (text.#######):
area.202264
area.202265
area.202266
area.202267
area.202268
area.202269
area.202270
area.204517
area.204518
area.204519
area.207171
area.207338
area.208842
I've been trying to figure out an automated way to simplify it into something like this:
area.20226(4|5|6|7|8|9)|area.202270|area.20451(7|8|9)|area.207171|area.207338|area.208842
The purpose of this would be to reduce string length when searching these areas, I have absolutely no way how to approach something like this in a simple, re-usable way.
Thanks in advance! Any solutions or tips on where to start would be appreciated :)
echo "area.202264 area.202265 area.202266 area.202267 area.202268 area.202269 area.202270 area.204517 area.204518 area.204519 area.207171 area.207338 area.208842" | tr ' ' '\n' > list.txt
cat list.txt | grep -v "^$" | sed -e "s/[0-9] *$//g" | sort -u | while read p; do l=`grep $p list.txt | sed -e "s/.*\([0-9]\)$/\1/g" | xargs | tr ' ' '|'` ;echo "$p($l)" ; done | sed -e "s/(\(.\))/\1/g"| xargs| tr ' ' '|'
put search strings to the file named "filter" in one column
area.202264
area.202265
area.202266
area.202267
than you can search fast enought by
fgrep -f filter file-to-search-in
I see no easy way to produce regexp from samples, and I'm not sure regexp approach will faster.
Here are a couple of things you should know:
Nearly all regex engines build a state machine from their patterns. You can probably just put the various named between vertical bars and get good performance. (It won't look nice, but it will work.)
That is, something like:
(area.202264|area.202265|area.202266|...|area.207338|area.208842)
Even with 4k items, the right engine will just compile it down. (I don't think bash will handle it, because of the length. But perl, grep, fgrep as mentioned elsewhere can do it.)
You say "BASH", so it's worth pointing out there is a difference between regex and file globbing. If the things you are working with are text, then regex (^area.\d+$) is the way to go. If the things you are working with are filenames, then globbing (*.c) has different rules.
You can simplify greatly if you don't care at all about the numbers, only the format. For regexes:
area\.\d+ # area, dot, one or more digits (0-9)
area\.\d{1,6} # area, dot no less than 1, no more than 6 digits
area\.\d{6} # area, dot, exactly 6 digits
area\.20[234]\d{3} # area, dot, 20 {2,3,4} then 3 more digits
If you can use Perl and the Regexp::Assemble module, it can convert multiple patterns into a single, optimized, regular expression. For instance, using it on the list of strings in the question yields:
(?-xism:area\.20(?:22(?:6[456789]|70)|7(?:171|338)|451[789]|8842))
That only works if the database plugin can accept Perl regular expressions.

Regex to extract everything until it encounters a number after a slash

I am looking to extract everything form a string but ignore everything after encountering numbers after a slash(alphanumeric allowed)
Examples:
http://www.test.com/products/cards/product_code100/12345/something_else
http://www.test.com/products/123abc/45678/
Desired output -
http://www.test.com/products/cards/product_code100/
http://www.test.com/products/123abc/
The following regex gives me everything in backreferences but it'll be great if I could get rid of numbers after a slash-
^(.*:)//([a-z\-.]+)(:[0-9]+)?(.*)
Additional Information - Languauge independent regex needed.
Many Thanks
this should work with most languages and should produce the desired output
(http://.*)(?=/\d+(?!\w+))
It takes every character until it finds (lookahead) \ followed by a number.
If you'd try to match
http://www.test.com/products/123abc/
or
http://www.test.com/products/123abc
it just would not find a match and you could be sure the string checked doesnt encounter a pure number after a slash
Example in Perl:
echo "http://...." | perl -pe 's/(.*\/)\d+\/.*/$1/'
or:
echo "http://...." | perl -ne 'print "$1\n" if /(.*\/)\d+\/.*/'
Edit: It's true what #creinig noted in his comment - there is no such thing as generic regex. Nonetheless, Perl is widely used, so it's an option.

Regular expression for a string containing one word but not another

I'm setting up some goals in Google Analytics and could use a little regex help.
Lets say I have 4 URLs
http://www.anydotcom.com/test/search.cfm?metric=blah&selector=size&value=1
http://www.anydotcom.com/test/search.cfm?metric=blah2&selector=style&value=1
http://www.anydotcom.com/test/search.cfm?metric=blah3&selector=size&value=1
http://www.anydotcom.com/test/details.cfm?metric=blah&selector=size&value=1
I want to create an expression that will identify any URL that contains the string selector=size but does NOT contain details.cfm
I know that to find a string that does NOT contain another string I can use this expression:
(^((?!details.cfm).)*$)
But, I'm not sure how to add in the selector=size portion.
Any help would be greatly appreciated!
This should do it:
^(?!.*details\.cfm).*selector=size.*$
^.*selector=size.*$ should be clear enough. The first bit, (?!.*details.cfm) is a negative look-ahead: before matching the string it checks the string does not contain "details.cfm" (with any number of characters before it).
^(?=.*selector=size)(?:(?!details\.cfm).)+$
If your regex engine supported posessive quantifiers (though I suspect Google Analytics does not), then I guess this will perform better for large input sets:
^[^?]*+(?<!details\.cfm).*?selector=size.*$
regex could be (perl syntax):
`/^[(^(?!.*details\.cfm).*selector=size.*)|(selector=size.*^(?!.*details\.cfm).*)]$/`
There is a problem with the regex in the accepted answer. It also matches abcselector=size, selector=sizeabc etc.
A correct regex can be ^(?!.*\bdetails\.cfm\b).*\bselector=size\b.*$
Explanation of the regex at regex101:
I was looking for a way to avoid --line-buffered on a tail in a similar situation as the OP and Kobi's solution works great for me. In my case excluding lines with either "bot" or "spider" while including ' / ' (for my root document).
My original command:
tail -f mylogfile | grep --line-buffered -v 'bot\|spider' | grep ' / '
Now becomes (with -P perl switch):
tail -f mylogfile | grep -P '^(?!.*(bot|spider)).*\s\/\s.*$'