QUERY RegEx and & - regex

I've got a long google sheets QUERY, part of which is this:
=QUERY(LOOKUP!$A$4:$H,"Select count(B) where UPPER(D) matches 'OK' and UPPER(H) matches '.*(?:^|,|,\s)"&REGEXEXTRACT(REGEXREPLACE($Q3,"\s|-","")," \w+ ")&"(?:,\s|,|$).*' and (UPPER(C) contains '"&REGEXEXTRACT($Q3, "\{(\w+)\}")&"' or UPPER(F) contains '"&REGEXEXTRACT($Q3, "\{(\w+)\}")&"') limit 1 label count(B) ''",0)
Basically if I have an entry like apple {pear}, I only want the apple bit to be matched as part of the query. This works absolutely fine except if I put an & in the bit to match eg. apple&banana {pear}the match fails even though apple&pearis definetely present in the lookup so I think the issue is with my RegEx. I've tried just replacing \w+ selector at the seperated spot in in the RegEx with .* above but no luck.
Any help would be much appreciated

It would make more sense to replace \w+ with \w+(?:&\w+)*.
The \w+(?:&\w+)* pattern matches
\w+ - 1 or more word chars, letters, digits or _
(?:&\w+)* - matches 0 or more occurrences of:
& - a & char
\w+ - 1 or more word chars, letters, digits or _
If you do not want to match _, use
[A-Za-z0-9]+(?:&[A-Za-z0-9]+)*
Note that [A-Za-z0-9&]+ can also be used if you do not care if there are consectuvie & chars in the input (and want to match it).

Related

Regex Help required for User-Agent Matching

Have used an online regex learning site (regexr) and created something that works but with my very limited experience with regex creation, I could do with some help/advice.
In IIS10 logs, there is a list for time, date... but I am only interested in the cs(User-Agent) field.
My Regex:
(scan\-\d+)(?:\w)+\.shadowserver\.org
which matches these:
scan-02.shadowserver.org
scan-15n.shadowserver.org
scan-42o.shadowserver.org
scan-42j.shadowserver.org
scan-42b.shadowserver.org
scan-47m.shadowserver.org
scan-47a.shadowserver.org
scan-47c.shadowserver.org
scan-42a.shadowserver.org
scan-42n.shadowserver.org
scan-42o.shadowserver.org
but what I would like it to do is:
Match a single number with the option of capturing more than one: scan-2 or scan-02 with an optional letter: scan-2j or scan-02f
Append the rest of the User Agent: .shadowserver.org to the regex.
I will then add it to an existing URL Rewrite rule (as a condition) to abort the request.
Any advice/help would be very much appreciated
Tried:
To write a regex for IIS10 to block requests from a certain user-agent
Expected:
It to work on single numbers as well as double/triple numbers with or without a letter.
(scan\-\d+)(?:\w)+\.shadowserver\.org
Input Text:
scan-2.shadowserver.org
scan-02.shadowserver.org
scan-2j.shadowserver.org
scan-02j.shadowserver.org
scan-17w.shadowserver.org
scan-101p.shadowserver.org
UPDATE:
I eventually came up with this:
scan\-[0-9]+[a-z]{0,1}\.shadowserver\.org
This is explanation of your regex pattern if you only want the solution, then go directly to the end.
(scan\-\d+)(?:\w)+
(scan\-\d+) Group1: match the word scan followed by a literal -, you escaped the hyphen with a \, but if you keep it without escaping it also means a literal - in this case, so you don't have to escape it here, the - followed by \d+ which means one more digit from 0-9 there must be at least one digit, then the value inside the group will be saved inside the first capturing group.
(?:\w)+ non-capturing group, \w one character which is equal to [A-Za-z0-9_], but the the plus + sign after the non-capturing group (?:\w)+, means match the whole group one or more times, the group contains only \w which means it will match one or more word character, note the non-capturing group here is redundant and we can use \w+ directly in this case.
Taking two examples:
The first example: scan-02.shadowserver.org
(scan\-\d+)(?:\w)+
scan will match the word scan in scan-02 and the \- will match the hyphen after scan scan-, the \d+ which means match one or more digit at first it will match the 02 after scan- and the value would be scan-02, then the (?:\w)+ part, the plus + means match one or more word character, at least match one, it will try to match the period . but it will fail, because the period . is not a word character, at this point, do you think it is over ? No , the regex engine will return back to the previous \d+, and this time it will only match the 0 in scan-02, and the value scan-0 will be saved inside the first capturing group, then the (?:\w)+ part will match the 2 in scan-02, but why the engine returns back to \d+ ? this is because you used the + sign after \d+, (?:\w)+ which means match at least one digit, and one word character respectively, so it will try to do what it is asked to do literally.
The second example: scan-2.shadowserver.org
(scan\-\d+)(?:\w)+
(scan\-\d+) will match scan-2, (?:\w)+ will try to match the period after scan-2 but it fails and this is the important point here, then it will go back to the beginning of the string scan-2.shadowserver.org and try to match (scan\-\d+) again but starting from the character c in the string , so s in (scan\-\d+) faild to match c, and it will continue trying, at the end it will fail.
Simple solution:
(scan-\d+[a-z]?)\.shadowserver\.org
Explanation
(scan-\d+[a-z]?), Group1: will capture the word scan, followed by a literal -, followed by \d+ one or more digits, followed by an optional small letter [a-z]? the ? make the [a-z] part optional, if not used, then the [a-z] means that there must be only one small letter.
See regex demo

Notepad++ Search for and replace Underscore Characters in "GUIDs"

A colleague has written some C# code that outputs GUIDs to a CSV file. The code has been running for a while but it has been discovered that the GUIDs contain underscore characters, instead of hyphens :-(
There are several files which have been produced already and rather than regenerate these, I'm thinking that we could use the Search and Replace facility in Notepad++ to search across the files for "GUIDs" in this format:
{89695C16_C0FF_4E7C_9BB2_8B50FAC9D371}
and replace it with a properly formatted GUID like this:
{89695C16-C0FF-4E7C-9BB2-8B50FAC9D371}.
I have a RegEx to find the offending GUIDs (probably not very efficient):
(([A-Z]|[0-9]){8}_)(([A-Z]|[0-9]){4})_(([A-Z]|[0-9]){4})_(([A-Z]|[0-9]){4}_(([A-Z]|[0-9]){12}))
but I don't know what RegEx to use to replace the underscores with. Does anybody know how to do this?
You can use the following solution:
Find What: (?:\G(?!\A)|{(?=[a-f\d]{8}(?:_[a-f\d]{4}){4}[a-f\d]{8}\}))[a-f\d]*\K_
Replace with: -
Match case: OFF
See the settings and demo:
See the regex demo online. Details:
(?:\G(?!\A)|{(?=[a-f\d]{8}(?:_[a-f\d]{4}){4}[a-f\d]{8}\})) - either the end of the previous match or a { char immediately followed with eight alphanumeric chars, four repetitions of an underscore and then four alphanumeric chars and then eight alphanumeric chars and a } char
[a-f\d]* - zero or more alphanumeric chars
\K - match reset operator that discards the text matched so far from the overall match memory buffer
_ - an underscore.
You can match the pattern with 5 capture groups where you would match the underscores in between.
Then you can use the capture groups in the replacement with $1-$2-$3-$4-$5
{\K([A-Z0-9]{8})_([A-Z0-9]{4})_([A-Z0-9]{4})_([A-Z0-9]{4})_([A-Z0-9]{12})(?=})
{ Match {
\K Clear the match buffer (forget what is matched so far)
([A-Z0-9]{8})_ Capture group 1, match 8 times a char A-Z0-9
([A-Z0-9]{4})_ Capture 4 times a char A-Z0-9 in group 2
([A-Z0-9]{4})_ Same for group 3
([A-Z0-9]{4})_ Same for group 4
([A-Z0-9]{12}) Capture 12 times a char A-Z0-9 in group 5
(?=}) Positive lookahead, assert } to the right
Regex demo
If the pattern should also match without matching the curly's { and } you can append word boundaries
\b([A-Z0-9]{8})_([A-Z0-9]{4})_([A-Z0-9]{4})_([A-Z0-9]{4})_([A-Z0-9]{12})\b
Regex demo

How to extract a word that could possibly be followed with another word

I want to extract [games, games, things, things] from
the following array.
Today_games
Today_games_freq
Today_things
Today_things_freq
I have tried Today_(\w+)(?=_freq)?
Which will give me the extra "freq"
And some other combinations, but I couldn't figure out how to get just after the first hyphen.
You can use
Today_(\w+?)(?:_freq)?$
See the regex demo. This matches Today_, then captures any one or more word chars (as few as possible) into Group 1 (with (\w+?)), and then (?:_freq)?$ matches an optional occurrence of a _freq substring and asserts the position at the end of string.
Or,
Today_([^\W_]+)
See this regex demo.
Here, Today_ is matched and the ([^\W_]+) pattern captures one or more alphanumeric chars into Group 1 (same as \w+ with _ subtracted from \w).

Using regex replacement in Sublime 3

I am trying to use replace in Sublime using regular expressions but I'm stuck. I tried various combinations but don't seem to be getting there.
This is the input and my desired output:
Input: N_BBP_c_46137_n
Output : BBP
I tried combinations of:
[^BBP]+\b
\*BBP*+\g
But none of the above (and many others) don't seem to work.
To turn N_BBP_c_46137_n into BBP and according to the comment just want that entire long name such as N_BBP_ to be replaced by only BBP* you might also use a capture group to keep BBP.
\bN_(BBP)_\S*
\bN_ Match N preceded by a word boundary
(BBP) Capture group 1, match BBP (or use [A-Z]+ to match 1+ uppercase chars)
_\S* Match _ followed by 0+ times a non whitespace char
In the replacement use the first capturing group $1
Regex demo
You may use
(N_)[^_]*(_c_\d+_n)
Replace with ${1}some new value$2.
Details
(N_) - Group 1 ($1 or ${1} if the next char is a digit): N_
[^_]* - any 0 or more chars other than _
-(_c_\d+_n) - Group 2 ($2): _c_, 1 or more digits and then _n.
See the regex demo.

regextract in google sheets to find twitter usernames

I have a spreadsheet of tweets and want to isolate username mentions in Google Sheets. Somehow, regexps that work in R or other languages are not doing the job there.
An example:
RT #Neromoto: #cazainfractor inconsciente agresiva y poco ciudadana conductora
Desired output:
#Neromoto
#cazainfractor
I have tried this: REGEXEXTRACT(B1,(^|[^#\w])#(\w{1,15})\b).
First of all, your (^|[^#\w])#(\w{1,15})\b regex pattern must be put inside a string literal, i.e. double quotes. Then note that every capturing group will be output, you may want to make the first group non-capturing by replacing ( with (?:. Also, the last \b is redundant, after the last \w matched, there will be either the end of string, or the non-word char.
I'd rather suggest
=REGEXEXTRACT(B1,"\B#\w{1,15}")
Or
=REGEXREPLACE(B1,"(\B#\w{1,15})\s*|.","$1 ")
Details:
\B - a non-word boundary (that is, before #, there can be either start of string or a non-word char)
# - a # char
\w{1,15} - 1 to 15 word chars (if you do not care about the length, replace {1,15} with +)
And the second regex details:
(\B#\w{1,15})\s* - Group 1 capturing # at the non-word boundary position, 1 to 15 word chars and then 0+ whitespaces (in the replacement, the $1 backreference inserts the found mentions back into the resulting string)
| - or
. - any 1 char.