Regex expression Giving all letters - regex

I need all groups of 4 capital letters in a string.
So I am using REGEXP_REPLACE([Description],'\b(?![A-Z]{4}\b)\w+\b',' ')
in Tableau to replace all small letters and extra characters. I want to get only instances of capital letters with 4 string length.
By google I got to know i cannot use Regex_extract (Since /g is not supported)
My String:
"The following trials have no study data-available, in the RBM mart. It appears as is this because they were . In y HIWEThe trials currently missing data are:
JADA, JPBD, JVCS, JADQ, JVDI, JVDO, JVTZ"
I have written [^A-Z]{4}/g.
I want:
HIWE JADA JPBD JVCS JADQ JVDI JVDO JVTZ
But this is also giving me single capital letter and space included.
Thanks

You can use this regex:
((?<=[A-Z]{4})|^).*?(?=[A-Z]{4}|$)
Explaining:
( # one of:
^ # the starting position
| # or
(?<=[A-Z]{4}) # any position after four upper letters
) #
.*? # match anything till the first:
(?= # position which in front
[A-Z]{4} # has four upper letters
| # or
$ # is the string's end
) #
Any doubt feel free to ask :)

Related

How to negate string pattern using re2 regex?

I'm using google re2 regex for the purpose of querying Prometheus on Grafana dashboard. Trying to get value from key by below 3 types of possible input strings
1. object{one="ab-vwxc",two="value1",key="abcd-eest-ed-xyz-bnn",four="obsoleteValues"}
2. object{one="ab-vwxc",two="value1",key="abcd-eest-xyz-bnn",four="obsoleteValues"}
3. object{one="ab-vwxc",two="value1",key="abcd-eest-xyz-bnn-ed",four="obsoleteValues"}
..with validation as listed below
should contain abcd-
shouldn't contain -ed
Somehow this regex
\bkey="(abcd(?:-\w+)*[^-][^e][^d]\w)"
..satisfies the first condition abcd- but couldn't satisfy the second condition (negating -ed).
The expected output would be abcd-eest-xyz-bnn from the 2nd input option. Any help would be really appreciated. Thanks a lot.
If I understand your requirements correctly, the following pattern should work:
\bkey="(abcd(?:-e|-(?:[^e\W]|e[^d\W])\w*)*)"
Demo.
Breakdown for the important part:
(?: # Start a non-capturing group.
-e # Match '-e' literally.
| # Or the following...
- # Match '-' literally.
(?: # Start a second non-capturing group.
[^e\W] # Match any word character except 'e'.
| # Or...
e[^d\W] # Match 'e' followed by any word character except 'd'.
) # Close non-capturing group.
\w* # Match zero or more additional word characters.
) # Close non-capturing group.
Or in simple terms:
Match a hyphen followed by:
only the letter 'e'. Or..
a word* not starting with 'e'. Or..
a word starting with 'e' not followed by 'd'.
*A "word" here means a string of word characters as defined in regex.
Maybe have a go with:
\bkey="((?:ktm-(?:(?:e-|[^e]\w*-|e[^d]\w*-)*)abcd(?:(?:-e|-[^e]\w*|-e[^d]\w*)*)|abcd(?:(?:-e|-[^e]\w*|-e[^d]\w*)*)))"
This would ensure that:
String starts with either ktm- or abcd.
If starts with ktm-, there should at least be an element called abcd.
If starts with abcd, there doesn't have to be another element.
Both options check that there must not be an element starting with -ed.
See the online demo
The struggle without lookarounds...

Conditional replace depending on which character is found

This is NOT a duplicate of How to use conditionals when replacing in Notepad++ via regex as I am asking something very specific here which I cannot implement following the info in that question. So kindly allow this question.
I want to replace a range of characters with a corresponding range of characters. So far, I can only do it with multiple operations.
For example, match any word that starts with a capital Latin character in the range [ABEZHIKMNOPTYXZ] and is followed by a Greek lowercase letter [α-ωά-ώ] and replace the character in the first matched group with a similar-looking character but in the Greek range [ΑΒΕΖΗΙΚΜΝΟΡΤΥΧΖ] (note, they look the same but are different characters).
What I came up so far was multiple replacements, ie.
(A)([α-ωά-ώ])
Α\2
(B)([α-ωά-ώ])
Β\2
....
So that for example:
Aνθρώπινος would become Ανθρώπινος
Bάτος would become Βάτος
Preferably this should work in EmEditor, Notepad++ being the 2nd option.
Notepad++ supports conditional replacement, you can use it like:
Find what: (?:(A)|(B)|(E)|(Z)|(H)|(I)|(K)|(M)|(N)|(O)|(P)|(T)|(Y)|(X)|(Z))(?=[α-ωά-ώ])
Replace with: (?{1}Α:(?{2}Β:(?{3}Ε:(?{4}Ζ:)))) add the other Greek letters similarly
Replacement:
(?: # start non capture group
(?{1} # if group 1 exists "A"
Α # replace with greek letter
: # else
(?{2} # if group 2 exists "B"
Β # replace with greek letter
: # else
(?{3} # and so on ...
Ε
:
(?{4}
Ζ
:
)
)
)
)
) # end non capture group
(?= # positive lookahead, make sure we have after:
[α-ωά-ώ] # a small greek letter
) # end lookahead
I've made a test but for only for 2 letters "A" and "B" and replace them with more visual different letters "X" and "Y" just to show the way it works.
Screen capture (before):
Screen capture (after):

How does this regex for FQDNs (excluding.arpa) work?

I am trying to understand how regex works. I understand it little by little. However, I don't understand this one completely. It's basically a regex for fully qualified domain names but a requirement is that the ending can't be .arpa.
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
https://regex101.com/r/hU6tP0/3
This doesn't match google.uk. If I change it to:
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{1,63}[^.arpa]$)
It works again.
But this works as well
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
Here is my thought process for
?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
I see it as this
(?=
Is a positive look ahead (Can someone explain to me what this actually means?) As I understand it now, it just means that the string needs to match the regex.
^.{4,253}$)
Match all characters but it needs to be between 4 and 253 characters long.
(^([a-zA-Z0-9]{1,63}\.)
Start a capture group and make another capture group within. This capture group says that every non special character can be written 1 to 63 times or till the . is written.
+
The previous capture group can be repeated indefinitely, but it should always end with a .. This way the next capture group is started.
[a-zA-Z]{2,63}
Then as many times as you want you can write a to z with upper, but it needs to be between 2 and 63.
[^.arpa]$)
The last characters can't be .arpa.
Can someone tell me where I am going wrong?
This doesn't do what you think it does:
[^.arpa]
All that says is 'ends with something that isn't one of the letter apr.' - it's a negated character class.
You might be thinking of a negative lookahead assertion:
(?!\.arpa)$
But if you're trying to compound multiple criteria in a regex, I'd suggest you're probably using the wrong tool for the job. It ends up complicated and hard to debug, thanks to greedy/non-greedy matching, etc.
Your 'positive/negative' lookaheads are to match a piece of a pattern that aren't surrounded by other pieces of pattern. But that can have some unexpected outcomes if you're matching variable widths, because the regex engine will backtrack until it finds something that matches.
A simpler example:
([\w.]+)(?!arpa)$
Applied to:
www.test.arpa
Will it match? What's in the group?
... it will match, because [\w\.]+ will consume all of it, and then the lookahead won't "see" anything.
If you use:
([\w]+)\.(?!arpa)
Instead though - you'll capture.... www, but you won't match test (with e.g. g flag, because the www doesn't have .arpa after it, but the test does.
https://regex101.com/r/hU6tP0/5
It really does get complicated using negative assertions in a pattern as a result. I'd suggest simply not doing so, and applying two separate tests. It's hard for you to figure out, and it's hard for a future maintenance programmer too!
This is an analysis of your regex:
(?=^.{4,253}$) # force min length: 4 chars, max length: 253 chars
( # Capturing Group 1 (CG1) - not needed
^ # Match start of the string
( # CG2 (can be a non capturing group '(?:...)')
[a-zA-Z0-9]{1,63} # any sequence of letters and numbers with length between 1 and 63
\. # a literal dot
)+ # CLOSE CG2
[a-zA-Z]{1,63} # any letter sequence with length between 1 to 63
[^.arpa] # a negated char class: any char that is not a "literal" '.','a','r','p' (last 'a' is redundant)
$ # end of the string
) # CLOSE CG1
To avoid the tail of the string to be .arpa you need to use a negative lookahead (?!...), so modify just like this:
(?=^.{4,253}$)(?!.*\.arpa$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
An online demo
Update:
I've upgraded the regex to rationalise it (i've incorporated also the Sobrique suggestion adding an important details):
/^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i
Compact version online demo
Legenda
/ # js regex delimiter
^ # start of the string
(?=.{4,253}$) # force min length: 4 chars, max length: 253 chars
(?: # Non capturing group 1 (NCG1)
[a-z0-9]{1,63} # any letter or digit in a sequence with length from 1 to 63 chars
[.] # a literal dot '.' (more readable than \.)
)+ # CLOSE NCG1 - repeat its content one or more time
(?!arpa$) # force that after the last literal dot '.' the string does not end with 'arpa' (i've added '$' to Sobrique suggestion instead it prevents also '.arpanet' too)
[a-z]{2,63} # a sequence of letters with length from 2 to 63
$ # end of the string
/i # Close the regex delimiter and add case insensitive flag [a-z] match also [A-Z] and viceversa
var re = /^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i;
var tests = ['google.uk','domain.arpa','domain.arpa2','another.domain.arpa.net','domain.arpanet'];
var m;
while(t = tests.pop()) {
document.getElementById("r").innerHTML += '"' + t + '"<br/>';
document.getElementById("r").innerHTML += 'Valid domain? ' + ( (t.match(re)) ? '<font color="green">YES</font>' : '<font color="red">NO</font>') + '<br/><br/>';
}
<div id="r"/>

Regex to find strings not containing a specified value

I'm using notepad++'s regular expression search function to find all strings in a .txt document that do not contain a specific value (HIJ in the below example), where all strings begin with the same value (ABC in the below example).
How would I go about doing this?
Example
Every String starts with ABC
ABC is never used in a string other than at the beginning,
ABCABC123 would be two strings --"ABC" and "ABC123"
HIJ may appear multiple times in a string
I need to find the strings that do not contain HIJ
Input is one long file with no line breaks, but does contain special characters (*, ^, #, ~, :) and spaces
Example Input:
ABC1234HIJ56ABC7#HIJABC89ABCHIJ0ABE:HIJABC12~34HI456J
Example Input would be viewed as the following strings
ABC1234HIJ56
ABC7#HIJ
ABC89
ABCHIJ0ABE:HIJ
ABC12%34HI456J
The Third and Fifth strings both lack "HIJ" and therefore are included in the output, all others are not included in the output.
Example desired output:
ABC89
ABC12~34HI456J
I am 99% new to RegEx and will be looking more into it in the future, as my job description suddenly changed earlier this week when someone else in the company left suddenly, and therefore I have been doing this manually by searching (ABC|HIJ) and going through the search function's results looking for "ABC" appearing twice in a row. Supposedly the former employee was able to do this in an automated way, but left no documentation.
Any help would be appreciated!
This question is a repeat of a prior question I asked, but I was very very bad at formatting a question and it seems to have sunk beyond noticeable levels.
You can find the items you want with:
ABC(?:[^HA]+|H(?!IJ)|A(?!BC))*+(?=ABC|$)
Note: in this first pattern, you can replace (?=ABC|$) with (?!HIJ)
pattern details:
ABC
(?: # non-capturing group
[^HA]+ # all that is not a H or an A
| # OR
H(?!IJ) # an H not followed by IJ
|
A(?!BC) # an A not followed by BC
)*+ # repeat the group
(?=ABC|$) # followed by "ABC" or the end of the string
Note: if you want to remove all that is not the items you want you can make this search replace:
search: (?:ABC(?:[^HA]+|H(?!IJ)|A(?!BC))*+HIJ.*?(?=ABC|$))+|(?=ABC)
replace: \r\n
you could use this pattern
(ABC(?:(?!HIJ).)*?)(?=ABC|\R)
Demo
( # Capturing Group (1)
ABC # "ABC"
(?: # Non Capturing Group
(?! # Negative Look-Ahead
HIJ # "HIJ"
) # End of Negative Look-Ahead
. # Any character except line break
) # End of Non Capturing Group
*? # (zero or more)(lazy)
) # End of Capturing Group (1)
(?= # Look-Ahead
ABC # "ABC"
| # OR
\R # <line break>
) # End of Look-Ahead
You can use the following expression to match your criterion:
(^ABC(?:(?!HIJ).)*$)
This starts with ABC and looks ahead (negative) for HIJ pattern. The pattern works for the separated strings.
For a single line pattern (as provided in your question), a slight modification of this works (as follows):
(ABC(?:(?!HIJ).)*?)(?=ABC|$)

remove initial period and text after final period in string

I have a regex edge case that I am unable to solve. I need to grep to remove the leading period (if it exists) and the text following the last period (if it exists) from a string.
That is, given a vector:
x <- c("abc.txt", "abc.com.plist", ".abc.com")
I'd like to get the output:
[1] "abc" "abc.com" "abc"
The first two cases are solved already I obtained help in this related question. However not for the third case with leading .
I am sure it is trivial, but i'm not making the connections.
This regex does what you want:
^\.+|\.[^.]*$
Replace its matches with the empty string.
In R:
gsub("^\\.+|\\.[^.]*$", "", subject, perl=TRUE);
Explanation:
^ # Anchor the match to the start of the string
\.+ # and match one or more dots
| # OR
\. # Match a dot
[^.]* # plus any characters except dots
$ # anchored to the end of the string.