Are there any characters that are not allowed/used in regex - regex

I have the somehow weird requirement that several regex should be passed as one single string to a jenkins plugin.
They should be entered in one single textfield and I have to split this string in a List of Regex later on.
Now the issue is, I can't think of any way to delimit the regexes in the string so I can later split this string as a character like a , could also be considered part of a regex itself.
E.g. if I'd use a , for the two regex "(\d+,?\s+\d{1})\.xls" and "\w+\.exe" :
"(\d+,?\s+\d{1})\.xls,\w+\.exe"
would be split into 3 regexes: "(\d+", "?\s+\d{1})\.xls" and "\w+\.exe"
where the first 2 are obviously invalid.
So my actual question is, are there any characters, that can never appear in a regex which I could use to delimit my regexes?

No, any and all characters can appear in a regex. Use any serialisation format to serialise your list of strings into a clearly expressed list format, e.g. JSON:
["(\\d+", "?\\s+\\d{1})\\.xls", "\\w+\\.exe"]
Alternatively CSV or anything else that can express a list of things and properly escapes characters used to denote item separators.

Related

Regex for removing spaces and random trailing chars

I am successfully validating an ID such as:
ZFA1G2H34J5K6L7P5
using this regex:
([a-h,A-H,j-n,J-N,p-z,P-Z,0-9]){17}$
This ID sometimes arrives corrupted (comes from a OCR process) and therefore the previous regex does not work. I need to support the most common way of corruption which is having a space within the ID:
ZFA1G2H34 J5K6L7P5
The regex should remove the space and compose just the allowed 17 chars of the ID.
Please note I cannot use scripting (.replace for example) because the software where this regex is used does not support it.
As a bonus, sometimes the ID contains trailing chars which I would like to remove as well:
ZFA1G2H34 J5K6L7P5...ç
You can use one of the following regular expressions to validate the query:
^(?:(?![iIoO])[ ç0-9a-zA-Z]){17,}$
^([ ça-hA-Hj-nJ-Np-zP-Z0-9]){17,}$
And then, you can use the following regular expression to only match characters you like:
(?:(?![iIoO])[0-9a-zA-Z])
[a-hA-Hj-nJ-Np-zP-Z0-9]
Don't use , in a set like [A-Z,a-z], because commas are actually part of the set and not a separator between the character ranges.

Regex to match string except when part of specific string

I am trying to match a specific string, but only when it's not part of a couple specific literal strings. I wish to exclude results falling within the literal strings <span class='highlight'> and </span>. So if I search for "light", "high", "pan", "an", etc. I want to match any other occurrences that are not part of those two literals.
I'm not trying to parse full HTML, only those two strings listed, which will never change. The class value will never change from 'highlight'.
I have tried all manners of lookarounds, capturing groups, non-capturing groups, etc that I can think of and have come up with nothing. Lookarounds don't seem to be working, I'm betting because the position(s) of the string in relation to the cases to be excluded are not guaranteed to be in a certain order.
Is this possible with only regex?
Would this method work for you?
Search-and-replace those two tags with the empty string:
s/(<span class='highlight'>|<\/span>)//g
Search for your string
Of course you might end up with your search string being "around" one of those bits, e.g. searching for abcd and matching ab</span>cd. You could get around that my replacing with some character sequence you are sure is not something that can be searched for.
You'll also lose the context of the situation of the string you're looking for relative to those tags, but not knowing what you're trying to achieve exactly, it's difficult to say whether that is important for you or not.
Oops, I thought I was properly simplifying my question, but it turns out I was wrong. I inherited code that was taking a string and doing a regex replace on a list of search terms by looping through them one at a time and wrapping matches in <span class="highlight"></span>. That resulted in a phrase like "Look into the light" ending up looking incorrect if you searched for "the light". "the" was matched and replaced, then "light" was matched, but would match the newly replaced tag for "the". The trick wasn't to fix the regex that got run on each individual word, but to change it into a regex that processed all of them together. Rather than regex replace using the, then light, the regex just needed to be the|light.

Regular expression to check strings containing a set of words separated by a delimiter

As the title says, I'm trying to build up a regular expression that can recognize strings with this format:
word!!cat!!DOG!! ... Phone!!home!!
where !! is used as a delimiter. Each word must have a length between 1 and 5 characters. Empty words are not allowed, i.e. no strings like !!,!!!! etc.
A word can only contain alphabetical characters between a and z (case insensitive). After each word I expect to find the special delimiter !!.
I came up with the solution below but since I need to add other controls (e.g. words can contain spaces) I would like to know if I'm on the right way.
(([a-zA-Z]{1,5})([!]{2}))+
Also note that empty strings are not allowed, hence the use of +
Help and advices are very welcome since I just started learning how to build regular expressions. I run some tests using http://regexr.com/ and it seems to be okay but I want to be sure. Thank you!
Examples that shouldn't match:
a!!b!!aaaaaa!!
a123!!b!!c!!
aAaa!!bbb
aAaa!!bbb!
Splitting the string and using the values between the !!
It depends on what you want to do with the regular expression. If you want to match the values between the !!, here are two ways:
Matching with groups
([^!]+)!!
[^!]+ requires at least 1 character other than !
!! instead of [!]{2} because it is the same but much more readable
Matching with lookahead
If you only want to match the actual word (and not the two !), you can do this by using a positive lookahead:
[^!]+(?=!!)
(?=) is a positive lookahead. It requires everything inside, i.e. here !!, to be directly after the previous match. It however won't be in the resulting match.
Here is a live example.
Validating the string
If you however want to check the validity of the whole string, then you need something like this:
^([^!]+!!)+$
^ start of the string
$ end of the string
It requires the whole string to contain only ([^!]+!!) one or more than one times.
If [^!] does not fit your requirements, you can of course replace it with [a-zA-Z] or similar.

Regex to get everything between 2 words

I am trying to get through a lot of content and to extract some data from it. Therefore I need to pick the information between 2 set of characters.
It looks like this
***some text*** li> ***data to capture*** </li ***more text***
What regex can I use to get everything that is enclosed between li> and </li ?
Basically it will be like this:
li>(.*?)(?:</li)
Depending on your language environment, certain characters may need to be escaped or the way of retrieving the matched string may differ. Typically you would need to escape / by prepending a backslash, resulting in this new version:
li>(.*?)(?:<\/li)
Here's a live demo:
https://regex101.com/r/zV4uN6/1

Split a string based on each time a Deterministic Finite Automata reaches a final state?

I have a problem which has an solution that can be solved by iteration, but I'm wondering if there's a more elegant solution using regular expressions and split()
I have a string (which excel is putting on the clipboard), which is, in essence, comma delimited. The caveat is that when the cell values contain a comma, the whole cell is surrounded with quotation marks (presumably to escape the commas within that string). An example string is as follows:
123,12,"12,345",834,54,"1,111","98,273","1,923,002",23,"1,243"
Now, I want to elegantly split this string into individual cells, but the catch is I cannot use a normal split expression with comma as a delimiter, because it will divide cells that contain a comma in their value. Another way of looking at this problem, is that I can ONLY split on a comma if there is an EVEN number of quotation marks preceding the comma.
This is easy to solve with a loop, but I'm wondering if there's a regular expression.split function capable of capturing this logic. In an attempt to solve this problem, I constructed the Deterministic Finite Automata (DFA) for the logic.
The question now is reduced to the following: is there a way to split this string such that a new array element (corresponding to /s) is produced each time the final state (state 4 here) is reached in a DFA?
Using regex (unescaped): (?:(?:"[^"]*")|(?:[^,]*))
Use that and call Regex.Matches() which is .NET, or its analog in other platforms.
You could further expand the above to this: ^(?:(?:"(?<Value>[^"]*)")|(?<Value>[^,]*))(?:,(?:(?:"(?<Value>[^"]*)")|(?<Value>[^,]*)))*$
This will parse the whole string in 1 shot, but you need named groups and multi-capture per group for this to work (.NET supports it).
Eligible commas are also followed by an even number of quotes, and VBScript does support lookaheads. Try splitting on this:
",(?=(?:[^""]*""[^""]*"")*[^""]*$)"