Regex for removing spaces and random trailing chars - regex

I am successfully validating an ID such as:
ZFA1G2H34J5K6L7P5
using this regex:
([a-h,A-H,j-n,J-N,p-z,P-Z,0-9]){17}$
This ID sometimes arrives corrupted (comes from a OCR process) and therefore the previous regex does not work. I need to support the most common way of corruption which is having a space within the ID:
ZFA1G2H34 J5K6L7P5
The regex should remove the space and compose just the allowed 17 chars of the ID.
Please note I cannot use scripting (.replace for example) because the software where this regex is used does not support it.
As a bonus, sometimes the ID contains trailing chars which I would like to remove as well:
ZFA1G2H34 J5K6L7P5...ç

You can use one of the following regular expressions to validate the query:
^(?:(?![iIoO])[ ç0-9a-zA-Z]){17,}$
^([ ça-hA-Hj-nJ-Np-zP-Z0-9]){17,}$
And then, you can use the following regular expression to only match characters you like:
(?:(?![iIoO])[0-9a-zA-Z])
[a-hA-Hj-nJ-Np-zP-Z0-9]
Don't use , in a set like [A-Z,a-z], because commas are actually part of the set and not a separator between the character ranges.

Related

Regular expression to check strings containing a set of words separated by a delimiter

As the title says, I'm trying to build up a regular expression that can recognize strings with this format:
word!!cat!!DOG!! ... Phone!!home!!
where !! is used as a delimiter. Each word must have a length between 1 and 5 characters. Empty words are not allowed, i.e. no strings like !!,!!!! etc.
A word can only contain alphabetical characters between a and z (case insensitive). After each word I expect to find the special delimiter !!.
I came up with the solution below but since I need to add other controls (e.g. words can contain spaces) I would like to know if I'm on the right way.
(([a-zA-Z]{1,5})([!]{2}))+
Also note that empty strings are not allowed, hence the use of +
Help and advices are very welcome since I just started learning how to build regular expressions. I run some tests using http://regexr.com/ and it seems to be okay but I want to be sure. Thank you!
Examples that shouldn't match:
a!!b!!aaaaaa!!
a123!!b!!c!!
aAaa!!bbb
aAaa!!bbb!
Splitting the string and using the values between the !!
It depends on what you want to do with the regular expression. If you want to match the values between the !!, here are two ways:
Matching with groups
([^!]+)!!
[^!]+ requires at least 1 character other than !
!! instead of [!]{2} because it is the same but much more readable
Matching with lookahead
If you only want to match the actual word (and not the two !), you can do this by using a positive lookahead:
[^!]+(?=!!)
(?=) is a positive lookahead. It requires everything inside, i.e. here !!, to be directly after the previous match. It however won't be in the resulting match.
Here is a live example.
Validating the string
If you however want to check the validity of the whole string, then you need something like this:
^([^!]+!!)+$
^ start of the string
$ end of the string
It requires the whole string to contain only ([^!]+!!) one or more than one times.
If [^!] does not fit your requirements, you can of course replace it with [a-zA-Z] or similar.

Trying to extract repeating pattern from string in php/javascript

The following is in PHP but the regex will also be used in javascript.
Trying to extract repeating patterns from a string
string can be any of the following:
"something arbitrary"
"D123"
"D111|something"
"D197|what.org|when.net"
"D297|who.197d234.whatever|when.net|some other arbitrary string"
I'm currently using the following regex: /^D([0-9]{3})(?:\|([^\|]+))*/
This correctly does not match the first string, matches the second and third correctly. The problem is the third and fourth only match the Dxxx and the last string. I need each of the strings between the '|' to be matched.
I'm hoping to use a regex as it makes it a single step. I realize I could just detect the leading Dxxx then use explode or split as appropriate to break the strings out. I've just gotten stuck on wanting a single regular expression match step.
This same regex may be used in Python as well so just want a generic regex solution.
There is no way to have a dynamic number of capture groups in a regular expression, but if you know some upper limit to how many parts you would have in one string, you can just repeat the pattern that many times:
/^D([0-9]{3})(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)/
So after the initial ^D([0-9]{3})(?:$|\|) you just repeat (.*?)(?:$|\|) as many times as you need it.
When the string has fewer elements, those remaining capture groups will match the empty string.
See regex tester.
Is something like preg_match_all() (the PHP variant of a global match) also acceptable for you?
Then you could use:
^(?|D([0-9]{3})|^.+$|(?!^)\|([^|\n]*)(?=\||$))
This will match everything in a string in different matches, e.g. take your string:
D197|what.org|when.net
It will you then give three matches:
D197
what.org
when.net
Running live: https://regex101.com/r/jL2oX6/4 (Everything in green are your group matches. Ignore what's in blue.)

Regex to get everything between 2 words

I am trying to get through a lot of content and to extract some data from it. Therefore I need to pick the information between 2 set of characters.
It looks like this
***some text*** li> ***data to capture*** </li ***more text***
What regex can I use to get everything that is enclosed between li> and </li ?
Basically it will be like this:
li>(.*?)(?:</li)
Depending on your language environment, certain characters may need to be escaped or the way of retrieving the matched string may differ. Typically you would need to escape / by prepending a backslash, resulting in this new version:
li>(.*?)(?:<\/li)
Here's a live demo:
https://regex101.com/r/zV4uN6/1

How to evaluate a RegExp in an array with match groups?

I need to parse an array-like text with regular expression and get the match groups.
One example of then text I want to parse is this:
['red','green', 'blue']
I want to use match groups, because I want to extract them.
I am using this regular expression, but the groups found by it are not like what I expected:
\[ *('.+?')( *, *('.+?'))* *\]
The idea is to parse in this order:
A square bracket
Any number of spaces
A group with:
Single quote
Any character
Single quote
Zero or more groups of:
Any number of spaces
A comma
Any number of spaces
A group with
Single quote
Any character
Single quote
Any number of spaces
A square bracket
And get one group with each parsed array element.
Can you help me?
Hint: a easy way to test regexp is the site http://rubular.com
This isn't going to be a totalitarian answer, but I'm fairly certain you can't whitespace check by doing " *", at least it may depend on the language you're using.
Here's a C# regex example that shows some of the language requirements to check for whitespace: regex check for white space in middle of string
Edit: I see you added Ruby as your language, unfortunately I'm not verbose in Ruby so specifics I cannot help you with, sorry.
Edit2: Seeing as you're forcing yourself into Ruby to debug your regex statement, might I suggest: http://www.debuggex.com/ which tries to stay language independent?
Try this regex: '([^']+)', it should give you the following match groups red, green, blue according to rubular.com
You can match an arbitrary number of groups with one regex:
^\[\s*|(?:\G'([^']+)'\s*(?:,\s*|]$))+
or like this (should be more performant):
^\[\s*+|(?>\G'([^']++)'\s*+(?>,\s*+|]$))++
This work in ruby like asked before, in delphi I don't know.

Including Regular Expressions in AutoHotKey Script

I am currently developing a very "simple" script in AutoHotKey, but it involves using hotstrings following the format:
::btw::by the way
which would detect whenever a user types "btw" and replace it with "by the way".
However, whenever I try to put a regular expression in between the colons, it interprets it literally. Is there any way to use regular expressions with hotstrings? Workarounds are accepted.
Hotstrings don't natively support RegEx,
but there is RegEx Powered Dynamic Hotstrings which I've never tried.
Your other option is a Loop with the Input command inside of it.
That would require an end character, such as space.
Then you would have the script analyze what the Input command returns with RegExReplace.
Place the number in the regular expression in a capturing group and use it as a back-reference in the replacement. But unless the pattern always has the digit in the same place I think it would require two steps (with RegExMatch) as shown in this working example:
loop
{
Input, retrieved, V, {space}
RegExMatch(retrieved, "[a-zA-Z0-9]{6}", match)
RegExMatch(match, "\d", output)
If (output != "")
Sendinput, {bs 7}%output%
}
Type any sequence of six with five letters and one digit,
press space and it will replace the sequence with only the number.