Regex for searching strings matching the following one - regex

I am searching strings matching the following one in my source code:
<CONSTANT_STRING_1> <CONSTANT_STRING_2> <VARIABLE_DIGITS> <CONSTANT_STRING_3>
where
<CONSTANT_STRING_1>, <CONSTANT_STRING_2> and <CONSTANT_STRING_3> are constant strings like "ABC", ""DEF" and "GHI".
<VARIABLE_DIGITS> is a random number of 14 digits like "12345678901234"
Note: there are white spaces between words.
What I am looking for is to search <CONSTANT_STRING_1> <CONSTANT_STRING_2> <WHATEVER> <CONSTANT_STRING_3>. How can I build the Regex?

I am reading that by "constant string" you mean character strings? If so the below should work to find that full string you are looking for. Btw the website linked below is really great for visualizing this type of problem... give it a try :)
(([a-zA-Z]+\s){2})[0-9]{14}\s([a-zA-Z]+)$
Debuggex Demo
To break it down...
(([a-zA-Z]+\s){2}) means a string of one or more characters comprised of either LC or UC letters followed by a space and that whole thing (chars + space) repeated twice
[0-9]{14}\s 14 digits followed by a space. As #Avinash said \d{14}\s is another way of writing this portion
([a-zA-Z]+)$ Another string of one or more characters. The $ indicates that this ends the string you are searching for

You could try the below regex.
<CONSTANT_STRING_1> <CONSTANT_STRING_2> \d{14} <CONSTANT_STRING_3>
Where, \d{14} matches exactly the 14 digit number.

Related

Is there a more efficient RegEx for solving Wordle?

So I have a list of all 5 letter words in the English language that I can interrogate when I'm really stuck at Wordle. I found this an excellent exercise for brushing up on my Regular Expressions in BBEDIT, which is what I tell myself I'm doing.
The way wordle works, I can have three conditions.
A letter that is somewhere in the word (and must be present)
A letter that is not present in the word
A letter that is correct in presence and position
Condition 3 is easy. If my start word "crone" has the n in the right place, my pattern is
...n.
And I can add condition 2 fairly easily with
^(?!.*[croe])...n.
If my next guess is "burns" I'll know there's an "s"
^(?!.*[croebur])^(?=.*s)...n.
And that it's not in the last position:
^(?!.*[croebur])^(?=.*s)...n[^s]
If my next (very poor) guess is 'stone' I'll know there's a 't'.
^(?!.*[croebur])^(?=.*s)^(?=.*t)sa.n.
So that's a workable formula.
But if my next guess were "wimpy" I'd know there was an 'i' in the answer, but I have to add an additional ^(?=.*i) which just feels inefficient. I tried grouping the letters that must be in the word by using a bracket set, ^(?=.*[ist]) but of course that will match targets that contain any one of those characters rather than all.
Is there a more efficient way to express the phrase "the word must contain all of the following letters to match" than a series of "start at the beginning, scan for occurence of this single character until the end" phrases?
If you enter a word into Wordle, it displays all the matched characters in your word. It also shows the characters which exist in the word but not in the correct order.
Considering these requirements, I think you should create different rules for each letter's place. This way, your regex pattern keeps simple, and you get the search results quickly. Let me give an example:
Input word: crone
Matched Characters: ...n.
Characters in the wrong place: -
Next regex search pattern: ^[^crone][^crone][^crone]n[^crone]$
Input word: burns
Matched Characters: ...n.
Characters in the wrong place: s
Next regex search pattern: ^(?=\S*[s]\S*)[^bucrone][^bucrone][^bucrone]n[^bucrones]$ (Be careful, there is an "s" character in the last parenthesis because we know its place isn't there.)
Input word: stone
Matched Characters: s..n.
Characters in the wrong place: t
Next regex search pattern: ^(?=\S*[t]\S*)s[^tsbucrone][^sbucrone]n[^sbucrones]$ (Be careful, there is a "t" character in the first parenthesis because we know its place isn't there.)
^ => Start of the line
[^abc] => Any character except "a" and "b" and "c"
(?=\S*[t]\S*)=> There must be a "t" character in the given string
(?=\S*[t]\S*)(?=\S*[u]\S*)=> There must be "t" and "u" characters in the given string
$ => End of the line
When we look at performance tests of the regex patterns with a seven-word sample, my regex pattern found the result in 130 steps, whereas your pattern in 175 steps. The performance difference will increase as the word-list increase. You can review it from the following links:
Suggested pattern: https://regex101.com/r/mvHL3J/1
Your pattern: https://regex101.com/r/Nn8EwL/1
Note: You need to click the "Regex Debugger" link in the left sidebar to see the steps.
Note 2: I updated my response to fix the bug in the following comment.

End a regular expression pattern with a string

all. I have spent some time now to learn regular expression, but eventually there is a problem I cannot solve properly.
Lets assume the following 'string' (html-extract):
"{'2018-05-02', '2018-01-05', r, '2018-07-01', '2017-07-02', '2016-07-31' random_text XYCCC Letters and 55565798 ]}"
My intention is, to extract all values from '2018-05-02' ... to (and excluding) random_text. I tried to achieve this through chosing the "anything but" structure to achieve this [^a] (not a):
\'[^random]*
The above does not do the job, because random is not a string, but a set of characters, hence the 'r' in the string will split my extracted value.
If there is no r in the text before the word random_text, this would work fine:
\'[^r]*
Is there any way to include a specific string as the end of my sequence. e.g.
start: \'
repeated characters unlike string: [^{my_string}]*
Appreciate any insight :)
This regex will do the job:
'.+'(?= random)
Just replace random with the string you want to exclude at the end.
Demo & explanation

RegEx for finding strings with chars and numbers

I am trying to match strings that are part numbers mixed with normal text.
Here are a few examples.
Towing Cntrl Ecu,Gl3t-19H378-Ac
Assy,Pwr,Tested Gd,Priv-M50t3
Left,Rear,Brn-Tan,Pwr,4DR,Mju1
T-Case Ecu,56029590AE
Right,Blind Spot Module,284K0 9HS0F
In these examples I am trying to match.
Gl3t-19H378-Ac
Priv-M50t3
Mju1
56029590AE
284K0 and 9HS0F
I am in .Net and this is the Regex I have been using.
(\b[a-zA-Z0-9][a-zA-Z0-9\-]{1,32}(\b|$)(?<=[0-9]))
It works for what I need if the match ends in a number. The rule I want is to match any string between word boundaries that is either all numbers or numbers and chars mixed, but never just chars.
This should do it:
\b[a-zA-Z0-9-]*\d[a-zA-Z0-9-]*\b
If you need to restrict the length to a maximum of 32, add a look ahead:
\b(?=[a-zA-Z0-9-]{1,32}\b)[a-zA-Z0-9-]*\d[a-zA-Z0-9-]*\b
If the underscore character is OK too, you can use [\w-] instead of [a-zA-Z0-9-].

RegEx Lookaround issue

I am using Powershell 2.0. I have file names like my_file_name_01012013_111546.xls. I am trying to get my_file_name.xls. I have tried:
.*(?=_.{8}_.{6})
which returns my_file_name. However, when I try
.*(?=_.{8}_.{6}).{3}
it returns my_file_name_01.
I can't figure out how to get the extension (which can be any 3 characters. The time/date part will always be _ 8 characters _ 6 characters.
I've looked at a ton of examples and tried a bunch of things, but no luck.
If you just want to find the name and extension, you probably want something like this: ^(.*)_[0-9]{8}_[0-9]{6}(\..{3})$
my_file_name will be in backreference 1 and .xls in backreference 2.
If you want to remove everything else and return the answer, you want to substitute the "numbers" with nothing: 'my_file_name_01012013_111546.xls' -replace '_[0-9]{8}_[0-9]{6}' ''. You can't simply pull two bits (name and extension) of the string out as one match - regex patterns match contiguous chunks only.
try this ( not tested), but it should works for any 'my_file_name' lenght , any lenght of digit and any kind of extension.
"my_file_name_01012013_111546.xls" -replace '(?<=[\D_]*)(_[\d_]*)(\..*)','$2'
non regex solution:
$a = "my_file_name_01012013_111546.xls"
$a.replace( ($a.substring( ($a.LastIndexOf('.') - 16 ) , 16 )),"")
The original regex you specified returns the maximum match that has 14 characters after it (you can change to (?=.{14}) who is the same).
Once you've changed it, it returns the maximum match that has 14 characters after it + the next 3 characters. This is why you're getting this result.
The approach described by Inductiveload is probably better in case you can use backreferences. I'd use the following regex: (.*)[_\d]{16}\.(.*) Otherwise, I'd do it in two separate stages
get the initial part
get the extension
The reason you get my_filename_01 when you add that is because lookaheads are zero-width. This means that they do not consume characters in the string.
As you stated, .*(?=_.{8}_.{6}) matches my_file_name because that string is is followed by something matching _.{8}_.{6}, however once that match is found, you've only consumed my_file_name, so the addition of .{3} will then consume the next 3 characters, namely _01.
As for a regex that would fit your needs, others have posted viable alternatives.

RegEx matching variable names but not string values

It is hard to find. I need to write lexer and tokenizer for it.
I've got a problem in finding a regex which matches variable names but not string values.
The following should not be matched:
"ala ma kota"
5aalaas
This should be matched:
ala_ma_KOTA999653
l90
a
I already got something like this:
[a-zA-z]\w+
but I don't know how to exclude " chars from the beginning and end of a match.
Thanks for any reply or google links (I couldn't find it - it can be from lmgify ;)).
I interpret variable names as all word character sequences with a min length of 1 and starting with a letter. Your regexp was almost correct then:
^[A-Za-z]\w*$