Using a regex to identify EQUIPMENTID numbers - VBA - regex

Struggling trying to construct a Regexp to identify equipment numbers, I require this to identify equipment numbers in multiple formats including pooled equipment numbers e.g AFD21101 or AFD21101-02-03 or AFD21101-2-3 including various prefixes as per testdata.
Any tips or feedback welcome, possibly it may be easier with multiple RegExp for each scenario but I had hopped to have a master that would identify any of these patterns and be able to extract from a string for further process in a more detailed order. Possibly converting to Long format etc.
Any assistance is greatly appreciated. Hopefully I can return the favour.
What I've tried so far:
^[abcpfsmschafddfcpdcdplldt][glvmdugmrxftiichlewsnuabn][mmrprbdpucdsxtvuwcrslbubk][0-9][0-9xX][0-9xX][0-9xX][0-9xX]|[0-9xX-][0-9]|[0-9]
^[abcpfsmschafddfcpdcdplldt][glvmdugmrxftiichlewsnuabn][mmrprbdpucdsxtvuwcrslbubk][0-9][0-9xX][0-9xX][0-9xX][0-9xX]
^(BLM)|(SUB)|
(CVR)|FDR|SMP|CRU|HXC|ATS|AFD|FTS|DIX|DIT|FIT|FCV|KV|FV|CHU|PLW|BCR|DEC|CTR|CWR|V|DSS|PNL|MTR|LUB|LAU|CCL|DBB|TNK|THK|PIT|[0-9][0-9xX][0-9xX][0-9xX][0-9xX]
Testdata - will have to handle multiple separated by comma or multiline as per testdata examples below
// Example test data 1: (CSV+)
CRN21003 (CB-3), CRN21004 (CB-4)
// Example test data 2: (CSV)
CVR21404, CHU21437, AFD21401
// Example test data 3: (Multi-line)
MGD22401 - 16
DEC22401 - 16
// Example test data 4: (In string)
AFD11122 SOME OTHER RANDOM DATA WDC11121_22 SOME OTHER RANDOM DATA
//Additional matches
AFD21101-03
AFD21101_03
AFD21101-02-03
AFD21101_02_03
AFD21101-2-3
AFD21101_2_3
FDR21407-08
BLM21401
SUB21601
CVR21601
Fdr21601
SMP21501
CRU21501
HXC21501
AFD21501
FTS21X01
DIX21301
DIT22501
FIT21X0X
FCV21501
Pattern:
Base is max 8 digits
1-3 letters (A-Z)
5 Digits (0-9) including X as wildcard
Followed by pooled EQUIPMENT ID's
e.g. AFD21101-2-3, AFD21101-02-03 or AFD21101_02_03
_ or - are delimiters indicating abbreviated subsequent equipment id's or ranges.
AFD21101-02-03 is equivalent to AFD21101, AFD21102, AFD21103 in full form
Possible Prefix's continued
KV
CHU
PLW
BCR
DEC
CTR
CWR
V
DSS
PNL
MTR
LUB
LAU
CCL
DBB
TNK
THK
PIT
AGM2XXXX - valid
Some Invalid matches would be something like
AGM211011 or AGMXXXXX or 21101 or 2110 or AGM21101-094-034 or AGM (prefix only without a trailing 5 digit number/ X wildcard)

If I understand your issue, you need to get the strings which starts with substring provided and contains numbers.
You could try the following regex.
^(?:BLM|SUB|CVR|FDR|SMP|CRU|HXC|ATS|AFD|FTS|DIX|DIT|FIT|FCV|KV|FV|CHU|PLW|BCR|DEC|CTR|CWR|V|DSS|PNL|MTR|LUB|LAU|CCL|DBB|TNK|THK|PIT)[0-9_-]+
Details:
^: start of string
?:: non capturing group
(?:BLM|SUB|CVR|FDR|SMP|CRU|HXC|ATS|AFD|FTS|DIX|DIT|FIT|FCV|KV|FV|CHU|PLW|BCR|DEC|CTR|CWR|V|DSS|PNL|MTR|LUB|LAU|CCL|DBB|TNK|THK|PIT): list of prefixes.
Demo

It isn't 100% clear what you're intending to do because:
The test data you've supplied is comprised wholly of expected matches
The expected output is unclear. Although this largely relays back to point 1!
However, there are many ways of getting the information you require. They all depend on how your source data is organised though...
// Example test data 1:
AFD11122 SOME OTHER RANDOM DATA
WDC11121_22 SOME OTHER RANDOM DATA
// Example test Data 2:
SOME RANDOM DATA AFD11122 AND SOME MORE RANDOM DATA WDC11121_22 WITH SOME MORE
Assuming that the data is at the start of the string AND that you want to capture each string as a whole:
// Option 1
/^(.*?)\s/
^ : Start of string
(.*?) : Non-greedy capture group
\s : First space (first because the capture group was non-greedy)
// Option 2
/^([ABCDEFHIKLMNPRSTUVWX][ABCDEFHILMNRSTUVWX]?[BCDKLMPRSTUVWX]?[x\d]{5}[_\-\d]*)/i
^ : Start of string
( : Start of capture group
[ABCDEFHIKLMNPRSTUVWX] : Capture any letter in character set
[ABCDEFHILMNRSTUVWX]? : OPTIONALLY [?] capture any letter in character set
[BCDKLMPRSTUVWX]? : OPTIONALLY [?] capture any letter in character set
[x\d]{5} : Capture any number or x 5 times
[_\-\d]* : Capture any number, hyphen, or underscore until you reach a character not in the set
) : End of capture group
i : FLAG - case insensitive
// Option 3
/^((?:AFD|BCR|BLM....TNK|V)[\d_\-]*)/i
^ : Start of string
( : Start of capture group
(?: : Start of non-capturing group
AFD|BCR|BLM....TNK|V : List of prefixes separated with "|"
) : End of non-capturing group
[\d_\-]* : Capture any number, hyphen, or underscore until you reach a character not in the set
) : End of capture group
i : FLAG - case insensitive
// Option 4
/^([a-z]{1,3}[x\d]{5}[_\-\d]*)/i :
^ : Start of string
( : Start of capture group
[a-z]{1,3} : Capture any letter [range: a-z] 1 to 3 times {1,3}
[x\d]{5} : Capture any number [\d] or x [x] 5 times {5}
[_\-\d]* : Capture any number, hyphen, or underscore until you reach a character not in the set
) : End of capture group
i : FLAG - case insensitive
Based on your updates to the main question I would stick with option 4 unless you specifically need to make sure that only the set prefixes are matched.
In the event that your data looks more like Example Data 2 then the above expressions will need to be altered accordingly; some examples below:
/([a-z]{1,3}[x\d]{5}[_\-\d]*)/i : Remove the ^
/\b([a-z]{1,3}[x\d]{5}[_\-\d]*)/i : Add a word boundary to the start of the expression
/[^a-z]([a-z]{1,3}[x\d]{5}[_\-\d]*)/i : Start the expression with anything BUT a letter
How you alter it will depend on the data that you're searching through.
Updated RegEx based on latest question edits
/([a-z]{1,3}(?!xxxxx)[x\d]{5}(?!\d)[_\-\d]*)/ig

Try this:
[A-Z]{1,3}[\dX]{5}([_-])0?\d(\10?\d)?
This requires the separator to be the consistent, ie either both - or both _, by capturing the separator and using a back reference to it \1, although the second “pooled ID” is optional.
As far as I can tell, this matches all of your examples.

Related

Regex for SQL Query

Hello together I have the following problem:
I have a long list of SQL queries which I would like to adapt to one of my changes. Finally, I have a renaming problem and I'm afraid I want to solve it more complicated than expected.
The query looks like this:
INSERT member (member, prename, name, street, postalcode, town, tel1, tel2, fax, bem, anrede, salutation, email, name2, name3, association, project) VALUES (2005, N'John', N'Doe', N'Street 4711', N'1234', N'Town', N'1234-5678', N'1234-5678', N'1234-5678', N'Leader', NULL, N'Dear Mr. Doe', N'a#b.com', N'This is the text i want to delete', N'Name2', N'Name3', NULL, NULL);
In the "Insert" there was another column which I removed (which I did simply via Notepad++ by typing the search term - "example, " - and replaced it with an empty field. Only the following entry in Values I can't get out using this method, because the text varies here. So far I have only worked with the text file in which I adjusted the list of queries.
So as you can see there is one more entry in Values than in the insertions (there was another column here, but it was removed by my change).
It is the entry after the email address. I would like to remove this including the comma (N'This is the text i want to delete',).
My idea was to form a group and say that the 14th digit after the comma should be removed. However, even after research I do not know how to realize this.
I thought it could look like this (tried in https://regex101.com/)
VALUES\s?\((,) something here
Is this even the right approach or is there another method? I only knew Regex to solve this problem, because of course the values look different here.
And how can I finally use the regex to get the queries adapted (because the queries are local to my computer and not yet included in the code).
Short summary:
Change the query from
VALUES (... test5, test6, test7 ...)
To
VALUES (... test5, test7 ...)
As per my comment, you could use find/replace, where you search for:
(\bVALUES +\((?:[^,]+,){13})[^,]+,
And replace with $1
See the online demo
( - Open 1st capture group.
\bValues +\( - Match a word-boundary, literally 'VALUES', followed by at least a single space and a literal open paranthesis.
(?: - Open non-capturing group.
[^,]+, - Match anything but a comma at least once followed by a comma.
){13} - Close non-capture group and repeat it 13 times.
) - Close 1st capture group.
[^,]+, - Match anything but a comma at least once followed by a comma.
You may use the following to remove / replace the value you need:
Find What: \bVALUES\s*\((\s*(?:N'[^']*'|\w+))(?:,(?1)){12}\K,(?1)
Replace With: (empty string, or whatever value you need)
See the regex demo
Details
\bVALUES - whole word VALUES
\s* - 0+ whitespaces
\( - a (
(\s*(?:N'[^']*'|\w+)) - Group 1: 0+ whitespaces and then either N' followed with any 0 or more chars other than ' and then a ', or 1+ word chars
(?:,(?1)){12} - twelve repetitions of , followed with the Group 1 pattern
\K - match reset operator that discards the text matched so far from the match memory buffer
, - a comma
(?1) - Group 1 pattern.
Settings screen:

Hive REGEXP_EXTRACT returning null results

I am trying to extract R7080075 and X1234567 from the sample data below. The format is always a single upper case character followed by 7 digit number. This ID is also always preceded by an underscore. Since it's user generated data, sometimes it's the first underscore in the record and sometimes all preceding spaces have been replaced with underscores.
I'm querying HDP Hive with this in the select statement:
REGEXP_EXTRACT(column_name,'[(?:(^_A-Z))](\d{7})',0)
I've tried addressing positions 0-2 and none return an error or any data. I tested the code on regextester.com and it highlighted the data I want to extract. When I then run it in Zepplin, it returns NULLs.
My regex experience is limited so I have reviewed the articles here on regexp_extract (+hive) and talked with a colleague. Thanks in advance for your help.
Sample data:
Sept Wk 5 Sunny Sailing_R7080075_12345
Holiday_Wk2_Smiles_X1234567_ABC
The Hive manual says this:
Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc.
Also, your expression includes unnecessary characters in the character class.
Try this:
REGEXP_EXTRACT(column_name,'_[A-Z](\\d{7})',0)
Since you want only the part without underscore, use this:
REGEXP_EXTRACT(column_name,'_([A-Z]\\d{7})',1)
It matches the entire pattern, but extracts only the second group instead of the entire match.
Or alternatively:
REGEXP_EXTRACT(column_name,'(?<=_)[A-Z]\\d{7}', 0)
This uses a regexp technique called "positive lookbehind". It translates to : "find me an upper case alphabet followed by 7 digits, but only if they are preceded by an _". It uses the _ for matching but doesn't consider it part of the extracted match.

Regular expression for specific file mask

I want to have 2 regex patterns that checks files after specific file mask. The way I like to do it is written below.
Pattern 1:
check if the left side of _ has 7 digits.
checks if the right side of _ is numeric.
checks for the specified extension is there.
the input will look like this : 1234567_1.jpg
Pattern 2:
check if there is 10 digits to the left of a "Space" char
check if there is 4 digits to the right of a "Space" char
check to the right side of _ is numeric
check for the specified extension is there.
The input will look like this: 1234567891 1234_1.png
As stated above this is to be used to check for a specific file mask.
i have been playing around with ideas like : ^[0-9][0-9].jpg$
and ^[0-9] [0-9][0-9].jpg$ is my first tries.
i do apologies for not providing my tries.
I suggest combining patterns with | (or):
string pattern = string.Join("|",
#"(^[0-9]{7}_[0-9]+\.jpg$)", // 1st possibility
#"(^[0-9]{10} [0-9]{4}_[0-9]+\.png$)"); // 2nd one
....
string fileName = #"c:\myFiles\1234567_1.jpg";
// RegexOptions.IgnoreCase - let's accept ".JPG" or ".Jpg" files
if (Regex.IsMatch(Path.GetFileName(fileName), pattern, RegexOptions.IgnoreCase)) {
...
}
Let's explain the second pattern: (^[0-9]{10} [0-9]{4}_[0-9]+\.jpg$)
^ - anchor (string start)
[0-9]{10} - 10 digits - 0-9
- single space
[0-9]{4} - 4 digits
_ - single underscope
[0-9]+ - one or more digits
\.png - .png (. is escaped)
$ - anchor (string end)
This should work for first regex:
\d{7}_\d*.(jpg|png)
This should work for second regex:
\d{10}\s\d{4}_\d*.(jpg|png)
If you want to use them together just do it like below:
(\d{7}_\d*.(jpg|png)|\d{10}\s\d{4}_\d*.(jpg|png))
In this group (jpg|png) you can just add another extensions by separating them with | (or).
You can check if it works for you at: https://regex101.com/
Cheers!

regex get string within two string

I have a query and want to get table names between from & where
If its a single line and single table without alias i could do so:
(?<=from )([^#]\w*)(?=.*where)
I need to get each table except the prefixed table. i.e course c marks s
But i can't figure out regex for the following Query.
(The where clause could be in same line or new line, on start of line or with space or tab)
from #prefix#student, course c, marks m
where ....
There are also sub queries in some places, if that case could also be handled would help.
select ... from course c
where id = (select ... from student where ...)
I'm trying to find & replace in sublime text 3 editor
Test case queries:
//output [course]
select ... from course
where ...
//output [course c] [marks s]
select ... from course c, marks s
where ....
//output [marks m]
select ... from #prefix#course c, marks m
where ...
//output [student s]
select ... from #prefix#course c
where id = (select ... from student s where ...)
You can use the following regex:
\bfrom\b(?!\s*#)([^w]*(?:\bw(?!here\b)[^w]*)*)\bwhere\b
See the regex demo
Check Case sensitive option in case you need that.
If you need to just highlight all between from and where, use lookarounds:
(?<=\bfrom\b)(?!\s*#)([^w]*(?:\bw(?!here\b)[^w]*)*)(?=\bwhere\b)
See another demo and the screen with results:
Regex breakdown:
(?<=\bfrom\b) - check if there is a whole word from before the next...
(?!\s*#) - make sure there is no 0 or more whitespaces followed by #
([^w]*(?:\bw(?!here\b)[^w]*)*) - match any text that is not where up to...
(?=\bwhere\b) - a whole word where.
UPDATE
Since you need to get comma-separated values excluding prefixed names with their aliases, you need a boundary-constrained regex. It can be achieved with \G operator:
(?:\bfrom\b(?:\s*#\w+(?:\s*\w+))*+|(?!^)\G),?\s*\K(?!(?:\w+ )?\bwhere\b)([\w ]+)(?=[^w]*(?:\bw(?!here\b)[^w]*)*\bwhere\b)
Here,
(?:\bfrom\b(?:\s*#\w+(?:\s*\w+))*+|(?!^)\G) - matches from (as a whole word) followed by optional whitespace followed by # and 1 or more alphanumerics that are followed by whitespaces+alphanumerics (alias)
,?\s*\K - optional (1 or 0) commas followed by 0 or more whitespaces that are followed by \K that forces the engine to omit the whole chunk of text matched so fat
(?!(?:\w+ )?\bwhere\b) - a restrictive lookahead with which we forbid the next or the word following the next word to be equal to where
([\w ]+) - our match, 1 or more alphanumerics or space (may be replaced with [\w\h]+)
(?=[^w]*(?:\bw(?!here\b)[^w]*)*\bwhere\b) - a trailing boundary: there must be text other than where up to the first where.

How to identify a given string is hex color format

I'm looking for a regular expression to validate hex colors in ASP.NET C# and
am also looking code for validation on server side.
For instance: #CCCCCC
Note: This is strictly for validation, i.e. accepting a valid hex color. For actual parsing you won't get the individual parts out of this.
^#(?:[0-9a-fA-F]{3}){1,2}$
For ARGB:
^#(?:[0-9a-fA-F]{3,4}){1,2}$
Dissection:
^ anchor for start of string
# the literal #
( start of group
?: indicate a non-capturing group that doesn't generate backreferences
[0-9a-fA-F] hexadecimal digit
{3} three times
) end of group
{1,2} repeat either once or twice
$ anchor for end of string
This will match an arbitrary hexadecimal color value that can be used in CSS, such as #91bf4a or #f13.
Minor disagreement with the other solution. I'd say
^#(([0-9a-fA-F]{2}){3}|([0-9a-fA-F]){3})$
The reason is that this (correctly) captures the individual RGB components. The other expression broke #112233 in three parts, '#' 112 233. The syntax is actually '#' (RR GG BB) | (R G B)
The slight disadvantage is more backtracking is required. When parsing #CCC you don't know that the second C is the green component until you hit the end of the string; when parsing #CCCCCC you don't know that the second C is still part of the red component until you see the 4th C.
It also works great for RGBA but the other solution doesn't
const thisRegex = /#(([0-9a-fA-F]{2}){3,4}|([0-9a-fA-F]){3,4})/g
document.write("#fff;ae#rvaerv c #fffaff---#afd #ffff".match(thisRegex))
// #fff,#fffaff,#afd,#ffff
the other solution doesn't recognize #fffaff well
const theOtherSolutionRegex = /#(?:[0-9a-fA-F]{3,4}){1,2}/g
document.write("#fff;ae#rvaerv c #fffaff---#afd #ffff".match(theOtherSolutionRegex))
// #fff,#fffa,#afd,#ffff
all answers mentioned RGB format,
here is regex for ARGB format:
^#[0-9a-fA-F]{8}$|#[0-9a-fA-F]{6}$|#[0-9a-fA-F]{4}$|#[0-9a-fA-F]{3}$
This should match any #rgb, #rgba, #rrggbb, and #rrggbbaa syntax:
/^#(?:(?:[\da-f]{3}){1,2}|(?:[\da-f]{4}){1,2})$/i
break down:
^ // start of line
# // literal pound sign, followed by
(?: // either:
(?: // a non-capturing group of:
[\da-f]{3} // exactly 3 of: a single digit or a letter 'a'–'f'
){1,2} // repeated exactly 1 or 2 times
| // or:
(?: // a non-capturing group of:
[\da-f]{4} // exactly 4 of: a single digit or a letter 'a'–'f'
){1,2} // repeated exactly 1 or 2 times
)
$ // end of line
i // ignore case (let 'A'–'F' match 'a'–'f')
Notice that the above is not equivalent to this syntax, which is incorrect:
/^#(?:[\da-f]{3,4}){1,2}$/i
This would allow a group of 3 followed by a group of 4, such as #1234567, which is not a valid hex color.
This if you want to accept named colors and rgb(a,b,c) too. The final "i" is for case insensitive.
HTML colors (#123, rgb not accepted)
/^(#[a-f0-9]{6}|black|green|silver|gray|olive|white|yellow|maroon|navy|red|blue|purple|teal|fuchsia|aqua)$/i
CSS colors (#123, rgb accepted)
/^(#[a-f0-9]{6}|#[a-f0-9]{3}|rgb *\( *[0-9]{1,3}%? *, *[0-9]{1,3}%? *, *[0-9]{1,3}%? *\)|rgba *\( *[0-9]{1,3}%? *, *[0-9]{1,3}%? *, *[0-9]{1,3}%? *, *[0-9]{1,3}%? *\)|black|green|silver|gray|olive|white|yellow|maroon|navy|red|blue|purple|teal|fuchsia|aqua)$/i
Based on MSalters' answer, but preventing an incorrect match, the following works
^#(([0-9a-fA-F]{2}){3}|([0-9a-fA-F]){3})$
Or for an optional hash # symbol:
^#?(([0-9a-fA-F]{2}){3}|([0-9a-fA-F]){3})$
And without back references being generated:
^#?(?:(?:[0-9a-fA-F]{2}){3}|(?:[0-9a-fA-F]){3})$
Ruby
In Ruby, you have access to the \h (hexadecimal) character class. You also have to take more care of line endings, hence the \A...\z instead of the more common ^...$
/\A#(\h{3}){1,2}\z/
This will match 3 or 6 hexadecimal characters following a #. So no RGBA. It's also case-insensitive, despite not having the i flag.