Regular expression for specific file mask - regex

I want to have 2 regex patterns that checks files after specific file mask. The way I like to do it is written below.
Pattern 1:
check if the left side of _ has 7 digits.
checks if the right side of _ is numeric.
checks for the specified extension is there.
the input will look like this : 1234567_1.jpg
Pattern 2:
check if there is 10 digits to the left of a "Space" char
check if there is 4 digits to the right of a "Space" char
check to the right side of _ is numeric
check for the specified extension is there.
The input will look like this: 1234567891 1234_1.png
As stated above this is to be used to check for a specific file mask.
i have been playing around with ideas like : ^[0-9][0-9].jpg$
and ^[0-9] [0-9][0-9].jpg$ is my first tries.
i do apologies for not providing my tries.

I suggest combining patterns with | (or):
string pattern = string.Join("|",
#"(^[0-9]{7}_[0-9]+\.jpg$)", // 1st possibility
#"(^[0-9]{10} [0-9]{4}_[0-9]+\.png$)"); // 2nd one
....
string fileName = #"c:\myFiles\1234567_1.jpg";
// RegexOptions.IgnoreCase - let's accept ".JPG" or ".Jpg" files
if (Regex.IsMatch(Path.GetFileName(fileName), pattern, RegexOptions.IgnoreCase)) {
...
}
Let's explain the second pattern: (^[0-9]{10} [0-9]{4}_[0-9]+\.jpg$)
^ - anchor (string start)
[0-9]{10} - 10 digits - 0-9
- single space
[0-9]{4} - 4 digits
_ - single underscope
[0-9]+ - one or more digits
\.png - .png (. is escaped)
$ - anchor (string end)

This should work for first regex:
\d{7}_\d*.(jpg|png)
This should work for second regex:
\d{10}\s\d{4}_\d*.(jpg|png)
If you want to use them together just do it like below:
(\d{7}_\d*.(jpg|png)|\d{10}\s\d{4}_\d*.(jpg|png))
In this group (jpg|png) you can just add another extensions by separating them with | (or).
You can check if it works for you at: https://regex101.com/
Cheers!

Related

Snippets VS Code Regex

I need your help, I am building a snippets, but I need to transform the path of the file which is this:
D:\Project\test\src\EnsLib\File\aaa\bbb
and I need it to be like this:
EnsLib\File\aaa\bbb
just leave me from "SRC" forward and replace the \ with points.
Example: D:\Project\test\src\EnsLib\File\aaa\bbb
Result: EnsLib.File.aaa.bbb
that always after the src folder is the starting point
my test regex are these:
"${TM_DIRECTORY/(.*\\\\{4})/$1/}",
"${TM_DIRECTORY/.*src\\\\(.*)\\\\(.*)$/.$2/}.${TM_FILENAME_BASE}",
// "${TM_DIRECTORY/.*\\\\(.*)\\\\(.*)$/$1.$2/}.${TM_FILENAME_BASE}",
// "${RELATIVE_FILEPATH/\\D{4}(\\W)\\..+$/$1/g}",
// "${TM_DIRECTORY/(.*src\\\\)//g}.${TM_FILENAME_BASE}",
// "${RELATIVE_FILEPATH/(\\D{3})\\W|(\\..+$)/$1.$2/g}",
// "${RELATIVE_FILEPATH/\\W/./g}",
It seems you want
"${TM_DIRECTORY/^.*?\\\\src\\\\|(\\\\)/${1:+.}/g}"
The regex is ^.*?\\src\\|(\\), it matches
^ - start of string
.*? - any zero or more chars other than line break chars, as few as possible
\\src\\ - \src\ string
| - or
(\\) - Group 1 ($1): a \ char.
If Group 1 matches, the replacement is a ., else, the replacement is an empty string, i.e. the text from the start of string till \src\ is simply removed.

Using a regex to identify EQUIPMENTID numbers - VBA

Struggling trying to construct a Regexp to identify equipment numbers, I require this to identify equipment numbers in multiple formats including pooled equipment numbers e.g AFD21101 or AFD21101-02-03 or AFD21101-2-3 including various prefixes as per testdata.
Any tips or feedback welcome, possibly it may be easier with multiple RegExp for each scenario but I had hopped to have a master that would identify any of these patterns and be able to extract from a string for further process in a more detailed order. Possibly converting to Long format etc.
Any assistance is greatly appreciated. Hopefully I can return the favour.
What I've tried so far:
^[abcpfsmschafddfcpdcdplldt][glvmdugmrxftiichlewsnuabn][mmrprbdpucdsxtvuwcrslbubk][0-9][0-9xX][0-9xX][0-9xX][0-9xX]|[0-9xX-][0-9]|[0-9]
^[abcpfsmschafddfcpdcdplldt][glvmdugmrxftiichlewsnuabn][mmrprbdpucdsxtvuwcrslbubk][0-9][0-9xX][0-9xX][0-9xX][0-9xX]
^(BLM)|(SUB)|
(CVR)|FDR|SMP|CRU|HXC|ATS|AFD|FTS|DIX|DIT|FIT|FCV|KV|FV|CHU|PLW|BCR|DEC|CTR|CWR|V|DSS|PNL|MTR|LUB|LAU|CCL|DBB|TNK|THK|PIT|[0-9][0-9xX][0-9xX][0-9xX][0-9xX]
Testdata - will have to handle multiple separated by comma or multiline as per testdata examples below
// Example test data 1: (CSV+)
CRN21003 (CB-3), CRN21004 (CB-4)
// Example test data 2: (CSV)
CVR21404, CHU21437, AFD21401
// Example test data 3: (Multi-line)
MGD22401 - 16
DEC22401 - 16
// Example test data 4: (In string)
AFD11122 SOME OTHER RANDOM DATA WDC11121_22 SOME OTHER RANDOM DATA
//Additional matches
AFD21101-03
AFD21101_03
AFD21101-02-03
AFD21101_02_03
AFD21101-2-3
AFD21101_2_3
FDR21407-08
BLM21401
SUB21601
CVR21601
Fdr21601
SMP21501
CRU21501
HXC21501
AFD21501
FTS21X01
DIX21301
DIT22501
FIT21X0X
FCV21501
Pattern:
Base is max 8 digits
1-3 letters (A-Z)
5 Digits (0-9) including X as wildcard
Followed by pooled EQUIPMENT ID's
e.g. AFD21101-2-3, AFD21101-02-03 or AFD21101_02_03
_ or - are delimiters indicating abbreviated subsequent equipment id's or ranges.
AFD21101-02-03 is equivalent to AFD21101, AFD21102, AFD21103 in full form
Possible Prefix's continued
KV
CHU
PLW
BCR
DEC
CTR
CWR
V
DSS
PNL
MTR
LUB
LAU
CCL
DBB
TNK
THK
PIT
AGM2XXXX - valid
Some Invalid matches would be something like
AGM211011 or AGMXXXXX or 21101 or 2110 or AGM21101-094-034 or AGM (prefix only without a trailing 5 digit number/ X wildcard)
If I understand your issue, you need to get the strings which starts with substring provided and contains numbers.
You could try the following regex.
^(?:BLM|SUB|CVR|FDR|SMP|CRU|HXC|ATS|AFD|FTS|DIX|DIT|FIT|FCV|KV|FV|CHU|PLW|BCR|DEC|CTR|CWR|V|DSS|PNL|MTR|LUB|LAU|CCL|DBB|TNK|THK|PIT)[0-9_-]+
Details:
^: start of string
?:: non capturing group
(?:BLM|SUB|CVR|FDR|SMP|CRU|HXC|ATS|AFD|FTS|DIX|DIT|FIT|FCV|KV|FV|CHU|PLW|BCR|DEC|CTR|CWR|V|DSS|PNL|MTR|LUB|LAU|CCL|DBB|TNK|THK|PIT): list of prefixes.
Demo
It isn't 100% clear what you're intending to do because:
The test data you've supplied is comprised wholly of expected matches
The expected output is unclear. Although this largely relays back to point 1!
However, there are many ways of getting the information you require. They all depend on how your source data is organised though...
// Example test data 1:
AFD11122 SOME OTHER RANDOM DATA
WDC11121_22 SOME OTHER RANDOM DATA
// Example test Data 2:
SOME RANDOM DATA AFD11122 AND SOME MORE RANDOM DATA WDC11121_22 WITH SOME MORE
Assuming that the data is at the start of the string AND that you want to capture each string as a whole:
// Option 1
/^(.*?)\s/
^ : Start of string
(.*?) : Non-greedy capture group
\s : First space (first because the capture group was non-greedy)
// Option 2
/^([ABCDEFHIKLMNPRSTUVWX][ABCDEFHILMNRSTUVWX]?[BCDKLMPRSTUVWX]?[x\d]{5}[_\-\d]*)/i
^ : Start of string
( : Start of capture group
[ABCDEFHIKLMNPRSTUVWX] : Capture any letter in character set
[ABCDEFHILMNRSTUVWX]? : OPTIONALLY [?] capture any letter in character set
[BCDKLMPRSTUVWX]? : OPTIONALLY [?] capture any letter in character set
[x\d]{5} : Capture any number or x 5 times
[_\-\d]* : Capture any number, hyphen, or underscore until you reach a character not in the set
) : End of capture group
i : FLAG - case insensitive
// Option 3
/^((?:AFD|BCR|BLM....TNK|V)[\d_\-]*)/i
^ : Start of string
( : Start of capture group
(?: : Start of non-capturing group
AFD|BCR|BLM....TNK|V : List of prefixes separated with "|"
) : End of non-capturing group
[\d_\-]* : Capture any number, hyphen, or underscore until you reach a character not in the set
) : End of capture group
i : FLAG - case insensitive
// Option 4
/^([a-z]{1,3}[x\d]{5}[_\-\d]*)/i :
^ : Start of string
( : Start of capture group
[a-z]{1,3} : Capture any letter [range: a-z] 1 to 3 times {1,3}
[x\d]{5} : Capture any number [\d] or x [x] 5 times {5}
[_\-\d]* : Capture any number, hyphen, or underscore until you reach a character not in the set
) : End of capture group
i : FLAG - case insensitive
Based on your updates to the main question I would stick with option 4 unless you specifically need to make sure that only the set prefixes are matched.
In the event that your data looks more like Example Data 2 then the above expressions will need to be altered accordingly; some examples below:
/([a-z]{1,3}[x\d]{5}[_\-\d]*)/i : Remove the ^
/\b([a-z]{1,3}[x\d]{5}[_\-\d]*)/i : Add a word boundary to the start of the expression
/[^a-z]([a-z]{1,3}[x\d]{5}[_\-\d]*)/i : Start the expression with anything BUT a letter
How you alter it will depend on the data that you're searching through.
Updated RegEx based on latest question edits
/([a-z]{1,3}(?!xxxxx)[x\d]{5}(?!\d)[_\-\d]*)/ig
Try this:
[A-Z]{1,3}[\dX]{5}([_-])0?\d(\10?\d)?
This requires the separator to be the consistent, ie either both - or both _, by capturing the separator and using a back reference to it \1, although the second “pooled ID” is optional.
As far as I can tell, this matches all of your examples.

Regex: Separate a string of characters with a non-consistent pattern (Oracle) (POSIX ERE)

EDIT: This question pertains to Oracle implementation of regex (POSIX ERE) which does not support 'lookaheads'
I need to separate a string of characters with a comma, however, the pattern is not consistent and I am not sure if this can be accomplished with Regex.
Corpus: 1710ABCD.131711ABCD.431711ABCD.41711ABCD.4041711ABCD.25
The pattern is basically 4 digits, followed by 4 characters, followed by a dot, followed by 1,2, or 3 digits! To make the string above clear, this is how it looks like separated by a space 1710ABCD.13 1711ABCD.43 1711ABCD.4 1711ABCD.404 1711ABCD.25
So the output of a replace operation should look like this:
1710ABCD.13,1711ABCD.43,1711ABCD.4,1711ABCD.404,1711ABCD.25
I was able to match the pattern using this regex:
(\d{4}\w{4}\.\d{1,3})
It does insert a comma but after the third digit beyond the dot (wrong, should have been after the second digit), but I cannot get it to do it in the right position and globally.
Here is a link to a fiddle
https://regex101.com/r/qQ2dE4/329
All you need is a lookahead at the end of the regular expression, so that the greedy \d{1,3} backtracks until it's followed by 4 digits (indicating the start of the next substring):
(\d{4}\w{4}\.\d{1,3})(?=\d{4})
^^^^^^^^^
https://regex101.com/r/qQ2dE4/330
To expand on #CertainPerformance's answer, if you want to be able to match the last token, you can use an alternative match of $:
(\d{4}\w{4}\.\d{1,3})(?=\d{4}|$)
Demo: https://regex101.com/r/qQ2dE4/331
EDIT: Since you now mentioned in the comment that you're using Oracle's implementation, you can simply do:
regexp_replace(corpus, '(\d{1,3})(\d{4})', '\1,\2')
to get your desired output:
1710ABCD.13,1711ABCD.43,1711ABCD.4,1711ABCD.404,1711ABCD.25
Demo: https://regex101.com/r/qQ2dE4/333
In order to continue finding matches after the first one you must use the global flag /g. The pattern is very tricky but it's feasible if you reverse the string.
Demo
var str = `1710ABCD.131711ABCD.431711ABCD.41711ABCD.4041711ABCD.25`;
// Reverse String
var rts = str.split("").reverse().join("");
// Do a reverse version of RegEx
/*In order to continue searching after the first match,
use the `g`lobal flag*/
var rgx = /(\d{1,3}\.\w{4}\d{4})/g;
// Replace on reversed String with a reversed substitution
var res = rts.replace(rgx, ` ,$1`);
// Revert the result back to normal direction
var ser = res.split("").reverse().join("");
console.log(ser);

How to match the whole expression only, even when there are sub parts that match?

Just trying to write input validation pattern that would allow entry of wild characters. Input field is 9 char max and should follow these rules:
* + 1- 8 charcters
1- 8 chars + *
* + 1-7 chars + *
I've written this regex using the regex documentation and testing it on one of the regex testers.
\*{1}[0-9]{1,7}\*{1}|[0-9]{1,8}\*{1}|\*{1}[0-9]{1,8}|[0-9]{9}
It matches all these correctly
123456789
*1*
*12*
*123*
*1234*
*12345*
*123456*
*1234567*
1234567*
123456*
12345*
1234*
123*
12*
1*
*1
*12
*123
*1234
*12345
*123456
*1234567
*12345678
But it also matches when I don't want it. For example it finds 2 matches in this *123456789* First match is *12345678 and second one is 9*
I don't want in this case to find any matches. Either the whole string matches one of the patterns or not. How does one do that?
Use anchors that make sure the regex always matches the entire string:
^(\*[0-9]{1,7}\*|[0-9]{1,8}\*|\*[0-9]{1,8}|[0-9]{9})$
Note the parentheses to make sure that the alternation is contained within the group:
^
(
\*[0-9]{1,7}\*
|
[0-9]{1,8}\*
|
\*[0-9]{1,8}
|
[0-9]{9}
)
$
Also, {1} is always superfluous - one match per token is the default.
You could use start and end string anchors:
http://www.regular-expressions.info/anchors.html
So, your regex would be something like this (note first and last symbol):
^(\*{1}[0-9]{1,7}*{1}|[0-9]{1,8}*{1}|*{1}[0-9]{1,8}|[0-9]{9})$

Regex for Regex validation decimal[19,3]

I want to validate a decimal number (decimal[19,3]). I used this
#"[\d]{1,16}|[\d]{1,16}[\.]\d{1,3}"
but it didn't work.
Below are valid values:
1234567890123456.123
1234567890123456.12
1234567890123456.1
1234567890123456
1234567
0.0
.1
Simplification:
The \d doesn't have to be in []. Use [] only when you want to check whether a character is one of multiple characters or character classes.
. doesn't need to be escaped inside [] - [\.] appears to just allow ., but allowing \ to appear in the string in the place of the . may be a language dependent possibility (?). Or you can just take it out of the [] and keep it escaped.
So we get to:
\d{1,16}|\d{1,16}\.\d{1,3}
(which can be shortened using the optional / "once or not at all" quantifier (?)
to \d{1,16}(\.\d{1,3})?)
Corrections:
You probably want to make the second \d{1,16} optional, or equivalently simply make it \d{0,16}, so something like .1 is allowed:
\d{1,16}|\d{0,16}\.\d{1,3}
If something like 1. should also be allowed, you'll need to add an optional . to the first part:
\d{1,16}\.?|\d{0,16}\.\d{1,3}
Edit: I was under the impression [\d] matches \ or d, but it actually matches the character class \d (corrected above).
This would match your 3 scenarios
^(\d{1,16}|(\d{0,16}\.)?\d{1,3})$
first part: a 0 to 16 digit number
second: a 0 to 16 digit number with 1 to 3 decimals
third: nothing before a dot and then 1 to 3 decimals
the ^ and $ are anchorpoints that match start of line and end of line, so if you need to search for numbers inside lines of text, your should remove those.
Testdata:
Usage in C#
string resultString = null;
try {
resultString = Regex.Match(subjectString, #"\d{1,16}\.?|\d{0,16}\.\d{1,3}").Value;
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Slight optimization
A bit more complicated regex, but a bit more correct would be to have the ?: notation in the "inner" group, if you are not using it, to make that a non-capture group, like this:
^(\d{1,16}|(?:\d{0,16}\.)?\d{1,3})$
Following Regex will help you out -
#"^(\d{1,16}(\.\d{1,3})?|\.\d{1,3})$"
Try something like that
(\d{0,16}\.\d{0,3})|(\d{0,16})
It work with all your examples.
edit. new version ;)
You can try:
^\d{0,16}(?:\.|$)(?:\d{0,3}|)$
match 0 to 16 digits
then match a dot or end of string
and then match 3 more digits