I am using Visual Studio File search with a regular expression to find an alphanumeric string of 7 characters, starting with an S or s and followed by 6 digits. For example:
s123456
S012458
s004580
Is there any easy way of searching for it?
I have already used this one, although I am not sure if it is getting everything as there are lots of files:
[sS]{1}\d{6}
The first remark: Your proposed regex [sS]{1}\d{6} contains unnecessary
{1}, because the default quantifier is just {1}.
Another remark: The "shortened" regex [sS]\d{6} can capture a fragment
of a longer word, like xs123456 (extra chars before s) or S01245856
(more than 6 digits).
To protect against such cases, you should add a word boundary marker - \b,
both at the start and at the end of the regex.
So the final version is: \b[sS]\d{6}\b.
Related
I use regex sparingly only when it comes up in projects. I've tried to learn, but I'm stuck. Any help would be much appreciated!
Expected Behavior
FNBR6.03202021.Default_Title.mp4
GEVY230.03182021.Project76.mp4
FNBR
GEVY
6
230
03202021
03182021
Default_Title
Project76
mp4
mp4
Sample Array: [FNBR, 6, 03202021, Default_Title, mp4]
What I have
I have gotten this far: /(^.+?(?=\d))|(\d+)|([^.]+)/gm
Expression
Description
^.+?(?=\d)
This should match everything up until a digit.
(\d+)
This should match digits in blocks (next to each other).
([^.]+)
This should match everything except .
So I tried to put it together using | OR operator. I got a list like so [FNBR, 6, ., 30202021, .Default_Title.mp, 4, ]. Which I was expecting, but I don't know how to continue.
I know I need basically what ([^.]+) does, with the exception of the first block of digits. I only care about separating the first block of digits before the first . decimal/period. So in FNBR6.03202021.Default_Title.mp4 I only care about separating FNBR6 into FNBR and 6, after that everything is just split like normal using .
I am not using a standard programing language like Python or Java; it would be a hassle to have to do this using two regex expressions. Like splitting everything by . and then just splitting index[0] of FNBR6 separately. I don't want that.
Any help, feedback, and criticism would be much appreciated! I tried not to make this too long, but I also wanted to ensure I thoroughly explained the situation. Thanks!
Edit 1: Helpful Suggestions
#41686d6564 points out that I should mention what platform I'm using. I'm testing Microsoft Power Automate Desktop for work. That's what I'm running the regex expressions in.
#trincot listed a bunch of very helpful examples in the comments. However, none of them seem to work properly in this program.
Using /(^.+?(?=\d))|(\d+)|([^.]+)/gm:
Using any of trincot's examples like [A-Z]{2,}|\w+ give:
Edit 2: Reliable Answer
#knittl's example worked reliably and works as needed. Their deconstruction and explanation were also very helpful. Thanks!
Power Automate seems to make the first and last index empty when using regular expressions to create a list. I just simply remove the first and last index.
I tried everyone's comments/suggestions in the order in which they were posted. knittl's was the first that worked reliably. I want to thank everyone for their quick, and thoughtful help! Thanks everyone!
This does not sound too complicated. It only becomes complicated when people start to use lookahead and lookbehinds when they are not required. With the information given, each part can be matched directly with simple character classes and quantifiers.
Input: FNBR6.03202021.Default_Title.mp4
Regex:
([^0-9]+)([0-9]+)\.([0-9]+)\.([^.]+)\.(.+)
Deconstructed:
([^0-9]+) match any number of non-digits and capture in group 1 (but match at least one character)
([0-9]+) match any number of digits and capture in group 2 (at least 1 digit)
\. match one literal dot
([0-9]+) match any number of digits and capture in group 3 (at least 1 digit)
\. match one literal dot
([^.]+) match any number of characters which are not the dot and capture in group 4 (at least one character)
\. match one literal dot
(.+) match the rest of the input and capture in group 5. Can be anything (but again: at least 1 character)
Try it online on regex101.
NB. depending on your regex flavor, [0-9] can be shortened to \d and [^0-9] can be replaced with \D, resulting in: (\D+)(\d+)\.(\d+)\.([^.]+)\.(.+)
Or if can already split the string into an array/multiple variables, split first, then apply regex (pseudo code):
var allparts = input.split('.');
var firstpart = part[0].match(/^(\D+)(\d+)$/);
// firstpart[0] = FNBR
// firstpart[1] = 6
Is it possible to have multiple quantifiers in a regex?
Say I have the following regex:
[A-Z0-9]{44}|[A-Z0-9]{36}|[A-Z0-9]{30}
I want to match any string which is either 30, 36 or 44 chars long. Is it possible to write this shorter in any way? Something like the following:
[A-Z0-9<]{30|36|44}?
Edit: Seeing the answers I assume there is not really a way in which you can write the above shorter. The best solution would be to solve it programmatically I guess. Thanks for the input.
Brief
Note that your regex performs much better than any other answers you'll get on your question, but since your question is actually about simplifying/shortening your regex, you can use this.
Your original regex (38 characters):
[A-Z0-9]{44}|[A-Z0-9]{36}|[A-Z0-9]{30}
Your original regex with modifications so that we can use it to test against multiline input (44 characters):
^(?:[A-Z0-9]{44}|[A-Z0-9]{36}|[A-Z0-9]{30})$
Code
My original regex (32 characters):
([A-Z0-9]){44}|(?1){36}|(?1){30}
My original regex with modifications so that we can use it to test against multiline input (38 characters):
^(?:([A-Z0-9]){44}|(?1){36}|(?1){30})$
See regex in use here
Explanation
([A-Z0-9]){44}|(?1){36}|(?1){30} Match either of the following
([A-Z0-9]){44} Match any character in the set (A-Z or 0-9) exactly 44 times. This also captures a single character in the set into capture group 1. We will later use this capture group through recursion.
(?1){36} Recurse the first subpattern exactly 36 times
(?1){30} Recurse the first subpattern exactly 30 times
Looks like you want
[A-Z0-9]{30}([A-Z0-9]{6}([A-Z0-9]{8})?)?
This isn't actually simpler, mind you.
You don't need to check your input contains only uppercase letters [A-Z] and digits [0-9] to test whether it is a string. Eliminate [A-Z0-9] part for this reason. Now, you can specify multiple quantifiers as follows:
^(?:.{30}|.{36}|.{44})$
If you need to do that check strictly. You can use this regex without typing [A-Z0-9] three times:
^(?=[A-Z0-9]*$)(?:.{30}|.{36}|.{44})$
You have the [A-Z0-9] part only once and a generic . to check the length of string.
Just getting into regex and I am trying to write a regex for a uk national insurance number example ab123456c.
I've currently got this which works
^[jJ]{2}[\-\s]{0,1}[0-9]{2}[\-\s]{0,1}[0-9]{2}[\-\s]{0,1}[0-9]{2}[\-\s]{0,1}[a-zA-Z]$
but I was wondering if there is a shorter version for exmaple
^[jJ]{2} [ [\-\s]{0,1}[0-9]{2} ]{3} [\-\s]{0,1}[a-zA-Z]$
So repeat the [-\s]{0,1}[0-9]{2} 3 by wrapping it in some sort of group [ * ]{3}
If i got you right, your insurance numbers are always two letters, 6 numbers, and a final letter, A,B,C or D? Wouldn't it be the easiest way to try sth. like that
/\w{2}\d{6}[A-D]/
you catch 2 letters at first with \w{2} , then you get 6 numbers with \d{6} and you end with a letter from A to D by [A-D]
Or, if blanks are impontant, try this
/\w{2}\d\d \d\d \d\d [A-D]/
I dont think that shorten it much more would be possible, since when you are trying to use (\d\d ){3} it would only repeat the same pattern three times, e.g. 23 23 23
If you really want to learn RegEx, i suggest you this tutorial, it helped me a lot in the beginning of Regular Expressions
A simple research for a regex tutorial in your favorite search engine (duckduckgo for sure) would give you the answer faster than asking in a forum!
So what you are looking for is a non-capturing group (?:...). You can rewrite your pattern like this:
^[jJ]{2}(?:[-\s]?[0-9]{2}){3}[-\s]?[a-zA-Z]$
or like this if you use a case insensitive flag/option:
^J{2}(?:[-\s]?[0-9]{2}){3}[-\s]?[A-Z]$
An other possible way consists to remove all that is not a letter or a digit before (and eventually to use an uppercase function). Then you only need:
^J{2}[0-9]{6}[A-Z]$
As an aside, I don't understand why you start your pattern with J for the first two letters, since many others letters are allowed according to this article: https://en.wikipedia.org/wiki/National_Insurance_number
Other thing, short and efficient are two different things in computing.
for example this pattern will be efficient too and more restrictive:
^(?!N[KT]|BG|GB|[KT]N|ZZ)[ABCEGHJ-PRSTW-Z][ABCEGHJ-NPRSTW-Z][0-9][0-9][-\s]?[0-9][0-9][-\s]?[0-9][0-9][-\s]?[A-D]$
A shorter version:
/^j{2}(?:[-\s]?\d{2}){3}[-\s]?[a-zA-Z]$/i
See the regex online demo
Note that
you do not need to escape - inside the character class if it is at the beginning or end of the class (see Metacharacters Inside Character Classes)
you can use a \d as a shorthand character class for a digit (see Shorthand Character Classes)
{0,1} limiting quantifier can usually be represented as a ? quantifier (1 or zero occurrences) (see Limiting Repetition)
The /i (or inline modifier version (?i) - depending on the engine) can be used to turn [jJ] to just j or J (see Specifying Modes Inside The Regular Expression)
A limiting quantifier can be applied to a whole (better non-capturing) group: (?:[-\s]?\d{2}){3} (see Limiting Repetition)
I am trying to understand the following regex, I understand the initial part but I'm not able to figure out what {3,19} is doing here:
/[A-Z][A-Za-z0-9\s]{3,19}$/
That is the custom repetition operation known as the Quantifier.
\d{3} will find exactly three digits.
[a-c]{1,3} will find any occurrance of a, b, or c at least one time, but up to three times.
\w{0,1} means a word character will be optionally found. This is the same as placing a Question Mark, e.g.: \w?
(\d\w){1,} will find any combination of a Digit followed by a Word Character at least One time, but up to infinite times. So it would match 1k1k2k4k1k5j2j9k4h1k5k This is the same as a Plus symbol, e.g.: (\d\w)+
b{0,}\d will optionally find the letter b followed by a digit, but could also match infinite letter b's followed by a digit. So it will match 5, b5, or even bbbbbbb5. This is the same as an Asterisk. e.g.: b*\d
Quantifiers
They are 'quantifiers' - it means 'match previous pattern between 3 and 19 times'
When you are learning regular expressions, it's really use to play with them in an interactive tool which can highlight the matches. I've always liked a tool called Regex Coach, but it is Windows only. Plenty of online tools though - have a play with your regex here, for example.
{n,m} means "repeat the previous element at least n times and at most m times", so the expression
[A-Za-z0-9\s]{3,19} means "match between 3 and 19 characters that are letters, digits, or whitespace". Note that repetition is greedy by default, so this will try to match as many characters as possible within that range (this doesn't come into play here, since the end of line anchor makes it so that there is really only one possibility for each match).
The regular expression you have there /[A-Z][A-Za-z0-9\s]{3,19}$/ breaks up to mean this:
[A-Z] We are looking for a Capital letter
Followed by
[A-Za-z0-9\s]{3,19} a series of letters, digits, or white space that is between 3 and 19 characters
$ Then the end of the line.
It will have to match [A-Za-z0-9\s] between 3 and 19 times.
Here's a good regex reference guide:
http://www.regular-expressions.info/reference.html
what does comma separated numbers in curly brace at the end of regex means
It denotes Quantifier with the range specified in curly brace.
curly brace analogues to function with arguments. Where we can specify single integer or two integers which acts as a range between the two numbers.
/[A-Z][A-Za-z0-9\s]{3,19}$/
Using online regex websites we can get understand as follows:
https://regex101.com/
I've got this RegEx example: http://regexr.com?34hihsvn
I'm wondering if there's a more elegant way of writing it, or perhaps a more optimised way?
Here are the rules:
Digits and dashes only.
Must not contain more than 10 digits.
Must have two hyphens.
Must have at least one digit between each hyphen.
Last number must only be one digit.
I'm new to this so would appreciate any hints or tips.
In case the link expires, the text to search is
----------
22-22-1
22-22-22
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
666666-7777777-1
88888888-88888888-1
1-1-1
88888888-88888888-22
22-333-
333-22
----------
My regex is: \b((\d{1,4}-\d{1,5})|(\d{1,5}-\d{1,4}))-\d{1}\b
I'm using this site for testing: http://gskinner.com/RegExr/
Thanks for any help,
Nick
Here is a regex I came up with:
(?=\b[\d-]{3,10}-\d\b)\b\d+-\d+-\d\b
This uses a look-ahead to validate the information before attempting the match. So it looks for between 3-10 characters in the class of [\d-] followed by a dash and a digit. And then after that you have the actual match to confirm that the format of your string is actually digit(dash)digit(dash)digit.
From your sample strings this regex matches:
22-22-1
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
1-1-1
It also matches the following strings:
22-7777777-1
1-88888888-1
Your regexp only allows a first and second group of digits with a maximum length of 5. Therefore, valid strings like 1-12345678-1 or 123456-1-1 won't be matched.
This regexp works for the given requirements:
\b(?:\d\-\d{1,8}|\d{2}\-\d{1,7}|\d{3}\-\d{1,6}|\d{4}\-\d{1,5}|\d{5}\-\d{1,4}|\d{6}\-\d{1,3}|\d{7}\-\d{1,2}|\d{8}\-\d)\-\d\b
(RegExr)
You can use this with the m modifier (switch the multiline mode on):
^\d(?!.{12})\d*-\d+-\d$
or this one without the m modifier:
\b\d(?!.{12})\d*-\d+-\d\b
By design these two patterns match at least three digits separated by hyphens (so no need to put a {5,n} quantifier somewhere, it's useless).
Patterns are also build to fail faster:
I have chosen to start them with a digit \d, this way each beginning of a line or word-boundary not followed by a digit is immediately discarded. Other thing, using only one digit, I know the remaining string length.
Then I test the upper limit of the string length with a negative lookahead that test if there is one more character than the maximum length (if there are 12 characters at this position, there are 13 characters at least in the string). No need to use more descriptive that the dot meta-character here, the goal is to quickly test the length.
finally, I describe the end of string without doing something particular. That is probably the slower part of the pattern, but it doesn't matter since the overwhelming majority of unnecessary positions have already been discarded.