RegEx for finding strings with chars and numbers - regex

I am trying to match strings that are part numbers mixed with normal text.
Here are a few examples.
Towing Cntrl Ecu,Gl3t-19H378-Ac
Assy,Pwr,Tested Gd,Priv-M50t3
Left,Rear,Brn-Tan,Pwr,4DR,Mju1
T-Case Ecu,56029590AE
Right,Blind Spot Module,284K0 9HS0F
In these examples I am trying to match.
Gl3t-19H378-Ac
Priv-M50t3
Mju1
56029590AE
284K0 and 9HS0F
I am in .Net and this is the Regex I have been using.
(\b[a-zA-Z0-9][a-zA-Z0-9\-]{1,32}(\b|$)(?<=[0-9]))
It works for what I need if the match ends in a number. The rule I want is to match any string between word boundaries that is either all numbers or numbers and chars mixed, but never just chars.

This should do it:
\b[a-zA-Z0-9-]*\d[a-zA-Z0-9-]*\b
If you need to restrict the length to a maximum of 32, add a look ahead:
\b(?=[a-zA-Z0-9-]{1,32}\b)[a-zA-Z0-9-]*\d[a-zA-Z0-9-]*\b
If the underscore character is OK too, you can use [\w-] instead of [a-zA-Z0-9-].

Related

Using Gsub to get matched strings in R - regular expression

I am trying to extract words after the first space using
species<-gsub(".* ([A-Za-z]+)", "\1", x=genus)
This works fine for the other rows that have two words, however row [9] "Eulamprus tympanum marnieae" has 3 words and my code is only returning the last word in the string "marnieae". How can I extract the words after the first space so I can retrieve "tympanum marnieae" instead of "marnieae" but have the answers stored in one variable called >species.
genus
[9] "Eulamprus tympanum marnieae"
Your original pattern didn't work because the subpattern [A-Za-z]+ doesn't match spaces, and therefore will only match a single word.
You can use the following pattern to match any number of words (other than 0) after the first, within double quotes:
"[A-Za-z]+ ([A-Za-z ]+)" https://regex101.com/r/p6ET3I/1
https://regex101.com/r/p6ET3I/2
This is a relatively simple, but imperfect, solution. It will also match trailing spaces, or just one or more spaces after the first word even if a second word doesn't exist. "Eulamprus " for example will successfully match the pattern, and return 5 spaces. You should only use this pattern if you trust your data to be properly formatted.
A more reliable approach would be the following:
"[A-Za-z]+ ([A-Za-z]+(?: [A-Za-z]+)*)"
https://regex101.com/r/p6ET3I/3
This pattern will capture one word (following the first), followed by any number of addition words (including 0), separated by spaces.
However, from what I remember from biology class, species are only ever comprised of one or two names, and never capitalized. The following pattern will reflect this format:
"[A-Za-z]+ ([a-z]+(?: [a-z]+)?)"
https://regex101.com/r/p6ET3I/4

Regex: This range OR that range

So I am trying match a certain postcode range:
CB1 *, CB2 *, CB3 *, CB4 *, CB5 *, CB21 *, CB22 *, CB23 *, CB24 *, CB25 *
So I am trying to use range 1-5 OR 21-25.
This is my current regex:
^[CBcb].([1-5]|[21-25]).+$
I want to make sure the post code parts contains the following
[CB OR cb],[1-5 OR 21-25] and [Any combination]
Have a tinker: https://regex101.com/r/aP9uG3/2
How do you do you specify two ranges?
Since the patterns are the same and it is just the 2 that may or may not occur, you can say something like:
CB2?[1-5] # add ^ and $ if required
If you want to specify two ranges, you can always group them with parentheses common_pattern(pattern1|pattern2).
Your Regex pattern:
^[CBcb].([1-5]|[21-25]).+$
is being interpreted as:
^[CBcb].([12345]|[2125]).+$
You need:
^CB2?[1-5].+'
here ? means zero or one match of the preceding token, 2 in this case.
^cb2?[1-5].+$ and use the i flag as well.
The first error was that you were only matching one character from the list [cbCB]. The second is that there's a strange . in the middle. And the third is that you do not specify a range of numbers, but a range of characters. 21 is not a character, it is a sequence of characters. A range of characters to get all possible (integer) numbers would be [0-9]*. What you want is an optional 2 followed by a character from the range [1-5].
You should read up on what lists and ranges are and mean in Regular Expressions because you misused both of them! Eeryone makes mistakes obviously, but this is one of the basics you should get a hang of.
Having characters inside [] makes it a character class. This means that in matches any character inside the brackets (unless it's negated). It doesn't understand numbers, but characters.
If you want to match CB or cb, you separate them by | like CB|cb. Or even better - make your regex case independent. This is done in different ways in different regex flavors. In javascript for example, attach the character i to the regex: /cb/i.
As for the rest of the pattern, if 1-5 and 20-25 is literally what you want, matching 1-5 is done with a character class (which you now are familiar with ;) like [1-5] meaning match any character in the ASCII range between the characters 1 and 5 inclusive.
Make the preceding 2 optional, and your regex looks like this
CB2?[1-5]
It matches your postcode and without a terminating $, it allows for your [Any combination].
Hope this helps.
Regards

Regex for searching strings matching the following one

I am searching strings matching the following one in my source code:
<CONSTANT_STRING_1> <CONSTANT_STRING_2> <VARIABLE_DIGITS> <CONSTANT_STRING_3>
where
<CONSTANT_STRING_1>, <CONSTANT_STRING_2> and <CONSTANT_STRING_3> are constant strings like "ABC", ""DEF" and "GHI".
<VARIABLE_DIGITS> is a random number of 14 digits like "12345678901234"
Note: there are white spaces between words.
What I am looking for is to search <CONSTANT_STRING_1> <CONSTANT_STRING_2> <WHATEVER> <CONSTANT_STRING_3>. How can I build the Regex?
I am reading that by "constant string" you mean character strings? If so the below should work to find that full string you are looking for. Btw the website linked below is really great for visualizing this type of problem... give it a try :)
(([a-zA-Z]+\s){2})[0-9]{14}\s([a-zA-Z]+)$
Debuggex Demo
To break it down...
(([a-zA-Z]+\s){2}) means a string of one or more characters comprised of either LC or UC letters followed by a space and that whole thing (chars + space) repeated twice
[0-9]{14}\s 14 digits followed by a space. As #Avinash said \d{14}\s is another way of writing this portion
([a-zA-Z]+)$ Another string of one or more characters. The $ indicates that this ends the string you are searching for
You could try the below regex.
<CONSTANT_STRING_1> <CONSTANT_STRING_2> \d{14} <CONSTANT_STRING_3>
Where, \d{14} matches exactly the 14 digit number.

split text into words and exclude hyphens

I want to split a text into it's single words using regular expressions. The obvious solution would be to use the regex \\b unfortunately this one does split words also on the hyphen.
So I am searching an expression doing exactly the same as the \\b but does not split on hyphens.
Thanks for your help.
Example:
String s = "This is my text! It uses some odd words like user-generated and need therefore a special regex.";
String [] b = s.split("\\b+");
for (int i = 0; i < b.length; i++){
System.out.println(b[i]);
}
Output:
This
is
my
text
!
It
uses
some
odd
words
like
user
-
generated
and
need
therefore
a
special
regex
.
Expected output:
...
like
user-generated
and
....
#Matmarbon solution is already quite close, but not 100% fitting it gives me
...
like
user-
generated
and
....
This should do the trick, even if lookaheads are not available:
[^\w\-]+
Also not you but somebody who needs this for another purpose (i.e. inserting something) this is more of an equivalent to the \b-solutions:
([^\w\-]|$|^)+
because:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
--- http://www.regular-expressions.info/wordboundaries.html
You can use this:
(?<!-)\\b(?!-)

Comma Separated Numbers Regex

I am trying to validate a comma separated list for numbers 1-8.
i.e. 2,4,6,8,1 is valid input.
I tried [0-8,]* but it seems to accept 1234 as valid. It is not requiring a comma and it is letting me type in a number larger than 8. I am not sure why.
[0-8,]* will match zero or more consecutive instances of 0 through 8 or ,, anywhere in your string. You want something more like this:
^[1-8](,[1-8])*$
^ matches the start of the string, and $ matches the end, ensuring that you're examining the entire string. It will match a single digit, plus zero or more instances of a comma followed by a digit after it.
/^\d+(,\d+)*$/
for at least one digit, otherwise you will accept 1,,,,,4
[0-9]+(,[0-9]+)+
This works better for me for comma separated numbers in general, like: 1,234,933
You can try with this Regex:
^[1-8](,[1-8])+$
If you are using python and looking to find out all possible matching strings like
XX,XX,XXX or X,XX,XXX
or 12,000, 1,20,000 using regex
string = "I spent 1,20,000 on new project "
re.findall(r'(\b[1-8]*(,[0-9]*[0-9])+\b)', string, re.IGNORECASE)
Result will be ---> [('1,20,000', ',000')]
You need a number + comma combination that can repeat:
^[1-8](,[1-8])*$
If you don't want remembering parentheses add ?: to the parens, like so:
^[1-8](?:,[1-8])*$