My regex knowledge is pretty limited, but I'm trying to write/find an expression that will capture the following string types in a document:
DO match:
ADY123
AD12ADY
1HGER_2
145-DE-FR2
Bicycle1
2Bicycle
128D
128878P
DON'T match:
BICYCLE
183-329-193
3123123
Is such an expression possible? Basically, it should find any string containing letters AND digits, regardless of whether the string contains a dash or underscore. I can find the first two using the following two regex:
/([A-Z][0-9])\w+/g
/([0-9][A-Z)\w+/g
But searching for possible dashes and hyphens makes it more complicated...
Thanks for any help you can provide! :)
MORE INFO:
I've made slight progress with: ([A-Z|a-z][0-9]+-*_*\w+) but it doesn't capture strings with more than one hyphen.
I had a document with a lot of text strings and number strings, which I don't want to capture. What I do want is any product code, which could be any length string with or without hyphens and underscores but will always include at least one digit and at least one letter.
You can use the following expression with the case-insensitive mode:
\b((?:[a-z]+\S*\d+|\d\S*[a-z]+)[a-z\d_-]*)\b
Explanation:
\b # Assert position at a word boundary
( # Beginning of capturing group 1
(?: # Beginning of the non-capturing group
[a-z]+\S*\d+ # Match letters followed by numbers
| # OR
\d+\S*[a-z]+ # Match numbers followed by letters
) # End of the group
[a-z\d_-]* # Match letter, digit, '_', or '-' 0 or more times
) # End of capturing group 1
\b # Assert position at a word boundary
Regex101 Demo
Related
I have a list of values that contains various values, but I'm only interested in the number after # of those starting with XXX_
ABC
XXX_YYY
XXX_YYY#12235
XXX_YYY#12281
XXX_YYY#12318
I have tried several things but not quite hit the head of the nail :-(
(?<!XXX\_)#
and
(?<=XXX\_)\*\[^#\]+$ - closest but also get those without # in :-(
To get the number after #, please find below python code and modify as per need
import re
result = re.findall("(?<=#)(.*?)(?=$)",a)
print(result[0])
Both patterns do not take numbers into account, and will match:
(?<!XXX_)# only matches a single # when not directly preceded by XXX_
(?<=XXX_)*[^#]+$ Optionally repeats a lookbehind assertion, and then matches 1+ chars other than # till the end of the string.
If there is a single # char in the string before the numbers, you can match XXX_ followed by any char except # using a negated character class and then match # followed by capturing the digits at the end of the string in group 1.
XXX_[^\n#]*#(\d+)$
The pattern matches:
XXX_ Match literally
[^\n#]*# Match optional chars other than # or a newline, then match #
(\d+) Capture 1+ digits in group 1
$ End of string
See a regex demo.
There are 5 examples as below, and I am trying to find 3,4,5 while excluding 1,2.
ABC-abc
abc-ABC
ABC-ABC
ABC
vABC-ABC-ABCv
The current expression I use is:
(?!(\w*[A-Z]{2,}-[a-z]+\w*|\w*[a-z]+-[A-Z]{2,}\w*))(\w*-?[A-Z]{2,}-?\w*)
I utilize (\w*-?[A-Z]{2,}-?\w*) to get all possibility of all examples first.
I then use (?!...|...) to put two exclusion conditions.
The first exclusion condition is \w*[A-Z]{2,}-[a-z]+\w* and the second is \w*[a-z]+-[A-Z]{2,}\w*.
This expression works to exclude 1.ABC-abc but not abc-ABC.
I searched a lot and found some people say this way is not something regex is "good" at. Is there any solution or improvement I can do to get rid of abc-ABC.
Appreciate any help or opinion.
As I understand strings are to be rejected if they contain a hyphen that is preceded by a lower-case letter and followed by an upper-case letter, or vice-versa; else they are to be accepted. If so, the following regular expression could be used.
^(?!.*(?:[a-z]-[A-Z]|[A-Z]-[a-z]))
Demo
The regex engine performs the following operations.
^ # match beginning of line
(?! # begin a negative lookahead
.* # match 0+ characters
(?: # begin a non-capture group
[a-z]-[A-Z] # match a lc letter, '-', uc letter
| # or
[A-Z]-[a-z] # match an uc letter, '-', lc letter
) # end non-capture group
) # end negative lookahead
I have a string like a Taxi:[(h19){h12}], HeavyTruck :[(h19){h12}] wherein I want to keep information before the ":" that is a taxi or heavy truck . can somebody help me with this?
This will capture a single word if it's followed by :[ allowing spaces before and after :.
[A-Za-z]+(?=\s*:\s*\[)
You'll need to set regex global flag to capture all occurrences.
I think this will do the trick in your case: (?=\s)*\w+(?=\s*:)
Explanation:
(?=\s)* - Searches for 0 or more spaces at the begging of the word without including them in the selection .
\w+ - Selects one or more word characters.
(?=\s*:) - Searches for 0 or more white spaces after the word followed by a column without including them in the selection.
To match the information in your provided data before the : you could try [A-Za-z]+(?= ?:) which matches upper or lowercase characters one or more times and uses a positive lookahead to assert that what follows is an optional whitespace and a :.
If the pattern after the colon should match, your could try: [A-Za-z]+(?= ?:\[\(h\d+\){h\d+}])
Explanation
Match one or more upper or lowercase characters [A-Za-z]+
A positive lookahead (?: which asserts that what follows
An optional white space ?
Is a colon with the pattern after the colon using \d+ to match one or more digits (if you want to match a valid time you could update this with a pattern that matches your time format) :\[\(h\d+\){h\d+}]
Close the positive lookahead )
I have created the following expression: (.NET regex engine)
((-|\+)?\w+(\^\.?\d+)?)
hello , hello^.555,hello^111, -hello,+hello, hello+, hello^.25, hello^-1212121
It works well except that :
it captures the term 'hello+' but without the '+' : this group should not be captured at all
the last term 'hello^-1212121' as 2 groups 'hello' and '-1212121' both should be ignored
The strings to capture are as follows :
word can have a + or a - before it
or word can have a ^ that is followed by a positive number (not necessarily an integer)
words are separated by commas and any number of white spaces (both not part of the capture)
A few examples of valid strings to capture :
hello^2
hello^.2
+hello
-hello
hello
EDIT
I have found the following expression which effectively captures all these terms, it's not really optimized but it just works :
([a-zA-Z]+(?= ?,))|((-|\+)[a-zA-Z]+(?=,))|([a-zA-Z]+\^\.?\d+)
Ok, there are some issues to tackle here:
((-|+)?\w+(\^.?\d+)?)
^ ^
The + and . should be escaped like this:
((-|\+)?\w+(\^\.?\d+)?)
Now, you'll also get -1212121 there. If your string hello is always letters, then you would change \w to [a-zA-Z]:
((-|\+)?[a-zA-Z]+(\^\.?\d+)?)
\w includes letters, numbers and underscore. So, you might want to restrict it down a bit to only letters.
And finally, to take into consideration of the completely not capturing groups, you'll have to use lookarounds. I don't know of anyway otherwise to get to the delimiters without hindering the matches:
(?<=^|,)\s*((-|\+)?[a-zA-Z]+(\^\.?\d+)?)\s*(?=,|$)
EDIT: If it cannot be something like -hello^2, and if another valid string is hello^9.8, then this one will fit better:
(?<=^|,)\s*((?:-|\+)?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)(?=\s*(?:,|$))
And lastly, if capturing the words is sufficient, we can remove the lookarounds:
([-+]?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)
It would be better if you first state what it is you are looking to extract.
You also don't indicate which Regular Expression engine you're using, which is important since they vary in their features, but...
Assuming you want to capture only:
words that have a leading + or -
words that have a trailing ^ followed by an optional period followed by one or more digits
and that words are sequences of one or more letters
I'd use:
([a-zA-Z]+\^\.?\d+|[-+][a-zA-Z]+)
which breaks down into:
( # start capture group
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
\^ # literal
\.? # optional period
\d+ # one or more digits
| # OR
[+-]? # optional plus or minus
[a-zA-Z]+ # one or more letters or underscores
) # end of capture group
EDIT
To also capture plain words (without leading or trailing chars) you'll need to rearrange the regexp a little. I'd use:
([+-][a-zA-Z]+|[a-zA-Z]+\^(?:\.\d+|\d+\.\d+|\d+)|[a-zA-Z]+)
which breaks down into:
( # start capture group
[+-] # literal plus or minus
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
| # OR
[a-zA-Z]+ # one or more letters
\^ # literal
(?: # start of non-capturing group
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
) # end of non-capturing group
| # OR
[a-zA-Z]+ # one or more letters
) # end of capture group
Also note that, per your updated requirements, this regexp captures both true non-negative numbers (i.e. 0, 1, 1.2, 1.23) as well as those lacking a leading digit (i.e. .1, .12)
FURTHER EDIT
This regexp will only match the following patterns delimited by commas:
word
word with leading plus or minus
word with trailing ^ followed by a positive number of the form \d+, \d+.\d+, or .\d+
([+-][A-Za-z]+|[A-Za-z]+\^(?:.\d+|\d+(?:.\d+)?)|[A-Za-z]+)(?=,|\s|$)
Please note that the useful match will appear in the first capture group, not the entire match.
So, in Javascript, you'd:
var src="hello , hello ,hello,+hello,-hello,hello+,hello-,hello^1,hello^1.0,hello^.1",
RE=/([+-][A-Za-z]+|[A-Za-z]+\^(?:\.\d+|\d+(?:\.\d+)?)|[A-Za-z]+)(?=,|\s|$)/g;
while(RE.test(src)){
console.log(RegExp.$1)
}
which produces:
hello
hello
hello
+hello
-hello
hello^1
hello^1.0
hello^.1
I need a regular expression that will match this pattern (case doesn't matter):
066B-E77B-CE41-4279
4 groups of letters or numbers 4 characters long per group, hyphens in between each group.
Any help would be greatly appreciated.
^(?:\w{4}-){3}\w{4}$
Explanation:
^ # must match beginning of string
(?: # make a non-capturing group (for duplicating entry)
\w{4} # a-z, A-Z, 0-9 or _ matching 4 times
- # hyphen
){3} # this group matches 3 times
\w{4} # 4 more of the letters numbers or underscore
$ # must match end of string
Would be my best bet. Then you can use Regex Match (static).
P.S. More info on regex can be found here.
P.P.S. If you don't want to match underscores, the \w above can be replaced (both times) with [a-zA-Z0-9] (known as a class matching lowercase and uppercase letters and numbers). e.g.
^(?:[a-zA-Z0-9]{4}-){3}[a-zA-Z0-9]{4}$
Try:
[A-Za-z0-9]{4}\-[A-Za-z0-9]{4}\-[A-Za-z0-9]{4}\-[A-Za-z0-9]{4}
With such a small sample of data, it's not easy to be certain what you actually want.
I'm going to assume that all the characters in that string are hex digits, and that's what you need to search for.
In that case, you would need a regular expression something like this:
^[a-f0-9]-[a-f0-9]-[a-f0-9]-[a-f0-9]$
If they can be any letter, then replace the fs with zs.
Oh, and use myRE.IgnoreCase = True to make it case insensitive.
If you need further advice on regular expressions, I'd recommend http://www.regular-expressions.info/ as good site. They even have a VB.net-specific page.
Assuming from your example:
There are four groups of letters, separated by dashes.
Each group is four letters.
The letters are hexadecimal digits.
This pattern would match that:
^[\dA-F]{4}-[\dA-F]{4}-[\dA-F]{4}-[\dA-F]{4}$
Note that ^ and $ match the beginning and end of the string, which is important if you want to match the entire string and not check if the pattern occurs inside a string.
You could also make use of the repetitions in the pattern:
^(?:[\dA-F]{4}-){3}[\dA-F]{4}$