Regular expression captures unwanted string - regex

I have created the following expression: (.NET regex engine)
((-|\+)?\w+(\^\.?\d+)?)
hello , hello^.555,hello^111, -hello,+hello, hello+, hello^.25, hello^-1212121
It works well except that :
it captures the term 'hello+' but without the '+' : this group should not be captured at all
the last term 'hello^-1212121' as 2 groups 'hello' and '-1212121' both should be ignored
The strings to capture are as follows :
word can have a + or a - before it
or word can have a ^ that is followed by a positive number (not necessarily an integer)
words are separated by commas and any number of white spaces (both not part of the capture)
A few examples of valid strings to capture :
hello^2
hello^.2
+hello
-hello
hello
EDIT
I have found the following expression which effectively captures all these terms, it's not really optimized but it just works :
([a-zA-Z]+(?= ?,))|((-|\+)[a-zA-Z]+(?=,))|([a-zA-Z]+\^\.?\d+)

Ok, there are some issues to tackle here:
((-|+)?\w+(\^.?\d+)?)
^ ^
The + and . should be escaped like this:
((-|\+)?\w+(\^\.?\d+)?)
Now, you'll also get -1212121 there. If your string hello is always letters, then you would change \w to [a-zA-Z]:
((-|\+)?[a-zA-Z]+(\^\.?\d+)?)
\w includes letters, numbers and underscore. So, you might want to restrict it down a bit to only letters.
And finally, to take into consideration of the completely not capturing groups, you'll have to use lookarounds. I don't know of anyway otherwise to get to the delimiters without hindering the matches:
(?<=^|,)\s*((-|\+)?[a-zA-Z]+(\^\.?\d+)?)\s*(?=,|$)
EDIT: If it cannot be something like -hello^2, and if another valid string is hello^9.8, then this one will fit better:
(?<=^|,)\s*((?:-|\+)?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)(?=\s*(?:,|$))
And lastly, if capturing the words is sufficient, we can remove the lookarounds:
([-+]?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)

It would be better if you first state what it is you are looking to extract.
You also don't indicate which Regular Expression engine you're using, which is important since they vary in their features, but...
Assuming you want to capture only:
words that have a leading + or -
words that have a trailing ^ followed by an optional period followed by one or more digits
and that words are sequences of one or more letters
I'd use:
([a-zA-Z]+\^\.?\d+|[-+][a-zA-Z]+)
which breaks down into:
( # start capture group
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
\^ # literal
\.? # optional period
\d+ # one or more digits
| # OR
[+-]? # optional plus or minus
[a-zA-Z]+ # one or more letters or underscores
) # end of capture group
EDIT
To also capture plain words (without leading or trailing chars) you'll need to rearrange the regexp a little. I'd use:
([+-][a-zA-Z]+|[a-zA-Z]+\^(?:\.\d+|\d+\.\d+|\d+)|[a-zA-Z]+)
which breaks down into:
( # start capture group
[+-] # literal plus or minus
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
| # OR
[a-zA-Z]+ # one or more letters
\^ # literal
(?: # start of non-capturing group
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
) # end of non-capturing group
| # OR
[a-zA-Z]+ # one or more letters
) # end of capture group
Also note that, per your updated requirements, this regexp captures both true non-negative numbers (i.e. 0, 1, 1.2, 1.23) as well as those lacking a leading digit (i.e. .1, .12)
FURTHER EDIT
This regexp will only match the following patterns delimited by commas:
word
word with leading plus or minus
word with trailing ^ followed by a positive number of the form \d+, \d+.\d+, or .\d+
([+-][A-Za-z]+|[A-Za-z]+\^(?:.\d+|\d+(?:.\d+)?)|[A-Za-z]+)(?=,|\s|$)
Please note that the useful match will appear in the first capture group, not the entire match.
So, in Javascript, you'd:
var src="hello , hello ,hello,+hello,-hello,hello+,hello-,hello^1,hello^1.0,hello^.1",
RE=/([+-][A-Za-z]+|[A-Za-z]+\^(?:\.\d+|\d+(?:\.\d+)?)|[A-Za-z]+)(?=,|\s|$)/g;
while(RE.test(src)){
console.log(RegExp.$1)
}
which produces:
hello
hello
hello
+hello
-hello
hello^1
hello^1.0
hello^.1

Related

Regex to add underscore between number and unit (or replace whitespace with underscore between number and unit)

I have a long text that contains data like:
23cm,
23m,
60 cm,
60 m,
So sometimes there is a space between number and unit. Sometimes there isn't one.
How to add an underscore in each case, so the result would be:
23_cm,
23_m,
60_cm,
60_m
The search pattern for a part of it is probably (\d) (?:cm|m), but I can figure out the rest.
We can use capturing groups. The following example uses \2 and \3 for the capturing groups. Some languages would use $2 and $3.
See https://regex101.com/r/KxYyrb/1
input string
23cm, 23m, 60 cm, 60 m,
pattern
((\d+)\s?(m|cm))
replace using
\2_\3
output
23_cm, 23_m, 60_cm, 60_m,
You can use
(\d)\s?(c?m)\b
The replacement pattern is $1_$2.
See the regex demo.
Details:
(\d) - Capturing group 1: a digit
\s? - an optional whitespace char
(c?m) - Capturing group 2: an optional c and an m
\b - a word boundary (else, the regex will match m in men, for example).
I suggest replacing matches of
(?<=\d) ?(?=c?m,)
with an underscore. If a space is present it is matched; else the (zero-width) location between the last digit and 'cm' or 'm' is matched.
Demo
The regular expression can be broken down as follows. (I have enclosed the space in a character class to make it visible to the reader.)
(?<= # begin a positive lookbehind
\d # match a digit
) # end positive lookbehind
[ ]? # optionally match a space
(?= # begin a positive lookahead
c?m, # optionally match a 'c' followed by 'm,'
) # end positive lookahead
If the comma is not always present replace (?=c?m,) with (?=c?m\b), \b being a word boundary.

Regex to capture optional characters

I want to pull out a base string (Wax) or (noWax) from a longer string, along with potentially any data before and after if the string is Wax. I'm having trouble getting the last item in my list below (noWax) to match.
Can anyone flex their regex muscles? I'm fairly new to regex so advice on optimization is welcome as long as all matches below are found.
What I'm working with in Regex101:
/(?<Wax>Wax(?:Only|-?\d+))/mg
Original string
need to extract in a capturing group
Loc3_341001_WaxOnly_S212
WaxOnly
Loc4_34412-a_Wax4_S231
Wax4
Loc3a_231121-a_Wax-4-S451
Wax-4
Loc3_34112_noWax_S311
noWax
Here is one way to do so, using a conditional:
(?<Wax>(no)?Wax(?(2)|(?:Only|-?\d+)))
See the online demo.
(no)?: Optional capture group.
(? If.
(2): Test if capture group 2 exists ((no)). If it does, do nothing.
|: Or.
(?:Only|-?\d+)
I assume the following match is desired.
the match must include 'Wax'
'Wax' is to be preceded by '_' or by '_no'. If the latter 'no' is included in the match.
'Wax' may be followed by:
'Only' followed by '_', in which case 'Only' is part of the match, or
one or more digits, followed by '_', in which case the digits are part of the match, or
'-' followed by one or more digits, followed by '-', in which case
'-' followed by one or more digits is part of the match.
If these assumptions are correct the string can be matched against the following regular expression:
(?<=_)(?:(?:no)?Wax(?:(?:Only|\d+)?(?=_)|\-\d+(?=-)))
Demo
The regular expression can be broken down as follows.
(?<=_) # positive lookbehind asserts previous character is '_'
(?: # begin non-capture group
(?:no)? # optionally match 'no'
Wax # match literal
(?: # begin non-capture group
(?:Only|\d+)? # optionally match 'Only' or >=1 digits
(?=_) # positive lookahead asserts next character is '_'
| # or
\-\d+ # match '-' followed by >= 1 digits
(?=-) # positive lookahead asserts next character is '-'
) # end non-capture group
) # end non-capture group

Regex - All before an underscore, and all between second underscore and the last period?

How do I get everything before the first underscore, and everything between the last underscore and the period in the file extension?
So far, I have everything before the first underscore, not sure what to do after that.
.+?(?=_)
EXAMPLES:
111111_SMITH, JIM_END TLD 6-01-20 THR LEWISHS.pdf
222222_JONES, MIKE_G URS TO 7.25 2-28-19 SA COOPSHS.pdf
DESIRED RESULTS:
111111_END TLD 6-01-20 THR LEWISHS
222222_G URS TO 7.25 2-28-19 SA COOPSHS
You can match the following regular expression that contains no capture groups.
^[^_]*|(?!.*_).*(?=\.)
Demo
This expression can be broken down as follows.
^ # match the beginning of the string
[^_]* # match zero or more characters other than an underscore
| # or
(?! # begin negative lookahead
.*_ # match zero or more characters followed by an underscore
) # end negative lookahead
.* # match zero or more characters greedily
(?= # begin positive lookahead
\. # match a period
) # end positive lookahead
.*_ means to match zero or more characters greedily, followed by an underscore. To match greedily (the default) means to match as many characters as possible. Here that includes all underscores (if there are any) before the last one. Similarly, .* followed by (?=\.) means to match zero or more characters, possibly including periods, up to the last period.
Had I written .*?_ (incorrectly) it would match zero or more characters lazily, followed by an underscore. That means it would match as few characters as possible before matching an underscore; that is, it would match zero or more characters up to, but not including, the first underscore.
If instead of capturing the two parts of the string of interest you wanted to remove the two parts of the string you don't want (as suggested by the desired results of your example), you could substitute matches of the following regular expression with empty strings.
_.*_|\.[^.]*$
Demo
This regular expression reads, "Match an underscore followed by zero of more characters followed by an underscore, or match a period followed by zero or more characters that are not periods, followed by the end of the string".
You could use 2 capture groups:
^([^_\n]+_).*\b([^\s_]*_.*)(?=\.)
^ Start of string
([^_\n]+_) Capture group 1, match any char except _ or a newline followed by matching a _
.*\b Match the rest of the line and match a word boundary
([^\s_]*_.*) Capture group 2, optionally match any char except _ or a whitespace char, then match _ and the rest of the line
(?=\.) Positive lookahead, assert a . to the right
See a regex demo.
Another option could be using a non greedy version to get to the first _ and make sure that there are no following underscores and then match the last dot:
^([^_\n]+_).*?(\S*_[^_\n]+)\.[^.\n]+$
See another regex demo.
Looks like you're very close. You could eliminate the names between the underscores by finding this
(_.+?_)
and replacing the returned value with a single underscore.
I am assuming that you did not intend your second result to include the name MIKE.

Regex code , Python-2 alphanumeric [duplicate]

My regex knowledge is pretty limited, but I'm trying to write/find an expression that will capture the following string types in a document:
DO match:
ADY123
AD12ADY
1HGER_2
145-DE-FR2
Bicycle1
2Bicycle
128D
128878P
DON'T match:
BICYCLE
183-329-193
3123123
Is such an expression possible? Basically, it should find any string containing letters AND digits, regardless of whether the string contains a dash or underscore. I can find the first two using the following two regex:
/([A-Z][0-9])\w+/g
/([0-9][A-Z)\w+/g
But searching for possible dashes and hyphens makes it more complicated...
Thanks for any help you can provide! :)
MORE INFO:
I've made slight progress with: ([A-Z|a-z][0-9]+-*_*\w+) but it doesn't capture strings with more than one hyphen.
I had a document with a lot of text strings and number strings, which I don't want to capture. What I do want is any product code, which could be any length string with or without hyphens and underscores but will always include at least one digit and at least one letter.
You can use the following expression with the case-insensitive mode:
\b((?:[a-z]+\S*\d+|\d\S*[a-z]+)[a-z\d_-]*)\b
Explanation:
\b # Assert position at a word boundary
( # Beginning of capturing group 1
(?: # Beginning of the non-capturing group
[a-z]+\S*\d+ # Match letters followed by numbers
| # OR
\d+\S*[a-z]+ # Match numbers followed by letters
) # End of the group
[a-z\d_-]* # Match letter, digit, '_', or '-' 0 or more times
) # End of capturing group 1
\b # Assert position at a word boundary
Regex101 Demo

Regex between a string

Example:
I have the following string
a125A##THISSTRING##.test123
I need to find THISSTRING. There are many strings which are nearly the same so I'd like to check if there is a digit or letter before the ## and also if there is a dot (.) after the ##.
I have tried something like:
([a-zA-Z0-9]+##?)(.+?)(.##)
But I am unable to get it working
You can use look behind and look ahead:
(?<=[a-zA-Z0-9]##).*?(?=##\.)
https://regex101.com/r/i3RzFJ/2
But I am unable to get it working.
Let's deconstruct what your regex ([a-zA-Z0-9]+##?)(.+?)(.##) says.
([a-zA-Z0-9]+##?) match as many [a-zA-Z0-9] followed by a # followed by optional #.
(.+?) any character as much as possible but fewer times.
(.##) any character followed by two #. Now . consumes G and then ##. Hence THISSTRING is not completely captured in group.
Lookaround assertions are great but are little expensive.
You can easily search for such patterns by matching wanted and unwanted and capturing wanted stuff in a capturing group.
Regex: (?:[a-zA-Z0-9]##)([^#]+)(?:##\.)
Explanation:
(?:[a-zA-Z0-9]##) Non-capturing group matching ## preceded by a letter or digit.
([^#]+) Capturing as many characters other than #. Stops before a # is met.
(?:##\.) Non-capturing group matching ##. literally.
Regex101 Demo
Javascript Example
var myString = "a125A##THISSTRING##.test123";
var myRegexp = /(?:[a-zA-Z0-9]##)([^#]+)(?:##\.)/g;
var match = myRegexp.exec(myString);
console.log(match[1]);
You wrote:
check if there is a digit or letter before the ##
I assume you mean a digit / letter before the first ## and
check for a dot after the second ## (as in your example).
You can use the following regex:
[a-z0-9]+ # Chars before "##", except the last
(?: # Last char before "##"
(\d) # either a digit - group 1
| # or
([a-z]) # a letter - group 2
)
##? # 1 or 2 "at" chars
([^#]+) # "Central" part - group 3
##? # 1 or 2 "at" chars
(?: # Check for a dot
(\.) # Captured - group 4
| # or nothing captured
)
[a-z0-9]+ # The last part
# Flags:
# i - case insensitive
# x - ignore blanks and comments
How it works:
Group 1 or 2 captures the last char before the first ##
(either group 1 captures a digit or group 2 captures a letter).
Group 3 catches the "central" part (THISSTRING,
a sequence of chars other than #).
Group 4 catches a dot, if any.
You can test it at https://regex101.com/r/ATjprp/1
Your regex has such an error that a dot matches any char.
If you want to check for a literal dot, you must escape it
with a backslash (compare with group 4 in my solution).