Parsing and extracting variables from math/algebraic expression - regex

Given a string representing a math expression
(((x + temp) * temp_2) / 2_temp) + x3
I need to extract all the variables, except numbers and operators.
Only round brackets are permitted.
In particular, valid variables that can be extracted are:
x
x23
x_23
2_temp
2temp
2_
temp
temp_2
temp
but not
2
15
So, in general, any variable can start with any character but, if it starts with a number, then it must contain at least one letter.
I've tried this regex expression
matcher=Pattern.compile("([a-zA-Z0-9_]+\\w*?[a-zA-Z0-9_]*\\w*?)").matcher(equation);
but, for example, 15 is extracted.
Anyone can help me?
Thanks in advance

Based on your description, you seem to want to extract entities starting with an alphanumeric and then having 0+ word chars.
You may use
\b(?!\d+\b)[a-zA-Z0-9]\w*\b
See the regex demo
If you allow these variables to start with an underscore, just use \b(?!\d+\b)\w+\b.
The point here is that the (?!\d+\b) negative lookahead does not allow the string between word boundaries to only contain digits.
Details:
\b - word boundary
(?!\d+\b) - restriction: there must be no 1+ digits followed with a word boundary
[a-zA-Z0-9] - an alphanumeric (in Java, you may also use \p{Alnum})
\w* - 0 or more word chars
\b - trailing word boundary.
In Java, do not forget to use double backslashes when defining a pattern:
String pat = "\\b(?!\\d+\\b)[a-zA-Z0-9]\\w*\\b";

An alphanumeric string that contains at least one letter:
[a-zA-Z0-9_]*[a-zA-Z][a-zA-Z0-9_]*

Related

Find certain colons in string using Regex

I'm trying to search for colons in a given string so as to split the string at the colon for preprocessing based on the following conditions
Preceeded or followed by a word e.g A Book: Chapter 1 or A Book :Chapter 1
Do not match if it is part of emoticons i.e :( or ): or :/ or :-) etc
Do not match if it is part of a given time i.e 16:00 etc
I've come up with a regex as such
(\:)(?=\w)|(?<=\w)(\:)
which satisfies conditions 2 & 3 but still fails on condition 3 as it matches the colon present in the string representation of time. How do I fix this?
edit: it has to be in a single regex statement if possible
You can use
(:\b|\b:)(?!(?:(?<=\b\d:)|(?<=\b\d{2}:))\d{1,2}\b)
See the regex demo. Details:
(:\b|\b:) - Group 1: a : that is either preceded or followed with a word char
(?!(?:(?<=\b\d:)|(?<=\b\d{2}:))\d{1,2}\b) - there should be no one or two digits right after : (followed with a word boundary) if the : is preceded with a single or two digits (preceded with a word boundary).
Note :\b is equal to :(?=\w) and \b: is equal to (?<=\w):.
If you need to get the same capturing groups as in your original pattern, replace (:\b|\b:) with (?:(:)\b|\b(:)).
More flexible solution
Note that excluding matches can be done with a simpler pattern that matches and captures what you need and just matches what you do not need. This is called "best regex trick ever". So, you may use a regex like
8:|:[PD]|\d+(?::\d+)+|(:\b|\b:)
that will match 8:, :P, :D, one or more digits and then one or more sequences of : and one or more digits, or will match and capture into Group 1 a : char that is either preceded or followed with a word char. All you need to do is to check if Group 1 matched, and implement required extraction/replacement logic in the code.
Word characters \w include numbers [a-zA-Z0-9_]
So just use [a-ZA-Z] instead
(\:)(?=[a-zA-Z])|(?<=[a-zA-Z])(\:)
Test Here

Extend regular expression

I want to find invoice numbers with a regex. The string has be longer than 3 char. It may contain signs like {., , /, _}, all numbers and it may contain one or two capital letters - those can stay alone or after each other. That is, what I'm currently trying, without success.
`([0-9-\.\\\/_]{,3})([A-Z]{0,2})?`
Here I have two examples, which should be matched:
019S836/03717008
DR094255
This should not be matched:
DRF094255
Can somebody help me please?
You can use
^(?!(?:[^A-Z]*[A-Z]){3})(?=\D*\d)[0-9A-Z.\\\/_-]{3,}$
See the regex demo.
Details:
^ - start of string
(?!(?:[^A-Z]*[A-Z]){3}) - a negative lookahead that fails the match if, immediately to the right of the current location (i.e. from the start of string), there are three occurrences of any zero or more chars other than uppercase ASCII letters followed with one uppercase ASCII letter
(?=\D*\d) - there must be at least one digit in the string
[0-9A-Z.\\\/_-]{4,} - four or more occurrences of digits, uppercase letters, ., \, /, _ or -
$ - end of string.

Is there a way to use Regex to capture numbers out of a string based on a specific leading letters?

I need to extract any number between 4-10 digits that following directly after 'PO#' OR 'PO# ' (with a whitespace). I do not want to include the PO# with the actual value that is extracted, however I do need it as criteria to target the value within a string. If the digits are less than 4 or greater than 10, I do not wish to capture the value and would like to otherwise ignore it.
A sample string would look like this:
PO#12445 for Vendor Enterprise
or
Invoice# 21412556 for Vendor Enterprise for PO# 12445
My current RegEX expression captures PO# with '#' and I use additional logic after the fact to remove the '#', however my expression is also capturing Invoice# and Inv# which I don't want it to do. I'd like it to only target PO#.
Current Expression: [P][O][#]\s*[0-9]{3,9}\d+\w
Any help would be greatly appreciated!
If you need only the digits, you can use \b(?<=PO#)\s?(\d{4,10})\b, with:
(?<=PO#): positivive lookbehind, be sure that this pattern is present before the needed pattern (PO followed by #)
\s?: 0 or 1 whitespace
(\d{4,10}): between 4 and 10 digits
\b: word boundaries to avoid ie. the 10 first digits of a 11 digits pattern match or 'SPO#' to match
Edit: Alexander Mashin is right about the lookbehind having to be fixed width, so \b(?<=PO#)\s?(\d{4,10})\b is better https://regex101.com/r/1KBQd1/5
Edit: added word boundaries
You can use a capturing group and repeat matching the digits 4-10 times using [0-9]{4,10}.
Note that [P][O][#] is the same as PO#
\bPO#\s*([0-9]{4,10})\b
\bPO#\s* Match PO# preceded by a word boundary and match 0+ whitespace chars
( Capture group 1
[0-9]{4,10} Match 4 - 10 digits
)\b Close group followed by a word boundary to prevent the match being part of a larger word
Regex demo
If PCRE is available, how about:
PO#\s*\K\d{4,10}(?=\D|$)
PO#\s* matches the leading substring "PO#" followed by 0 or more whitespaces.
\K resets the starting position of the match and works as a positive (zero length) lookbehind.
\d{4,10} matches a sequence of digits of 4 <= length <= 10.
(?=\D|$) is the positive lookahead to match a non-digit character or the end of the string.

Matching exactly 8 digits with 0 or more dashes

I am trying to match an exactly 8 digit phone number that has 0 or more dashes in it. For example, the following should all match:
12345678
123456-78
1234-5678
1-2-3-4-5-6-7-8
If I ignore the dashes, it is rather simple. I can just use:
[\d]{8}
If I want to match a string containing at least 8 characters (digits and dashes) I can use:
[\d-]{8,}
However, here I can't put an upper bound on the number of characters because I don't know how many dashes the number would have.
The only way I thought of would be to use:
[0-9][-]?[0-9][-]?[0-9][-]?[0-9][-]?[0-9][-]?[0-9][-]?[0-9][-]?[0-9]
However, this seems really messy for something that (at least in my mind) seems simple. Is there an easier way to do this?
You can use this regex with optional - after each digit:
^([0-9]-?){8}$
If your regex supports \d then use:
^(\d-?){8}$
RegEx Demo
You should use
^[0-9](-?[0-9]){7}$
^([0-9]-?){8}\b$
See the regex demo #1 and regex demo #2, where \b is used to make sure the last char is a digit (that is a word char).
Details
^ - start of string
[0-9] to match a digit since \d in various regex flavors may match more than just ASCII digits from 0 to 9.
(-?[0-9]){7} - matches 7 sequences of an optional hyphen and a digit, and will not allow trailing hyphen at the end of the string.
([0-9]-?){8} - matches eight occurrences of a digit followed with an optional - char
\b$ - is a trick to make sure the last char is of a word type. Since the pattern can only match a - (a non-word char) or a digit at the end, \b automatically makes sure the last char is a digit.

extract substring with regular expression

I have a string, actually is a directory file name.
str='\\198.168.0.10\share\ccdfiles\UA-midd3-files\UA0001A_15_Jun_2014_08.17.49\Midd3\y12m05d25h03m16.midd3'
I need to extract the target substring 'UA0001A' with matlab (well I would like think all tools should have same syntax).
It does not necessary to be exact 'UA0001A', it is arbitrary alphabet-number combination.
To make it more general, I would like to think the substring (or the word) shall satisfy
it is a alphabet-number combination word
it cannot be pure alphabet word or pure number word
it cannot include 'midd' or 'midd3' or 'Midd3' or 'MIDD3', etc, so may use case-intensive method to exclude word begin with 'midd'
it cannot include 'y[0-9]{2,4}m[0-9]{1,2}d[0-9]{1,2}\w*'
How to write the regular expression to find the target substring?
Thanks in advance!
You can use
s = '\\198.168.0.10\share\ccdfiles\UA-midd3-files\UA0001A_15_Jun_2014_08.17.49\Midd3\y12m05d25h03m16.midd3';
res = regexp(s, '(?i)\\(?![^\W_]*(midd|y\d+m\d+))(?=[^\W_]*\d)(?=[^\W_]*[a-zA-Z])([^\W_]+)','tokens');
disp(res{1}{1})
See the regex demo
Pattern explanation:
(?i) - the case-insensitive modifier
\\ - a literal backslash
(?![^\W_]*(midd|y\d+m\d+)) - a negative lookahead that will fail a match if there are midd or y+digits+m+digits after 0+ letters or digits
(?=[^\W_]*\d) - a positive lookahead that requires at least 1 digit after 0+ digits or letters ([^\W_]*)
(?=[^\W_]*[a-zA-Z]) - there must be at least 1 letter after 0+ letters or digits
([^\W_]+) - Group 1 (what will extract) matching 1+ letters or digits (or 1+ characters other than non-word chars and _).
The 'tokens' "mode" will let you extract the captured value rather than the whole match.
See the IDEONE demo
this should get you started:
[\\](?i)(?!.*midd.*)([a-z]+[0-9]+[a-z0-9]*|[a-z]+[0-9]+[a-z0-9]*)
[\\] : match a backslash
(?i) : rest of regex is case insensitive
?! following match can not match this
(?!.*midd.*) : following match can not be a word wich has any character, midd, any character
([a-z]+[0-9]+[a-z0-9]*|[a-z]+[0-9]+[a-z0-9]*) match at least one number followed by at least one letter OR at least one letter followed by at least one number followed by any amount of letters and numbers (remember, cannot match the ?! group so no word which contains mid )