How to find digits in String by regular expression? [duplicate] - regex

I would like to match positive and negative numbers (no decimal or thousand separators) inside a string using .NET, but I want to match whole words only.
So if a string looks like
redeem: -1234
paid: 234432
then I'd like to match -1234 and 234432
But if text is
LS022-1234-5678
FA123245
then I want no match returned. I tried
\b\-?\d+\b
but it will only match 1234 in the first scenario, not returning the "-" sign.
Any help is appreciated. Thank you.

Well, I'm sure this is far from perfect, but it works for your examples:
(?<=\W)-?(?<!\w-)\d+
If you want to allow underscores just before the number, then I'd use this modification:
(?i)(?<=[^a-z0-9])-?(?<![a-z0-9]-)\d+
Let me know of any issues and I'll try and help. If you'd like me to explain either of them, let me know that too.
EDIT
To only match if there is a space or tab just before the number / negative sign (as noted in the comment below), this could be used:
(?<=[ \t])-?\d+
Note that it will match e.g. on the first number series of a telephone number, time or date value, and will not match if the number is at the beginning of the line (after a newline) - make sure this is what you intend :D

There is no word boundary between a space and -, thus you can't use \b there.
You could use:
(?<!\S)-?\d+\b
or
(?<![\w-])-?\d+\b
depending on your requirements (which aren't fully specified).
Both will work for your examples tho.

The \b-?\d+\b pattern is wrong because \b before an optional -? pattern will require a word char to appear immediately to the left of the hyphen. In general, do not use word boundaries next to optional patterns (unless you know what you are doing of course).
You might use -?\b\d+\b to match 123 or -123 like numbers as whole words. However, here, you are looking for something a bit different, because the 1234 and 5678 are whole words inside LS022-1234-5678 since they are enclosed with non-word chars (namely, a hyphen).
In this case, you need to extend whole word matching \b with extra lookbehind check on the left:
-?\b(?<!\d-)\d+\b
See the regex demo. Details:
-? - an optional hyphen
\b - a word boundary
(?<!\d-) - a negative lookbehind that fails the match if there is a digit + - immediately to the left of the current location.
\d+ - one or more digits
\b - a word boundary.
See the C# demo:
using System;
using System.Linq;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var text = "LS022-1234-5678, FA123245, redeem: -1234, paid: 234432";
var matches = Regex.Matches(text, #"-?\b(?<!\d-)\d+\b").Cast<Match>().Select(x => x.Value).ToList();
foreach (var s in matches)
Console.WriteLine(s);
}
}
Output:
-1234
234432

Related

Regular Expression to extract alphanumeric parts of a URL?

Given any URL, like:
https://stackoverflow.com/v1/summary/1243PQ/details/P1/9981
How do I extract the numeric or alphanumeric part of the URL? I.e. the following strings from the url given above:
1. v1
2. 1243PQ
3. P1
4. 9981
To rephrase, a regex to extract strings from a string (URL) which have at least 1 digit and 0 or more alphabet characters, separated by '/'.
I tried to capture a repeating group (^[a-zA-Z0-9]+)+ and ([a-zA-Z]{0,100}[0-9]{1,100})+ but it didn't work. In hindsight intuition does say this shouldn't work. I am unsure how do I match patterns over a group and not just a single character.
If I understand what you really want:
Extracting parts with only numbers or with numbers following alphabets
then; I can suggest this regex:
\b[a-zA-Z]*[0-9]+[a-zA-z]*\b
Regex Demo
I use \b to assert position of a word boundary or a part.
As numbers are required and alphabets can comes before or after that I use above regex.
If following alphabets are not required then I can suggest this regex:
\b[a-zA-z0-9]*[0-9]+[a-zA-Z0-9]*\b
Regex Demo
I believe this should work for you:
(\d*\w+\d+\w*)
EDIT: actually, this should be sufficient
(\w+\d+\w*)
or
(\w*\d+\w*)
Well, you could do this:
(\w*\d+\w*) with the g (global) regex option
On the example URL, it would look like this:
const regex = /(\w*\d+\w*)/g;
const url = 'https://stackoverflow.com/v1/summary/1243PQ/details/P1/9981';
console.log(url.match(regex))
Try \/[a-zA-Z]*\d+[a-zA-Z0-9]*
Explanation:
\/ - match / literally
[a-zA-Z]* - 0+ letters
\d+ - 1+ digits - thanks to this, we require at least one digits
[a-zA-Z0-9]* - 0+ letters or digits
Demo
It will captrure together with / at the beginning, so you need to trim it.

Regex lookahead part of group accepted

I'm using regex in powershell 5.1.
I need it to detect groups of numbers, but ignore groups followed or preceeded by /, so from this it should detect only 9876.
[regex]::matches('9876 1234/56',‘(?<!/)([0-9]{1,}(?!(\/[0-9])))’).value
As it is now, the result is:
9876
123
6
More examples: "13 17 10/20" should only match 13 and 17.
Tried using something like (?!(\/([0-9]{1,}))), but it did not help.
You may use
\b(?<!/)[0-9]+\b(?!/[0-9])
See the regex demo
Alternatively, if the numbers can be glued to text:
(?<![/0-9])[0-9]+(?!/?[0-9])
See this regex demo.
The first pattern is based on word boundaries \b that make sure there are no letters, digits and _ right before and after an expected match. The second one just makes sure there are no digits and / on both ends of the match.
Details
(?<![/0-9]) - a negative lookbehind making sure there is no digit or / immediately to the left of the current location
[0-9]+ - one or more digis
(?!/?[0-9]) - a negative lookahead making sure there is no optional / followed with a digit immediately to the right of the current location.

Only match unique string occurrences

We are doing some Data Loss Prevention for emails, but the issue is when people reply to emails multiple times sometimes the credit card number or account number will appear multiple times.
How can we get Java Regex to only match strings once each.
So for example, we are using the following regex to catch account numbers that match 2 letters followed by 5 or 6 numbers. it will also omit CR in either case.
\b(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}\b
How can we have it find:
CX12345
CX14584
JB145888
JD748452
CX12345 (Ignore as its already found it above)
LM45855
Unique string occurrence can be matched with
<STRING_PATTERN>(?!.*<STRING_PATTERN>) // Find the last occurrence
(?<!<STRING_PATTERN>.*)<STRING_PATTERN> // Find the first occurrence, only works in regex
// that supports infinite-width lookbehind patterns
where <STRING_PATTERN> is the pattern the unique occurrence of which one searches for. Note that both will work with the .NET regex library, but the second one is not usually supported by the majority of other libraries (only PyPi Python regex library and the JavaScript ECMAScript 2018 regex support it). Note that . does not match line break chars by default, so you need to pass a modifier like DOTALL (in most libraries, you may add (?s) modifier inside the pattern (only in Ruby (?m) does the same), or use specific flags that you pass to the regex compile method. See more about this in How do I match any character across multiple lines in a regular expression?
You seem to need a regex like this:
/\b((?!CR|cr)[A-Za-z]{2}\d{5,6})\b(?![\s\S]*\b\1\b)/
The regex demo is available here
Details:
\b - a leading word boundary
((?!CR|cr)[A-Za-z]{2}\d{5,6}) - Group 1 capturing
(?!CR|cr) - the next two characters cannot be CR or cr, the negative lookahead check
[A-Za-z]{2} - 2 ASCII letters
\d{5,6} - 5 to 6 digits
\b - trailing word boundary
(?![\s\S]*\b\1\b) - a negative lookahead that fails the match if there are any 0+ chars ([\s\S]*) followed with a word boundary (\b), same value captured into Group 1 (with the \1 backreference), and a trailing word boundary.
I would use a Map of some sort here, to keep tally of the strings which you encounter. For example:
String ccNumber = "CX12345";
Map<String, Boolean> ccMap = new HashMap<>();
if (ccNumber.matches("^(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}$")) {
ccMap.put(ccNumber, null);
}
Then just iterate over the keyset of the map to get unique credit card numbers which matched the pattern in your regex:
for (String key : map.keySet()) {
System.out.println("Found a matching credit card: " + key);
}

RegExp: How do I include 'avoid non-numeric characters' from a pattern search?

I want to filter out all .+[0-9]. (correct way?) patterns to avoid duplicate decimal points within a numeral: (e.g., .12345.); but allow non-numerals to include duplicate decimal points: (e.g. .12345*.) where * is any NON-NUMERAL.
How do I include a non-numeral negation value into the regexp pattern? Again,
.12345. <-- error: erroneous numeral.<br/>
.12345(.' or '.12345*.' <-- Good.
I think you are looking for
^\d*(?:\.\d+)?(?:(?<=\d)[^.\d\n]+\.)?$
Here is a demo
Remember to escape the regex properly in Swift:
let rx = "^\d*(?:\\.\\d+)?(?:(?<=\\d)[^.\\d\\n]+\\.)?$"
REGEX EXPLANATION:
^ - Start of string
\d* - Match a digit optionally
(?:\.\d+)? - Match decimal part, 0 or 1 time (due to ?)
(?:(?<=\d)[^.\d\n]+\.)? - Optionally (due to ? at the end) matches 1 or more symbols preceded with a digit (due to (?<=\d) lookbehind) other than a digit ([^\d]), a full stop ([^.]) or a linebreak ([^\n]) (this one is more for demo purposes) and then followed by a full stop (\.).
$ - End of string
I am using non-capturing groups (?:...) for better performance and usability.
UPDATE:
If you prefer an opposite approach, that is, matching the invalid strings, you can use a much simpler regex:
\.[0-9]+\.
In Swift, let rx = "\\.[0-9]+\\.". It matches any substrings starting with a dot, then 1 or more digits from 0 to 9 range, and then again a dot.
See another regex demo
The non-numeral regex delimited character is \D. Conversely, if you're looking for only numerals, \d would work.
Without further context of what you're trying to achieve it's hard to suggest how to build a regex for it, though based on your example, (I think) this should work: .+\d+\D+

How to extract internal words using regex

I am trying to match only the street name from a series of addresses. The addresses might look like:
23 Barrel Rd.
14 Old Mill Dr.
65-345 Howard's Bluff
I want to use a regex to match "Barrel", "Old Mill", and "Howard's". I need to figure out how to exclude the last word. So far I have a lookbehind to exclude the digits, and I can include the words and spaces and "'" by using this:
(?<=\d\s)(\w|\s|\')+
How can I exclude the final word (which may or may not end in a period)? I figure I should be using a lookahead, but I can't figure out how to formulate it.
You don't need a look-behind for this:
/^[-\d]+ ([\w ']+) \w+\.?$/
Match one or more digits and hyphens
space
match letters, digits, spaces, apostrophes into capture group 1
space
match a final word and an optional period
An example Ruby implementation:
regex = /^[-\d]+ ([\w ']+) \w+\.?$/
tests = [ "23 Barrel Rd.", "14 Old Mill Dr.", "65-345 Howard's Bluff" ]
tests.each do |test|
p test.match(regex)[1]
end
Output:
"Barrel"
"Old Mill"
"Howard's"
I believe the lookahead you want is (?=\s\w+\.?$).
\s: you don't want to include the last space
\w: at least one word-character (A-Z, a-z, 0-9, or '_')
\.?: optional period (for abbreviations such as "St.")
$: make sure this is the last word
If there's a possibility that there might be additional whitespace before the newline, just change this to (?=\s\w+\.?\s*$).
Why not just match what you want? If I have understood well you need to get all the words after the numbers excluding the last word. Words are separated by space so just get everything between numbers and the last space.
Example
\d+(?:-\d+)? ((?:.)+)  Note: there's a space at the end.
Tha will end up with what you want in \1 N times.
If you just want to match the exact text you may use \K (not supported by every regex engine) but: Example
With the regex \d+(?:-\d+)? \K.+(?= )
Another option is to use the split() function provided in most scripting languages. Here's the Python version of what you want:
stname = address.split()[1:-1]
(Here address is the original address line, and stname is the name of the street, i.e., what you're trying to extract.)