Regex numbers from string - regex

I am trying to write a regex that can find only numbers from given string. What I mean is:
Input: My number is +12 345 678. I have galaxy s3, its symbol 34abc.
Output: 345 and 678 (but not +12, 3 from word s3 or 34 from 34abc)
I tried just numbers (\d+) and I combinations with white and words characters. The closest was^\d$ but that doesn't work as my numbers are part of the bigger string, not whole string themselves. Can you give me a hint?
------- EDIT
Looks like I just don't know how to check a character without actually getting it into result. Like "digit that follow space character (without this space)".

In general case, you can make use of lookbehind and lookahead:
(?<=^|\s)\d+(?=$|\s)
The part which makes it into the captured output is \d+.
Lookbehind and lookahead are not included in the match.
I just included spaces as delimiters in the regex, but you may replace \s with any character class, as defined by your requirements. For example, to allow dots as separators (both in front and after the digits), use the following regex:
(?<=^|[\s.])\d+(?=$|[\s.])
The (?<=^|\s) should be read as follows:
(?<= ... ) defines the lookbehind group.
The expression which must precede the \d+ is ^|\s, meaning "either start of the line (^) or whitespace".
Similarly, (?=$|\s) defines the lookahead group (it must follow the captured digits), which is either end of the line ($) or whitespace.
A note on \b mentioned in other answers: it is a nice feature, means "word boundary", but the "word characters" are not customizable. This means that, for example, the "+" character is considered to be a separator and you can't change this if you use \b. With lookaround, you can customize the separators to your needs.

What you seem to want is a sequence of digits (\d+) that is preceded by a whitespace (\s) or the start of the string (^), and followed by a whitespace or punctuation character ([\s.,:;!?]) or the end of the string ($), but the preceding/following whitespace or punctuation character should not be included in the match, so you need positive lookahead ((?=xxx)) and lookbehind ((?<=xxx)).
(?<=^|\s)\d+(?=[\s.,:;!?]|$)
See regex101 for demo.
Remember to double the backslashes in a Java literal.

Safer RegEx
Try this:
(?<=\s|^)\d+(?=\s|\b)
Live Demo on Regex101
How it works:
(?<=\s|^) # Start of String OR Whitespace (will not select +)
# Positive Lookbehind ensures the data is not included in the match
\d+ # Digit(s)
(?=\s|\b) # Whitespace OR Word Boundary
# Positive Lookahead ensures the data is not included in the match
Lookarounds do not take up any characters in the match, so they can be used so Capture Groups do not need to be. For example:
# Regex /.*barbaz/
barbaz # Matched Data Result: barbaz
foobarbaz # Matched Data Result: foobarbaz
# Regex (with Positive Lookahead) /.*bar(?=baz)/
barbaz # Matched Data Result: bar
foobarbaz # Matched Data Result: foobar
As you can see with the second RegEx, baz is never included in the matched data result, however it was required in the string for the RegEx to match. The RegEx above works on the same principle
Not as Safe (Old) RegEx
You can try this RegEx:
\b\d+\b
\b is a Word Boundary. This will, however, select 12 from +12.
You can change the RegEx to this to stop 12 from being selected:
(?<!\+)\b\d+\b
This uses a Negative Lookbehind and will fail if there is a + before the digits.
Live Demo on Regex101

Related

Regex pattern reads correctly but doesn't produce desired result

I am testing the following regex:
(?<=\d{3}).+(?!',')
This at regex101 regex
Test string:
187 SURNAME First Names 7 Every Street, Welltown Racing Driver
The sequence I require is:
Begin after 3 digit numeral
Read all characters
Don't read the comma
In other words:
SURNAME First Names 7 Every Street
But as demo shows the negative lookahead to the comma has no bearing on the result. I can't see anything wrong with my lookarounds.
You could match the 3 digits, and make use of a capture group capturing any character except a comma.
\b\d{3}\b\s*([^,]+)
Explanation
\b\d{3}\b Match 3 digits between word boundaries to prevent partial word matches
\s* Match optional whitespace chars
([^,]+) Capture group 1, match 1+ chars other than a comma
Regex demo
.+ consumes everything.
So (?!,) is guaranteed to be true.
I'm not sure if using quotes is correct for whichever flavour of regex you are using. Bare comma seems more correct.
Try:
(?<=\d{3})[^,]+

Regex : Find a number between parentheses and a specific string

I would like to find lines where there is a number between a parentheses and string BAC after it
For exemple
ABABBAB (87490), BAC ===> OK
BLABLABLA (65688), BIC ===> Not OK
ABABBAB (75664), EEE ===> Not OK
I Have found an answer to get numbers between parentheses
^.*?\([^\d]*(\d+)[^\d]*\).*$ here an example
Now I would like to add the condition to match also the BAC string
Something like this should work:
^.*?\([^\d]*(\d+)[^\d]*\),\s+BAC\s*$
, — direct match
\s+ — one or more spaces
BAC — direct match
\s* — zero or more spaces
If you'd like to match and report an arbitrary word, this should work:
^.*?\([^\d]*(\d+)[^\d]*\)\s+(\S+).*$
\S+ — one or more non-space characters
To match BAC, followed by anything:
^.*?\([^\d]*(\d+)[^\d]*\),\s+BAC,.*$
You could avoid using a capture group with the following regex:
(?<=\()\d+(?=\))(?=.*\bBAC\b)
Demo
Each string of one or more digits surrounded by parentheses and followed by the word BAC (but not BACK or ABAC, for example) is matched.
This regex works with PCRE (PHP), Python, Javascript, Onigmo regex engines, and others that support fixed-length positive look-behinds and positive look-aheads. See the comparison chart here.
The regex engine performs the following operations.
(?<=\() # match '(' in a positive lookbehind
\d+ # match 1+ digits
(?=\)) # match ')' in a positive lookahead
(?=.*\bBAC\b) # match 0+ chars followed by `BAC` with word breaks fore and aft

Unmatch complete words if a negative lookahead is satisfied

I need to match only those words which doesn't have special characters like # and :.
For example:
git#github.com shouldn't match
list should return a valid match
show should also return a valid match
I tried it using a negative lookahead \w+(?![#:])
But it matches gi out of git#github.com but it shouldn't match that too.
You may add \w to the lookahead:
\w+(?![\w#:])
The equivalent is using a word boundary:
\w+\b(?![#:])
Besides, you may consider adding a left-hand boundary to avoid matching words inside non-word non-whitespace chunks of text:
^\w+(?![\w#:])
Or
(?<!\S)\w+(?![\w#:])
The ^ will match the word at the start of the string and (?<!S) will match only if the word is preceded with whitespace or start of string.
See the regex demo.
Why not (?<!\S)\w+(?!\S), the whitespace boundaries? Because since you are building a lexer, you most probably have to deal with natural language sentences where words are likely to be followed with punctuation, and the (?!\S) negative lookahead would make the \w+ match only when it is followed with whitespace or at the end of the string.
You can use negative lookbehind and negative lookahead patterns around a word pattern to make sure that the word is not preceded or followed by a non-space character, or in other words, to make sure that it is surrounded by either a space or a string boundary:
(?<!\S)\w+(?!\S)
Demo: https://regex101.com/r/cjhUUM/2

Regex to find string with only numbers, but match only when preceeded with # or \s and followed by space

I am attempting to find a regex that will find a string of numbers and only match if they are preceded with white space of a pound sign and followed by either white space or a line break. For example, the following would match:
#1234
#001234
000123
1234
But the following would not:
123-456
#1234
123kok
Using one of those online regex sandboxes, I tried to use a negative look behind:
\d*(?<=#|\s)\d{1,10} but I can't get the following to work. So out of these:
123-456
#1234
123kok
456 would match
(?<=...) is a lookbehind (preceded by ...), (?<!...) is a negative lookbehind (not preceded by ...). Writting \d*(?<=#|\s) doesn't make sense and behaves like (?<=#|\s) alone since a same position can't be a digit and a # or a whitespace at the same time. But it isn't the problem. All you need is an assertion for the condition after the digits: a lookahead (negative here).
(?<![^\s#])\d+(?!\S)
The double negation: not preceded by a character that is not a whitespace or a #, is useful to include the start of the string. Same thing for the negative lookahead (not followed by a character that is not a whitespace) to include the end of the string.
Obviously:
(?<=^|\s|#)\d+(?=\s|$)
is correct too but longer.

Why is this regex selecting this text

I am using the regex
(.*)\d.txt
on the expression
MyFile23.txt
Now the online tester says that using the above regex the mentioned string would be allowed (selected). My understanding is that it should not be allowed because there are two numeric digits 2 and 3 while the above regex expression has only one numeric digit in it i.e \d.It should have been \d+. My current expression reads. Zero of more of any character followed by one numeric digit followed by .txt. My question is why is the above string passing the regex expression ?
This regex (.*)\d.txt will still match MyFile23.txt because of .* which will match 0 or more of any character (including a digit).
So for the given input: MyFile23.txt here is the breakup:
.* # matches MyFile2
\d # matched 3
. # matches a dot (though it can match anything here due to unescaped dot)
txt # will match literal txt
To make sure it only matches MyFile2.txt you can use:
^\D*\d\.txt$
Where ^ and $ are anchors to match start and end. \D* will match 0 or more non-digit.
The pattern you have has one group (.*) which would match using your example:MyFile2
because the . allows any character.
Furthermore the . in the pattern after this group is not escaped which will result in allowing another character of any kind.
To avoid this use:
(\D*)\d+\.txt
the group (\D*) would now match all non digit characters.
Here is the explanation, your "MyFile23.txt" matches the regex pattern:
A literal period . should always be escaped as \. else it will match "any character".
And finally, (.*) matches all the string from the beginning to the last digit (MyFile2). Have a look at the "MATCH INFORMATION" area on the right at this page.
So, I'd suggest the following fix:
^\D*\d\.txt$ = beginning of a line/string, non-digit character, any number of repetitions, a digit, a literal period, a literal txt, and the end of the string/line (depending on the m switch, which depends on the input string, whether you have a list of words on separate lines, or just a separate file name).
Here is a working example.