Optional thousand-separator processes incomplete string

Optional thousand-separator processes incomplete string - regex

I need to process numbers that may have optional thousand-separators, such as 1234567 and 1,234,567
I naively assumed I could achieve this with
(\d{1,3}([,]?(\d{3}))*)
This, however, matches only 123456 (not the 7) and 1,234,567 (correctly)
However, if I specify an explicit number of matches (2 in this case)
(\d{1,3}([,]?(\d{3})){2})
or a bound (such as \b)
(\d{1,3}([,]?(\d{3}))*)\b
the full match is performed.
Why does the “greedy” * quantifier stop after the first match in the first regex?

If you want to match both numbers with, and without, proper comma thousands separators, then I would use an alternation:
^(\d{1,3}(?:,\d{3})*|\d+)$
Demo

The reason is that \d{1,3} is greedy, so it matches 123 at the beginning of the number. Then the rest of the regexp will only match groups of exactly 3 digits because it uses \d{3}. A regular expression doesn't try to match the longest possible string, so it won't backtrack and shorten the match for \d{1,3} to make the rest of the regexp go further.
But if you add a word boundary \b at the end, it no longer matches with that 3-digit prefix. That causes it to backtrack until it's able to match groups of 3 digits ending with a word boundary.

Related

Regex stopped matching after the first match

I need some help here
Here is example of what im trying to match:
1 ScreenMail Enable friendly none Internal any 5
I need to match everything excluding the last digits (5) Meaning matching the first digit(1), spaces, letter, special characters, etc I tried using /^(\d), but after matching the first digits, it stopped. Your assistance would be appreciated.

The simplest way is probably to remove last digits with:
\d+$
\d+\s*$
See the regex demo.
You may want to use a matching regex like
^.*[^\d\s]
that matches any zero or more chars other than line break chars (.*) as many as possible and then a char other than a digit and whitespace. See this regex demo.
However, if the digits are followed with an optional whitespace, or if you allow any text after the last digits, it will fail. You can then use
^.*[^\d\s](?=\s*\d)
See this regex demo. The (?=\s*\d) positive lookahead requires zero or more whitespaces and then a digit immediately to the right of the current location.

Regex to match 4 letters in a string

I am trying to write some regex that will match a string that contains 4 or more letters in it that are not necessarily in sequence.
The input string can have a mix of upper and lowercase letters, numbers, non-alpha chars etc, but I only want it to pass the regex test if it contains at least 4 upper or lowercase letters.
An example of what I would like to be a valid input can be seen below:
a124Gh0st
I have currently written this piece of regex:
(?(?=[a-zA-Z])([a-zA-Z])| )
Which returns 5 matches successfully but it will currently always pass as long as I have greater than 1 letter in the input string. if I add {4,} to the end of it then it works, but only in situations where there are 4 letters in a row.
I am using the following website to test what I have been doing: regex101
Any help on this would be greatly appreciated.

You may use
(?s)^([^a-zA-Z]*[A-Za-z]){4}.*
or
^([^a-zA-Z]*[A-Za-z]){4}[\s\S]*
See the regex demo.
Details:
^ - start of string
([^a-zA-Z]*[A-Za-z]){4} - exactly 4 sequences of:
[^a-zA-Z]* - 0+ chars other than ASCII letters
[A-Za-z] - an ASCII letter
[\S\s]* - any 0+ chars (same as .* if the DOTALL modifier is enabled).

Why don't you just match the zero or more characters between each letter? For example,
(?:[A-Za-z].*){4}
You'll recognize the [A-Za-z]. The . matches any character, so .* is a run of any number (including zero) of any character. The group of a letter followed by any number of any characters is repeated four times, so this pattern matches if and only if at least four letters appear in the string. (Note that the trailing .* of the fourth repeat of the pattern is mostly inconsequential, since it can match zero characters).
If you are using a regex language that supports reluctant quantifiers, then using them will make this pattern considerably more efficient. For example, in Java or Perl, one might prefer to use
(?:[A-Za-z].*?){4}
The .*? still matches any number of any character, but the matching algorithm will match as few characters as possible with each such run. This will reduce the amount of backtracking it needs to perform. For this particular pattern, it will reduce the needed backtracking to zero.
If you do not have reluctant quantifiers in your regex dialect, then you can achieve the same desirable effect a bit more verbosely:
(?:[A-Za-z][^A-Za-z]*?){4}
There, only non-letters are matched for the runs between letters.
Even with this, the pattern uses some regex features not present in all regex flavors -- non-capturing groups, enumerated quantifiers -- but these are present in your original regex. For a maximally-compatible form, you might write
[A-Za-z][^A-Za-z]*[A-Za-z][^A-Za-z]*[A-Za-z][^A-Za-z]*[A-Za-z]

How to find words that contain string with a limited size

I need to find all the words in an inputted text that has (?i:val) in it and are no longer that 5 characters.
So far I got: \b([a-zA-Z]*(?i:val)[a-zA-Z]*){1,4}\b
If we take this sample text to look in: In computer science, a value is an expression which cannot be evaluated any further (a normal form). Val is also a match
I get 3 matches (value, evaluated and Val), however evaluated should not match the pattern, as it is too long. What is the right way to get this straight?

Your pattern does not account for the length of the words matched.
Use word boundaries and a lookahead like this:
(?i)\b(?=\w*val)\w{1,5}\b
See regex demo
The regex matches:
\b - a leading word boundary since the next pattern is \w
(?=\w*val) - a lookahead making sure there is a val substring after zero or more word characters
\w{1,5} - matches 1 to 5 word characters
\b - trailing word boundary that stops words of more than 5 characters long from matching
You may use an ASCII JS version of the regex:
/\b(?=[a-z]*val)[a-z]{1,5}\b/i

It's important to understand why the "evaluated" was matched. Note:
[a-zA-Z]* matches the "e"
(?i:val) matches "val"
[a-zA-Z]* matches "uated"
Actually there's not repetition here! The pattern was matched in only one iteration.
You can achieve what you want using lookarounds, but I think that regex is not the best tool for this task. I highly recommend you using other functions depending on what you have.

Match only exact numbers, not pre of suffixed with slash/dash etc

I need a regular expression that matches only numbers of length 7 (they can have leading zeros). I used the following super easy regex: \b[0-9]{7}\b. However, this regex also matches numbers in e.g. 5254-6408499 and (0241)4013999 (see https://regex101.com/r/zF5hV7/1).
How can I prevent them from being matched? I only want numbers of length 7 having leading and/or trailing spaces.

Depending on the regular expression flavor, you could create your own boundaries:
(?<=^| )\d{7}(?= |$)
This asserts that either the beginning of the string or a space precedes moving on to matching exactly 7 digits only if the engine asserts that either a space or the end of string follows.

You can use this regex:
(?:^|\s)([0-9]{7})(?:\s|$)
and grab captured group #1
Updated RegEx Demo

Regular expression to match last number in a string

I need to extract the last number that is inside a string. I'm trying to do this with regex and negative lookaheads, but it's not working. This is the regex that I have:
\d+(?!\d+)
And these are some strings, just to give you an idea, and what the regex should match:
ARRAY[123] matches 123
ARRAY[123].ITEM[4] matches 4
B:1000 matches 1000
B:1000.10 matches 10
And so on. The regex matches the numbers, but all of them. I don't get why the negative lookahead is not working. Any one care to explain?

Your regex \d+(?!\d+) says
match any number if it is not immediately followed by a number.
which is incorrect. A number is last if it is not followed (following it anywhere, not just immediately) by any other number.
When translated to regex we have:
(\d+)(?!.*\d)
Rubular Link

I took it this way: you need to make sure the match is close enough to the end of the string; close enough in the sense that only non-digits may intervene. What I suggest is the following:
/(\d+)\D*\z/
\z at the end means that that is the end of the string.
\D* before that means that an arbitrary number of non-digits can intervene between the match and the end of the string.
(\d+) is the matching part. It is in parenthesis so that you can pick it up, as was pointed out by Cameron.

You can use
.*(?:\D|^)(\d+)
to get the last number; this is because the matcher will gobble up all the characters with .*, then backtrack to the first non-digit character or the start of the string, then match the final group of digits.
Your negative lookahead isn't working because on the string "1 3", for example, the 1 is matched by the \d+, then the space matches the negative lookahead (since it's not a sequence of one or more digits). The 3 is never even looked at.
Note that your example regex doesn't have any groups in it, so I'm not sure how you were extracting the number.

I still had issues with managing the capture groups
(for example, if using Inline Modifiers (?imsxXU)).
This worked for my purposes -
.(?:\D|^)\d(\D)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Optional thousand-separator processes incomplete string - regex

If you want to match both numbers with, and without, proper comma thousands separators, then I would use an alternation: ^(\d{1,3}(?:,\d{3})*|\d+)$ Demo

Related

Regex stopped matching after the first match

Regex to match 4 letters in a string

How to find words that contain string with a limited size

Match only exact numbers, not pre of suffixed with slash/dash etc

Regular expression to match last number in a string

Categories

Resources