Extract numbers out of text with inconcistant linebreaks - regex

I have text with 6 numbers typically stored in one line
SomeData\n0.00 0.00 0.00 31,570.07 0.00 31,570.07\nSomeData
SomeData\n0.00 0.00 0.00 485,007.24 0.00 485,007.24\nSomeData
This regex worked fine on it:
\n[0-9,.-]* [0-9,.-]* [0-9,.-]* [0-9,.-]* [0-9,.-]* [0-9,.-]*\n
I noticed that every once in a while I get this:
SomeData\n0.00 0.00 10,921,594\n.89\n-\n9,563,271.0\n6\n0.00 1,358,323.83\nSomeData
Note how the linebreaks are randomly inserted after a sign or between numbers as if the system stored the values without filtering linebreaks.
I am struggling to get this extracted. I tried various expressions but my more successful one was [0-9,.-][\n]{0,1}[0-9,.-][ ]{0,1} to match an individual number.
What expression can I use to match both variations of the number formats preferably already stripping out the inconstant line breaks?
Update: Going with
[-\n]{0,2}[0-9,]+[\n.0-9]{3,4}[\n ]{0,1}
Please let me know if I there's a better way

One way would be to write an exact representation of what constitutes a number, so in your case [-+]?[0-9]+[0-9,]*(?:\.[0-9]+)? would do the trick. This helps, because then your search can know when a number starts and when one ends (because of rules like: a sign always is at the start a dot cannot appear multiple times, etc.). Then you want to match pairs of six delimited by either a new line or space so wrap it in a capture group and limit by 6: (...[ \n]*){6,6}. This helps because then the regex engine can figure out by backtracking what to consider a number by knowing how many it should match. Then you want to allow new lines in pretty much any position, so place the new line in each character group. You might also want to anchor the numbers on both sides, but this is not necessary, because now the regex engine will try to identify valid tuples of 6 numbers. End result is:
SomeData\n([-+]?[0-9\n]+[0-9,\n]*(?:\.[0-9\n]+)?[ \n]){6,6}SomeData
This will find tuples of 6 numbers no matter where the enters are. Here is an example: https://regex101.com/r/jD5nT8/1

Related

regex for number between numbers

I'm in need of a regex, which takes a minimum and a maximum number to determine valid input, And I want the maximum and minimum to be dynamic.
I have been trying to get this done using this link
https://stackoverflow.com/a/13473595/1866676
But couldn't get it to work. Can someone please let me know how to do this.
Let's say I want to make a html5 input box, and I Want it to only receive numbers from 100 to 1999
What would a regex for this like this look like?
First off, while it is possible to do this, I think if there is a simpler way to choose a number range such as <input type="number" min="1" max="100">, that way would be preferred.
Having said that, here's how the kind of regex you requested works:
ones: ^[0-9]$ // just set the numbers -- matches 0 to 9
tens: ^[1-3]?[0-9]$ //set max tens and max ones -- matches 0 to 39
tens where max does not end in 9 ^[1-2]?[0-9]$|^[3][0-4]$ // 0 to 34
only tens: ^[1][5-9]$|^[2-3][0-9]$|^[4][0-5]$ // 15 to 45
Here, lets pick an arbitrary number 1234 to 2345
^[1][2][3][4-9]$|
^[1][2][4-9][0-9]$|
^[1][3-9][0-9][0-9]$|
^[2][0-2][0-9][0-9]$|
^[2][3][0-3][0-9]$|
^[2][3][4][0-5]$
https://regex101.com/r/pP8rQ7/4
Basically the ending of the middle series always needs to be a straight range that can reach 9 unless we are dealing with the ones place, and if it cant, you have to build it upwards toward the middle each time we have a value that can't start in 0 and then once we reach a value that cant end in 9 break early and set it in the next condition.
Notice the pattern, as each place solidifies. Also keep in mind that when dealing with going from lower to higher places, optional operators ? should be used.
Its a bit complex, but its nowhere near impossible to design a custom range with a bit of thought.
If you are more specific, we can craft an exact example, but this is generally how it is done:beginning-range|middle-range|end-range
You should only need beginning or end-ranges in certain cases like if the min or max does not end in 9. the ? means that the range that comes after it is optional. (so for example in the first case it lets us have both single and double numbers.
so for 100 - 1999 it's quite simple actually because you have lots of 9's and 0's
/^[1-9][0-9][0-9]$|^[1][0-9][0-9][0-9]$/
https://regex101.com/r/pP8rQ7/1
Note: Single values don't need ranges [n] I just added them for readability.
Edit: There used to be a regex range generator at: http://gamon.webfactional.com/regexnumericrangegenerator/. It appears to be offline now.
Essentially, you can't.
For every numeric range, there exists a regex that will match numbers in that range, therefore it is possible to write code that can generate a such regex. But such a regex is not a simple reformatting of the range ends.
However, such code would require colossal effort and complexity to write compared to code that simply checked the number using numeric methods.
With HTML 5 simply put a range input...
<form>
Quantity (between 100 and 1999):
<input type="number" name="quantity" min="100" max="1999">
</form>
with regex:
^([12345679])(\d)(\d)|^(1)(\d)(\d)(\d)
So if you need to create the regex dinamically it's possible but a bit tricky and complex

regex for a real number in flex ignoring leading zeros

I have the following sets:
NUMBER [0-9]+
DECIMAL ("."{NUMBER})|({NUMBER}("."{NUMBER}?)?)
REAL {DECIMAL}([eE][+-]?{NUMBER})?
and I want my lexer to accept real numbers like:
0.002 or 0.004e-10 or .01
the problem is that I want it ignore the leading zeros but to keep the rest of the number for example:
when I give 000.0002 I want to keep 0.0002 and when I give 0.2e-0100 I want to keep 0.2e-100
So I was thinking something like the atof function but I do not know how to do it exactly.
Any thoughts?
Thanks in advance
lex will return the complete token that your pattern matches as one string. You cannot change that. At the expense of considerable complexity you could use start conditions to match a leading zero (which may be the only digit), and collect tokens for the pieces, e.g.,
0.2e-0100
as
0.2e-
0
100
and glue the first/last tokens together but you would find it much simpler to develop your own string function which filters out the unwanted leading zeroes.

RegEx to find numbers over certain value with commas, and another text value, on same line

I'm new to Regular Expressions, and I have been trying to figure out how to code this: I need to find numbers greater than 25000 where the same line also has the number " 19" somewhere on that line (that's a space then 19). The problem is that the numbers have commas in them. I tried a few options:
This finds lines with any numbers over 25000:
^.*(25,|26,|27,|28,|29,|30,|31,|32,|33,|34,|35,|36,|37,|38,|39,|40,|41,|42,|43,|44,|45,|46,|47,|48,|49,|50,|51,|52,|53,|54,|55,|56,|57,|58,|59,|60,|61,|62,|63,|64,|65,|66,|67,|68,|69,|70,|71,|72,|73,|74,|75,|76,|77,|78,|79,|80,|81,|82,|83,|84,|85,|86,|87,|88,|89,|90,|91,|92,|93,|94,|95,|96,|97,|98,|99,|100,|101,|102,|103,|104,|105,|106,|107,|108,|109,|110,|111,|112,|113,|114,|115,|116,|117,|118,|119,|120,|121,|122,|123,|124,).*$
This finds line with both " 19" and 26, (but not with the comma behind the 26)
^.*( 19.*26).*$
Any help is appreciated!
Numbers over 25000 can be represented as follows :
\d{6,}|2[5-9]\d{3}|[3-9]\d{4}
That is, in english :
numbers of 6 digits or more
numbers of 5 digits starting with 2 and another digit equal or greater than 5
numbers of 5 digits starting with a digit greater than 2
So the complete regex would look like this :
.*(\d{6,}|2[5-9]\d{3,}|[3-9]\d{4,}).* 19.*
Which is said number somewhere in the line, followed by 19 somewhere in the line.
Here is a test run on regex101 for you to test with your data.
I also second the comment that this isn't a job for regular expressions, which as you can see work on characters rather than numbers.
I would try something like this:
^(([0-9,]*([3-9][0-9]|2[5-9]),?[0-9]{3})\s?)$
That should handle the numeric part. You didn't really explain if the " 19" would come before or after that, and what would delimit that from the numeric part, but just insert (\s19) wherever that bit needs to go.
example
Thanks everyone. The following RegEx worked for me:
^.* 19.(25,|26,|27,|28,|29,|30,|31,|32,|33,|34,|35,|36,|37,|38,|39,|40,|41,|42,|43,|44,|45,|46,|47,|48,|49,|50,|51,|52,|53,|54,|55,|56,|57,|58,|59,|60,|61,|62,|63,|64,|65,|66,|67,|68,|69,|70,|71,|72,|73,|74,|75,|76,|77,|78,|79,|80,|81,|82,|83,|84,|85,|86,|87,|88,|89,|90,|91,|92,|93,|94,|95,|96,|97,|98,|99,|100,|101,|102,|103,|104,|105,|106,|107,|108,|109,|110,|111,|112,|113,|114,|115,|116,|117,|118,|119,|120,|121,|122,|123,|124,).$
This finds lines that have " 19" first in the line then a number greater than 25K later in the line, when the numbers have commas in them. I couldn't use the shortcut "number ranges" that were suggested because there are other numbers on the lines without commas that are over 25K that I don't want to flag. Maybe there's any easier way that my brute force method, but if not, at least this works. Thanks again!

Integer range and multiple of

I have a number of fields I want to validate on text entry with a regex for both matching a range (0..120) and must be a multiple of 5.
For example, 0, 5, 25, 120 are valid. 1, 16, 123, 130 are not valid.
I think I have the regex for multiple of 5:
^\d*\d?((5)|(0))\.?((0)|(00))?$
and the regex for the range:
120|1[01][0-9]|[2-9][0-9]
However, I dont know how to combine these, any help much appreciated!
You can't do that with a simple regex. At least not the range-part (especially if the range should be generic/changeable).
And even if you manage to write the regex, it will be very complex and unreadable.
Write the validation on your own, using a parseStringToInt() function of your language and simple < and > checks.
Update: added another regex (see below) to be used when the range of values is not 0..120 (it can even be dynamic).
The second regex in the question does not match numbers smaller than 20. You can change it to match smaller numbers that always end in 0 or 5 to be multiple by 5:
\b(120|(1[01]|[0-9])?[05])\b
How it works (starting from inside):
(1[01]|[0-9])? matches 10, 11 or any one-digit number (0 to 9); these are the hundreds and tens in the final number; the question mark (?) after the sub-expression makes it match 0 or 1 times; this way the regex can also match numbers having only one digit (0..9);
[05] that follows matches 0 or 5 on the last digit (the units); only the numbers that end in 0 or 5 are multiple of 5;
everything is enclosed in parenthesis because | has greater priority than \b;
the outer \b matches word boundaries; they prevent the regex match only 1..3 digits from a longer number or numbers that are embedded in strings; it prevents it matching 15 in 150 or 120 in abc120.
Using dynamic range of values
The regex above is not very complex and it can be used to match numbers between 0 and 120 that are multiple of 5. When the range of values is different it cannot be used any more. It can be modified to match, lets say, numbers between 20 and 120 (as the OP asked in a comment below) but it will become harder to read.
More, if the range of allowed values is dynamic then a regex cannot be used at all to match the values inside the range. The multiplicity with 5 however can be achieved using regex :-)
For dynamic range of values that are multiple of 5 you can use this expression:
\b([1-9][0-9]*)?[05]\b
Parse the matched string as integer (the language you use probably provides such a function or a library that contains it) then use the comparison operators (<, >) of the host language to check if the matched value is inside the desired range.
At the risk of being painfully obvious
120|1[01][05]|[2-9][05]
Also, why the 2?

JSP input validate amount - possibly with REGEX

This is probably a simple question that has been solved many times. I am new to front end dev, so struggling with the validation part. I have a currency input that I used the following statement in JavaScript to only allow numbers. Can I just edit this or add a line to also only allow two decimals as you type?
$("input#amountToSave").on("blur keyup", function() {
this.value=this.value.replace(/[^0-9.]+/,'');
});
You can try something like this
^\$?([1-9]{1}[0-9]{0,2}(\,[0-9]{3})*(\.[0-9]{0,2})?|[1-9]{1}[0-9]{0,}(\.[0-9]{0,2})?|0(\.[0-9]{0,2})?|(\.[0-9]{1,2})?)$
Many currency expresssions allow leading zeros, thus $01.40 passes thru them. This expression kills them, except for 0 in the one's column. Works with or without commas and/or dollar sign. Decimals not mandatory, unless no zero in ones column and decimal point is placed. Allows $0.00 and .0 Keywords: money dollar currency
E.g.,
$1,234.50 | $0.70 | .7
Okay - I think I see the problem. I need a regex (it can be a simple one for the two decimals - for now I don't need it too complex and no Currency symbol is necessary) that is inverted. I now saw when reading up I am testing if the value is not 0-9 replace with space. So I need to add to that reg exp that if it is the 3rd decimal, then replace with space.