regex for a real number in flex ignoring leading zeros - regex

I have the following sets:
NUMBER [0-9]+
DECIMAL ("."{NUMBER})|({NUMBER}("."{NUMBER}?)?)
REAL {DECIMAL}([eE][+-]?{NUMBER})?
and I want my lexer to accept real numbers like:
0.002 or 0.004e-10 or .01
the problem is that I want it ignore the leading zeros but to keep the rest of the number for example:
when I give 000.0002 I want to keep 0.0002 and when I give 0.2e-0100 I want to keep 0.2e-100
So I was thinking something like the atof function but I do not know how to do it exactly.
Any thoughts?
Thanks in advance

lex will return the complete token that your pattern matches as one string. You cannot change that. At the expense of considerable complexity you could use start conditions to match a leading zero (which may be the only digit), and collect tokens for the pieces, e.g.,
0.2e-0100
as
0.2e-
0
100
and glue the first/last tokens together but you would find it much simpler to develop your own string function which filters out the unwanted leading zeroes.

Related

Replace trailing ".1" to ".2"

I am assuming you would need a regex for this. The best I could come up with is
=REGEXREPLACE(C2, "\.(?=[^.]*$)", ".2")
but it only detects the period in the end and the google sheet returns #REF!
Other ways, such as directly changing the cell C2:C5, are also welcomed.
You can just check if the trailing 2 characters from the right are equal to .1
get two chars from the right
test equality
RIGHT(A1,2)=".1"
Then, to convert matching values, you can slice off the last two chars (length-2) and append the .2
LEFT(A1,LEN(A1)-2)&".2"
All together
=IF(RIGHT(A1,2)=".1",LEFT(A1,LEN(A1)-2)&".2",A1)
If you actually want to increment arbitrary values (and not just .1), you can skip the equality check and add 0.1 intermediately
=LEFT(C3,LEN(C3)-2)&((RIGHT(C3,2)+0.1)&"")
If you have values with more than a single digit, hunt them in an intermediate column so you can use their length to
add the right power of ten (.5+0.1, .993+0.001, etc.)
exclude the right number of chars when appending
If you want a full version parser, consider VBA or passing the column to a more practical language

Extract numbers out of text with inconcistant linebreaks

I have text with 6 numbers typically stored in one line
SomeData\n0.00 0.00 0.00 31,570.07 0.00 31,570.07\nSomeData
SomeData\n0.00 0.00 0.00 485,007.24 0.00 485,007.24\nSomeData
This regex worked fine on it:
\n[0-9,.-]* [0-9,.-]* [0-9,.-]* [0-9,.-]* [0-9,.-]* [0-9,.-]*\n
I noticed that every once in a while I get this:
SomeData\n0.00 0.00 10,921,594\n.89\n-\n9,563,271.0\n6\n0.00 1,358,323.83\nSomeData
Note how the linebreaks are randomly inserted after a sign or between numbers as if the system stored the values without filtering linebreaks.
I am struggling to get this extracted. I tried various expressions but my more successful one was [0-9,.-][\n]{0,1}[0-9,.-][ ]{0,1} to match an individual number.
What expression can I use to match both variations of the number formats preferably already stripping out the inconstant line breaks?
Update: Going with
[-\n]{0,2}[0-9,]+[\n.0-9]{3,4}[\n ]{0,1}
Please let me know if I there's a better way
One way would be to write an exact representation of what constitutes a number, so in your case [-+]?[0-9]+[0-9,]*(?:\.[0-9]+)? would do the trick. This helps, because then your search can know when a number starts and when one ends (because of rules like: a sign always is at the start a dot cannot appear multiple times, etc.). Then you want to match pairs of six delimited by either a new line or space so wrap it in a capture group and limit by 6: (...[ \n]*){6,6}. This helps because then the regex engine can figure out by backtracking what to consider a number by knowing how many it should match. Then you want to allow new lines in pretty much any position, so place the new line in each character group. You might also want to anchor the numbers on both sides, but this is not necessary, because now the regex engine will try to identify valid tuples of 6 numbers. End result is:
SomeData\n([-+]?[0-9\n]+[0-9,\n]*(?:\.[0-9\n]+)?[ \n]){6,6}SomeData
This will find tuples of 6 numbers no matter where the enters are. Here is an example: https://regex101.com/r/jD5nT8/1

How to programmatically learn regexes?

My question is a continuation of this one. Basically, I have a table of words like so:
HAT18178_890909.098070313.1
HAT18178_890909.098070313.2
HAT18178_890909.143412462.1
HAT18178_890909.143412462.2
For my purposes, I do not need the terminal .1 or .2 for this set of names. I can manually write the following regex (using Python syntax):
r = re.compile('(.*\.\d+)\.\d+')
However, I cannot guarantee that my next set of names will have a similar structure where the final 2 characters will be discardable - it could be 3 characters (i.e. .12) and the separator could change as well (i.e. . to _).
What is the appropriate way to either explicitly learn a regex or to determine which characters are unnecessary?
It's an interesting problem.
X y
HAT18178_890909.098070313.1 HAT18178_890909.098070313
HAT18178_890909.098070313.2 HAT18178_890909.098070313
HAT18178_890909.143412462.1 HAT18178_890909.143412462
HAT18178_890909.143412462.2 HAT18178_890909.143412462
The problem is that there is not a single solution but many.
Even for a human it is not clear what the regex should be that you want.
Based on this data, I would think the possibilities to learn are:
Just match a fixed width of 25: .{25}
Fixed first part: HAT18178_890909.
Then:
There's only 2 varying numbers on each single spot (as you show 2 cases).
So e.g. [01] (either 0 or 1), [94] the next spot and so on would be a good solution.
The obvious one would be \d+
But it could also be \d{9}
You see, there are multiple correct answers.
These regexes would still work if the second point would be an underscore instead.
My conclusion:
The problem is that it is much more work to prepare the data for machine learning than it is to create a regex. If you want to be sure you cover everything, you need to have complete data, so then a regex is probably less effort.
You could split on non-alphanumeric characters;
[^a-zA-Z0-9']+
That would get you, in this case, few strings like this:
HAT18178
890909
098070313
1
From there on you can simply discard the last one if that's never necessary, and continue on processing the first sequences

JSP input validate amount - possibly with REGEX

This is probably a simple question that has been solved many times. I am new to front end dev, so struggling with the validation part. I have a currency input that I used the following statement in JavaScript to only allow numbers. Can I just edit this or add a line to also only allow two decimals as you type?
$("input#amountToSave").on("blur keyup", function() {
this.value=this.value.replace(/[^0-9.]+/,'');
});
You can try something like this
^\$?([1-9]{1}[0-9]{0,2}(\,[0-9]{3})*(\.[0-9]{0,2})?|[1-9]{1}[0-9]{0,}(\.[0-9]{0,2})?|0(\.[0-9]{0,2})?|(\.[0-9]{1,2})?)$
Many currency expresssions allow leading zeros, thus $01.40 passes thru them. This expression kills them, except for 0 in the one's column. Works with or without commas and/or dollar sign. Decimals not mandatory, unless no zero in ones column and decimal point is placed. Allows $0.00 and .0 Keywords: money dollar currency
E.g.,
$1,234.50 | $0.70 | .7
Okay - I think I see the problem. I need a regex (it can be a simple one for the two decimals - for now I don't need it too complex and no Currency symbol is necessary) that is inverted. I now saw when reading up I am testing if the value is not 0-9 replace with space. So I need to add to that reg exp that if it is the 3rd decimal, then replace with space.

Can someone provide a regex for validating and parsing a csv of integers and reals

I am new to regex and struggling to create an expression to parse a csv containing 1 to n values. The values can be integers or real numbers. The sample inputs would be:
1
1,2,3,4,5
1,2.456, 3.08, 0.5, 7
This would be used in c#.
Thanks,
Jerry
Use a CSV parser instead of RegEx.
There are several options - see this SO questions and answers and this one for the different options (built into the BCL and third party libraries).
The BCL provides the TextFieldParser (within the VisualBasic namespace, but don't let that put you off it).
A third party library that is liked by many is filehelpers.
Using REGEX for CSV parsing has been a 10 year jihad for me. I have found it remarkably frustrating, due to the boundary cases:
Numbers come in a variety of forms (here in the US, Canada):
1
1.
1.0
1000
1000.
1,000
1e3
1.0e3
1.0e+3
1.0e+003
-1
-1.0 (etc)
But of course, Europe has traditionally been different with regard to commas and decimal points:
1
1,0
1000
1.000e3
1e3
1,0e3
1,0e+3
1,0e+003
Which just ruins everything. So, we ignore the German and French and Continental standard because the comma just is impossible to work out whether it is separating values, or part of values. (The Continent likes TAB instead of COMMA)
I'll assume that you're "just" looking for numerical values separated from each other by commas and possible space-padding. The expression:
\s*(\-?\d+(?:\.\d*)?(?:[eE][\-+]?\d*)?)\s*
is a pretty fair parser of A NUMBER. Catches just about every reasonable case. Doesn't deal with imbedded commas though! It also trims off spaces, either side of the number.
From there, you can either build an iterative CSV string decomposer (walking each field, absorbing commas, assigning to an array, say), or use the scanf type function to do the same thing. I do prefer the iterative decomposition method - as it also allows you to parse out strings, hexadecimal, and virtually any other pattern you find in the data.
The regex you want is
#"([+-]?\d+(?:\.\d+)?)(?:$|,\s*)"
...from which you'll want capture group 1. However, don't use regex for something like this. String manipulation is much better when the input is in a very static, predictable format:
string[] nums = strInput.split(", ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
List<float> results = (from n in nums
select float.Parse(n)).ToList();
If you do use regex, make sure you do a global capture.
I think you would have to loop it to check for an unknown number of ints... or else something like this:
/ *([0-9.]*) *,? *([0-9.]*) *,? *([0-9.]*) *,? *([0-9.]*) *,? *([0-9.]*) */
and you could keep that going ",?([0-9]*)" as far as you wanted to, to account for a lot of numbers. The result would be an array of numbers....
http://jsfiddle.net/8URvL/1/