Extract Float from Specific String Using Regular Expression - regex

What regular expression do I use to extract, for example, 1.09487 from the following text contained in a .txt file? Also, how would I modify the regular expression to account for the case where the float is negative (for example, -1.948)?
I tried several suggestions on Google as well as a regular expression generator, but none seem to work. It seems I want to use an anchor (such as ^) to start searching for digits at the word "serial" and then stop at "(", but this doesn't seem to work.
Output in .txt file:
Entropy = 7.980627 bits per character.
Optimum compression would reduce the size
of this 51768 character file by 0 percent.
Chi square distribution for 51768 samples is 1542.26, and randomly
would exceed this value less than 0.01 percent of the times.
Arithmetic mean value of data bytes is 125.93 (127.5 = random).
Monte Carlo value for Pi is 3.169834647 (error 0.90 percent).
Serial correlation coefficient is 1.09487 (totally uncorrelated = 0.0).
Thanks for any help.

This should be sufficient:
(?<=Serial correlation coefficient is )[-\d.]+
Unless you're expecting garbage, this will work fine.

try this:
(-?\d+\.\d+)(?=\s\(totally)
check here

Related

regex for a real number in flex ignoring leading zeros

I have the following sets:
NUMBER [0-9]+
DECIMAL ("."{NUMBER})|({NUMBER}("."{NUMBER}?)?)
REAL {DECIMAL}([eE][+-]?{NUMBER})?
and I want my lexer to accept real numbers like:
0.002 or 0.004e-10 or .01
the problem is that I want it ignore the leading zeros but to keep the rest of the number for example:
when I give 000.0002 I want to keep 0.0002 and when I give 0.2e-0100 I want to keep 0.2e-100
So I was thinking something like the atof function but I do not know how to do it exactly.
Any thoughts?
Thanks in advance
lex will return the complete token that your pattern matches as one string. You cannot change that. At the expense of considerable complexity you could use start conditions to match a leading zero (which may be the only digit), and collect tokens for the pieces, e.g.,
0.2e-0100
as
0.2e-
0
100
and glue the first/last tokens together but you would find it much simpler to develop your own string function which filters out the unwanted leading zeroes.

Generate a list of all possible combinations of a regular expression based string

This is not a straightforward "all possible combinations" question.
EDIT: The regex is just a fixed length string with different combinations of alpha and non alphanumeric for each index...
Given a regular expression of fixed length, what would be the fastest way of computing and storing all combinations in a database, speed of saving to database included. From the get go, given the regular expression to having any type of database of every combination?
What I did, successfully but ridiculously slow, was just create an array the length of the fixed length regular expression and each element contained every possible character at that position, I generated this with some script. And then just did a loopception on the array with an SQL Server connection open from start to finish inserting 10 possibilities at a time. It was extremely slow, we're talking a string of 7/8 characters with a maximum of 36 possibilities in any given location. It took a few days.
So, my question is given this problem what would be the best combination of technologies, languages and algorithm to accomplish this the quickest?
Number of possible strings with length 8 and composed of 36 possible characters:
36^8 = 2821109907456 = 2,8 trillion
Generating that many strings in any way will take "considerable" time. Let's look at how long it will take to insert them into DB. Assuming a really good DB performance, we can take 20000 inserts/sec. In such a case the total insertion time is expected to be:
2,8 * 10^12 / 20000 = 140 million seconds
140 * 10^6 / (60*60*24) = 1620 days
So, this answers your question I guess: 1620 DAYS!

Extract numbers out of text with inconcistant linebreaks

I have text with 6 numbers typically stored in one line
SomeData\n0.00 0.00 0.00 31,570.07 0.00 31,570.07\nSomeData
SomeData\n0.00 0.00 0.00 485,007.24 0.00 485,007.24\nSomeData
This regex worked fine on it:
\n[0-9,.-]* [0-9,.-]* [0-9,.-]* [0-9,.-]* [0-9,.-]* [0-9,.-]*\n
I noticed that every once in a while I get this:
SomeData\n0.00 0.00 10,921,594\n.89\n-\n9,563,271.0\n6\n0.00 1,358,323.83\nSomeData
Note how the linebreaks are randomly inserted after a sign or between numbers as if the system stored the values without filtering linebreaks.
I am struggling to get this extracted. I tried various expressions but my more successful one was [0-9,.-][\n]{0,1}[0-9,.-][ ]{0,1} to match an individual number.
What expression can I use to match both variations of the number formats preferably already stripping out the inconstant line breaks?
Update: Going with
[-\n]{0,2}[0-9,]+[\n.0-9]{3,4}[\n ]{0,1}
Please let me know if I there's a better way
One way would be to write an exact representation of what constitutes a number, so in your case [-+]?[0-9]+[0-9,]*(?:\.[0-9]+)? would do the trick. This helps, because then your search can know when a number starts and when one ends (because of rules like: a sign always is at the start a dot cannot appear multiple times, etc.). Then you want to match pairs of six delimited by either a new line or space so wrap it in a capture group and limit by 6: (...[ \n]*){6,6}. This helps because then the regex engine can figure out by backtracking what to consider a number by knowing how many it should match. Then you want to allow new lines in pretty much any position, so place the new line in each character group. You might also want to anchor the numbers on both sides, but this is not necessary, because now the regex engine will try to identify valid tuples of 6 numbers. End result is:
SomeData\n([-+]?[0-9\n]+[0-9,\n]*(?:\.[0-9\n]+)?[ \n]){6,6}SomeData
This will find tuples of 6 numbers no matter where the enters are. Here is an example: https://regex101.com/r/jD5nT8/1

Can we validate Min and Max Value for a floating Number using RegExp?

Hi I know that we can validate Min and Max Length of a number using Regex.
But can we validate Min and Max Value for a floating point number using the same?
Min Value : 0.00
Max Value :100,000,000.00
Could anyone please just apply Min and Max Value to following Regex:
^(?=.*\d)(?!.*?\.[^.\n]*,)\d*(,\d*,?)*(\.\d*)?$
Above Regex matches a floating number with optional decimal point and commas.
Regex is for strings. You try to compare floats. It's just not the right tool. It's worse than eating your soup with a fork. It's like writing on paper with a knife or cutting your hair with a teaspoon.
Look here for a solution with positive integers without thousands separator :
Using regular expressions to compare numbers
I leave the task to you to extend that solution to using floats, thousands separator and negative numbers.
I guess this should help you. This regex will match 0.00 to 100,000,000.00 upto 2 decimal places.
^(:?(?=[1])(10{0,8})|(?=[^0])(\d{1,8})|0)\.[0-9]{1,2}$
But keep in mind that its always best to compare numbers numerically that using regex.
Here is the link to verify it.

Can someone provide a regex for validating and parsing a csv of integers and reals

I am new to regex and struggling to create an expression to parse a csv containing 1 to n values. The values can be integers or real numbers. The sample inputs would be:
1
1,2,3,4,5
1,2.456, 3.08, 0.5, 7
This would be used in c#.
Thanks,
Jerry
Use a CSV parser instead of RegEx.
There are several options - see this SO questions and answers and this one for the different options (built into the BCL and third party libraries).
The BCL provides the TextFieldParser (within the VisualBasic namespace, but don't let that put you off it).
A third party library that is liked by many is filehelpers.
Using REGEX for CSV parsing has been a 10 year jihad for me. I have found it remarkably frustrating, due to the boundary cases:
Numbers come in a variety of forms (here in the US, Canada):
1
1.
1.0
1000
1000.
1,000
1e3
1.0e3
1.0e+3
1.0e+003
-1
-1.0 (etc)
But of course, Europe has traditionally been different with regard to commas and decimal points:
1
1,0
1000
1.000e3
1e3
1,0e3
1,0e+3
1,0e+003
Which just ruins everything. So, we ignore the German and French and Continental standard because the comma just is impossible to work out whether it is separating values, or part of values. (The Continent likes TAB instead of COMMA)
I'll assume that you're "just" looking for numerical values separated from each other by commas and possible space-padding. The expression:
\s*(\-?\d+(?:\.\d*)?(?:[eE][\-+]?\d*)?)\s*
is a pretty fair parser of A NUMBER. Catches just about every reasonable case. Doesn't deal with imbedded commas though! It also trims off spaces, either side of the number.
From there, you can either build an iterative CSV string decomposer (walking each field, absorbing commas, assigning to an array, say), or use the scanf type function to do the same thing. I do prefer the iterative decomposition method - as it also allows you to parse out strings, hexadecimal, and virtually any other pattern you find in the data.
The regex you want is
#"([+-]?\d+(?:\.\d+)?)(?:$|,\s*)"
...from which you'll want capture group 1. However, don't use regex for something like this. String manipulation is much better when the input is in a very static, predictable format:
string[] nums = strInput.split(", ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
List<float> results = (from n in nums
select float.Parse(n)).ToList();
If you do use regex, make sure you do a global capture.
I think you would have to loop it to check for an unknown number of ints... or else something like this:
/ *([0-9.]*) *,? *([0-9.]*) *,? *([0-9.]*) *,? *([0-9.]*) *,? *([0-9.]*) */
and you could keep that going ",?([0-9]*)" as far as you wanted to, to account for a lot of numbers. The result would be an array of numbers....
http://jsfiddle.net/8URvL/1/