Parsing a numerical string into a numerical vector with regex - regex

I have a set of numerical strings (used in filenames) which I would like to parse into a vectors
Here is an example
-0_01_-1_0_23_0_52_-0_25
Which should be parse into
-0.01 -1 0.23 0.52 -0.25
The rules are:
There are 5 numbers between [-1, 1]
Numbers are separated by '_'
Decimal point is replaced by '_'.
integer numbers {-1, 0, 1}, don't have a decimal point
How can I use regex (preferably matlab) to convert the string into a vector?
I tried some regex expression, but got stuck with dealing with the integer rule.

Use this code:
a = '-0_01_-1_0_23_0_52_-0_25';
a = strrep(a, '0_', '0.');
res = regexp(a, '(-?[0-9]+(?:\.[0-9]+)?)','match');
res = cellfun(#str2num, res)
First, replace 0_ with 0, and then use the -?[0-9]+(?:,[0-9]+)? regex to match the numbers only.
The regex matches an optional -, then 1+ digits, and then an optional substring with , and 1+ digits.

Related

Add leading zeros after"0x" to make all numbers in a list to be of same length ( 8-digits)

I have a long list of data pair that look like this:
{0x1023350, 0x3014},
{0x1023954, 0x3007},
{0x1023960, 0x10F},
{0x102396C, 0x2FF},
{0x10219, 0x16},
The numbers here can be anywhere from 2 digits to 8 digits, but my requirement is to pad them with leading zeros so that in the final output all the numbers are 8-digits long.
{0x01023350, 0x00003014},
{0x01023954, 0x00003007},
{0x01023960, 0x0000010F},
{0x0102396C, 0x000002FF},
{0x00010219, 0x00000016},
How can I do it using regular expressions (I am using notepad++ , but I am open to some other tool, if I cant do it in notepad++)
I am not as fluent in regex to try out any solution yet.
Do it in two steps:
Step 1 - Add excess left padding
Search: 0x
Replace: 0x00000000
Step 2 - Match excess exactly and delete it
Search: 0x\d+(?=[\dA-F]{8}[,}])
Replace: 0x
The regex:
\d+ means one or more digits
(?=[\dA-F]{8}[,}]) means followed by 8 hex chars then a comma or }
Some python code to demonstrate:
import re
str = """{0x1023350, 0x3014},
{0x1023954, 0x3007},
{0x1023960, 0x10F},
{0x102396C, 0x2FF},
{0x10219, 0x16},
"""
str = str.replace("0x", "0x00000000")
str = re.sub("0x\d+(?=[\dA-F]{8}[,}])", "0x", str)
print(str)
Output:
{0x01023350, 0x00003014},
{0x01023954, 0x00003007},
{0x01023960, 0x0000010F},
{0x0102396C, 0x000002FF},
{0x00010219, 0x00000016},

RegExp. I need to advance my expression with leading zeros

I've got my RegExp: '^[0-9]{0,6}$|^[0-9]\d{0,6}[.,]\d{0,2}'.
I need to upgrade condition above to work with an input like '000'. It should format into '0.00'
There is a list of inputs and outputs that i expect to get:
Inputs:
[5555,
55.5,
55.55,
0.50,
555555.55,
000005,
005]
Outputs:
[5555,
55.5,
55.55,
0.50,
555555.55,
0.00005,
0.05]
When working with RegExps it's important to describe, in prose, what it is you want it to match, and then what you want to do with the match.
In this case your RegExp matches a string consisting entirely of 0-6 digits, or a string starting with 1-7 digits, a . or , and then 0-2 digits.
That is: either a string with digits and no ,/., or one with digits and ,/. and as many (up to 2) digits afterwards.
You then ask to convert 000 to 0.00. I'm guessing that you want to normalize numbers to no unnecessary leading zeros, and two decimal digits.
(Now with more examples, I'm guessing that a number with a leading zero and no decimal point should have decimal point added after the first zero).
I agree that using a proper number formatter is probably the way to go, but since we are talking RegExps, here's what I'd do if I had to use RegExps:
Use capture groups, so you can easily see which part matched.
Use a regexp which doesn't capture leading zeros.
Don't try to count in RegExps. Do that in code on the side (if necessary).
Something like:
final _re = RegExp(r"^\d{0,6}$|^(\d{1,7}[,.]\d{0,2})");
String format(String number) {
var match = _re.firstMatch(number);
if (match == null) return null;
var decimals = match[1];
if (decimals != null) return decimals;
var noDecimals = match[0];
if (!noDecimals.startsWith('0')) return noDecimals;
return "0.${noDecimals.substring(1)}";
}
This matches the same strings as your RegExp.

Valid regex for number(a,b) format

How can I express number(a,b) in regex? Example:
number(5,2) can be 123.45 but also 2.44
The best I got is: ([0-9]{1,5}.[0-9]{1,2}) but it isn't enough because it wrongly accepts 12345.22.
I thought about doing multiple OR (|) but that can be too long in case of a long format such as number(15,5)
You might use
(?<!\S)(?!(?:[0-9.]*[0-9]){6})[0-9]{1,5}(?:\.[0-9]{1,2})?(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert what is on the left is not a non whitespace char
(?! Negative lookahead, assert what is on the right is not
(?:[0-9.]*[0-9]){6} Match 6 digits
) Close lookahead
[0-9]{1,5} Match 1 - 5 times a digit 0-9
(?:\.[0-9]{1,2})? Optionally match a dot and 1 - 2 digits
(?!\S) Negative lookahead, assert what is on the right is not a non whitespace char
Regex demo
I don't know Scala, but you would need to input those numbers when building your regular expression.
val a = 5
val b = 2
val regex = (raw"\((?=\d{1," + a + raw"}(?:\.0+)?|(?:(?=.{1," + (a + 1) + "}0*)(?:\d+\.\d{1," + n + "})))‌.+\)").r
This checks for either total digits is 5, or 6 (including decimal) where digits after the decimal are a max of 2 digits. For the above scenario. Of course, this accounts for variable numbers for a and b when set in code.

Regex for string *11F23H3*: Start and end with *, 7 Uppercase literals or numbers in between

I need to check strings like *11F23H3* that start and end with a *and have 7 uppercase literals or numbers in between. So far I have:
if (!barcode.match('[*A-Z0-9*]')) {
console.error(`ERROR: Barcode not valid`);
process.exitCode = 1;
}
But this does not cover strings like *11111111111*. How would the correct regex look like?
I need to check strings like 11F23H3 that start and end with a *and have 7 uppercase literals or numbers in between
You can use this regex:
/\*[A-Z0-9]{7}\*/
* is regex meta character that needs to be escaped outside character class
[A-Z0-9]{7} will match 7 characters containing uppercase letter or digits
RegEx Demo
Code:
var re = /\*[A-Z0-9]{7}\*/;
if (!re.test(barcode)) {
console.error(`ERROR: Barcode ${barcode} in row ${row} is not valid`);
process.exitCode = 1;
}
Note that if barcode is only going to have this string then you should also use anchors like this to avoid matching any other text on either side of *:
var re = /^\*[A-Z0-9]{7}\*$/;

Regex for numbers on scientific notation?

I'm loading a .obj file that has lines like
vn 8.67548e-017 1 -1.55211e-016
for the vertex normals. How can I detect them and bring them to double notation?
A regex that would work pretty well would be:
-?[\d.]+(?:e-?\d+)?
Converting to a number can be done like this: String in scientific notation C++ to double conversion, I guess.
The regex is
-? # an optional -
[\d.]+ # a series of digits or dots (see *1)
(?: # start non capturing group
e # "e"
-? # an optional -
\d+ # digits
)? # end non-capturing group, make optional
**1) This is not 100% correct, technically there can be only one dot, and before it only one (or no) digit. But practically, this should not happen. So the regex is a good approximation and false positives should be very unlikely. Feel free to make the regex more specific.*
You can identify the scientific values using: -?\d*\.?\d+e[+-]?\d+ regex.
I tried a number of the other solutions to no avail, so I came up with this.
^(-?\d+)\.?\d+(e-|e\+|e|\d+)\d+$
Debuggex Demo
Anything that matches is considered to be valid Scientific Notation.
Please note: This accepts e+, e- and e; if you don't want to accept e, use this: ^(-?\d+)\.?\d+(e-|e\+|\d+)\d+$
I'm not sure if it works for c++, but in c# you can add (?i) between the ^ and (- in the regex, to toggle in-line case-insensitivity. Without it, exponents declared like 1.05E+10 will fail to be recognised.
Edit: My previous regex was a little buggy, so I've replaced it with the one above.
The standard library function strtod handles the exponential component just fine (so does atof, but strtod allows you to differentiate between a failed parse and parsing the value zero).
If you can be sure that the format of the double is scientific, you can try something like the following:
string inp("8.67548e-017");
istringstream str(inp);
double v;
str >> scientific >> v;
cout << "v: " << v << endl;
If you want to detect whether there is a floating point number of that format, then the regexes above will do the trick.
EDIT: the scientific manipulator is actually not needed, when you stream in a double, it will automatically do the handling for you (whether it's fixed or scientific)
Well this is not exactly what you asked for since it isn't Perl (gak) and it is a regular definition not a regular expression, but it's what I use to recognize an extension of C floating point literals (the extension is permitting "_" in digit strings), I'm sure you can convert it to an unreadable regexp if you want:
/* floats: Follows ISO C89, except that we allow underscores */
let decimal_string = digit (underscore? digit) *
let hexadecimal_string = hexdigit (underscore? hexdigit) *
let decimal_fractional_constant =
decimal_string '.' decimal_string?
| '.' decimal_string
let hexadecimal_fractional_constant =
("0x" |"0X")
(hexadecimal_string '.' hexadecimal_string?
| '.' hexadecimal_string)
let decimal_exponent = ('E'|'e') ('+'|'-')? decimal_string
let binary_exponent = ('P'|'p') ('+'|'-')? decimal_string
let floating_suffix = 'L' | 'l' | 'F' | 'f' | 'D' | 'd'
let floating_literal =
(
decimal_fractional_constant decimal_exponent? |
hexadecimal_fractional_constant binary_exponent?
)
floating_suffix?
C format is designed for programming languages not data, so it may support things your input does not require.
For extracting numbers in scientific notation in C++ with std::regex I normally use
((\\+|-)?[[:digit:]]+)(\\.(([[:digit:]]+)?))?((e|E)((\\+|-)?)[[:digit:]]+)?
which corresponds to
((\+|-)?\d+)(\.((\d+)?))?((e|E)((\+|-)?)\d+)?
Debuggex Demo
This will match any number of the form +12.3456e-78 where
the sign can be either + or - and is optional
the comma as well as the positions after the comma are optional
the exponent is optional and can be written with a lower- or upper-case letter
A corresponding code for parsing might look like this:
std::regex const scientific_regex {"((\\+|-)?[[:digit:]]+)(\\.(([[:digit:]]+)?))?((e|E)((\\+|-)?)[[:digit:]]+)?"};
std::string const str {"8.67548e-017 1 -1.55211e-016"};
for (auto it = std::sregex_iterator(str.begin(), str.end(), scientific_regex); it != std::sregex_iterator(); ++it) {
std::string const match {it->str()};
std::cout << match << std::endl;
}
If you want to convert the found sub-strings to a double number std::stod should handle the conversion correctly as already pointed out by Ben Voigt.
Try it here!