Matlab Regexp for nested groups and captured tokens - regex

Is there a way to capture tokens inside a non-captured group in Matlab regular expressions? Here is the specific problem:
InputString = 'Identifiers: 10 12 1 3 8 6 4 2'
Expression = 'Identifiers:\s(?:(\d*)\t?)+'
regexp(InputString, Expression, 'tokens')
I need to find the numbers after 'Identifier'. The string InputString is part of a big character array with lines before and after this line, separated by \r\n characters. The character after the colon is a whitespace, the numbers are seperated by tabs. The last number has no trailing tab. The number of numbers can vary, but it's always at least one and only integers with 1 or n digits.
I had the following idea in my Expression: Identify line by Identifiers:\s, find numbers with n>1 digits and captured token and possible trailing tab by (\d*)\t and repeat this 1 or more times by +. To repeat the digit part expression, I need to put it in a group. But I don't want to capture the token of the outer group (?:(\d*)\t?), but of course the token of the inner grouping (\d*). Thats why I used ?:. When I remove ?: only one token containing 1012138642 is returned.
Isn't it possible to capture tokens inside a non-capturing group? Do you have any solution to return the numbers in a single statement?
In my current solution I find the line by
Expression = 'Identifiers:.+?\r\n'
Line = regexp(InputString, Expression, 'match')
and get the digits with
regexp(Line, '(\d+)\t+', 'tokens')
I spend so much time finding a single statement solution, I now really need to know if it's possible or not! I am not sure if I am thinking wrong, my head is not working as intended or it's just impossible.

MATLAB doesn't support nested tokens, even if you you mark them as non capturing.
Starting in 16b there are some new text manipulations that make this easier:
str = "Identifiers: 10 12 1 3 8 6 4 2" + newline + "Blah";
str = str.extractBetween("Identifiers: ",newline).split
str =
8×1 string array
"10"
"12"
"1"
"3"
"8"
"6"
"4"
"2"
If your goal is one statement with regexp, using split might get you closer.
str = regexp(str,'(?<=Identifiers[^\n]*)\s*(?=[^\n]*)','split')
str =
1×10 string array
"Identifiers:" "10" "12" "1" "3" "8" "6" "4" "2" "Blah"

Related

matlab: truncate large text and append '...'

I have a large array of text (text, stored as cell-array), that I want to truncate in matlab, say for 5 characters. Truncating with regexprep is quite efficient, but now, I would love to append a '...' at the end of every truncated match (and only there).
(How) can this be achieved within MATLAB's regexprep?
>> text = {'123456780','1','12'}; %<- small representative sample
>> regexprep(text,'(^.{0,5})(.*)','$1') %capture first 5 characters or less in first group (and replace the text with first group captures)
ans =
1×3 cell array
{'12345'} {'1'} {'12'}
it should read:
ans =
1×3 cell array
{'12345...'} {'1'} {'12'}
You need to use
regexprep(text,'^(.{5}).+','$1...')
See the regex demo.
The main point is that you need to only trigger the replacement if a string is linger than five chars (else, you do not even need to truncate the string).
Note that regexprep returns the input string as is if there was no regex match found, thus you do not need to worry about strings that are zero to five chars long.
Details:
^ - start of string
(.{5}) - Capturing group 1 ($1): any five chars
.+ - any one or more chars, as many as possible.
Note that the string 12345... is in fact 8 characters long. You don't want to make the mistake of truncating 1234567 to 12345..., as the truncated version is longer and therefore shouldn't be truncated in the first place.
A solution that takes this into account is:
regexprep(text,'^(.{5}).{3}.+','$1...')
which will only truncate if there are more than 8 characters and, if so, will display the first 5 with the trailing ellipsis.

Search regex expression whose line begins with

Take for example these two lines:
# ManyRandomCharacters 1 2 3
ManyRandomDifferentCharacters 4 5 6
I'd like a regex such that it finds the numbers at the end but only for the line that doesn't begin with #. I just want to match the numbers, not the whole line (i.e., I just want "4", "5" and "6", not "1", "2" or "3"). That's the tricky part, because everything I tried selects all the line up to the numbers. Is there a way to do this? Thanks!
^ matches the start of the string, and if the multiline flag is set (depends on implementation, usually m), it detects also the start of a line.
So, something like /^(?:[^#].*)(\d+ \d+ \d+)/gm would match any line whose first character isn't #.

concatenating regex pattern in C#

I have a C# project that requires me to capture a string value from a html stream.
The pattern I need to match is:
XXXX-abc
Where:
XXXX = a 4 character integer
followed by a -
abc = a 3 character alphanumeric.
I looked at txt2re.com and got
string re1="(\\d)"; // Any Single Digit 1
string re2="(\\d)"; // Any Single Digit 2
string re3="(\\d)"; // Any Single Digit 3
string re4="(\\d)"; // Any Single Digit 4
string re5="(-)"; // Any Single Character 1
string re6="((?:[a-z][a-z]*[0-9]+[a-z0-9]*))"; // Alphanum 1
The thing I am having difficulty with is combining it into one expression instead of 6.
I know I can do:
Regex r = new Regex(re1+re2+re3+re4+re5+re6,RegexOptions.IgnoreCase|RegexOptions.Singleline);
However, my OCD cringes at this method :)
You can use the expresion \d{4}-\w{3} 4 digits follow by - follow by 3 alphanumerical characters. Here is a good site to test and learn about the regular expresion.

Regex that matches integers in between whitespace or start/end of string only

I'm currently using the pattern: \b\d+\b, testing it with these entries:
numb3r
2
3454
3.214
test
I only want it to catch 2, and 3454. It works great for catching number words, except that the boundary flags (\b) include "." as consideration as a separate word. I tried excluding the period, but had troubles writing the pattern.
Basically I want to remove integer words, and just them alone.
All you want is the below regex:
^\d+$
Similar to manojlds but includes the optional negative/positive numbers:
var regex = /^[-+]?\d+$/;
EDIT
If you don't want to allow zeros in the front (023 becomes invalid), you could write it this way:
var regex = /^[-+]?[1-9]\d*$/;
EDIT 2
As #DmitriyLezhnev pointed out, if you want to allow the number 0 to be valid by itself but still invalid when in front of other numbers (example: 0 is valid, but 023 is invalid). Then you could use
var regex = /^([+-]?[1-9]\d*|0)$/
You could use lookaround instead if all you want to match is whitespace:
(?<=\s|^)\d+(?=\s|$)
This just allow positive integers.
^[0-9]*[1-9][0-9]*$
I would add this as a comment to the other good answers, but I need more reputation to do so. Be sure to allow for scientific notation if necessary, i.e. 3e4 = 30000. This is default behavior in many languages. I found the following regex to work:
/^[-+]?\d+([Ee][+-]?\d+)?$/;
// ^^ If 'e' is present to denote exp notation, get it
// ^^^^^ along with optional sign of exponent
// ^^^ and the exponent itself
// ^ ^^ The entire exponent expression is optional
This solution matches integers:
Negative integers are matched (-1,-2,etc)
Single zeroes are matched (0)
Negative zeroes are not (-0, -01, -02)
Empty spaces are not matched ('')
/^(0|-*[1-9]+[0-9]*)$/
^([+-]?[0-9]\d*|0)$
will accept numbers with leading "+", leading "-" and leadings "0"
Try /^(?:-?[1-9]\d*$)|(?:^0)$/.
It matches positive, negative numbers as well as zeros.
It doesn't match input like 00, -0, +0, -00, +00, 01.
Online testing available at http://rubular.com/r/FlnXVL6SOq
^(-+)?[1-9][0-9]*$
starts with a - or + for 0 or 1 times, then you want a non zero number (because there is not such a thing -0 or +0) and then it continues with any number from 0 to 9
This worked in my case where I needed positive and negative integers that should NOT include zero-starting numbers like 01258 but should of course include 0
^(-?[1-9]+\d*)$|^0$
Example of valid values:
"3",
"-3",
"0",
"-555",
"945465464654"
Example of not valid values:
"0.0",
"1.0",
"0.7",
"690.7",
"0.0001",
"a",
"",
" ",
".",
"-",
"001",
"00.2",
"000.5",
".3",
"3.",
" -1",
"+100",
"--1",
"-.1",
"-0",
"00099",
"099"

to search for consecutive list elements prefixed by number and dot in plain text

The text looks like this:
"Beginning. 1. The container is 1.5 meters long 2. It can hold up to 2lt of fluid. 3. It 4 holes."
There may not be a dot at the end of each list element.
How can I split this text into a list as shown below?
"Beginning."
"The container is 1.5 meters long"
"It can hold up to 2lt of fluid."
"It has 4 holes."
In other words I need to match (\d+)\. such that all (\d+) are consecutive integers so that I can split and trim the text between them. Is it possible with regex? How far do I have to venture into the realm of computer science?
Use
\d+\.(?!\d)
as the splitting regex, i. e. in PHP
$result = preg_split('/\d+\.(?!\d)/', $subject);
The negative lookahead (?!\d) ensures that no digit follows after the dot has been matched.
Or make the spaces mandatory - if that's an option:
$result = preg_split('/\s+\d+\.\s+/', $subject);
This is working c# code:
string s = "Beginning. 1. The container is 1.5 meters long 2. It can hold up to 2lt of fluid. 3. It has 4 holes.";
string[] res = Regex.Split(s, #"\s*\d+\.\s+");
foreach (var r in res)
{
Console.WriteLine(r);
}
Console.ReadLine();
I split on \s*\d+\.\s+ that means optional white space, followed by at least one digit ,followed by a dot, then at least one whitespace.