Reading unknown length line

Reading unknown length line - regex

I have an unknown length line that is in this format
bobaboao dsaas : 5->2 2->3 4->6 7->2 1->4 5->1 8->1 222->1 23->13 ...
I need to read each
"X->Y"
and send to function
Dist(X,Y)
until the end of the line
How can I do this in MATLAB?

I'd use regexp with 'tokens', which pulls out the bits of the match in between parentheses (()):
>> C = regexp(s,'(\d*)->(\d*)','tokens')
C =
{1x2 cell} {1x2 cell} {1x2 cell} {1x2 cell} {1x2 cell} ...
{1x2 cell} {1x2 cell} {1x2 cell} {1x2 cell}
>> xy = str2double(vertcat(C{:})).'
xy =
5 2 4 7 1 5 8 222 23
2 3 6 2 4 1 1 1 13
Then you have X = xy(1,:); and Y = xy(2,:);.
Explained: \d is a digit ([0-9]), \d* means any number of digits. Wrapping them in () makes them tokens. The whole pattern defines a match, but the tokens are extracted into a cell array of cell arrays, one cell array for each match, containing cell arrays for the tokens. It's easy to make a single matrix with vertcat and convert it with str2double.

One suggestion I have is to use regular expressions so that you search for substrings within that string in your example that specifically have one ID, followed by -> followed by another ID. Once we find these exact patterns in your string, we simply extract those out and place them into a cell array. In other words, supposing that our string was stored in s (I'm actually going to use your example), do this:
s = 'bobaboao dsaas : 5->2 2->3 4->6 7->2 1->4 5->1 8->1 222->1 23->13';
g = regexp(s, '[0-9]+->[0-9]+', 'match');
Let's go through this code slowly. s stores the string that you're analyzing, then the next line finds substrings in your string s that finds a sequence of at least one digit, followed by a ->, followed by at least one digit. The 'match' flag extracts out the strings that match this pattern we are finding in s. g is the output of this line, and each string is stored in a cell array. We thus get:
g =
Columns 1 through 7
'5->2' '2->3' '4->6' '7->2' '1->4' '5->1' '8->1'
Columns 8 through 9
'222->1' '23->13'
Note that storing into a cell array is important, because the length of each substring may be different.
Once we extract these substrings, what we can do is extract the numbers before and after the ->. We simply apply two more regular expression calls to get the numbers before and after:
X = regexp(g, '^[0-9]+', 'match');
Y = regexp(g, '[0-9]+$', 'match');
The first call looks for substrings at the beginning of each string in g that starts with a number, while the second call looks for substrings at the end of each string in g that ends with a number. What will be returned are the numbers contained in cell arrays. Also, the numbers themselves are strings. Because each element in the cell is a string, we should convert these back into actual numbers. We should also place these into a numeric vector for you to use with your code:
X = cellfun(#str2double, X);
Y = cellfun(#str2double, Y);
cellfun is a function that allows you to apply a particular function to each cell in a cell array. In this case, we want to convert each number in the cell array as it's a string into double. Therefore, use str2double to facilitate this conversion. Once we're done, we will get numeric vectors that give you the numbers before the -> and after the ->.
We finally get:
X =
5 2 4 7 1 5 8 222 23
Y =
2 3 6 2 4 1 1 1 13

I found a tricky way:
0)delete the alphabetical sequence by writing the following line:
str(1:strfind(str,':'))=''
1)concatenate this string in the following way:
newStr=['[',str,']']
, so that now your new string will be: '[5->2 2->3 4->6 7->2 1->4 5->1 8->1 222->1 23->13]'
2)delete all the '>' , you can do it by the command:
newStr(newStr=='>')=''
,so that now you will have '[5-2 2-3 4-6 7-2 1-4 5-1 8-1 222-1 23-13]',
pay attention that this is actually a string that represents a vector that contains the distances between the numbers, and that leads us to step 3...
3)evaluate the string we get:
distances=eval(newStr);
if you want the distances without + an - , just use abs() function.

Related

Removing Measurement Units from Cell Array

I am trying to remove the units out of a column of cell array data i.e.:
cArray =
time temp
2022-05-10 20:19:43 '167 °F'
2022-05-10 20:19:53 '173 °F'
2022-05-10 20:20:03 '177 °F'
...
2022-06-09 20:18:10 '161 °F'
I have tried str2double but get all NaN.
I have found some info on regexp but don't follow exactly as the example is not the same.
Can anyone help me get the temp column to only read the value i.e.:
cArray =
time temp
2022-05-10 20:19:43 167
2022-05-10 20:19:53 173
2022-05-10 20:20:03 177
...
2022-06-09 20:18:10 161

For some cell array of data
cArray = { ...
1, '123 °F'
2, '234 °F'
3, '345 °F'
};
The easiest option is if we can safely assume the temperature data always starts with numeric values, and you want all of the numeric values. Then we can use regex to match only numbers
temps = regexp( cArray(:,2), '\d+', 'match', 'once' );
The match option causes regexp to return the matching string rather than the index of the match, and once means "stop at the first match" so that we ignore everything after the first non-numeric character.
The pattern '\d+' means "one or more numbers". You could expand it to match numbers with a decimal part using '\d+(\.\d+)?' instead if that's a requirement.
Then if you want to actually output numbers, you should use str2double. You could do this in a loop, or use cellfun which is a compact way of achieving the same thing.
temps = cellfun( #str2double, temps, 'uni', 0 ); % 'uni'=0 to retain cell array
Finally you can override the column in cArray
cArray(:,2) = temps;

regex: Match strings of numbers up to permutation of ciphers

I am trying to find a regex query, such that, for instance, the following strings match the same expression
"1116.67711..44."
"2224.43322..88."
"9993.35599..22."
"7779.91177..55."
I.e. formally "x1x1x1x2.x2x3x3x1x1..x4x4." where xi ≠ xj if i ≠ j, and where xi is some number from 1 to 9 inclusive.
Or (another example), the following strings match the same expression, but not the same expression as before:
"94..44.773399.4"
"25..55.886622.5"
"73..33.992277.3"
I.e. formally "x1x2..x2x2.x3x3x4x4x1x1.x2" where xi ≠ xj if i ≠ j, and where xi is some number from 1 to 9 inclusive.
That is two strings should be equal if they have the same form, but with the numbers internally permuted so that they are pairwise distinct.
The dots should mean a space in the sequence, this could be any value that is not a single digit number, and two "equal" strings, should have spaces the same places. If it helps, the strings all have the same length of 81 (above they all have a length of 15, as to not write too long strings).
That is, if I have some string as above, e.g. "3566.235.225..45" i want to have some reqular expression that i can apply to some database to find out if such a string already exists
Is it possible to do this?

The answer is fairly straightforward:
import re
pattern = re.compile(r'^(\d)\1{3}$')
print(pattern.match('1234'))
print(pattern.match('333'))
print(pattern.match('3333'))
print(pattern.match('33333'))
You capture what you need once, then tell the regex engine how often you need to repeat it. You can refer back to it as often as you like, for example for a pattern that would match 11.222.1 you'd use ^(\d)\1{1}\.(\d)\2{2}\.(\1){1}$.
Note that the {1} in there is superfluous, but it shows that the pattern can be very regular. So much so, that it's actually easy to write a function that solves the problem for you:
def make_pattern(grouping, separators='.'):
regex_chars = '.\\*+[](){}^$?!:'
groups = {}
i = 0
j = 0
last_group = 0
result = '^'
while i < len(grouping):
if grouping[i] in separators:
if grouping[i] in regex_chars:
result += '\\'
result += grouping[i]
i += 1
else:
while i < len(grouping) and grouping[i] == grouping[j]:
i += 1
if grouping[j] in groups:
group = groups[grouping[j]]
else:
last_group += 1
groups[grouping[j]] = last_group
group = last_group
result += '(.)'
j += 1
result += f'\\{group}{{{i-j}}}'
j = i
return re.compile(result+'$')
print(make_pattern('111.222.11').match('aaa.bbb.aa'))
So, you can give make_pattern a good example of the pattern and it will return the compiled regex for you. If you'd like other separators than '.', you can just pass those in as well:
my_pattern = make_pattern('11,222,11', separators=',')
print(my_pattern.match('aa,bbb,aa'))

Matlab regexp tokens - optional tokens and size of the returned array

Say we have a collection of files with names which can be either myfilename_ABC (type 1) or myfilename_ABC=XYZ(type 2). Providing that at any one time we supply regexp with an array of filenames of only one of these two types, how do I get it to return an array with either 1 (for type 1) or 2 (for type 2) columns containing the 3-letter combinations? I have tried using
'myfilename_(\w+)=?(\w+)?'
but this returns a cell array with 2 columns even for type 1 filenames where the second column contains empty string ''.

It's possible to do this if you simply create match expressions for each case and use a conditional operator. For example:
>> type1 = {'myfilename_ABC'; 'myfilename_DEF'};
>> type2 = {'myfilename_ABC=XYZ'; 'myfilename_DEF=UVW'};
>> matchExpr = 'myfilename_(\w+)=(\w+)|myfilename_(\w+)';
>> results1 = regexp(type1, matchExpr, 'tokens', 'once')
results1 =
2×1 cell array
{1×1 cell} % Each cell contains 1-by-1 results
{1×1 cell}
>> results2 = regexp(type2, matchExpr, 'tokens', 'once')
results2 =
2×1 cell array
{1×2 cell} % Each cell contains 1-by-2 results
{1×2 cell}
Notice that I placed the longer match expression (myfilename_(\w+)=(\w+)) before the shorter one (myfilename_(\w+)) so that it would attempt to match the longer first. I also used the 'once' option (to match the expression only once per input) to remove an extra layer of cell encapsulation.

NaN returned when converting string to number within a for loop

for n=1:37
for m=2:71
rep1 = regexp(Cell1{n,m}, 'f[0-9]*', 'match')
rep2 = regexp(rep1, '[0-9]*', 'match')
rep2 = [rep2{:}]
cln = str2double(rep2)
Cell2{n,cln} = Cell1{n,m}
end
end
Cell 1 is a 37x71 Cell, Cell 2 is a 37x71 empty cell.
Ex
Cell1{1,2} = -(f32.*x1.*x6)./v1
If I run each part of the loop above individually, the function works as intended. However, it returns cln as a NaN when the whole loop is executed.

You are getting a NaN because your regex doesn't match one of the values of Cell1 and returns an empty string (which str2double converts to a NaN).
But let's take a step back for a second here. You can use regexp on cell arrays so there is no need to loop through all of your elements. Also, you can use a look behind assertion to look for that "f" that precedes your number therefore preventing the use of regexp twice.
stringNumber = regexp(Cell1, '(?<=f)[0-9]*', 'match', 'once');
numbers = str2double(stringNumber);
You can then check for NaNs (isnan(numbers)) and look closer at the elements of Cell1 to see why your regex isn't finding a number in a particular string.
Once you get that sorted out, you can assign to Cell2 like you are doing
Cell2 = cell(37, 71);
for k = 1:numel(numbers)
row = mod(k - 1, size(Cell1, 2)) + 1;
Cell2(row, numbers(k)) = Cell1(k);
end

decision on regular expression length

I want to accomplish the following requirements using Regex only (no C# code can be used )
• BTN length is 12 and BTN starts with 0[123456789] then it should remove one digit from left and one digit from right.
WORKING CORRECTLY
• BTN length is 12 and it’s not the case stated above then it should always return 10 right digits by removing 2 from the start. (e.g. 491234567891 should be changed to 1234567891)
NOT WORKING CORRECTLY
• BTN length is 11 and it should remove one digit from left. WORKING CORRECTLY
for length <=10 BTNs , nothing is required to be done , they would remain as it is or Regex may get failed too on them , thats acceptable .
USING SQL this can be achieved like this
case when len(BTN) = 12 and BTN like '0[123456789]%' then SUBSTRING(BTN,2,10) else RIGHT(BTN,10) end
but how to do this using Regex .
So far I have used and able to get some result correct using this regex
[0*|\d\d]*(.{10}) but by this regex I am not able to correctly remove 1st and last character of a BTN like this 015732888810 to 1573288881 as this regex returns me this 5732888810 which is wrong
code is
string s = "111112573288881,0573288881000,057328888105,005732888810,15732888815,344956345335,004171511326,01777203102,1772576210,015732888810,494956345335";
string[] arr = s.Split(',');
foreach (string ss in arr)
{
// Match mm = Regex.Match(ss, #"\b(?:00(\d{10})|0(\d{10})\d?|(\d{10}))\b");
// Match mm = Regex.Match(ss, "0*(.{10})");
// ([0*|\\d\\d]*(.{10}))|
Match mm = Regex.Match(ss, "[0*|\\d\\d]*(.{10})");
// Match mm = Regex.Match(ss, "(?(^\\d{12}$)(.^{12}$)|(.^{10}$))");
// Match mm = Regex.Match(ss, "(info)[0*|\\d\\d]*(.{10}) (?(1)[0*|\\d\\d]*(.{10})|[0*|\\d\\d]*(.{10}))");
string m = mm.Groups[1].Value;
Console.WriteLine("Original BTN :"+ ss + "\t\tModified::" + m);
}

This should work:
(0(\d{10})0|\d\d(\d{10}))
UPDATE:
(0(\d{10})0|\d{1,2}(\d{10}))
1st alternate will match 12-digits with 0 on left and 0 on right and give you only 10 in between.
2nd alternate will match 11 or 12 digits and give you the right 10.
EDIT:
The regex matches the spec, but your code doesn't read the results correctly. Try this:
Match mm = Regex.Match(ss, "(0(\\d{10})0|\\d{1,2}(\\d{10}))");
string m = mm.Groups[2].Value;
if (string.IsNullOrEmpty(m))
m = mm.Groups[3].Value;
Groups are as follows:
index 0: returns full string
index 1: returns everything inside the outer closure
index 2: returns only what matches in the closure inside the first alternate
index 3: returns only what matches in the closure inside the second alternate
NOTE: This does not deal with anything greater than 12 digits or less than 11. Those entries will either fail or return 10 digits from somewhere. If you want results for those use this:
"(0(\\d{10})0|\\d*(\\d{10}))"
You'll get rightmost 10 digits for more than 12 digits, 10 digits for 10 digits, nothing for less than 10 digits.
EDIT:
This one should cover your additional requirements from the comments:
"^(?:0|\\d*)(\\d{10})0?$"
The (?:) makes a grouping excluded from the Groups returned.
EDIT:
This one might work:
"^(?:0?|\\d*)(\\d{10})\\d?$"

(?(^\d{12}$)(?(^0[1-9])0?(?<digit>.{10})|\d*(?<digit>.{10}))|\d*(?<digit>.{10}))
which does the exact same thing as sql query + giving result in Group[1] all the time so i didn't had to change the code a bit :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Reading unknown length line - regex

I have an unknown length line that is in this format bobaboao dsaas : 5->2 2->3 4->6 7->2 1->4 5->1 8->1 222->1 23->13 ... I need to read each "X->Y" and send to function Dist(X,Y) until the end of the line How can I do this in MATLAB?

Related

Removing Measurement Units from Cell Array

regex: Match strings of numbers up to permutation of ciphers

Matlab regexp tokens - optional tokens and size of the returned array

NaN returned when converting string to number within a for loop

decision on regular expression length

Categories

Resources