Matching specific lengths with regexp in Matlab - regex

String matching question in Matlab.
if i have a matrix
a = ['thehe'];
str = {'the','he'};
match = regexp(a,str);
the output is match =
[1] [1x2 double]
because it found 'he' twice and 'the' once
how can i make it so it looks from left to right of my string a and
only matches 'the' once and 'he' once?

To answer the explicit question, from the documentation for regexp you can specify the once search option:
a = 'thehe';
str = {'the','he'};
match = regexp(a,str, 'once');
Which returns:
match =
[1] [2]
Where match is a 1x2 cell array whose cell value(s) correspond to the first index of the match in a for each cell of str.

I understand from what the ambiguously described details I'v read, that you want the indexes of non-interleaved occurences of the and he, means 1, and 4.
a = ['thehe'];
str = {'the';'[^t]he'};
match = regexp(a,str)
after this print the two results.
a(match{1}:match{1}+2)
ans =
the
and
a(match{2}+1:match{2}+2)
ans =
he
no third occurence !
a(match{3})
??? Index exceeds matrix dimensions.

Related

How to replace the captured group in Ruby

I would like to replace the captured group of a string with the elements of an array.
I am trying something like this:
part_number = 'R1L16SB#AA'
regex = (/\A(RM|R1)([A-Z])(\d\d+)([A-Z]+)#?([A-Z])([A-Z])\z/)
g = ["X","Y","Z"]
g.each do |i|
ren_m,ch_conf,bit_conf,package_type,packing_val,envo_vals = part_number.match(regex).captures
m = part_number.sub! packing_val,i
puts m
end
My code with array g = ["X","Y","Z"] is giving desired output as:
R1L16SB#XA
R1L16SB#YA
R1L16SB#ZA
The captured group packing_val is replaced with
g = ["X","Y","Z"]
But when the array has elements which are already present in the string then it is not working:
g = ["A","B","C"]
outputs:
R1L16SB#AA
R1L16SB#BA
R1L16SC#BA
But my expected output is:
R1L16SB#AA
R1L16SB#BA
R1L16SB#CA
What is going wrong and what could be the possible solution?
sub! will replace the first match every iteration on part_number which is outside of the loop.
What happens is:
In the first iteration, the first A will be replaced with A giving the same
R1L16SB#AA
^
In the second iteration, the first A will be replaced by B giving
R1L16SB#BA
^
In the third iteration, the first B will be replaced by C giving
R1L16SC#BA
^
One way to get the desired output is to put part_number = 'R1L16SB#AA' inside the loop.
Ruby demo
You mutated your part_number every iteration. That's the reason.
Just switch to sub without bang:
m = part_number.sub(packing_val, i)
You can do it without regex:
part_number = 'R1L16SB#AA'
g = %w[X Y Z]
g.each do |i|
pn = part_number.dup
pn[-2] = i
puts pn
end

How can I extract a file name based on number string?

I have a list of filenames in a struct array, example:
4x1 struct array with fields:
name
date
bytes
isdir
datenum
where files.name
ans =
ts.01094000.crest.csv
ans =
ts.01100600.crest.csv
etc.
I have another list of numbers (say, 1094000). And I want to find the corresponding file name from the struct.
Please note, that 1094000 doesn't have preceding 0. Often there might be other numbers. So I want to search for '1094000' and find that name.
I know I can do it using Regex. But I have never used that before. And finding it difficult to write for numbers instead of text using strfind. Any suggestion or another method is welcome.
What I have tried:
regexp(files.name,'ts.(\d*)1094000.crest.csv','match');
I think the regular expression you'd want is more like
filenames = {'ts.01100600.crest.csv','ts.01094000.crest.csv'};
matches = regexp(filenames, ['ts\.0*' num2str(1094000) '\.crest\.csv']);
matches = ~cellfun('isempty', matches);
filenames(matches)
For a solution with strfind...
Pre-16b:
match = ~cellfun('isempty', strfind({files.name}, num2str(1094000)),'UniformOutput',true)
files(match)
16b+:
match = contains({files.name}, string(1094000))
files(match)
However, the strfind way might have issues if the number you are looking for exists in unexpected places such as looking for 10 in ["01000" "00101"].
If your filenames match the pattern ts.NUMBER.crest.csv, then in 16b+ you could do:
str = {files.name};
str = extractBetween(str,4,'.');
str = strip(str,'left','0');
matches = str == string(1094000);
files(matches)

groovy regex, how to match array items in a string

The string looks like this "[xx],[xx],[xx]"
Where xx is a ploygon like this "(1.0,2.3),(2.0,3)...
Basically, we are looking for a way to get the string between each pair of square brackets into an array.
E.g. String source = "[hello],[1,2],[(1,2),(2,4)]"
would result in an object a such that:
a[0] == 'hello'
a[1] == '1,2'
a[2] == '(1,2),(2,4)'
We have tried various strategies, including using groovy regex:
def p = "[12],[34]"
def points = p =~ /(\[([^\[]*)\])*/
println points[0][2] // yields 12
However,this yields the following 2 dim array:
[[12], [12], 12]
[, null, null]
[[34], [34], 34]
so if we took the 3rd item from every even rows we would be ok, but this does look very correct. We are not talking into account the ',' and we are not sure why we are getting "[12]" twice, when it should be zero times?
Any regex experts out there?
I think that this is what you're looking for:
def p = "[hello],[1,2],[(1,2),(2,4)]"
def points = p.findAll(/\[(.*?)\]/){match, group -> group }
println points[0]
println points[1]
println points[2]
This scripts prints:
hello
1,2
(1,2),(2,4)
The key is the use of the .*? to make the expression non-greedy to found the minimum between chars [] to avoid that the first [ match with the last ] resulting match in hello],[1,2],[(1,2),(2,4) match... then with findAll you returns only the group captured.
Hope it helps,

Change characters at specific position in a string in SAS

IS there a function to change letters at a given index in SAS?
For example if my string is
string1 = 'abcd1234efgh'
I want to do somehing like:
string2 = somefunction(string1, 5, 'zzzz');
to produce
'abcdzzzzefgh'
Yes, substr() = is what you're looking for. See here for details.
substr(string2, 5) = 'zzzz';
The substr(variable,position<,length>) = function can also take an third argument to define the length of the segment to be replaced.

Split a word with regexp in matlab; startIndex for 'split'?

My aim is to generate the phonetic transcription for any word according to a set of rules.
First, I want to split words into their syllables. For example, I want an algorithm to find 'ch' in a word and then separate it like shown below:
Input: 'aachbutcher'
Output: 'a' 'a' 'ch' 'b' 'u' 't' 'ch' 'e' 'r'
I have come so far:
check=regexp('aachbutcher','ch');
if (isempty(check{1,1})==0) % Returns 0, when 'ch' was found.
[match split startIndex endIndex] = regexp('aachbutcher','ch','match','split')
%Now I split the 'aa', 'but' and 'er' into single characters:
for i = 1:length(split)
SingleLetters{i} = regexp(split{1,i},'.','match');
end
end
My problem is: How do I put the cells together, such that they are formatted like the desired output? I only have the starting indexes for the match parts ('ch') but not for the split parts ('aa', 'but','er').
Any ideas?
You don't need to work with the indices or length. Simple logic: Process first element from match, then first from split, then second from match etc....
[match,split,startIndex,endIndex] = regexp('aachbutcher','ch','match','split');
%Now I split the 'aa', 'but' and 'er' into single characters:
SingleLetters=regexp(split{1,1},'.','match');
for i = 2:length(split)
SingleLetters=[SingleLetters,match{i-1},regexp(split{1,i},'.','match')];
end
So, you know the length of 'ch', it's 2. You know where you found it from regex, as those indices are stored in startIndex. I'm assuming (Please, correct me if I'm wrong) that you want to split all other letters of the word into single-letter cells, like in your output above. So, you can just use the startIndex data to construct your output, using conditionals, like this:
check=regexp('aachbutcher','ch');
if (isempty(check{1,1})==0) % Returns 0, when 'ch' was found.
[match split startIndex endIndex] = regexp('aachbutcher','ch','match','split')
%Now I split the 'aa', 'but' and 'er' into single characters:
for i = 1:length(split)
SingleLetters{i} = regexp(split{1,i},'.','match');
end
end
j = 0;
for i = 1 : length('aachbutcher')
if (i ~= startIndex(1)) && (i ~= startIndex(2))
j = j +1;
output{end+1} = SingleLetters{j};
else
i = i + 1;
output{end+1} = 'ch';
end
end
I don't have MATLAB right now, so I can't test it. I hope it works for you! If not, let me know and I'll take anther shot at it.