Split a word with regexp in matlab; startIndex for 'split'? - regex

My aim is to generate the phonetic transcription for any word according to a set of rules.
First, I want to split words into their syllables. For example, I want an algorithm to find 'ch' in a word and then separate it like shown below:
Input: 'aachbutcher'
Output: 'a' 'a' 'ch' 'b' 'u' 't' 'ch' 'e' 'r'
I have come so far:
check=regexp('aachbutcher','ch');
if (isempty(check{1,1})==0) % Returns 0, when 'ch' was found.
[match split startIndex endIndex] = regexp('aachbutcher','ch','match','split')
%Now I split the 'aa', 'but' and 'er' into single characters:
for i = 1:length(split)
SingleLetters{i} = regexp(split{1,i},'.','match');
end
end
My problem is: How do I put the cells together, such that they are formatted like the desired output? I only have the starting indexes for the match parts ('ch') but not for the split parts ('aa', 'but','er').
Any ideas?

You don't need to work with the indices or length. Simple logic: Process first element from match, then first from split, then second from match etc....
[match,split,startIndex,endIndex] = regexp('aachbutcher','ch','match','split');
%Now I split the 'aa', 'but' and 'er' into single characters:
SingleLetters=regexp(split{1,1},'.','match');
for i = 2:length(split)
SingleLetters=[SingleLetters,match{i-1},regexp(split{1,i},'.','match')];
end

So, you know the length of 'ch', it's 2. You know where you found it from regex, as those indices are stored in startIndex. I'm assuming (Please, correct me if I'm wrong) that you want to split all other letters of the word into single-letter cells, like in your output above. So, you can just use the startIndex data to construct your output, using conditionals, like this:
check=regexp('aachbutcher','ch');
if (isempty(check{1,1})==0) % Returns 0, when 'ch' was found.
[match split startIndex endIndex] = regexp('aachbutcher','ch','match','split')
%Now I split the 'aa', 'but' and 'er' into single characters:
for i = 1:length(split)
SingleLetters{i} = regexp(split{1,i},'.','match');
end
end
j = 0;
for i = 1 : length('aachbutcher')
if (i ~= startIndex(1)) && (i ~= startIndex(2))
j = j +1;
output{end+1} = SingleLetters{j};
else
i = i + 1;
output{end+1} = 'ch';
end
end
I don't have MATLAB right now, so I can't test it. I hope it works for you! If not, let me know and I'll take anther shot at it.

Related

NaN returned when converting string to number within a for loop

for n=1:37
for m=2:71
rep1 = regexp(Cell1{n,m}, 'f[0-9]*', 'match')
rep2 = regexp(rep1, '[0-9]*', 'match')
rep2 = [rep2{:}]
cln = str2double(rep2)
Cell2{n,cln} = Cell1{n,m}
end
end
Cell 1 is a 37x71 Cell, Cell 2 is a 37x71 empty cell.
Ex
Cell1{1,2} = -(f32.*x1.*x6)./v1
If I run each part of the loop above individually, the function works as intended. However, it returns cln as a NaN when the whole loop is executed.
You are getting a NaN because your regex doesn't match one of the values of Cell1 and returns an empty string (which str2double converts to a NaN).
But let's take a step back for a second here. You can use regexp on cell arrays so there is no need to loop through all of your elements. Also, you can use a look behind assertion to look for that "f" that precedes your number therefore preventing the use of regexp twice.
stringNumber = regexp(Cell1, '(?<=f)[0-9]*', 'match', 'once');
numbers = str2double(stringNumber);
You can then check for NaNs (isnan(numbers)) and look closer at the elements of Cell1 to see why your regex isn't finding a number in a particular string.
Once you get that sorted out, you can assign to Cell2 like you are doing
Cell2 = cell(37, 71);
for k = 1:numel(numbers)
row = mod(k - 1, size(Cell1, 2)) + 1;
Cell2(row, numbers(k)) = Cell1(k);
end

Matching specific lengths with regexp in Matlab

String matching question in Matlab.
if i have a matrix
a = ['thehe'];
str = {'the','he'};
match = regexp(a,str);
the output is match =
[1] [1x2 double]
because it found 'he' twice and 'the' once
how can i make it so it looks from left to right of my string a and
only matches 'the' once and 'he' once?
To answer the explicit question, from the documentation for regexp you can specify the once search option:
a = 'thehe';
str = {'the','he'};
match = regexp(a,str, 'once');
Which returns:
match =
[1] [2]
Where match is a 1x2 cell array whose cell value(s) correspond to the first index of the match in a for each cell of str.
I understand from what the ambiguously described details I'v read, that you want the indexes of non-interleaved occurences of the and he, means 1, and 4.
a = ['thehe'];
str = {'the';'[^t]he'};
match = regexp(a,str)
after this print the two results.
a(match{1}:match{1}+2)
ans =
the
and
a(match{2}+1:match{2}+2)
ans =
he
no third occurence !
a(match{3})
??? Index exceeds matrix dimensions.

Regex: How to remove English words from sentences using Regex?

I've number of rows in SQLite, each row has one column that contains data like this:
prosperکامیاب شدن ، موفق شدن ، رونق یافتن
As you can see, the sentence starts with English words, Now I want to remove English words at first of each sentence. Is there any way to do that via T-SQL query(using Regex)?
you may try this :) I have made it as a function to call upon
create function dbo.RemoveEngChars (#Unicode_string nvarchar(max))
returns nvarchar(max) as
begin
declare #i int = 1; -- must start from 1, as SubString is 1-based
declare #OriginalString nvarchar(100) = #Unicode_string collate SQL_Latin1_General_Cp1256_CS_AS
declare #ModifiedString nvarchar(100) = N'';
while #i <= Len(#OriginalString)
begin
if SubString(#OriginalString, #i, 1) not like '[a-Z]'
begin
set #ModifiedString = #ModifiedString + SubString(#OriginalString, #i, 1);
end
set #i = #i + 1;
end
return #ModifiedString
end
--To call the function , you can run the following script and pass the Unicode in N' prefix
select dbo.RemoveEngChars(N'prosperکامیاب شدن ، موفق شدن ، رونق یافتن')

create string in Matlab using only some parameters

In Matlab, is it possible to create a string like:
f1-*f2-*f3-*f4-*f5-*f6
giving only as parameters:
f, 1:6 and -* ?
I tried:
for i=1:6; str = strcat(str, sprintf('f%d %s',i,'-* ')); end
but it doesn't work very well and seems ineficient for a larger number of files... Perhaps a regexp would be more suitable here?
This gives you the string with extra trailing separator:
str = sprintf('f%d-*', 1:6)
Perhaps you can just remove the last two characters from this. In general, a single sprintf for an array input is quite efficient.
strjoin for using -* as a delimiter, and strcat for combining the numbers with f:
>> strjoin(strcat('f',sprintfc('%d',1:6)),'-*')
ans =
f1-*f2-*f3-*f4-*f5-*f6
Because strcat accepts cell arrays, no loop is needed.
% //Data:
letter = 'f';
numbers = 1:6;
separator = '-*';
%// Let's go:
num = mat2cell(num2str(numbers(:)), ones(1,numel(numbers))); %// cell array
%// of strings from the numbers. Those strings may contain spaces.
%// Those will be removed later
s = strcat('f',num,'-*'); %// concatenate letter and separator to each number
s = [s{:}]; %// contatenate all
s = s(1:end-numel(separator)); %// remove last separator
s(s==' ') = []; %// remove spaces (in case of several-digit numbers)

How can I split a string into an array and determine WHICH character came after the split?

I'm trying to split all the words in a string into an array in AS3. The obvious answer of course would be to simply do this:
str.split(/\s/);
The problem here is that I need to be able to tell whether the split occurred on a newline or a space. I'm trying to put the words of a string into draggable boxes, and I want the ones after a newline to go, well, on a new line.
Any idea the best way to go about this? Clearly, the above split method will get rid of the crucial newline character that will tell me what I need to know. Should I use a regex.exec with a while loop, or is there any way to use split to preserve the characters I need?
Example string :
This is an example string
with spaces as well as newlines
and needs a regex
1/ Split the string on newline, get array#1.
array#1 = [ "This is an example string","with spaces as well as
newlines","and needs a regex" ]
2/ For each element in array#1 , split based on your current regex which will break the strings
only on spaces as newlines have already been dealth with, this 2-D array is array#2
array#2 = [
["This","is","an","example","string"] ,
["with","spaces","as","well","as","newlines"],
["and","needs","a","regex"]
]
3/ Process elements of array#2 as you want.
First split you string at the newline
var lines:Array = str.split("\n");
Now you can loop on you lines and split each of these in to seperate words
for(var i:int = 0; i < lines.length; i++){
var words = str[i].split(" ");
for(var j:int = 0; j < words.length; j++){
trace("word", words[i]);
}
trace("newline");
}