Matlab: regexprep with variable - regex

I have an array a : a list of identified words to be compared and replace by empty character in an array b. newB is the result.
The value of a might vary according to an input file.
I am trying to use regexprep but it is not working well.
e.g.:
a = {'apple';'banana';'orange'}; % a might be also ‘watermelon’, ‘papaya’ etc
b = {'1 apple = 2 kiwi';'1 fig = 1 banana';'1 orange = 3 strawberry'};
newB = {' = 2 kiwi';'1 fig = ';' = 3 strawberry'};

From your example it seems like you want to remove a special word and a number, the appropriate regular expression for this is (for word = 'apple'): '\d+ apple'. Building the regular expression from all the words in a, using sprintf:
re = sprintf('\\d+ %s|',a{:}); %// adding | operator to select between expressions
re(end)=[]; %// discard the last '|'
The resulting regular expression is
re =
'\d+ apple|\d+ banana|\d+ orange'
Now the actual replacement:
newB = regexprep(b,re,'')
Resulting with
newB =
' = 2 kiwi'
'1 fig = '
' = 3 strawberry'

Related

How extract (changeable variable) word & number using regular expression matlab

I have more than 10k text files look similar like this, all of them are similar in format but not in size, sometime is bigger or smaller.
[{u'language': u'english', u'area': 3825.8953168044045, u'class': u'machine printed', u'utf8_string': u'troia', u'image_id': 428035, u'box': [426.42422762784093, 225.33333055900806, 75.15151515151516, 50.909090909090864], u'legibility': u'legible', u'id': 1056659}, {u'language': u'na', u'area': 24201.285583103767, u'id': 1056660, u'image_id': 428035, u'box': [223.99998520359847, 249.57575480143228, 172.12121212121215, 140.6060606060606], u'legibility': u'illegible', u'class': u'machine printed'}]
I want to extract two changeable variable in every text using regular expression.
The output should be like this
box = [223.99998520359847, 249.57575480143228, 172.12121212121215, 140.6060606060606]
box1 = .. sometime there is more than one
&
second output
word = troia
word1 = ... sometime there is more than one word
My code 1: for the word extraction
fid = fopen('text1.txt','r');
C = textscan(fid, '%s','Delimiter','');
fclose(fid);
C = C{:};
Lia = ~cellfun(#isempty, strfind(C,'utf8_string'));
output = [C{find(Lia)}];
expression = 'u''utf8_string'': u+'
matchStr = regexp(output, expression,'match');
My code 1 result give me only the
utf8_string
My code 2: for the box number extraction
s = sprintf('text_.txt');
fid = fopen(s);
tline = fgetl(fid);
C = regexp(tline,'u''box'': +\[([0-9\. ,]+)\]','tokens');
C = cellfun(#(x) x{1},C,'UniformOutput',false)';
M = cell2mat(cellfun(#(x) x', cat(1,C2{:}),'UniformOutput',false));
This code 2 is running but not with every text something i got this error
Error using cat Dimensions of matrices being concatenated are not consistent
If you do not insist on regexp: The input strings looks like json, so the following short code does even more than you want:
% Read the whole file
s = fileread('test.txt');
% Remove the odd u'
s = strrep(s, 'u''', '''');
% Replace ' by "
s = strrep(s, '''', '"');
% See http://www.mathworks.com/matlabcentral/fileexchange/20565
t = parse_json(s);
Now t a is cell object containing structs with the data. So
word = t{1}.utf8_string;
box = cell2mat(t{1}.box);
will give you the first word and box. If you have a newer Matlab version you can probably use jsondecode instead of parse_json.

Matching specific lengths with regexp in Matlab

String matching question in Matlab.
if i have a matrix
a = ['thehe'];
str = {'the','he'};
match = regexp(a,str);
the output is match =
[1] [1x2 double]
because it found 'he' twice and 'the' once
how can i make it so it looks from left to right of my string a and
only matches 'the' once and 'he' once?
To answer the explicit question, from the documentation for regexp you can specify the once search option:
a = 'thehe';
str = {'the','he'};
match = regexp(a,str, 'once');
Which returns:
match =
[1] [2]
Where match is a 1x2 cell array whose cell value(s) correspond to the first index of the match in a for each cell of str.
I understand from what the ambiguously described details I'v read, that you want the indexes of non-interleaved occurences of the and he, means 1, and 4.
a = ['thehe'];
str = {'the';'[^t]he'};
match = regexp(a,str)
after this print the two results.
a(match{1}:match{1}+2)
ans =
the
and
a(match{2}+1:match{2}+2)
ans =
he
no third occurence !
a(match{3})
??? Index exceeds matrix dimensions.

Mysterious no-match in regular expression

Imagine I have a cell array with two filenames:
filenames{1,1} = 'SMCSx0noSat48VTFeLeakTrace.txt';
filenames{2,1} = 'SMCSx0NoSat48VTrace.txt';
I want to get the filename which starts with 'SMCSx0' and contains the filterword 'NoSat48VTrace':
%// case 1
expression = 'SMCSx0';
filterword = 'NoSat48VTrace';
regs = regexp(filenames, ['^' expression '.*\' filterword '.*\.txt$'])
mask = ~cellfun(#isempty,regs);
file = filenames(mask)
it works, I get:
file =
'SMCSx0NoSat48VTrace.txt'
But for whatever reason does the change of the filterword to 'noSat48VTFeLeakTrace' doesn't get me the other file?
%// case 2
expression = 'SMCSx0';
filterword = 'noSat48VTFeLeakTrace';
regs = regexp(filenames, ['^' expression '.*\' filterword '.*\.txt$'])
mask = ~cellfun(#isempty,regs);
file = filenames(mask)
which is absolutely the same as before, but
file =
Empty cell array: 0-by-1
I'm actually use these lines in a function for months, without problems. But now I added some files to my folder which are not found, though their names are similar to before. Any hints?
It is actually supposed to work without including Trace into the filterword, which it does for the first case, that's why I put .*\ into the regex.
%// case 1
expression = 'SMCSx0';
filterword = 'NoSat48V';
... works
'^' expression '.*\'
The \ near the end makes it that \n is interpreted as a new-line character:
SMCSx0.*\noSat48VTFeLeakTrace.*\.txt$
This worked fine with the other filterword because NoSat48VTrace has an upper case N and \N is interpreted as simply N.
Get rid of the \, you don't need it.
You have an extra backslash in there:
regs = regexp(filenames, ['^' expression '.*\' filterword '.*\.txt$'])
^^^
|||
remove it and it should give the expected result.

regexp parsing in matlab

I have a cell array 3x1 like this:
name1 = text1
name2 = text2
name3 = text3
and I want to parse it into separate cells 1x2, for example name1 , text1. In future I want to treat text1 as a string to compare with other strings. How can I do it? I am trying with regexp and tokens, but I cannot write a proper formula for that, if someone can help me with it please, I will be grateful!
This code
input = {'name1 = text1';
'name2 = text2';
'name3 = text3'};
result = cell(size(input, 1), 2);
for row = 1 : size(input, 1)
tokens = regexp(input{row}, '(.*)=(.*)', 'tokens');
if ~isempty(tokens)
result(row, :) = tokens{1};
end
end
produces the outcome
result =
'name1 ' ' text1'
'name2 ' ' text2'
'name3 ' ' text3'
Note that the whitespace around the equal sign is preserved. You can modify this behaviour by adjusting the regular expression, e.g. also try '([^\s]+) *= *([^\s]+)' giving
result =
'name1' 'text1'
'name2' 'text2'
'name3' 'text3'
Edit: Based on the comments by user1578163.
Matlab also supports less-greedy quantifiers. For example, the regexp '(.*?) *= *(.*)' (note the question mark after the asterisk) works, if the text contains spaces. It will transform
input = {'my name1 = any text1';
'your name2 = more text2';
'her name3 = another text3'};
into
result =
'my name1' 'any text1'
'your name2' 'more text2'
'her name3' 'another text3'

How to grab a letter after ';' with regular expressions?

How can I grab a letter after ; using regular expressions? For example:
c ; d
e ; f ; m ; k ; s
import re
f = open('file.txt')
regex = re.compile(r"(?<=\; )\w+")
for line in f:
match = regex.search(line)
if match:
print match.group()
This code only grabs d and f. I need the outcome yo look like:
d
f
m
k
s
Replace all occurrences of "; " to a newline character and trim all spaces from the ends of every line.
use a regex similar to this if you want to "blacklist" the ";" character:
[;]
I don't know much about python, but here how you would use it in JavaScript:
var desired_chars = myString.replace(/[;]/gi, '')
Instead of regex.search use regex.findall. That'll give you a list of matches for each line which you can then manipulate and print on separate lines.