Shortcut to get a statement with certain pattern in R - regex

I have to write the following as it is.
('trial1' = Ozone1, 'trial2' = Ozone2, trial3 = Ozone3,...........trial1000 = Ozone1000)
I want to write this with one command in R. How do I do it?
I tried it using paste0
Let us take only 5 as number of repetitions:
paste0("trial",1:5,"= Ozone", 1:5)
I get this as result.
"trial1= Ozone1" "trial2= Ozone2" "trial3= Ozone3" "trial4= Ozone4" "trial5= Ozone5"
But it is not the way I wanted it. I want the output to come out as it is like (not even in inverted commas):
('trial1' = Ozone1, 'trial2' = Ozone2, 'trial3' = Ozone3, 'trial4' = Ozone4, 'trial5 = Ozone5)
Also as you can see, it is not a string i.e. output should not come between inverted commas as "........". I want it as it is exactly.
How do i do it?

This will generate the string you want...
paste0('(',paste0("'trial",1:1000,"'= Ozone",1:1000,collapse=' ,'),')')
This will print the string without quotes...
print(paste0('(',paste0("'trial",1:10,"'= Ozone",1:10,collapse=' ,'),')'), quote=FALSE)
I hope it answered your question...

You need to escape the single quotes, ie \', and use the collapse argument of paste0:
paste0("(", paste0("\'trial",1:5,"\' = Ozone",1:5, collapse=", "), ")")
[1] "('trial1' = Ozone1, 'trial2' = Ozone2, 'trial3' = Ozone3, 'trial4' = Ozone4, 'trial5' = Ozone5)"

Related

How extract (changeable variable) word & number using regular expression matlab

I have more than 10k text files look similar like this, all of them are similar in format but not in size, sometime is bigger or smaller.
[{u'language': u'english', u'area': 3825.8953168044045, u'class': u'machine printed', u'utf8_string': u'troia', u'image_id': 428035, u'box': [426.42422762784093, 225.33333055900806, 75.15151515151516, 50.909090909090864], u'legibility': u'legible', u'id': 1056659}, {u'language': u'na', u'area': 24201.285583103767, u'id': 1056660, u'image_id': 428035, u'box': [223.99998520359847, 249.57575480143228, 172.12121212121215, 140.6060606060606], u'legibility': u'illegible', u'class': u'machine printed'}]
I want to extract two changeable variable in every text using regular expression.
The output should be like this
box = [223.99998520359847, 249.57575480143228, 172.12121212121215, 140.6060606060606]
box1 = .. sometime there is more than one
&
second output
word = troia
word1 = ... sometime there is more than one word
My code 1: for the word extraction
fid = fopen('text1.txt','r');
C = textscan(fid, '%s','Delimiter','');
fclose(fid);
C = C{:};
Lia = ~cellfun(#isempty, strfind(C,'utf8_string'));
output = [C{find(Lia)}];
expression = 'u''utf8_string'': u+'
matchStr = regexp(output, expression,'match');
My code 1 result give me only the
utf8_string
My code 2: for the box number extraction
s = sprintf('text_.txt');
fid = fopen(s);
tline = fgetl(fid);
C = regexp(tline,'u''box'': +\[([0-9\. ,]+)\]','tokens');
C = cellfun(#(x) x{1},C,'UniformOutput',false)';
M = cell2mat(cellfun(#(x) x', cat(1,C2{:}),'UniformOutput',false));
This code 2 is running but not with every text something i got this error
Error using cat Dimensions of matrices being concatenated are not consistent
If you do not insist on regexp: The input strings looks like json, so the following short code does even more than you want:
% Read the whole file
s = fileread('test.txt');
% Remove the odd u'
s = strrep(s, 'u''', '''');
% Replace ' by "
s = strrep(s, '''', '"');
% See http://www.mathworks.com/matlabcentral/fileexchange/20565
t = parse_json(s);
Now t a is cell object containing structs with the data. So
word = t{1}.utf8_string;
box = cell2mat(t{1}.box);
will give you the first word and box. If you have a newer Matlab version you can probably use jsondecode instead of parse_json.

How can I extract a file name based on number string?

I have a list of filenames in a struct array, example:
4x1 struct array with fields:
name
date
bytes
isdir
datenum
where files.name
ans =
ts.01094000.crest.csv
ans =
ts.01100600.crest.csv
etc.
I have another list of numbers (say, 1094000). And I want to find the corresponding file name from the struct.
Please note, that 1094000 doesn't have preceding 0. Often there might be other numbers. So I want to search for '1094000' and find that name.
I know I can do it using Regex. But I have never used that before. And finding it difficult to write for numbers instead of text using strfind. Any suggestion or another method is welcome.
What I have tried:
regexp(files.name,'ts.(\d*)1094000.crest.csv','match');
I think the regular expression you'd want is more like
filenames = {'ts.01100600.crest.csv','ts.01094000.crest.csv'};
matches = regexp(filenames, ['ts\.0*' num2str(1094000) '\.crest\.csv']);
matches = ~cellfun('isempty', matches);
filenames(matches)
For a solution with strfind...
Pre-16b:
match = ~cellfun('isempty', strfind({files.name}, num2str(1094000)),'UniformOutput',true)
files(match)
16b+:
match = contains({files.name}, string(1094000))
files(match)
However, the strfind way might have issues if the number you are looking for exists in unexpected places such as looking for 10 in ["01000" "00101"].
If your filenames match the pattern ts.NUMBER.crest.csv, then in 16b+ you could do:
str = {files.name};
str = extractBetween(str,4,'.');
str = strip(str,'left','0');
matches = str == string(1094000);
files(matches)

Regex to find substring from a string

I have strings like:
str1 = eval(sum(feat(57),feat(57),feat(66))/feat(57));
str2 = eval(sum(feat(47),feat(55),feat(86)));
str3 = eval(feat(47)/sum(feat(51),feat(52),feat(53)));
str4 = eval(feat(63)/sum(feat(57):feat(66)));
I want to write a regex to get out as:
str1_output = (feat(57),feat(57),feat(66))
str2_output = (feat(47),feat(55),feat(86))
str3_output = (feat(51),feat(52),feat(53))
str4_output = (feat(57):feat(66))
I tried in the following way:
output = re.findall(re.compile(r"sum.*"),str_name)
This is giving correct output except str1.
Please suggest me a way to find out the desired output.
I guess you could try
sum\((?:\([^()]*\)|.)*?\)
It matches sum and the following matching pair of parentheses, and whatever are between them.
Example at regex101.
Regards.

Dynamic regexprep in MATLAB

I have the following strings in a long string:
a=b=c=d;
a=b;
a=b=c=d=e=f;
I want to first search for above mentioned pattern (X=Y=...=Z) and then output like the following for each of the above mentioned strings:
a=d;
b=d;
c=d;
a=b;
a=f;
b=f;
c=f;
d=f;
e=f;
In general, I want all the variables to have an equal sign with the last variable on the extreme right of the string. Is there a way I can do it using regexprep in MATLAB. I am able to do it for a fixed length string, but for variable length, I have no idea how to achieve this. Any help is appreciated.
My attempt for the case of two equal signs is as follows:
funstr = regexprep(funstr, '([^;])+\s*=\s*+(\w+)+\s*=\s*([^;])+;', '$1 = $3; \n $2 = $3;\n');
Not a regexp but if you stick to Matlab you can make use of the cellfun function to avoid loop:
str = 'a=b=c=d=e=f;' ; %// input string
list = strsplit(str,'=') ;
strout = cellfun( #(a) [a,'=',list{end}] , list(1:end-1), 'uni', 0).' %'// Horchler simplification of the previous solution below
%// this does the same than above but more convoluted
%// strout = cellfun( #(a,b) cat(2,a,'=',b) , list(1:end-1) , repmat(list(end),1,length(list)-1) , 'uni',0 ).'
Will give you:
strout =
'a=f;'
'b=f;'
'c=f;'
'd=f;'
'e=f;'
Note: As Horchler rightly pointed out in comment, although the cellfun instruction allows to compact your code, it is just a disguised loop. Moreover, since it runs on cell, it is notoriously slow. You won't see the difference on such simple inputs, but keep this use when super performances are not a major concern.
Now if you like regex you must like black magic code. If all your strings are in a cell array from the start, there is a way to (over)abuse of the cellfun capabilities to obscure your code do it all in one line.
Consider:
strlist = {
'a=b=c=d;'
'a=b;'
'a=b=c=d=e=f;'
};
Then you can have all your substring with:
strout = cellfun( #(s)cellfun(#(a,b)cat(2,a,'=',b),s(1:end-1),repmat(s(end),1,length(s)-1),'uni',0).' , cellfun(#(s) strsplit(s,'=') , strlist , 'uni',0 ) ,'uni',0)
>> strout{:}
ans =
'a=d;'
'b=d;'
'c=d;'
ans =
'a=b;'
ans =
'a=f;'
'b=f;'
'c=f;'
'd=f;'
'e=f;'
This gives you a 3x1 cell array. One cell for each group of substring. If you want to concatenate them all then simply: strall = cat(2,strout{:});
I haven't had much experience w/ Matlab; but your problem can be solved by a simple string split function.
[parts, m] = strsplit( funstr, {' ', '='}, 'CollapseDelimiters', true )
Now, store the last part of parts; and iterate over parts until that:
len = length( parts )
for i = 1:len-1
print( strcat(parts(i), ' = ', parts(len)) )
end
I do not know what exactly is the print function in matlab. You can update that accordingly.
There isn't a single Regex that you can write that will cover all the cases. As posted on this answer:
https://stackoverflow.com/a/5019658/3393095
However, you have a few alternatives to achieve your final result:
You can get all the values in the line with regexp, pick the last value, then use a for loop iterating throughout the other values to generate the output. The regex to get the values would be this:
matchStr = regexp(str,'([^=;\s]*)','match')
If you want to use regexprep at any means, you should write a pattern generator and a replace expression generator, based on number of '=' in the input string, and pass these as parameters of your regexprep func.
You can forget about Regex and Split the input to generate the output looping throughout the values (similarly to alternative #1) .

Simplify regular expression

I want to simplify this regular expression:
0*|0*1(ε|0*1)*00*
I used this identity:
(R+S)*=(R*S*)*=(R*+S*)*
and couldn't get better than this:
0*|0*1(0*1)*00* [(ε|0*1)*=(ε*0*1)*=(0*1)*]
Can this regular expression be simplified even more, and how? I have no clue what else to do. :)
EDIT 1: I altered + to | ,for + could stand for "one or more times", beside alternation which is now denoted by |.
Explanation of notation:
1) ε stands for empty word
2) * is Kleene star
3) AB is just a concatenation of languages of regular expressions A and B.
EDIT 2: Formal proof that this reduces to (0*1)*0+|ε:
0*|0*1(ε|0*1)*00* =
= 0*|0*1(0*1)*0+ =
= 0*|(0*1)+0+ =
= 0+|ε|(0*1)+0+ =
= ε0+|(0*1)+0+|ε
= (ε|(0*1)+)0+|ε
= (0*1)*0+|ε
Is there any way to reduce it further to (0|1)*0|ε?
I think it reduces to this (0*1)*0+|
(Update: See edit history for long, sad story of previous incorrect attempts).
I (now) believe this reduces to:
ε|(0|1)*0
in other words, either:
The empty string
Any string of ones and zeros ending in 0
Proving this is another matter altogether. ;-)
I managed to formally reduce given regular expression to ε|(0|1)*0.
This is the proof:
0*|0*1(ε|0*1)*00* =
= 0*|0*1(0*1)*0+ =
= 0*|(0*1)+0+ =
= 0+|ε|(0*1)+0+ =
= ε0+|(0*1)+0+|ε =
= (ε|(0*1)+)0+|ε =
= (0*1)*0+|ε =
= (0*1)*0*0|ε = #
= (0|1)*0|ε
The trick was to use the identity (A*B)*A* = (A|B)* of which I wasn't aware when the question was asked, in the step marked with #.