How extract (changeable variable) word & number using regular expression matlab - regex

I have more than 10k text files look similar like this, all of them are similar in format but not in size, sometime is bigger or smaller.
[{u'language': u'english', u'area': 3825.8953168044045, u'class': u'machine printed', u'utf8_string': u'troia', u'image_id': 428035, u'box': [426.42422762784093, 225.33333055900806, 75.15151515151516, 50.909090909090864], u'legibility': u'legible', u'id': 1056659}, {u'language': u'na', u'area': 24201.285583103767, u'id': 1056660, u'image_id': 428035, u'box': [223.99998520359847, 249.57575480143228, 172.12121212121215, 140.6060606060606], u'legibility': u'illegible', u'class': u'machine printed'}]
I want to extract two changeable variable in every text using regular expression.
The output should be like this
box = [223.99998520359847, 249.57575480143228, 172.12121212121215, 140.6060606060606]
box1 = .. sometime there is more than one
&
second output
word = troia
word1 = ... sometime there is more than one word
My code 1: for the word extraction
fid = fopen('text1.txt','r');
C = textscan(fid, '%s','Delimiter','');
fclose(fid);
C = C{:};
Lia = ~cellfun(#isempty, strfind(C,'utf8_string'));
output = [C{find(Lia)}];
expression = 'u''utf8_string'': u+'
matchStr = regexp(output, expression,'match');
My code 1 result give me only the
utf8_string
My code 2: for the box number extraction
s = sprintf('text_.txt');
fid = fopen(s);
tline = fgetl(fid);
C = regexp(tline,'u''box'': +\[([0-9\. ,]+)\]','tokens');
C = cellfun(#(x) x{1},C,'UniformOutput',false)';
M = cell2mat(cellfun(#(x) x', cat(1,C2{:}),'UniformOutput',false));
This code 2 is running but not with every text something i got this error
Error using cat Dimensions of matrices being concatenated are not consistent

If you do not insist on regexp: The input strings looks like json, so the following short code does even more than you want:
% Read the whole file
s = fileread('test.txt');
% Remove the odd u'
s = strrep(s, 'u''', '''');
% Replace ' by "
s = strrep(s, '''', '"');
% See http://www.mathworks.com/matlabcentral/fileexchange/20565
t = parse_json(s);
Now t a is cell object containing structs with the data. So
word = t{1}.utf8_string;
box = cell2mat(t{1}.box);
will give you the first word and box. If you have a newer Matlab version you can probably use jsondecode instead of parse_json.

Related

RegEx not working in MATLAB

I have not done any RegEx work in MATLAB, I do not think this is an environment issue but I am not sure. Here is my task:
Download NASDQ stock data from ftp://ftp.nasdaqtrader.com/symboldirectory/nasdaqtraded.txt
Extract all stock symbols using a RegEx
Here is the RegEx that I created: ^[A-Z]\|([A-Z]+)\|.+\|[A-Z]\|[A-Z]\|[A-Z]\|\d\d\d\|[A-Z]\|[A-Z]\|.*\|[A-Z]+$
This expression works on some, but not all lines in this file. For example, it works perfectly for this line:
- Y|AAPL|Apple Inc. - Common Stock|Q|Q|N|100|N|N||AAPL
However it does not match anything from this line:
- Y|A|Agilent Technologies, Inc. Common Stock|N| |N|100|N||A|A
- Y|AAMC|Altisource Asset Management Corp Com|A| |N|100|N||AAMC|AAMC
Help please...thanks!
Your file seems to be a set of columns delimited with |, with first line being column names.
Here is a solution to create directly structure array whose field names are obtained from column names:
function [structArray] = ReadNasdaqTraded(filename)
%[
% For debug
if (nargin < 1), filename = 'nasdaqtraded.txt'; end
% Read full file content
text = fileread(filename);
% Split on newline
text = strsplit(strtrim(text), '\n');
header = text{1}; % Keep header
content = text(2:(end-1)); % Keep content
footer = text{end}; %#ok - We don't care about last line (file creation date)
% Build suitable field names
fieldNames = strsplit(header, '|');
fieldNames = strtrim(fieldNames); % Remove any
fieldNames = strrep(fieldNames, ' ', ''); % spaces (TODO: OR special characters)
% Reformat content into cell matrix
count = length(content);
columnCount = length(fieldNames);
cellArray = cell(count, columnCount);
for ri = 1:count,
cellArray(ri, :) = strsplit(content{ri}, '|', 'CollapseDelimiters', false); % Carefull not to collapse empty delimiters
end
% Create structure array from cell content
structArray = cell2struct(cellArray, fieldNames, 2);
%]
It returns some result like this:
>> ReadNasdaqTraded('nasdaqtraded.txt')
ans =
8188x1 struct array with fields:
NasdaqTraded
Symbol
SecurityName
ListingExchange
MarketCategory
ETF
RoundLotSize
TestIssue
FinancialStatus
CQSSymbol
NASDAQSymbol
Easy to use then for whatever extra processing you need ...

Matlab: regexprep with variable

I have an array a : a list of identified words to be compared and replace by empty character in an array b. newB is the result.
The value of a might vary according to an input file.
I am trying to use regexprep but it is not working well.
e.g.:
a = {'apple';'banana';'orange'}; % a might be also ‘watermelon’, ‘papaya’ etc
b = {'1 apple = 2 kiwi';'1 fig = 1 banana';'1 orange = 3 strawberry'};
newB = {' = 2 kiwi';'1 fig = ';' = 3 strawberry'};
From your example it seems like you want to remove a special word and a number, the appropriate regular expression for this is (for word = 'apple'): '\d+ apple'. Building the regular expression from all the words in a, using sprintf:
re = sprintf('\\d+ %s|',a{:}); %// adding | operator to select between expressions
re(end)=[]; %// discard the last '|'
The resulting regular expression is
re =
'\d+ apple|\d+ banana|\d+ orange'
Now the actual replacement:
newB = regexprep(b,re,'')
Resulting with
newB =
' = 2 kiwi'
'1 fig = '
' = 3 strawberry'

Shortcut to get a statement with certain pattern in R

I have to write the following as it is.
('trial1' = Ozone1, 'trial2' = Ozone2, trial3 = Ozone3,...........trial1000 = Ozone1000)
I want to write this with one command in R. How do I do it?
I tried it using paste0
Let us take only 5 as number of repetitions:
paste0("trial",1:5,"= Ozone", 1:5)
I get this as result.
"trial1= Ozone1" "trial2= Ozone2" "trial3= Ozone3" "trial4= Ozone4" "trial5= Ozone5"
But it is not the way I wanted it. I want the output to come out as it is like (not even in inverted commas):
('trial1' = Ozone1, 'trial2' = Ozone2, 'trial3' = Ozone3, 'trial4' = Ozone4, 'trial5 = Ozone5)
Also as you can see, it is not a string i.e. output should not come between inverted commas as "........". I want it as it is exactly.
How do i do it?
This will generate the string you want...
paste0('(',paste0("'trial",1:1000,"'= Ozone",1:1000,collapse=' ,'),')')
This will print the string without quotes...
print(paste0('(',paste0("'trial",1:10,"'= Ozone",1:10,collapse=' ,'),')'), quote=FALSE)
I hope it answered your question...
You need to escape the single quotes, ie \', and use the collapse argument of paste0:
paste0("(", paste0("\'trial",1:5,"\' = Ozone",1:5, collapse=", "), ")")
[1] "('trial1' = Ozone1, 'trial2' = Ozone2, 'trial3' = Ozone3, 'trial4' = Ozone4, 'trial5' = Ozone5)"

Detect specific number using regexp

Any ideas how to detect the 3 in (>3<)and not 3 in (rank_value_3_months)?
"<span data-bind-domain="rank_value_3_months">3</span>"
rank(i) = str2double(regexp(CharData7,'>(\d)<','match','once'))
Here is my whole code for this part, I would like to detect the number inside (>number<) after the pre-prosses file,
%function [feature7] = f7(data)
for i = 1:1
%start read html file
data2=fopen(strcat('DATA\WHOIS\TR\',int2str(i),'.htm'),'r')
CharData = fread(data2, '*char')'; %read text file and store data in CharData
fclose(data2);
%end read html file
register_date = regexp(CharData, '<span data-bind- domain="rank_value_3_months">.*?/span>', 'match'); %checking
%start write only http in image file
fid = fopen(strcat('DATA\PRE-PROCESS_DATA\F23_TR\f23_TR_pdata_',int2str(i)),'w');
for col = 1:numel(register_date)
fprintf(fid,'%s\n',register_date{:,col});
end
fclose(fid);
%end write only http in image file
s = dir(strcat('DATA\PRE-PROCESS_DATA\F23_TR\','f23_TR_pdata_', int2str(i)));
disp(s.bytes);
if s.bytes ~= 0
data7=fopen(strcat('DATA\PRE-PROCESS_DATA\F23_TR\f23_TR_pdata_',int2str(i),''),'r')
CharData7 = fread(data7, '*char')'; %read text file and store data in CharData
fclose(data7);
rank(i) = str2double(regexp(CharData7,'>(\d)<','tokens','once') )
else
end
if rank(i)~=0
feature23(i)=-1;
else
feature23(i)=1;
end
end
Assuming CharData7 is a cell array, you can try this:
%// The find
%// - use 'tokens' to return just the part in brackets
%// - use \s* to make spacing flexible (which is also valid XML/HTML)
rank = regexp(CharData7, '>\s*(\d)\s*<', 'tokens', 'once');
%// Re-format into flat cells
%// ('tokens' returns ALL tokens, which is therefore a cell, regardless
%// of the 'once' setting)
rank = [rank{:}];
%// and convert everything to double
rank(i) = str2double(rank)
So, in a nice illegible one-liner:
rank(i) = str2double([builtin('_brace', regexp(C,'>\s*(\d)\s*<','tokens','once'), :)]);
In case CharData7 is just a single string, you can skip the cell-flattening step:
rank(i) = str2double( regexp(C,'>\s*(\d)\s*<','tokens','once') )

Retrieve particular parts of string from a text file and save it in a new file in MATLAB

I am trying to retrieve particular parts of a string in a text file such as below and i would like to save them in a text file in MATLAB
Original text file
D 1m8ea_ 1m8e A: d.174.1.1 74583 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=74583
D 1m8eb_ 1m8e B: d.174.1.1 74584 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=74584
D 3e7ia1 3e7i A:77-496 d.174.1.1 158052 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=158052
D 3e7ib1 3e7i B:77-496 d.174.1.1 158053 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=158053
D 2bhja1 2bhj A:77-497 d.174.1.1 128533 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=128533
So basically, I would like to retrieve the pdbcodes id which are labeled as "1m8e", chainid labeled as "A" the Start values which is "77" and stop values which is "496" and i would like all of these values to be saved inside of a fprintf statment.
Is there some kind of method is which i can use in RegExp stating which index its all starting at and retrieve those strings based on the position in the text file for each line?
In the end, all i want to have in the fprinf statement is 1m8e, A, 77, 496.
So far i have two fopen function which reads a file and one that writes to a new file and to read each line by line, also a fprintf statment:
pdbcode = '';
chainid = '';
start = '';
stop = '';
fin = fopen('dir.cla.scop.txt_1.75.txt', 'r');
fout = fopen('output_scop.txt', 'w');
% TODO: Add error check!
while true
line = fgetl(fin); % Get the next line from the file
if ~ischar(line)
% End of file
break;
end
% Print result into output_cath.txt file
fprintf(fout, 'INSERT INTO cath_domains (scop_pdbcode, scop_chainid, scopbegin, scopend) VALUES("%s", %s, %s, %s);\n', pdbcode, chainid, start, stop);
Thank you.
You should be able to strsplit on whitespace, get the third ("1m8e") and fourth elements ("A:77-496"), then repeat the process on the fourth element using ":" as the split character, and then again on the second of those two arguments using "-" as the split character. That's one approach. For example, you could do:
% split on space and tab, and ignore empty tokens
tokens = strsplit(line, ' \t', true);
pdbcode = tokens(3);
% split fourth token from previous split on colon
tokens = strsplit(tokens(4), ':');
chainid = tokens(1);
% split second token from previous split on dash
tokens = strsplit(tokens(2), '-');
start = tokens(1);
stop = tokens(2);
If you really wanted to use regular expressions, you could try the following
pattern = '\S+\s+\S+\s+(\S+)\s+([A-Za-z]+):([0-9]+)-([0-9]+)';
[mat tok] = regexp(line, pattern, 'match', 'tokens');
pdbcode = cell2mat(tok)(1);
chainid = cell2mat(tok)(2);
start = cell2mat(tok)(3);
stop = cell2mat(tok)(4);