How i can read multiple txt files? - regex

I want to get all words of my documents, but i have a problem with file in this code.
How do i fill file the fields of file with the content of the documents? This is my code:
textfilename=['example' '*' '.txt'];
Alltextfiles = dir(textfilename);
for i=1:length(Alltextfiles)
fileID (i) = fopen(Alltextfiles(i).name,'r+');
file (i) = fscanf(fileID(i), '%c',inf);
words (i) = regexp(file (i), ' ', 'split');
end

Make file and words cell arrays.
for i=1:length(Alltextfiles)
fileID(i) = fopen(Alltextfiles(i).name,'r+');
file{i} = fscanf(fileID(i), '%c',inf);
words{i} = regexp(file{i}, ' ', 'split');
end
Also, consider splitting by '\s|\n', I assume your regexp is not getting you the desired output.

You could do the following to read all words of a file:
words = textscan(fileread(fname), '%s');
words will be a N-by-1 cell array containing all the words of the file.

Related

Python- Writing all results from a loop to a variable

I have a .txt file with dozens of columns and hundreds of rows. I want to write the results of the entirety of two specific columns into two variables. I don't have a great deal of experience with for loops but here is my attempt to loop through the file.
a = open('file.txt', 'r') #<--This puts the file in read mode
header = a.readline() #<-- This skips the strings in the 0th row indicating the labels of each column
for line in a:
line = line.strip() #removes '\n' characters in text file
columns = line.split() #Splits the white space between columns
x = float(columns[0]) # the 1st column of interest
y = float(columns[1]) # the 2nd column of interest
print(x, y)
f.close()
Outside of the loop, printing x or y only displays the last value of the text file. I want it to have all the values of the specified columns of the file. I know of the append command but I am unsure how to apply it in this situation within the for loop.
Does anyone have any suggestions or easier methods on how to do this?
Make two lists x and y before you sart the loop and append to them in the loop:
a = open('file.txt', 'r') #<--This puts the file in read mode
header = a.readline() #<-- This skips the strings in the 0th row indicating the labels of each column
x = []
y = []
for line in a:
line = line.strip() #removes '\n' characters in text file
columns = line.split() #Splits the white space between columns
x.append(float(columns[0])) # the 1st column of interest
y.append(float(columns[1])) # the 2nd column of interest
f.close()
print('all x:')
print(x)
print('all y:')
print(y)
Your code only binds the value of the last element. I'm not sure that is your entire codes, but if you want to keep add the values of the column, I would suggest appending it to the array then print it outside of loop.
listx = []
listy = []
a = open('openfile', 'r')
#skip the header
for line in a:
#split the line
#set the x and y variables.
listx.append(x)
listy.append(y)
#print outside of loop.

How to insert two file.txt into one file

I have this function that takes two input .txt file, delete the punctuation mark, and adds the sentence pos or neg.
I would like the content of these fle converted to lowercase
and then these two files merged into a single file name union.txt
But my code does not work
def extractor (feature_select):
posFeatures = []
negFeatures = []
with open('positive.txt', 'r') as posSentences:
for i in posSentences:
posWords = re.findall(r"[\w']+|[(,.;:*##/?!$&)]", i.rstrip())
posWords = [feature_select(posWords), 'pos']
posFeatures.append(posWords)
with open('negative.txt', 'r') as negSentences:
for i in negSentences:
negWords = re.findall(r"[\w']+|[(,.;:*##/?!$&)]", i.rstrip())
negWords = [feature_select(negWords), 'neg']
negFeatures.append(negWords)
return posFeature, negFeature
filenames = [posFeature, negFeature]
with open('union.txt', 'w') as outfile:
for fname in filenames:
with open(fname) as infile:
outfile.write(infile.read())
Actually you are trying to open the files with names from the contents of the two files. fname holds the contents read from the input files.
filenames = [posFeature, negFeature]
with open('union.txt', 'w') as outfile :
for i in filenames : #refers to posFeature or negFeature which is a list
for j in i: #this loop reads each sentence from the list i
outfile.write(j) #it writes the sentence into outfile
No need to read back the contents already read and appended in posFeature and negFeature. Above code will directly write the contents in the list filenames and now your two files are merged.

RegEx not working in MATLAB

I have not done any RegEx work in MATLAB, I do not think this is an environment issue but I am not sure. Here is my task:
Download NASDQ stock data from ftp://ftp.nasdaqtrader.com/symboldirectory/nasdaqtraded.txt
Extract all stock symbols using a RegEx
Here is the RegEx that I created: ^[A-Z]\|([A-Z]+)\|.+\|[A-Z]\|[A-Z]\|[A-Z]\|\d\d\d\|[A-Z]\|[A-Z]\|.*\|[A-Z]+$
This expression works on some, but not all lines in this file. For example, it works perfectly for this line:
- Y|AAPL|Apple Inc. - Common Stock|Q|Q|N|100|N|N||AAPL
However it does not match anything from this line:
- Y|A|Agilent Technologies, Inc. Common Stock|N| |N|100|N||A|A
- Y|AAMC|Altisource Asset Management Corp Com|A| |N|100|N||AAMC|AAMC
Help please...thanks!
Your file seems to be a set of columns delimited with |, with first line being column names.
Here is a solution to create directly structure array whose field names are obtained from column names:
function [structArray] = ReadNasdaqTraded(filename)
%[
% For debug
if (nargin < 1), filename = 'nasdaqtraded.txt'; end
% Read full file content
text = fileread(filename);
% Split on newline
text = strsplit(strtrim(text), '\n');
header = text{1}; % Keep header
content = text(2:(end-1)); % Keep content
footer = text{end}; %#ok - We don't care about last line (file creation date)
% Build suitable field names
fieldNames = strsplit(header, '|');
fieldNames = strtrim(fieldNames); % Remove any
fieldNames = strrep(fieldNames, ' ', ''); % spaces (TODO: OR special characters)
% Reformat content into cell matrix
count = length(content);
columnCount = length(fieldNames);
cellArray = cell(count, columnCount);
for ri = 1:count,
cellArray(ri, :) = strsplit(content{ri}, '|', 'CollapseDelimiters', false); % Carefull not to collapse empty delimiters
end
% Create structure array from cell content
structArray = cell2struct(cellArray, fieldNames, 2);
%]
It returns some result like this:
>> ReadNasdaqTraded('nasdaqtraded.txt')
ans =
8188x1 struct array with fields:
NasdaqTraded
Symbol
SecurityName
ListingExchange
MarketCategory
ETF
RoundLotSize
TestIssue
FinancialStatus
CQSSymbol
NASDAQSymbol
Easy to use then for whatever extra processing you need ...

Read fields from text file and store them in a structure

I am trying to read a file that looks as follows:
Data Sampling Rate: 256 Hz
*************************
Channels in EDF Files:
**********************
Channel 1: FP1-F7
Channel 2: F7-T7
Channel 3: T7-P7
Channel 4: P7-O1
File Name: chb01_02.edf
File Start Time: 12:42:57
File End Time: 13:42:57
Number of Seizures in File: 0
File Name: chb01_03.edf
File Start Time: 13:43:04
File End Time: 14:43:04
Number of Seizures in File: 1
Seizure Start Time: 2996 seconds
Seizure End Time: 3036 seconds
So far I have this code:
fid1= fopen('chb01-summary.txt')
data=struct('id',{},'stime',{},'etime',{},'seizenum',{},'sseize',{},'eseize',{});
if fid1 ==-1
error('File cannot be opened ')
end
tline= fgetl(fid1);
while ischar(tline)
i=1;
disp(tline);
end
I want to use regexp to find the expressions and so I did:
line1 = '(.*\d{2} (\.edf)'
data{1} = regexp(tline, line1);
tline=fgetl(fid1);
time = '^Time: .*\d{2]}: \d{2} :\d{2}' ;
data{2}= regexp(tline,time);
tline=getl(fid1);
seizure = '^File: .*\d';
data{4}= regexp(tline,seizure);
if data{4}>0
stime = '^Time: .*\d{5}';
tline=getl(fid1);
data{5}= regexp(tline,seizure);
tline= getl(fid1);
data{6}= regexp(tline,seizure);
end
I tried using a loop to find the line at which file name starts with:
for (firstline<1) || (firstline>1 )
firstline= strfind(tline, 'File Name')
tline=fgetl(fid1);
end
and now I'm stumped.
Suppose that I am at the line at which the information is there, how do I store the information with regexp? I got an empty array for data after running the code once...
Thanks in advance.
I find it the easiest to read the lines into a cell array first using textscan:
%// Read lines as strings
fid = fopen('input.txt', 'r');
C = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
and then apply regexp on it to do the rest of the manipulations:
%// Parse field names and values
C = regexp(C{:}, '^\s*([^:]+)\s*:\s*(.+)\s*', 'tokens');
C = [C{:}]; %// Flatten the cell array
C = reshape([C{:}], 2, []); %// Reshape into name-value pairs
Now you have a cell array C of field names and their corresponding (string) values, and all you have to do is plug it into struct in the correct syntax (using a comma-separated list in this case). Note that the field names have spaces in them, so this needs to be taken care of before they can be used (e.g replace them with underscores):
C(1, :) = strrep(C(1, :), ' ', '_'); %// Replace spaces with underscores
data = struct(C{:});
Here's what I get for your input file:
data =
Data_Sampling_Rate: '256 Hz'
Channel_1: 'FP1-F7'
Channel_2: 'F7-T7'
Channel_3: 'T7-P7'
Channel_4: 'P7-O1'
File_Name: 'chb01_03.edf'
File_Start_Time: '13:43:04'
File_End_Time: '14:43:04'
Number_of_Seizures_in_File: '1'
Seizure_Start_Time: '2996 seconds'
Seizure_End_Time: '3036 seconds'
Of course, it is possible to prettify it even more by converting all relevant numbers to numerical values, grouping the 'channel' fields together and such, but I'll leave this to you. Good luck!

Retrieve particular parts of string from a text file and save it in a new file in MATLAB

I am trying to retrieve particular parts of a string in a text file such as below and i would like to save them in a text file in MATLAB
Original text file
D 1m8ea_ 1m8e A: d.174.1.1 74583 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=74583
D 1m8eb_ 1m8e B: d.174.1.1 74584 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=74584
D 3e7ia1 3e7i A:77-496 d.174.1.1 158052 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=158052
D 3e7ib1 3e7i B:77-496 d.174.1.1 158053 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=158053
D 2bhja1 2bhj A:77-497 d.174.1.1 128533 cl=53931,cf=56511,sf=56512,fa=56513,dm=56514,sp=56515,px=128533
So basically, I would like to retrieve the pdbcodes id which are labeled as "1m8e", chainid labeled as "A" the Start values which is "77" and stop values which is "496" and i would like all of these values to be saved inside of a fprintf statment.
Is there some kind of method is which i can use in RegExp stating which index its all starting at and retrieve those strings based on the position in the text file for each line?
In the end, all i want to have in the fprinf statement is 1m8e, A, 77, 496.
So far i have two fopen function which reads a file and one that writes to a new file and to read each line by line, also a fprintf statment:
pdbcode = '';
chainid = '';
start = '';
stop = '';
fin = fopen('dir.cla.scop.txt_1.75.txt', 'r');
fout = fopen('output_scop.txt', 'w');
% TODO: Add error check!
while true
line = fgetl(fin); % Get the next line from the file
if ~ischar(line)
% End of file
break;
end
% Print result into output_cath.txt file
fprintf(fout, 'INSERT INTO cath_domains (scop_pdbcode, scop_chainid, scopbegin, scopend) VALUES("%s", %s, %s, %s);\n', pdbcode, chainid, start, stop);
Thank you.
You should be able to strsplit on whitespace, get the third ("1m8e") and fourth elements ("A:77-496"), then repeat the process on the fourth element using ":" as the split character, and then again on the second of those two arguments using "-" as the split character. That's one approach. For example, you could do:
% split on space and tab, and ignore empty tokens
tokens = strsplit(line, ' \t', true);
pdbcode = tokens(3);
% split fourth token from previous split on colon
tokens = strsplit(tokens(4), ':');
chainid = tokens(1);
% split second token from previous split on dash
tokens = strsplit(tokens(2), '-');
start = tokens(1);
stop = tokens(2);
If you really wanted to use regular expressions, you could try the following
pattern = '\S+\s+\S+\s+(\S+)\s+([A-Za-z]+):([0-9]+)-([0-9]+)';
[mat tok] = regexp(line, pattern, 'match', 'tokens');
pdbcode = cell2mat(tok)(1);
chainid = cell2mat(tok)(2);
start = cell2mat(tok)(3);
stop = cell2mat(tok)(4);