Parsing text file in matlab

Parsing text file in matlab - regex

I have this txt file:
BLOCK_START_DATASET
dlcdata L:\loads\confidential\000_Loads_Analysis_Environment\Tools\releases\01_Preprocessor\Version_3.0\Parameterfiles\Bladed4.2\DLC-Files\DLCDataFile.txt
simulationdata L:\loads\confidential\000_Loads_Analysis_Environment\Tools\releases\01_Preprocessor\Version_3.0\Parameterfiles\Bladed4.2\DLC-Files\BladedFile.txt
outputfolder Pfadangabe\runs_test
windfolder L:\loads2\WEC\1002_50-2\_calc\50-2_D135_HH95_RB-AB66-0O_GL2005_towerdesign_Bladed_v4-2_revA01\_wind
referenzfile_servesea L:\loads\confidential\000_Loads_Analysis_Environment\Tools\releases\01_Preprocessor\Version_3.0\Dataset_to_start\Referencefiles\Bladed4.2\DLC\dlc1-1_04a1.$PJ
referenzfile_generalsea L:\loads\confidential\000_Loads_Analysis_Environment\Tools\releases\01_Preprocessor\Version_3.0\Dataset_to_start\Referencefiles\Bladed4.2\DLC\dlc6-1_000_a_50a_022.$PJ
externalcontrollerdll L:\loads\confidential\000_Loads_Analysis_Environment\Tools\releases\01_Preprocessor\Version_3.0\Dataset_to_start\external_Controller\DisCon_V3_2_22.dll
externalcontrollerparameter L:\loads\confidential\000_Loads_Analysis_Environment\Tools\releases\01_Preprocessor\Version_3.0\Dataset_to_start\external_Controller\ext_Ctrl_Data_V3_2_22.txt
BLOCK_END_DATASET
% ------------------------------------
BLOCK_START_WAVE
% a6*x^6 + a5*x^5 + a4*x^4 + a3*x^3 + a2*x^2 + a1*x + a0
factor_hs 0.008105;0.029055;0.153752
factor_tz -0.029956;1.050777;2.731063
factor_tp -0.118161;1.809956;3.452903
spectrum_gamma 3.3
BLOCK_END_WAVE
% ------------------------------------
BLOCK_START_EXTREMEWAVE
height_hs1 7.9
period_hs1 11.8
height_hs50 10.8
period_hs50 13.8
height_hred1 10.43
period_hred1 9.9
height_hred50 14.26
period_hred50 11.60
height_hmax1 14.8
period_hmax1 9.9
height_hmax50 20.1
period_hmax50 11.60
BLOCK_END_EXTREMEWAVE
% ------------------------------------
BLOCK_START_TIDE
normal 0.85
yr1 1.7
yr50 2.4
BLOCK_END_TIDE
% ------------------------------------
BLOCK_START_CURRENT
velocity_normal 1.09
velocity_yr1 1.09
velocity_yr50 1.38
BLOCK_END_CURRENT
% ------------------------------------
BLOCK_START_EXTREMEWIND
velocity_v1 29.7
velocity_v50 44.8
velocity_vred1 32.67
velocity_vred50 49.28
velocity_ve1 37.9
velocity_ve50 57
velocity_Vref 50
BLOCK_END_EXTREMEWIND
% ------------------------------------
Currently I'm parsing it this way:
clc, clear all, close all
%Find all row headers
fid = fopen('test_struct.txt','r');
row_headers = textscan(fid,'%s %*[^\n]','CommentStyle','%','CollectOutput',1);
row_headers = row_headers{1};
fclose(fid);
%Find all attributes
fid1 = fopen('test_struct.txt','r');
attributes = textscan(fid1,'%*s %s','CommentStyle','%','CollectOutput',1);
attributes = attributes{1};
fclose(fid1);
%Collect row headers and attributes in a single cell
parameters = [row_headers,attributes];
%Find all the blocks
startIdx = find(~cellfun(#isempty, regexp(parameters, 'BLOCK_START_', 'match')));
endIdx = find(~cellfun(#isempty, regexp(parameters, 'BLOCK_END_', 'match')));
assert(all(size(startIdx) == size(endIdx)))
%Extract fields between BLOCK_START_ and BLOCK_END_
extract_fields = #(n)(parameters(startIdx(n)+1:endIdx(n)-1,1));
struct_fields = arrayfun(extract_fields, 1:numel(startIdx), 'UniformOutput', false);
%Extract attributes between BLOCK_START_ and BLOCK_END_
extract_attributes = #(n)(parameters(startIdx(n)+1:endIdx(n)-1,2));
struct_attributes = arrayfun(extract_attributes, 1:numel(startIdx), 'UniformOutput', false);
%Get structure names stored after each BLOCK_START_
structures_name = #(n) strrep(parameters{startIdx(n)},'BLOCK_START_','');
structure_names = genvarname(arrayfun(structures_name,1:numel(startIdx),'UniformOutput',false));
%Generate structures
for i=1:numel(structure_names)
eval([structure_names{i} '=cell2struct(struct_attributes{i},struct_fields{i},1);'])
end
It works, but not as I want. The overall idea is to read the file into one structure (one field per block BLOCK_START / BLOCK_END). Furthermore, I would like the numbers to be read as double and not as char, and delimiters like "whitespace" "," or ";" have to be read as array separator (e.g. 3;4;5 = [3;4;5] and similar).
To clarify better, I will take the block
BLOCK_START_WAVE
% a6*x^6 + a5*x^5 + a4*x^4 + a3*x^3 + a2*x^2 + a1*x + a0
factor_hs 0.008105;0.029055;0.153752
factor_tz -0.029956;1.050777;2.731063
factor_tp -0.118161;1.809956;3.452903
spectrum_gamma 3.3
BLOCK_END_WAVE
The structure will be called WAVE with
WAVE.factor_hs = [0.008105;0.029055;0.153752]
WAVE.factor_tz = [-0.029956;1.050777;2.731063]
WAVE.factor_tp = [-0.118161;1.809956;3.452903]
WAVE.spectrum.gamma = 3.3
Any suggestion will be strongly appreciated.
Best regards.

You have answers to this question (which is also yours) as a good starting point! To extract everything into a cell array, you do:
%# Read data from input file
fd = fopen('test_struct.txt', 'rt');
C = textscan(fd, '%s', 'Delimiter', '\r\n', 'CommentStyle', '%');
fclose(fd);
%# Extract indices of start and end lines of each block
start_idx = find(~cellfun(#isempty, regexp(C{1}, 'BLOCK_START', 'match')));
end_idx = find(~cellfun(#isempty, regexp(C{1}, 'BLOCK_END', 'match')));
assert(all(size(start_idx) == size(end_idx)))
%# Extract blocks into a cell array
extract_block = #(n)({C{1}{start_idx(n):end_idx(n) - 1}});
cell_blocks = arrayfun(extract_block, 1:numel(start_idx), 'Uniform', false);
Now, to translate that into corresponding structs, do this:
%# Iterate over each block and convert it into a struct
for i = 1:length(cell_blocks)
%# Extract the block
C = strtrim(cell_blocks{i});
C(cellfun(#(x)isempty(x), C)) = []; %# Ignore empty lines
%# Parse the names and values
params = cellfun(#(s)textscan(s, '%s%s'), {C{2:end}}, 'Uniform', false);
name = strrep(C{1}, 'BLOCK_START_', ''); %# Struct name
fields = cellfun(#(x)x{1}{:}, params, 'Uniform', false);
values = cellfun(#(x)x{2}{:}, params, 'Uniform', false);
%# Create a struct
eval([name, ' = cell2struct({values{idx}}, {fields}, 2)'])
end

Well, I've never used matlab, but you could use the following regex to find a block:
/BLOCK_START_(\w+).*?BLOCK_END_\1/s
Then for each block, find all the attributes:
/^(?!BLOCK_END_)(\w+)\s+((?:-?\d+\.?\d*)(?:;(?:-?\d+\.?\d*))*)/m
Then based on the presence of semi colons in the second sub match you could assign it as either a single or multiple value variable. Not sure how to translate that into matLab, but I hope this helps!

Related

For loop for a list to prevent vertical display

ini_list = "[('G 02', 'UV', '2.73')]"
res = ini_list.strip('[]')
print(res)
('G 02', 'UV', '2.73')
result = res.strip('()')
print(result)
'G 02', 'UV', '2.73'
I have a list: 'G 02', 'UV', '2.73' and I would like to assign variables to this list
so that the outcome is as follows:
Element = G 02
Reason = UV
Time = 2.73
I have numerous lists that contain those parameters that I would like to later use to plot various things and so would like to extract each parameter from the list and associate it with the specific variable.
I tried to do it by:
Results = res
for index, Parameters in enumerate(Results):
element = Parameters[0]
print(element)
in the hopes that i could extract each item from the list to assign it a variable as mentioned above however when i print element the list prints vertically downwards and it also doesnt let me extract individual indexes.
'
G
0
2
'
,
'
U
V
'
,
'
2
.
7
3
'
how do i get it so it assigns variables to each parameter as mentioned above and so it prints as so:
element = G 02
reason = UV
time = 2.73

if the ini_list is a string such as ini_list = "[('G 02', 'UV', '2.73')]" then you need strip and split methods to do your work such as following,
ini_list = "[('G 02', 'UV', '2.73')]"
res = ini_list.strip('[]')
result = res.strip('()')
result1 = result.split(',')
result1=[x.strip(" ") for x in result1]
result1=[x.strip("''") for x in result1]
element = result1[0]
print("element:",element)
reason =result1[1]
print("reason:",reason)
time=result1[2]
print("time:",time)
output:
element: G 02
reason: UV
time: 2.73

how to decode tickstory bi5 file in c#

I tried to export tickstory data to a file in CSV, but it came out as bi5 files.
anyways, I am trying to decode bi5 files.
I used 7zip decoding which turned out a bigger garbage.
what should I use in order to decode bi5 file? Moreover how to do it in C#?

You should find csv file in tickstory directory. It's created by default along with bi5 files.

The Bi5 format background
The size of one Tick record in bi5 file is 5 x 4 bytes.
The structure of bi5 record is following:
1st 4 bytes -> Time part of the timestamp
2nd 4 bytes -> Bid
3rd 4 bytes -> Ask
4th 4 bytes -> Bid Volume
5th 4 bytes -> Ask Volume
Code Sample
So the code, that translates Bi5 to C# primitives could look like this:
byte[] bytes = new bytes[0]; // this is placeholder for tick data
var date = DateTime.Now.AddYears(-1); // this is the tick data start date
var i = 0;
var decimals = 5; // for JPY pairs or other CFD it can be 3 or other
var milliseconds = BitConverter.ToInt32(
bytes[new Range(new Index(i), new Index(i + 4))]
.Reverse().ToArray());
var i1 = BitConverter.ToInt32(bytes[new Range(i + 4, i + 8)].Reverse().ToArray());
var i2 = BitConverter.ToInt32(bytes[new Range(i + 8, i + 12)].Reverse().ToArray());
var f1 = BitConverter.ToSingle(bytes[new Range(i + 12, i + 16)].Reverse().ToArray());
var f2 = BitConverter.ToSingle(bytes[new Range(i + 16, i + 20)].Reverse().ToArray());
// resulting data
var tickTimestamp = new DateTime(date.Year, date.Month, date.Day,
date.Hour, 0, 0).AddMilliseconds(milliseconds);
var Ask = i1 / Math.Pow(10, decimals),
var Bid = i2 / Math.Pow(10, decimals);
var AskVolume = f1,
var BidVolume = f2;
Working sample on github

Why can't I extract text between patterns with varying text length

I have a column in a data frame with characters that need to be split into columns. My code seems to break when the string in the column has a length of 12 but it works fine when the string has a length of 11.
S99.ABCD{T}
S99.ABCD{V}
S99.ABCD{W}
S99.ABCD{Y}
Q100.ABCD{A}
Q100.ABCD{C}
Q100.ABCD{D}
Q100.ABCD{E}
An example of the ideal format is on the left, what I'm getting is on the right:
ID WILD RES MUT | ID WILD RES MUT
ABCD S 99 T | ABCD S 99 T
... | ...
ABCD Q 100 A | .ABC Q 100 {
... | ...
My current solution is the following:
data <- data.frame(ID = substr(mdata$substitution,
gregexpr(pattern = "\\.",
mdata$substitution)[[1]] + 1,
gregexpr(pattern = "\\{",
mdata$substitution)[[1]] - 1),
WILD = substr(mdata$substitution, 0, 1),
RES = gsub("[^0-9]","", mdata$substitution),
MUT = substr(mdata$substitution,
gregexpr(pattern = "\\{",
mdata$substitution)[[1]] + 1,
gregexpr(pattern = "\\}",
mdata$substitution)[[1]] - 1))
I'm not sure why my code isn't working, I thought using gregexpr I would be able to find where the pattern was in the string to find out the position of characters I want to extract but it doesn't work when the length of the string changes.

Using this example you can do what you want
test=c("S99.ABCD{T}",
"S99.ABCD{V}",
"S99.ABCD{W}",
"S99.ABCD{Y}",
"Q100.ABCD{A}",
"Q100.ABCD{C}",
"Q100.ABCD{D}",
"Q100.ABCD{E}")
library(stringr)
test=str_remove(test,pattern = "\\}")
testdf=as.data.frame(str_split(test,pattern = "\\.|\\{",simplify = T))
testdf$V4substring(testdf$V1, 1, 1)
testdf$V5=substring(testdf$V1, 2)

Extracting Specific Columns from Multiple Files & Writing to File Python

I have seven tab delimited files, each file has the exact number and name of the columns but different data of each. Below is a sample of how either of the seven files looks like:
test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change)
000001 000001 ZZ 1:1 01 01 NOTEST 0 0 0 0 1 1 no
I am trying to basically read all of those seven files and extract the third, fourth and tenth column (gene, locus, log2(fold_change)) And write those columns in a new file. So the file look something like this:
gene name locus log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change)
ZZ 1:1 0 0 0 0
all the log2(fold_change) are obtain from the tenth column from each of the seven files
What I had so far is this and need help constructing a more efficient pythonic way to accomplish the task above, note that the code is still not accomplish the task explained above, need some work
dicti = defaultdict(list)
filetag = []
def read_data(file, base):
with open(file, 'r') as f:
reader = csv.reader((f), delimiter='\t')
for row in reader:
if 'test_id' not in row[0]:
dicti[row[2]].append((base, row))
name_of_fold = raw_input("Folder name to stored output files in: ")
for file in glob.glob("*.txt"):
base=file[0:3]+"-log2(fold_change)"
filetag.append(base)
read_data(file, base)
with open ("output.txt", "w") as out:
out.write("gene name" + "\t"+ "locus" + "\t" + "\t".join(sorted(filetag))+"\n")
for k,v in dicti:
out.write(k + "\t" + v[1][1][3] + "\t" + "".join([ int(z[0][0:3]) * "\t" + z[1][9] for z in v ])+"\n")
So, the code above is a working code but is not what I am looking for here is why. The output code is the issue, I am writing a tab delimited output file with the gene at the first column (k), v[1][1][3] is the locus of that particular gene, and finally which is what I am having tough time coding is this is part of the output file:
"".join([ int(z[0][0:3]) * "\t" + z[1][9] for z in v ])
I am trying to provide a list of fold change from each of the seven file at that particular gene and locus and then write it to the correct column number, so I am basically multiply the column number of which file number is by "\t" this will insure that the value will go to the right column, the problem is that when the next column of another file comes a long, the writing will be starting from where it left off from writing which I don't want, I want to start again from the beginning of the writing:
Here is what I mean for instance,
gene name locus log2(fold change) from file 1 .... log2(fold change) from file7
ZZ 1:3 0
0
because first log2 will be recorded based on the column number for instance 2 and that is to ensure recording, I am multiplying the number of column (2) by "\t" and fold_change value , it will record it no problem but then last column will be the seventh for instance and will not record to the seven because the last writing was done.

Here is my first approach:
import glob
import numpy as np
with open('output.txt', 'w') as out:
fns = glob.glob('*.txt') # Here you can change the pattern of the file (e.g. 'file_experiment_*.txt')
# Title row:
titles = ['gene_name', 'locus'] + [str(file + 1) + '_log2(fold_change)' for file in range(len(fns))]
out.write('\t'.join(titles) + '\n')
# Data row:
data = []
for idx, fn in enumerate(fns):
file = np.genfromtxt(fn, skip_header=1, usecols=(2, 3, 9), dtype=np.str, autostrip=True)
if idx == 0:
data.extend([file[0], file[1]])
data.append(file[2])
out.write('\t'.join(data))
Content of the created file output.txt (Note: I created just three files for testing):
gene_name locus 1_log2(fold_change) 2_log2(fold_change) 3_log2(fold_change)
ZZ 1:1 0 0 0

I am using re instead of csv. The main problem with you code is the for loop which writes the output in the file. I am writing the complete code. Hope this solves problem you have.
import collections
import glob
import re
dicti = collections.defaultdict(list)
filetag = []
def read_data(file, base):
with open(file, 'r') as f:
for row in f:
r = re.compile(r'([^\s]*)\s*')
row = r.findall(row.strip())[:-1]
print row
if 'test_id' not in row[0]:
dicti[row[2]].append((base, row))
def main():
name_of_fold = raw_input("Folder name to stored output files in: ")
for file in glob.glob("*.txt"):
base=file[0:3]+"-log2(fold_change)"
filetag.append(base)
read_data(file, base)
with open ("output", "w") as out:
data = ("genename" + "\t"+ "locus" + "\t" + "\t".join(sorted(filetag))+"\n")
r = re.compile(r'([^\s]*)\s*')
data = r.findall(data.strip())[:-1]
out.write('{0[1]:<30}{0[2]:<30}{0[3]:<30}{0[4]:<30}{0[5]:<30} {0[6]:<30}{0[7]:<30}{0[8]:<30}'.format(data))
out.write('\n')
for key in dicti:
print 'locus = ' + str(dicti[key][1])
data = (key + "\t" + dicti[key][1][1][3] + "\t" + "".join([ len(z[0][0:3]) * "\t" + z[1][9] for z in dicti[key] ])+"\n")
data = r.findall(data.strip())[:-1]
out.write('{0[0]:<30}{0[1]:<30}{0[2]:<30}{0[3]:<30}{0[4]:<30}{0[5]:<30}{0[6]:<30}{0[7]:<30}{0[8]:<30}'.format(data))
out.write('\n')
if __name__ == '__main__':
main()
and i change the name of the output file from output.txt to output as the former may interrupt the code as code considers all .txt files. And I am attaching the output i got which i assume the format that you wanted.
Thanks
gene name locus 1.t-log2(fold_change) 2.t-log2(fold_change) 3.t-log2(fold_change) 4.t-log2(fold_change) 5.t-log2(fold_change) 6.t-log2(fold_change) 7.t-log2(fold_change)
ZZ 1:1 0 0 0 0 0 0 0

Remember to append \n to the end of each line to create a line break. This method is very memory efficient, as it just processes one row at a time.
import csv
import os
import glob
# Your folder location where the input files are saved.
name_of_folder = '...'
output_filename = 'output.txt'
input_files = glob.glob(os.path.join(name_of_folder, '*.txt'))
with open(os.path.join(name_of_folder, output_filename), 'w') as file_out:
headers_read = False
for input_file in input_files:
if input_file == os.path.join(name_of_folder, output_filename):
# If the output file is in the list of input files, ignore it.
continue
with open(input_file, 'r') as fin:
reader = csv.reader(fin)
if not headers_read:
# Read column headers just once
headers = reader.next()[0].split()
headers = headers[2:4] + [headers[9]]
file_out.write("\t".join(headers + ['\n'])) # Zero based indexing.
headers_read = True
else:
_ = reader.next() # Ignore header row.
for line in reader:
if line: # Ignore blank lines.
line_out = line[0].split()
file_out.write("\t".join(line_out[2:4] + [line_out[9]] + ['\n']))
>>> !cat output.txt
gene locus log2(fold_change)
ZZ 1:1 0
ZZ 1:1 0

Matlab Codegen build error

I am trying to convert the below Matlab code into C++ using codegen. However it fails at build and I get the error:
"??? Unless 'rows' is specified, the first input must be a vector. If the vector is variable-size, the either the first dimension or the second must have a fixed length of 1. The input [] is not supported. Use a 1-by-0 or 0-by-1 input (e.g., zeros(1,0) or zeros(0,1)) to represent the empty set."
It then points to [id,m,n] = unique(id); being the culprit. Why doesn't it build and what's the best way to fix it?
function [L,num,sz] = label(I,n) %#codegen
% Check input arguments
error(nargchk(1,2,nargin));
if nargin==1, n=8; end
assert(ndims(I)==2,'The input I must be a 2-D array')
sizI = size(I);
id = reshape(1:prod(sizI),sizI);
sz = ones(sizI);
% Indexes of the adjacent pixels
vec = #(x) x(:);
if n==4 % 4-connected neighborhood
idx1 = [vec(id(:,1:end-1)); vec(id(1:end-1,:))];
idx2 = [vec(id(:,2:end)); vec(id(2:end,:))];
elseif n==8 % 8-connected neighborhood
idx1 = [vec(id(:,1:end-1)); vec(id(1:end-1,:))];
idx2 = [vec(id(:,2:end)); vec(id(2:end,:))];
idx1 = [idx1; vec(id(1:end-1,1:end-1)); vec(id(2:end,1:end-1))];
idx2 = [idx2; vec(id(2:end,2:end)); vec(id(1:end-1,2:end))];
else
error('The second input argument must be either 4 or 8.')
end
% Create the groups and merge them (Union/Find Algorithm)
for k = 1:length(idx1)
root1 = idx1(k);
root2 = idx2(k);
while root1~=id(root1)
id(root1) = id(id(root1));
root1 = id(root1);
end
while root2~=id(root2)
id(root2) = id(id(root2));
root2 = id(root2);
end
if root1==root2, continue, end
% (The two pixels belong to the same group)
N1 = sz(root1); % size of the group belonging to root1
N2 = sz(root2); % size of the group belonging to root2
if I(root1)==I(root2) % then merge the two groups
if N1 < N2
id(root1) = root2;
sz(root2) = N1+N2;
else
id(root2) = root1;
sz(root1) = N1+N2;
end
end
end
while 1
id0 = id;
id = id(id);
if isequal(id0,id), break, end
end
sz = sz(id);
% Label matrix
isNaNI = isnan(I);
id(isNaNI) = NaN;
[id,m,n] = unique(id);
I = 1:length(id);
L = reshape(I(n),sizI);
L(isNaNI) = 0;
if nargout>1, num = nnz(~isnan(id)); end

Just an FYI, if you are using MATLAB R2013b or newer, you can replace error(nargchk(1,2,nargin)) with narginchk(1,2).
As the error message says, for codegen unique requires that the input be a vector unless 'rows' is passed.
If you look at the report (click the "Open report" link that is shown) and hover over id you will likely see that its size is neither 1-by-N nor N-by-1. The requirement for unique can be seen if you search for unique here:
http://www.mathworks.com/help/coder/ug/functions-supported-for-code-generation--alphabetical-list.html
You could do one of a few things:
Make id a vector and treat it as a vector for the computation. Instead of the declaration:
id = reshape(1:prod(sizI),sizI);
you could use:
id = 1:numel(I)
Then id would be a row vector.
You could also keep the code as is and do something like:
[idtemp,m,n] = unique(id(:));
id = reshape(idtemp,size(id));
Obviously, this will cause a copy, idtemp, to be made but it may involve fewer changes to your code.

Remove the anonymous function stored in the variable vec and make vec a subfunction:
function y = vec(x)
coder.inline('always');
y = x(:);
Without the 'rows' option, the input to the unique function is always interpreted as a vector, and the output is always a vector, anyway. So, for example, something like id = unique(id) would have the effect of id = id(:) if all the elements of the matrix id were unique. There is no harm in making the input a vector going in. So change the line
[id,m,n] = unique(id);
to
[id,m,n] = unique(id(:));

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parsing text file in matlab - regex

Related

For loop for a list to prevent vertical display

how to decode tickstory bi5 file in c#

Why can't I extract text between patterns with varying text length

Extracting Specific Columns from Multiple Files & Writing to File Python

Matlab Codegen build error

Categories

Resources