MATLAB: How to read PRE tag and create cellarray with NaN - regex

I am trying to read data from html file
The data are delimmited by <PRE></PRE> tag
e.g.:
<pre>
12.0 29132 -60.3 -91.4 1 0.01 260 753.2 753.3 753.2
10.0 30260 -57.9 1 0.01 260 58 802.4 802.5 802.4
9.8 30387 -57.7 -89.7 1 0.01 261 61 807.8 807.9 807.8
6.0 33631 -40.4 -77.4 1 0.17 260 88 1004.0 1006.5 1004.1
5.9 33746 -40.3 -77.3 1 0.17 1009.2 1011.8 1009.3
</pre>
t = regexp(html, '<PRE[^>]*>(.*?)</PRE>', 'tokens');
where t is a cell of char
Well, now I would to replace the blank space with NaN and to obtain:
12.0 29132 -60.3 -91.4 1 0.01 260 Nan 753.2 753.3 753.2
10.0 30260 -57.9 Nan 1 0.01 260 58 802.4 802.5 802.4
9.8 30387 -57.7 -89.7 1 0.01 261 61 807.8 807.9 807.8
6.0 33631 -40.4 -77.4 1 0.17 260 88 1004.0 1006.5 1004.1
5.9 33746 -40.3 -77.3 1 0.17 NaN NaN 1009.2 1011.8 1009.3
This data will be saved on mydata.dat file

If you have the HTML file hosted somewhere, then:
url = 'http://www.myDomain.com/myFile.html';
html = urlread(url);
% Use regular expressions to remove undesired HTML markup.
txt = regexprep(html,'<script.*?/script>','');
txt = regexprep(txt,'<style.*?/style>','');
txt = regexprep(txt,'<pre.*?/pre>','');
txt = regexprep(txt,'<.*?>','')
Now you should have the date in text format in txt variable. You can use textscan to parse the txt var and you can scan for the whitespace or for the numbers.
More Info:
- urlread
- regexprep

This isn't the perfect solution but it seems to get you there.
Assuming t is one long string, the delimiter is white space, and you know the number of columns:
numcols = 7;
sample = '1 2 3 4 5 7 1 3 5 7';
test = textscan(sample,'%f','delimiter',' ','MultipleDelimsAsOne',false);
test = test{:}; % Pull the double out of the cell array
test(2:2:end) = []; % Dump out extra NaNs
test2 = reshape(test,numcols,length(test)/numcols)'; % Have to mess with it a little to reshape rowwise instead of columnwise
Returns:
test2 =
1 2 3 4 5 NaN 7
1 NaN 3 NaN 5 NaN 7
This is assuming the delimiter is white space and constant. Textscan doesn't allow you to stack whitespace as a delimiter, so it throws a NaN after each white space character if there isn't data present. In your example data there are two white space characters between each data point, so every other NaN (or, more generically, n_whitespace - 1) can be thrown out, leaving you with the NaNs you actually want.

Related

A DAX function that takes the whole number and adding the decimal to the next row in DAX

Please I need help implementing this logic
I need help building a DAX function that takes
A numerical column,
Divides each by 21
from the result which is a float, return the whole number if the result has decimal number greater than 0,
then add the decimal number to the value in the next row
continue that way till the last row 
the last row will return both the whole number and the decimal
Here is a simple table that captures the problem I want to solve
from the tonnage column 0.71 has 0 as the whole number, we return 0 and take the decimal .71 add it to second row, that 2.67 + 0.71 = 3.38.
return 3 which is the whole number and add .38 to the third row 0.76 + 0.38 = 1.14. again return 1 being the whole number and add .14 to the fourth row 1.19 + 0.14 = 1.33. return 1 and add .33 to the next row. in that order till the end of the row. the last row will return both the whole number and decimal in any.
load
tonnage
trips (expected result)
15
0.71
0
56
2.67
3
16
0.76
1
25
1.19
1
19
0.90
1
14
0.67
0
52
2.48
3
75
3.57
3.95
Please help.
Thank you

Replace a number combination with a colon in pyspark dataframe

I have a column in pyspark as:
column_a
force is 12 N and weight is 5N 4455 6700 and second force is 12N 6700 3460
weight is 14N and force is 5N 7000 10000
acceleration due to gravity is 10 and force is 6N 15000 4500
force is 12 4 N and weight is 7N 9000 17000 and second force is 12N
I want to replace the numbers which are in the range of (1000, 20000) and which occur one after another by a colon (;). For example in 4th row 12 and 4 are one after another but, they do not fall into the range so we will not replace them with a colon (;).
So my final output will be
column_a
force is 12 N and weight is 5N ; and second force is 12N ;
weight is 14N and force is 5N ;
acceleration due to gravity is 10 and force is 6N ;
force is 12 4 N and weight is 7N ; and second force is 12N
How do I achieve this in pyspark?
You can use regexp_replace to replace the specificed format with ;.
The hardest part is coming up with the regex, we can use Numeric Range Regex Generator to find the regex pattern to match the condition.
from pyspark.sql import functions as F
data = [("force is 12 N and weight is 5N 4455 6700 and second force is 12N 6700.010 3460",),
("weight is 14N and force is 5N 7000 10000",),
("acceleration due to gravity is 10 and force is 6N 15000 4500.1999999901",),
("force is 12 4 N and weight is 7N 9000 17000 and second force is 12N",),
("handle zero padded decimals 20000.000000 20000.00",),
("Wont be replaced as outside range 20001 17000 even for decimal 20000.01 2000",),]
df = spark.createDataFrame(data, ("column_a", ))
df = spark.createDataFrame(data, ("column_a", ))
# This pattern matches whole and decimal numbers between 1000 and 20000 inclusive
numeric_pattern ="(((100[0-9]|10[1-9][0-9]|1[1-9][0-9]{2}|[2-9][0-9]{3}|1[0-9]{4})(\\.\\d+)?)|(20000)(\\.0*)?)"
# This pattern matches 2 numeric patterns separated by a space
pattern = f".({numeric_pattern}\\s{numeric_pattern})\\b"
df.withColumn("column_a", F.regexp_replace(F.col("column_a"), pattern, " ;")).show(truncate=False)
"""
+----------------------------------------------------------------------------+
|column_a |
+----------------------------------------------------------------------------+
|force is 12 N and weight is 5N ; and second force is 12N ; |
|weight is 14N and force is 5N ; |
|acceleration due to gravity is 10 and force is 6N ; |
|force is 12 4 N and weight is 7N ; and second force is 12N |
|handle zero padded decimals ; |
|Wont be replaced as outside range 20001 17000 even for decimal 20000.01 2000|
+----------------------------------------------------------------------------+
"""

Matlab: find small islands of numbers surrounded by NaN

I have a lengthy vector of numeric data, with some sequences of NaNs here and there. Most of the NaNs come in large chunks, but sometimes the segments of NaNs are close together, creating islands of numbers surrounded by NaNs like this:
...NaN 1 2 3 5 ... 9 4 2 NaN...
I would like to find all islands of data that are between 1 - 15000 elements in size and replace them by a solid block of NaNs.
I've tried a few things, but there are some problems --the data set is HUGE, hence converting it to a string and using a regular expression to do:
[found start end] = regexp(num2str(isnan(data)),'10{1,7}1','match','start','end')
is out of the question because it takes prohibitively long to do num2str(isnan(data)). So I need a numeric way to find all NaN-numbers-Nan where the number of numbers is between 1 and 15000.
Here is an example how you can do that:
% generate random data
data = rand(1,20)
data(data>0.5) = NaN
% add NaN before and after the original array
% for easier detection of the numerical block
% starting at 1st element and finishing at the last one
datatotest = [ NaN data NaN ];
NumBlockStart = find( ~isnan(datatotest(2:end)) & isnan(datatotest(1:end-1)) )+0
NumBlockEnd = find( ~isnan(datatotest(1:end-1)) & isnan(datatotest(2:end)) )-1
NumBlockLength = NumBlockEnd - NumBlockStart + 1
In this example NumBlockStart contains start index of numeric block and NumBlockEnd contains last index of numeric block. NumBlockLength contains length of each block.
Now you can do whatever you want to them :)
Here is possible output
data =
0.0382 0.3767 0.8597 0.2743 0.6276 0.2974 0.2587 0.8577 0.8319 0.1408 0.9288 0.0990 0.7653 0.7806 0.8576 0.8032 0.8340 0.1600 0.4937 0.7784
data =
0.0382 0.3767 NaN 0.2743 NaN 0.2974 0.2587 NaN NaN 0.1408 NaN 0.0990 NaN NaN NaN NaN NaN 0.1600 0.4937 NaN
NumBlockStart =
1 4 6 10 12 18
NumBlockEnd =
2 4 7 10 12 19
NumBlockLength =
2 1 2 1 1 2
UPDATE1
This is more efficient version:
data = rand(1,19)
data(data>0.5) = NaN
test2 = diff( ~isnan([ NaN data NaN ]) );
NumBlockStart = find( test2>0 )-0
NumBlockEnd = find( test2<0 )-1
NumBlockLength = NumBlockEnd - NumBlockStart + 1

Optimise conversion to integer - pandas

I have a DataFrame with 80,000 rows. One column 'prod_prom' contains either null values or string representations of numbers, i.e. including ','. I need to convert these to integers. So far I have been doing this:
for row in DF.index:
if pd.notnull(DF.loc[row, 'prod_prom']):
DF.loc[row, 'prod_prom'] = int(''.join([char for char in DF.loc[row, 'prod_prom'] if char != ',']))
But it is extremely slow. Would it be quicker to do this in list comprehension, or with an apply function? What is best practice for this kind of operation?
Thanks
So if I understand right, you have data like the following:
data = """
A,B
100,"5,000"
200,"10,000"
300,"100,000"
400,
500,"2,000"
"""
If that is the case probably the easiest thing is to use the thousands option in read_csv (the type will be float instead of int because of the missing value):
df = pd.read_csv(StringIO(data),header=True,thousands=',')
A B
0 100 5000
1 200 10000
2 300 100000
3 400 NaN
4 500 2000
If that is not an possible you can do something like the following:
print df
A B
0 100 5,000
1 200 10,000
2 300 100,000
3 400 NaN
4 500 2,000
df['B'] = df['B'].str.replace(r',','').astype(float)
print df
A B
0 100 5000
1 200 10000
2 300 100000
3 400 NaN
4 500 200
I changed the type to float because there are no NaN integers in pandas.

Matlab: Read in ascii with varying column numbers & different formats

I have been working on a puzzling question that involves reading an ascii file into matlab that contains 2 parts of different formats, the first part also including different column numbers.
MESH2D
MESHNAME "XXX"
E3T 1 1 29 30 1
E4Q 2 2 31 29 1 1
E4Q 3 31 2 3 32 1
...
...
...
ND 120450 5.28760039e+004 7.49260000e+004 8.05500000e+002
ND 120451 5.30560039e+004 7.49260000e+004 6.84126709e+002
ND 120452 5.32360039e+004 7.49260000e+004 6.97750000e+002
ND 120453 5.34010039e+004 7.49110000e+004 7.67000000e+002
NS 1 2 3 4 5 6 7 8 9 10
NS 11 12 13 14 15 16 17 18 19 20
NS 21 22 23 24 25 26 27 -28
BEGPARAMDEF
GM "Mesh"
I am only interested in the lines that contain the triangles and start with E3T/E4Q and the corresponding lines that hold the coordinates of the nodes of the triangles and start with ND. For the triangles (E3T/E4Q lines) I am only interest in the first 4 numbers, therefore I was trying to do something like this:
fileID = fopen(test);
t1 = textscan(fileID, '%s',3);
t2 = textscan(fileID, '%s %d %d %d*[^\n]');
fclose(fileID);
So read in the header to jump to the data and then read the first string and following 4 numbers, then jump to the end of the line and restart. But this does not work. I only get A single line with data and not the rest of the file. Also, I do not know how to treat the second part of the file, which starts at an arbitrary amount of numbers (which I could of course look up manually and feed into matlab, but would prefer matlab to find this change in format automatically).
Do you have any suggestions?
Cheers!
I suggest that you first read all the lines of the file with textscan as strings, and then filter out whatever you need:
fid = fopen(filename, 'r');
C = textscan(fid, '%s', 'delimiter', '');
fclose(fid);
Then parse only the E3T/E4Q/ND lines using regexp:
C = regexp(C, '(\w*)(.*)', 'tokens');
C = cellfun(#(x){x{1}{1}, str2num(x{1}{2})}, C, 'UniformOutput', false);
C = vertcat(C{:});
And then group corresponding E3T/E4Q and ND lines:
idx1 = strcmp(C(:, 1), 'E3T') | strcmp(C(:, 1), 'E4Q');
idx2 = strcmp(C(:, 1), 'ND');
N = max(nnz(idx1), nnz(idx2));
indices = cellfun(#(x)x(1:4), C(idx1, 2), 'UniformOutput', false);
S = struct('tag', [C(idx1, 1); cell(N - nnz(idx1), 1)], ...
'indices', [indices; cell(N - nnz(idx1), 1)], ...
'nodes', [C(idx2, 2); cell(N - nnz(idx2), 1)]);
I named the E3T/E4Q values "indices" and ND values "nodes". The resulting array S contains structs, each having three fields: tag (either E3T or E4Q), indices and nodes. Note that if you have more "indices" than "nodes" or vice versa, the missing values are indicated by an empty matrix.
I know this is not perfect, but if your files are not too big you can do something like this:
fileID = fopen(test,'r');
while ~feof(fileID)
FileLine = fgetl(fileID);
[LineHead,Rem] = strtok(FileLine); % separated string header and numbers
switch LineHead
case 'MESH2D'
% do something here
case 'MESHNAME'
% do something here
case 'E3T'
% parse integer numbers
[Num,NumCount] = sscanf(Rem, '%d');
case 'E4Q'
% parse integer numbers
[Num,NumCount] = sscanf(Rem, '%d');
case 'ND'
% parse integer numbers
[Num,NumCount] = sscanf(Rem, '%d');
% or if you prefer to parse first number separately
[strFirst,strOthers] = strtok(Rem);
FirstInteger = str2num(strFirst);
[Floats,FloatsCount] = sscanf(strOthers, '%g');
% and so on...
end
end
fclose(fileID);
OF course, you have to handle strings starting with MESH2D, MESHNAME, or GM separately