Need to create data frame from patterns in a string

Need to create data frame from patterns in a string - regex

I have the following string I need to extract the patterns into a single column data frame named SIZE
str <- "N · 0.1 [mm]: N · 0.1 + 0.02 [mm]: N · 0.1 + 0.05 [mm] N · 0.1 + 0.08 [mm] M · 1 [mm]: M · 1 + 0.5 [mm] M · 1 + 0.75 [mm]"
The patterns are either followed by : or whitespace and always ends in [mm]
The regex I am using to match my patterns is and it works, but i'm not sure how to extract the matches to create a column as a data frame.
\S\W+\d\.?\d?\s\+?\s?\d?\.?\d?\d?\s?\[mm\]
Output expected: 1 column named SIZE
N · 0.1 [mm]
N · 0.1 + 0.02 [mm]
N · 0.1 + 0.05 [mm]
N · 0.1 + 0.08 [mm]
M · 1 [mm]
M · 1 + 0.5 [mm]
M · 1 + 0.75 [mm]
Any help appreciated. Thanks..

Perhaps, strsplit would make things easier here..
str <- "N · 0.1 [mm]: N · 0.1 + 0.02 [mm]: N · 0.1 + 0.05 [mm] N · 0.1 + 0.08 [mm] M · 1 [mm]: M · 1 + 0.5 [mm] M · 1 + 0.75 [mm]"
vals <- strsplit(str, '(?<=\\])[\\s:]*', perl = T)
data.frame(SIZE = unlist(vals))
Output
SIZE
1 N · 0.1 [mm]
2 N · 0.1 + 0.02 [mm]
3 N · 0.1 + 0.05 [mm]
4 N · 0.1 + 0.08 [mm]
5 M · 1 [mm]
6 M · 1 + 0.5 [mm]
7 M · 1 + 0.75 [mm]

Here's one approach to get the data in: replace any instances of "[mm] " with "[mm]: " and scan the text in with ":" as your separator. No fussing with regexes....
scan(what = "", text = gsub("[mm] ", "[mm]: ", str, fixed=TRUE),
sep = ":", strip.white=TRUE)
# Read 7 items
# [1] "N · 0.1 [mm]" "N · 0.1 + 0.02 [mm]" "N · 0.1 + 0.05 [mm]"
# [4] "N · 0.1 + 0.08 [mm]" "M · 1 [mm]" "M · 1 + 0.5 [mm]"
# [7] "M · 1 + 0.75 [mm]"
Just assign the result there to a column in a data.frame or create a data.frame with the output. Or, all in one:
data.frame(
SIZE = scan(text = gsub("[mm] ", "[mm]: ", str, fixed=TRUE),
sep = ":", strip.white=TRUE, what = ""))

Related

Dynamic list in netlogo

I was unsure how to title this question, if somebody knows a more specific title, I'm happy to change it
I'm trying to develop a model in netlogo for my thesis where turtles buy water from 18 different wells. Each of the turtles has its distance to the individual wells stored in the breeds-own variable(s) adist_w_1 ,adist_w_2, ... etc.
What I want to do is model the consumption decisions of commercial establishments by first calculating the price of water (constant + x*adist_w_y), then figure out what the demand is at this price and subtract the demands at cheaper wells from this specific well.
price = 0.75 +0.15 * distance;
individual_demand_at_well_x = f (price_at_well) - demands_at_cheaper_wells
What I've done so far to this regard looks like this:
to calc-price-at-well
set earlier-demands 0
ask commercials [
let price_w_1 ((adist_w_1 / 1000) * 0.15 + 0.75 )
let price_w_2 ((adist_w_2 / 1000) * 0.15 + 0.75 )
let price_w_3 ((adist_w_3 / 1000) * 0.15 + 0.75 )
let price_w_4 ((adist_w_4 / 1000) * 0.15 + 0.75 )
let price_w_5 ((adist_w_5 / 1000) * 0.15 + 0.75 )
let price_w_6 ((adist_w_6 / 1000) * 0.15 + 0.75 )
let price_w_7 ((adist_w_7 / 1000) * 0.15 + 0.75 )
let price_w_8 ((adist_w_8 / 1000) * 0.15 + 0.75 )
let price_w_9 ((adist_w_9 / 1000) * 0.15 + 0.75 )
let price_w_10 ((adist_w_10 / 1000) * 0.15 + 0.75 )
let price_w_11 ((adist_w_11 / 1000) * 0.15 + 0.75 )
let price_w_12 ((adist_w_12 / 1000) * 0.15 + 0.75 )
let price_w_13 ((adist_w_13 / 1000) * 0.15 + 0.75 )
let price_w_14 ((adist_w_14 / 1000) * 0.15 + 0.75 )
let price_w_15 ((adist_w_15 / 1000) * 0.15 + 0.75 )
let price_w_16 ((adist_w_16 / 1000) * 0.15 + 0.75 )
let price_w_17 ((adist_w_17 / 1000) * 0.15 + 0.75 )
let price_w_18 ((adist_w_18 / 1000) * 0.15 + 0.75 )
let demand_w_1 5 * price_w_1 ;DUMMY! include demand function of price
let demand_w_2 5 * price_w_2
let demand_w_3 5 * price_w_3
let demand_w_4 5 * price_w_4
let demand_w_5 5 * price_w_5
let demand_w_6 5 * price_w_6
let demand_w_7 5 * price_w_7
let demand_w_8 5 * price_w_8
let demand_w_9 5 * price_w_9
let demand_w_10 5 * price_w_10
let demand_w_11 5 * price_w_11
let demand_w_12 5 * price_w_12
let demand_w_13 5 * price_w_13
let demand_w_14 5 * price_w_14
let demand_w_15 5 * price_w_15
let demand_w_16 5 * price_w_16
let demand_w_17 5 * price_w_17
let demand_w_18 5 * price_w_18
let demand-list (list demand_w_1 demand_w_2 demand_w_3 demand_w_4 demand_w_5 demand_w_6 demand_w_7 demand_w_8 demand_w_9 demand_w_10 demand_w_11 demand_w_12 demand_w_13 demand_w_14 demand_w_15 demand_w_16 demand_w_17 demand_w_18)
foreach sort demand-list
[
let cheapest-well item 0 demand-list
set final_t_dem (cheapest-well - earlier-demands)
set earlier-demands (earlier-demands + cheapest-well)
set demand-list remove cheapest-well demand-list
]
]
*whereas earlier-demands is a global, final_t_dem a commercials-own
This hopefully yields the cheapest well and reiterates until all wells are "treated". But now I'm clueless as to how I can extract the information which well gets which demand from individual turtles and, this is the goal, to sum them up.
In the end, I'd like to be able to say that commercial xy has an individual demand of x for well 1 and all commercials together have a demand of y.
I'd be glad to get advice and since I'm new to netlogo also an idea if this is a smart way to go with regard to model efficiency. Many thanks!

If column match with value, use gsub and print value to another column

I use some example:
INPUT:
0.6 0.7 A:0.01 - 0
C:0.01 0.1 - 0.2 0
0.7 0.02 G:0.2 - 0
0.5 0.23 0.1 T:0.05 0
0.1 0.2 0.3 0.58 0
So, if some column has a value start with A C T or G I would like to change it to "0" or "-" and last column change to "W" (it is $34 $35 $36 $37 $38 )
OUTPUT:
0.6 0.7 0 - W
0 0.1 - 0.2 W
0.7 0.02 0 - W
0.5 0.23 0.1 0 W
0.1 0.2 0.3 0.58 0
I would like to use awk.
awk '{if($34=="^:^");gsub($34,"*","0") && gsub($38,"0","W"); else print}' file
and same for other columns.
Thank you.

How about like this:
$ awk '{for(i=1;i<=4;i++){if ($i ~ /A:|C:|T:|G:/){$i=0; $NF="W"}}}1' file | column -t
0.6 0.7 0 - W
0 0.1 - 0.2 W
0.7 0.02 0 - W
0.5 0.23 0.1 0 W
0.1 0.2 0.3 0.58 0
In a more readable format:
$ awk '{
for(i=1;i<=4;i++) { # Loop through the fieds
if ($i ~ /A:|C:|T:|G:/) { # If current field matches pattern
$i=0 # Replace it with zero
$NF="W" # And make the last field a 'W'
}
}
}1' file | column -t
If you want to limit it to specific columns, you can use an array:
awk '{c="1,3";split(c,cols,/,/);for(i in cols){if ($cols[i] ~ /A:|C:|T:|G:/){$cols[i]=0; $NF="W"}}}1' file | column -t

what about something like this:
awk -v OFS="\t" '{if (gsub(/G:|C:|A:|T:/, "0")) print $1,$2,$3,$4,"W"; else print $0}'
And then replace values strarting 00 to zero.

If you don't care about spacing:
$ awk 'gsub(/[ACGT][^[:space:]]+/,0){$NF="W"}1' file
0.6 0.7 0 - W
0 0.1 - 0.2 W
0.7 0.02 0 - W
0.5 0.23 0.1 0 W
0.1 0.2 0.3 0.58 0
if you do:
$ awk 'gsub(/[ACGT][^[:space:]]+/,0){$NF="W"}1' file | column -t
0.6 0.7 0 - W
0 0.1 - 0.2 W
0.7 0.02 0 - W
0.5 0.23 0.1 0 W
0.1 0.2 0.3 0.58 0

Round a float number to the nearest integer divisible by five

I search this everywhere, no solution , i need to round to the nearest 5 integers,
don't know how to formulate this , for example round(0.13) will should return 5 ; here is the pattern logic with value to round first and expected result after rounding ;
0.12 => 0
0.99 => 0
1.01 => 0
4.99 => 5
5.45 => 5
7.00 => 5
8.00 => 10
9.10 => 10
14.34 => 15
17.4 => 15
17.5 => 20
37.6 => 40

Try
float x = roundf(x / 5) * 5;
or, assuming x >= 0 (and, as #JamesKanze noted, x <= INT_MAX)
int n = (int)(roundf(x / 5) * 5 + 0.5);

Try 5 * floor ((n + 2.5) \ 5), where \ denotes integer division. This is obviously not code, but it should be trivial to translate into any language you like.

C/C++: 1.00000 <= 1.0f = False

Can someone explain why 1.000000 <= 1.0f is false?
The code:
#include <iostream>
#include <stdio.h>
using namespace std;
int main(int argc, char **argv)
{
float step = 1.0f / 10;
float t;
for(t = 0; t <= 1.0f; t += step)
{
printf("t = %f\n", t);
cout << "t = " << t << "\n";
cout << "(t <= 1.0f) = " << (t <= 1.0f) << "\n";
}
printf("t = %f\n", t );
cout << "t = " << t << "\n";
cout << "(t <= 1.0f) = " << (t <= 1.0f) << "\n";
cout << "\n(1.000000 <= 1.0f) = " << (1.000000 <= 1.0f) << "\n";
}
The result:
t = 0.000000
t = 0
(t <= 1.0f) = 1
t = 0.100000
t = 0.1
(t <= 1.0f) = 1
t = 0.200000
t = 0.2
(t <= 1.0f) = 1
t = 0.300000
t = 0.3
(t <= 1.0f) = 1
t = 0.400000
t = 0.4
(t <= 1.0f) = 1
t = 0.500000
t = 0.5
(t <= 1.0f) = 1
t = 0.600000
t = 0.6
(t <= 1.0f) = 1
t = 0.700000
t = 0.7
(t <= 1.0f) = 1
t = 0.800000
t = 0.8
(t <= 1.0f) = 1
t = 0.900000
t = 0.9
(t <= 1.0f) = 1
t = 1.000000
t = 1
(t <= 1.0f) = 0
(1.000000 <= 1.0f) = 1

As correctly pointed out in the comments, the value of t is not actually the same 1.00000 that you are defining in the line below.
Printing t with higher precision with std::setprecision(20) will reveal its actual value: 1.0000001192092895508.
The common way to avoid these kinds of issues is to compare not with 1, but with 1 + epsilon, where epsilon is a very small number, that is maybe one or two magnitudes greater than your floating point precision.
So you would write your for loop condition as
for(t = 0; t <= 1.000001f; t += step)
Note that in your case, epsilon should be atleast ten times greater than the maximum possible floating point error, as the float is added ten times.
As pointed out by Muepe and Alain, the reason for t != 1.0f is that 1/10 can not be precisely represented in binary floating point numbers.

Floating point types in C++ (and most other languages) are implemented using an approach that uses the available bytes (for example 4 or 8) for the following 3 components:
Sign
Exponent
Mantissa
Lets have a look at it for a 32 bit (4 byte) type which often is what you have in C++ for float.
The sign is just a simple bit beeing 1 or 0 where 0 could mean its positive and 1 that its negative. If you leave every standardization away that exists you could also say 0 -> negative, 1 -> positive.
The exponent could use 8 bits. Opposed to our daily life this exponent is not ment to be used to the base 10 but base 2. That means 1 as an exponent does not correspond to 10 but to 2, and the exponent 2 means 4 (=2^2) and not 100 (=10^2).
Another important part is, that for floating point variables we also might want to have negative exponents like 2^-1 beeing 0.5, 2^-2 for 0.25 and so on. Thus we define a bias value that gets subtracted from the exponent and yields the real value. In this case with 8 bits we'd choose 127 meaning that an exponent of 0 gives 2^-127 and an exponent of 255 means 2^128. But, there is an exception to this case. Usually two values of the exponent are used to mark NaN and infinity. Therefore the real exponent is from 0 to 253 giving a range from 2^-127 to 2^126.
The mantissa obviously now fills up the remaining 23 bits. If we see the mantissa as a series of 0 and 1 you can imagine its value to be like 1.m where m is the series of those bits, but not in powers of 10 but in powers of 2. So 1.1 would be 1 * 2^0 + 1 * 2^-1 = 1 + 0.5 = 1.5. As an example lets have a look at the following mantissa (a very short one):
m = 100101 -> 1.100101 to base 2 -> 1 * 2^0 + 1 * 2^-1 + 0 * 2^-2 + 0 * 2^-3 + 1 * 2^-4 + 0 * 2^-5 + 1 * 2^-6 = 1 * 1 + 1 * 0.5 + 1 * 1/16 + 1 * 1/64 = 1.578125
The final result of a float is then calculated using:
e * 1.m * (sign ? -1 : 1)
What exactly is going wrong in your loop: Your step is 0.1! 0.1 is a very bad number for floating point numbers to base 2, lets have a look why:
sign -> 0 (as its non-negative)
exponent -> The first value smaller than 0.1 is 2^-4. So the exponent should be -4 + 127 = 123
mantissa -> For this we check how many times the exponent is 0.1 and then try to convert the fraction to a mantissa. 0.1 / (2^-4) = 0.1/0.0625 = 1.6. Considering the mantissa gives 1.m our mantissa should be 0.6. So lets convert that to binary:
0.6 = 1 * 2^-1 + 0.1 -> m = 1
0.1 = 0 * 2^-2 + 0.1 -> m = 10
0.1 = 0 * 2^-3 + 0.1 -> m = 100
0.1 = 1 * 2^-4 + 0.0375 -> m = 1001
0.0375 = 1 * 2^-5 + 0.00625 -> m = 10011
0.00625 = 0 * 2^-6 + 0.00625 -> m = 100110
0.00625 = 0 * 2^-7 + 0.00625 -> m = 1001100
0.00625 = 1 * 2^-8 + 0.00234375 -> m = 10011001
We could continue like thiw until we have our 23 mantissa bits but i can tell you that you get:
m = 10011001100110011001...
Therefore 0.1 in a binary floating point environment is like 1/3 is in a base 10 system. Its a periodic infinite number. As the space in a float is limited there comes the 23rd bit where it just has to cut of, and therefore 0.1 is a tiny bit greater than 0.1 as there are not all infinite parts of the number in the float and after 23 bits there would be a 0 but it gets rounded up to a 1.

The reason is that 1.0/10.0 = 0.1 can not be represented exactly in binary, just as 1.0/3.0 = 0.333.. can not be represented exactly in decimals.
If we use
float step = 1.0f / 8;
for example, the result is as expected.
To avoid such problems, use a small offset as shown in the answer of mic_e.

Issue on GMPL code

I was trying to solve the following problem, using the GLPSOL solver:
Fred has $5000 to invest over the next five years. At the beginning of each year he can invest money in one- or two-year time deposits. The bank pays 4% interest on one-year time deposits and 9 percent (total) on two-year time deposits. In addition, West World Limited will offer three-year certificates starting at the beginning of the second year. These certificates will return 15% (total). If Fred reinvest his money that is available every year, formulate a linear program to show him how to maximize his total cash on hand at the end of the fifth year.
I came up with the following LP model:
Being xij the amount invested in option i at year j, we look to
maximize z = 1,04x15 + 1,09x24 + 1,15x33,
subject to:
x11 + x12 <= 5000
x31 = x34 = x35 = 0
x12 + x22 + x32 <= 1,04 x11
x13 + x23 + x33 <= 1,04 x12 + 1,09 x21
x14 + x24 <= 1,04 x13 + 1,09 x22
x15 <= 1,04 x14 + 1,09 x23 + 1,15 x32
xij >= 0
And tried to write it in GMPL:
/* Variables */
var x{i in 1..3, j in 1..5} >= 0;
/* Objective */
maximize money: 1.04*x[1,5] + 1.09*x[2,4] + 1.15*x[3,3];
/* Constraints */
s.t. x[1,1] + x[2,1] <= 5000;
s.t. x[3,1] = x[3,4] = x[3,5] = 0;
s.t. x[1,2] + x[2,2] + x[3,2] <= 1.04 * x[1,1];
s.t. x[1,3] + x[2,3] + x[3,3] <= 1.04 * x[1,2] + 1.09 * x[2,1];
s.t. x[1,4] + x[2,4] <= 1.04 * x[1,3] + 1.09 * x[2,2];
s.t. x[1,5] <= 1.04 * x[1,4] + 1.09 * x[2,3] + 1.15 * x[3,2];
/* Resolve */
solve;
/* Results */
printf{j in 1..5}:"\n* %.2f %.2f %2.f \n", x[1,j], x[2,j], x[3,j];
end;
However, I'm getting the following error:
inv.mod:14: x multiply declared
Context: ...[ 1 , 5 ] + 1.09 * x [ 2 , 4 ] + 1.15 * x [ 3 , 3 ] ; s.t. x
MathProg model processing error
Does anyone have any thoughts about this?

You have to give a unique name to each constraint. Multiple assignments are not allowed.
This works on my machine:
/* Variables */
var x{i in 1..3, j in 1..5} >= 0;
/* Objective */
maximize money: 1.04*x[1,5] + 1.09*x[2,4] + 1.15*x[3,3];
/* Restrições */
s.t. c1: x[1,1] + x[2,1] <= 5000;
s.t. c2: x[3,1] = 0;
s.t. c3: x[3,4] = 0;
s.t. c4: x[3,5] = 0;
s.t. c5: x[1,2] + x[2,2] + x[3,2] <= 1.04 * x[1,1];
s.t. c6: x[1,3] + x[2,3] + x[3,3] <= 1.04 * x[1,2] + 1.09 * x[2,1];
s.t. c7: x[1,4] + x[2,4] <= 1.04 * x[1,3] + 1.09 * x[2,2];
s.t. c8: x[1,5] <= 1.04 * x[1,4] + 1.09 * x[2,3] + 1.15 * x[3,2];
/* Resolve */
solve;
/* Results */
printf{j in 1..5}:"\n* %.2f %.2f %2.f \n", x[1,j], x[2,j], x[3,j];
end;
It prints:
* 0.00 5000.00 0
* 0.00 0.00 0
* 0.00 0.00 5450
* 0.00 0.00 0
* 0.00 0.00 0
Good luck!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Need to create data frame from patterns in a string - regex

Related

Dynamic list in netlogo

If column match with value, use gsub and print value to another column

Round a float number to the nearest integer divisible by five

C/C++: 1.00000 <= 1.0f = False

Issue on GMPL code

Categories

Resources