How to add column to a data.table in R that is based on a string in another column?

How to add column to a data.table in R that is based on a string in another column? - regex

I would like to add columns to a data.table based on a string in another column. This is my data and the approach that I have tried:
Params
1: { clientID : 459; time : 1386868908703; version : 6}
2: { clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001}
3: { clientID : 988; time : 1388939739771}
4: { clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001}
5: { clientID : 459; time : 1388090530634}
Code to create this table:
DT = data.table(Params=c("{ clientID : 459; time : 1386868908703; version : 6}","{ clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001}","{ clientID : 988; time : 1388939739771}","{ clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001}","{ clientID : 459; time : 1388090530634}"))
I would like to parse the text in the "Params"-column and create new columns based on the text in it. For example I would like to have a new column named "user" that only holds the number after "user:" in the Params string. The added column should look like this:
Params user
1: { clientID : 459; time : 1386868908703; version : 6} NA
2: { clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001} 459001
3: { clientID : 988; time : 1388939739771} NA
4: { clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001} 459001
5: { clientID : 459; time : 1388090530634} 459001
I created the following function to parse (in this case for the "user"):
myparse <- function(searchterm, s) {
s <-gsub("{","",s, fixed = TRUE)
s <-gsub(" ","",s, fixed = TRUE)
s <-gsub("}","",s, fixed = TRUE)
s <-strsplit(s, '[;:]')
s <-unlist(s)
if (length(s[which(s==searchterm)])>0) {s[which(s==searchterm)+1]} else {NA}
}
Then I use the following function to add a column:
DT <- transform(DT, user = myparse("user", Params))
This works in the case of "time" which is included in all the rows but does not work in the case of "user" which is only included in two of the rows. The following error is returned:
Error in data.table(list(Params = c("{ clientID : 459; time : 1386868908703; version : 6}", :
argument 2 (nrow 2) cannot be recycled without remainder to match longest nrow (5)
How can I address this? Thanks!

Here's a way to use regular expressions for this task:
myparse <- function(searchterm, s) {
res <- rep(NA_character_, length(s)) # NA vector
idx <- grepl(searchterm, s) # index for strings including the search term
pattern <- paste0(".*", searchterm, " : ([^;}]+)[;}].*") # regex pattern
res[idx] <- sub(pattern, "\\1", s[idx]) # extract target string
return(res)
}
You can use this function to add new columns, e.g., for user:
DT[, user := myparse("user", Params)]
The new column contains NA for the rows with no user field:
DT[, user]
# [1] NA "459001" NA "459001" NA

I would use some external parser, for example:
library(yaml)
DT = data.frame(
Params=c("{ clientID : 459; time : 1386868908703; version : 6}","{ clientID : 459; id : 52a9ea8b534b2b0b5000575f; time : 1386868824339; user : 459001}","{ clientID : 988; time : 1388939739771}","{ clientID : 459; id : 52a9ec00b73cbf0b210057e9; time : 1386868810519; user : 459001}","{ clientID : 459; time : 1388090530634}"),
stringsAsFactors=F
)
conv.to.yaml <- function(x){
gsub('; ','\n',substr(x, 3, nchar(x)-1))
}
tmp <- lapply( DT$Params, function(x) yaml.load(conv.to.yaml(x)) )
then combine the parsed lists into data frame:
unames <- unique( unlist(sapply( tmp, names) ) )
res <- as.data.frame( do.call(rbind, lapply(tmp, function(x)x[unames]) ) )
colnames( res ) <- unames
res
the result is pretty much close to what you have in mind, but you need to think about better handling for the time values:
> res
clientID time version id user
1 459 -405527905 6 NULL NULL
2 459 -405612269 NULL 52a9ea8b534b2b0b5000575f 459001
3 988 1665303163 NULL NULL NULL
4 459 -405626089 NULL 52a9ec00b73cbf0b210057e9 459001
5 459 816094026 NULL NULL NULL

Related

Error handling in pyomo - division by zero

I am working on a linear optimization problem where I have a set of cities and a set of powerplants. The cities have an electricity demand that needs to be satisfied. However, in the context of my problem, in certain time periods, the cities have no electricity demand (from the power plants because they can produce some of their own). I do not think the specific details are very important so below is my best description of the issue.
The objective function contains the following term:
Term in objective function
I created the appropriate city and month sets and set up my objective function as:
sum(sum(1/model.monthly_demand[c,t]*model.theta[c] for c in model.cities) for t in model.months)
The issue clearly arises when monthly_demand[c,t] = 0 as i get a division by zero error. And I am not sure how to deal with this. Ideally I would like theta[c] to be set to zero in that case but I am unsure how to do this. I tried adding some if/else statements in the sum() function but this is not possible as far as I understand.
I think I can also define a function that is passed into the pyomo objective, so my idea was to try something like an if statement that sets theta[c] to zero when the monthly demand is zero, but this was not successful.
Another idea was to set the demands to something like 0.000001 but I would like this to be a last resort solution because I think it will cause issues.

You should just make a subset wherever convenient of the non-zero elements of demand. I am assuming demand is a parameter which would make sense, but you didn't specify. Then use that subset as the basis of summation. This will avoid including the empty/zero elements.
Note the elements of the objective in the printout of the model.
Code:
import pyomo.environ as pyo
demand_data = { ('LA', 0) : 20,
('LA', 1) : 0,
('SF', 1) : 15,
('NY', 0) : 20,
('NY', 1) : 30}
m = pyo.ConcreteModel('electricity')
m.C = pyo.Set(initialize=['LA', 'SF', 'NY'])
m.T = pyo.Set(initialize=list(range(2)))
# VARS
m.x = pyo.Var(m.C, m.T, domain=pyo.NonNegativeReals)
# PARAMS
m.demand = pyo.Param(m.C, m.T, initialize=demand_data, default=0)
# CONSTRAINTS
# ...
# OBJ
# make a subset of the non-zero demand periods
nonzero_demand_domain = {(c,t) for c in m.C for t in m.T if m.demand[c, t] > 0}
# use that in your objective...
m.obj = pyo.Objective(expr=sum(m.x[c, t] for c, t in nonzero_demand_domain))
m.pprint()
Output:
4 Set Declarations
C : Size=1, Index=None, Ordered=Insertion
Key : Dimen : Domain : Size : Members
None : 1 : Any : 3 : {'LA', 'SF', 'NY'}
T : Size=1, Index=None, Ordered=Insertion
Key : Dimen : Domain : Size : Members
None : 1 : Any : 2 : {0, 1}
demand_index : Size=1, Index=None, Ordered=True
Key : Dimen : Domain : Size : Members
None : 2 : C*T : 6 : {('LA', 0), ('LA', 1), ('SF', 0), ('SF', 1), ('NY', 0), ('NY', 1)}
x_index : Size=1, Index=None, Ordered=True
Key : Dimen : Domain : Size : Members
None : 2 : C*T : 6 : {('LA', 0), ('LA', 1), ('SF', 0), ('SF', 1), ('NY', 0), ('NY', 1)}
1 Param Declarations
demand : Size=6, Index=demand_index, Domain=Any, Default=0, Mutable=False
Key : Value
('LA', 0) : 20
('LA', 1) : 0
('NY', 0) : 20
('NY', 1) : 30
('SF', 1) : 15
1 Var Declarations
x : Size=6, Index=x_index
Key : Lower : Value : Upper : Fixed : Stale : Domain
('LA', 0) : 0 : None : None : False : True : NonNegativeReals
('LA', 1) : 0 : None : None : False : True : NonNegativeReals
('NY', 0) : 0 : None : None : False : True : NonNegativeReals
('NY', 1) : 0 : None : None : False : True : NonNegativeReals
('SF', 0) : 0 : None : None : False : True : NonNegativeReals
('SF', 1) : 0 : None : None : False : True : NonNegativeReals
1 Objective Declarations
obj : Size=1, Index=None, Active=True
Key : Active : Sense : Expression
None : True : minimize : x[NY,0] + x[NY,1] + x[LA,0] + x[SF,1]
7 Declarations: C T x_index x demand_index demand obj

pyomo: an indexed variable should be linear or integer depending on the variable index

I have a indexed variable New_UnitsBuilt[p] and this variabele should be integer for the index "GasPowerplant"
but linear for the index "batterystorage".
new_units_built_set = pyo.Set(initialize=list(params.Installable_units))
model.New_UnitsBuilt = pyo.Var(new_units_built_set, domain=(pyo.NonNegativeIntegers if p="GasPowerplant" else NonNegativeReals)
Please help me how to do this in pyomo.
I am new in pyomo
Best Greetings
Gerhard

There are a couple ways you can accomplish this. For the following, I am assuming that your params.Installable_units = ["GasPowerplant", "batterystorage"]:
If the number of elements in new_units_built_set is small, then you can use a dictionary:
model.new_units_built_set = pyo.Set(initialize=list(params.Installable_units))
model.New_UnitsBuilt = pyo.Var(model.new_units_built_set,
domain={"GasPowerplant": pyo.NonNegativeIntegers, "batterystorage": pyo.NonNegativeReals})
Or if there are a lot – or there is a simple formula to get the return value – you can use a function (rule):
model.new_units_built_set = pyo.Set(initialize=list(params.Installable_units))
def _new_unitsbuilt_domain(m, p):
return pyo.NonNegativeIntegers if p=="GasPowerplant" else pyo.NonNegativeReals
model.New_UnitsBuilt = pyo.Var(model.new_units_built_set, domain=_new_unitsbuilt_domain)
Or you can just set everything to one value and override later (assuming you are using a ConcreteModel):
model.new_units_built_set = pyo.Set(initialize=list(params.Installable_units))
model.New_UnitsBuilt = pyo.Var(model.new_units_built_set, domain=pyo.NonNegativeReals)
model.New_UnitsBuilt["GasPowerplant"].domain = pyo.NonNegativeIntegers
All of these will produce:
>>> model.pprint()
1 Set Declarations
new_units_built_set : Size=1, Index=None, Ordered=Insertion
Key : Dimen : Domain : Size : Members
None : 1 : Any : 2 : {'GasPowerplant', 'batterystorage'}
1 Var Declarations
New_UnitsBuilt : Size=2, Index=new_units_built_set
Key : Lower : Value : Upper : Fixed : Stale : Domain
GasPowerplant : 0 : None : None : False : True : NonNegativeIntegers
batterystorage : 0 : None : None : False : True : NonNegativeReals
2 Declarations: new_units_built_set New_UnitsBuilt

SAS: Why is the string trimmed at position 955

I'm trying to concatenate retained values.
This is my piece of code:
data &_output.;
set &_input.;
by cpn;
retain json_array;
if first.cpn and last.cpn then do;
flag = 'both';
concat = ('subscriptions:[{'||'"mpc" : "'||compress(mpc)||'" , '||'"contract_start_date" : "'||compress(contract_start_date)||'" '||' , "contract_end_date" : "'||compress(contract_end_date)||'"'||' , "subscription_status_code" : "'||compress(subscription_status_code)||'"'||' , '||'"promo_code" : "'||compress(promo_code)||'" '||' , "print_or_digi_flag" : "'||compress(print_or_digi_flag)||'"'||' , "payment_method_selection" : "'||compress(payment_method_selection)||'"'||', "subscription_type_code" : "'||compress(subscription_type_code)||'"'||' , "report_trial_subscription" : "'||compress(report_trial_subscription)||'"'||' , "product_desc" : "'||compress(product_desc)||'"}]');
end;
else if first.cpn then do;
flag = 'first';
concat = ('subscriptions:[{'||'"mpc" : "'||compress(mpc)||'" , '||'"contract_start_date" : "'||compress(contract_start_date)||'" '||' , "contract_end_date" : "'||compress(contract_end_date)||'"'||' , "subscription_status_code" : "'||compress(subscription_status_code)||'"'||' , '||'"promo_code" : "'||compress(promo_code)||'" '||' , "print_or_digi_flag" : "'||compress(print_or_digi_flag)||'"'||' , "payment_method_selection" : "'||compress(payment_method_selection)||'"'||', "subscription_type_code" : "'||compress(subscription_type_code)||'"'||' , "report_trial_subscription" : "'||compress(report_trial_subscription)||'"'||' , "product_desc" : "'||compress(product_desc)||'"} , ');
end;
else if last.cpn then do;
flag = 'last';
concat = ('{"mpc" : "'||compress(mpc)||'" , '||'"contract_start_date" : "'||compress(contract_start_date)||'" '||' , "contract_end_date" : "'||compress(contract_end_date)||'"'||' , "subscription_status_code" : "'||compress(subscription_status_code)||'"'||' , '||'"promo_code" : "'||compress(promo_code)||'" '||' , "print_or_digi_flag" : "'||compress(print_or_digi_flag)||'"'||' , "payment_method_selection" : "'||compress(payment_method_selection)||'"'||', "subscription_type_code" : "'||compress(subscription_type_code)||'"'||' , "report_trial_subscription" : "'||compress(report_trial_subscription)||'"'||' , "product_desc" : "'||compress(product_desc)||'" }]');
end;
else if not first.cpn and not last.cpn then do;
flag = 'none';
concat = trim(('{"mpc" : "'||compress(mpc)||'" , '||'"contract_start_date" : "'||compress(contract_start_date)||'" '||' , "contract_end_date" : "'||compress(contract_end_date)||'"'||' , "subscription_status_code" : "'||compress(subscription_status_code)||'"'||' , '||'"promo_code" : "'||compress(promo_code)||'" '||' , "print_or_digi_flag" : "'||compress(print_or_digi_flag)||'"'||' , "payment_method_selection" : "'||compress(payment_method_selection)||'"'||', "subscription_type_code" : "'||compress(subscription_type_code)||'"'||' , "report_trial_subscription" : "'||compress(report_trial_subscription)||'"'||' , "product_desc" : "'||compress(product_desc)||'"} , '));
end;
if first.cpn then json_array = trim(concat);
else json_array = trim(json_array)||trim(concat);
run;
If, for instance, there are 4 records for a cpn,
and the length of json_array reaches 955 on the third record - this is where the value is trimmed and that is the final result for json_array for that cpn.
Both json_array & concat are set to 10,000 positions.
Why is it trimmed?
Thank you in advance.

From the SAS Documentation on TRIM() (https://support.sas.com/documentation/cdl/en/lefunctionsref/67960/HTML/default/viewer.htm#n1io938ofitwnzn18e1hzel3u9ut.htm)
In a DATA step, if the TRIM function returns a value to a variable that has not previously been assigned a length, then that variable is given the length of the argument.
If the size of JSON_ARRAY is not explicitly set, then it will be the length of CONCAT. If you don't set the length of CONCAT, then it will be the length it is first assigned (in this case, 955).
So add a FORMAT, or LENGTH statement and set JSON_ARRAY (you should do CONCAT too) and you should be good to go.

How read separate lines from files?

------------------------------------------------
Artiles for a magazine
------------------------------------------------
There are total 5 articles in the magazine
------------------------------------------------
ID : 3
Description : opis2
Price : 212
Incoming amount : 2
Outgoing amount : 0
Taxes : 0
Total : 424
Date : 20324
------------------------------------------------
ID : 3
Description : 54
Price : 123
Incoming amount : 12
Outgoing amount : 0
Taxes : 0
Total : 1476
Date : 120915
------------------------------------------------
ID : 3
Description : opsi2
Price : 12
Incoming amount : 324
Outgoing amount : 0
Taxes : 0
Total : 3888
Date : 570509
------------------------------------------------
ID : 2
Description : vopi
Price : 2
Incoming amount : 2
Outgoing amount : 0
Taxes : 0
Total : 4
Date : 951230
------------------------------------------------
ID : 1
Description : opis1
Price : 2
Incoming amount : 2
Outgoing amount : 0
Taxes : 0
Total : 4
Date : 101
------------------------------------------------
I have a file called directory.dat with the contents above. What I'm trying to do is the following.
I want to find all articles with the same ID in a given year and do the following : outgoing amount - incoming amount. So, my problem is how can I find all the articles with same ID in a given year (by the user) and do the outgoing amount-incoming amount for them, by working with the file?
I tried something like this:
ifstream directory("directory.dat");
//directory.open("directory.dat");
string line;
string priceLine = "Price : ";
int price;
while(getline(directory, line)){
if(line.find(priceLine) == 0){
cout << atoi(line.substr(priceLine.size()).c_str()) << endl;
}
}
cout << price << endl;
directory.close();
But I am far away from getting on the right track and I need some help to achieve something like this.

You need to define precisely the format of your input (perhaps as a BNF grammar). A single example is not enough. We can't guess if Artiles for a magazine is meaningful or not.
while(getline(directory, line)){
int colonpos = -1;
if (line.find("----")) {
/// check that line has only dashes, then
process_dash_line();
}
else if ((colonpos=line.find(':'))>0) {
std::string name = line.substr(0, colonpos-1);
std::string value = line.substr(colonpos+1);
process_name_value (name, value);
}
}
Also, study (and perhaps adapt) the source code of some free software C++ parsers for JSON (e.g. jsoncpp) and YAML (e.g. yaml-cpp). They will certainly give you some inspiration.
Learn more about C++ standard libraries, e.g. on cppreference.com & cplusplus.com (both sites are easy to read but are imperfect) and of course by reading the C++11 standard, or at least its draft n3337

Returning list in ANTLR for type checking, language java

I am working on ANLTR to support type checking. I am in trouble at some point. I will try to explain it with an example grammar, suppose that I have the following:
#members {
private java.util.HashMap<String, String> mapping = new java.util.HashMap<String, String>();
}
var_dec
: type_specifiers d=dec_list? SEMICOLON
{
mapping.put($d.ids.get(0).toString(), $type_specifiers.type_name);
System.out.println("identext = " + $d.ids.get(0).toString() + " - " + $type_specifiers.type_name);
};
type_specifiers returns [String type_name]
: 'int' { $type_name = "int";}
| 'float' {$type_name = "float"; }
;
dec_list returns [List ids]
: ( a += ID brackets*) (COMMA ( a += ID brackets* ) )*
{$ids = $a;}
;
brackets : LBRACKET (ICONST | ID) RBRACKET;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
LBRACKET : '[';
RBRACKET : ']';
In rule dec_list, you will see that I am returning List with ids. However, in var_dec when I try to put the first element of the list (I am using only get(0) just to see the return value from dec_list rule, I can iterate it later, that's not my point) into mapping I get a whole string like
[#4,6:6='a',<17>,1:6]
for an input
int a, b;
What I am trying to do is to get text of each ID, in this case a and b in the list of index 0 and 1, respectively.
Does anyone have any idea?

The += operator creates a List of Tokens, not just the text these Tokens match. You'll need to initialize the List in the #init{...} block of the rule and add the inner-text of the tokens yourself.
Also, you don't need to do this:
type_specifiers returns [String type_name]
: 'int' { $type_name = "int";}
| ...
;
simply access type_specifiers's text attribute from the rule you use it in and remove the returns statement, like this:
var_dec
: t=type_specifiers ... {System.out.println($t.text);}
;
type_specifiers
: 'int'
| ...
;
Try something like this:
grammar T;
var_dec
: type dec_list? ';'
{
System.out.println("type = " + $type.text);
System.out.println("ids = " + $dec_list.ids);
}
;
type
: Int
| Float
;
dec_list returns [List ids]
#init{$ids = new ArrayList();}
: a=ID {$ids.add($a.text);} (',' b=ID {$ids.add($b.text);})*
;
Int : 'int';
Float : 'float';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
Space : ' ' {skip();};
which will print the following to the console:
type = int
ids = [a, b, foo]
If you run the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(new ANTLRStringStream("int a, b, foo;"));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.var_dec();
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to add column to a data.table in R that is based on a string in another column? - regex

Related

Error handling in pyomo - division by zero

pyomo: an indexed variable should be linear or integer depending on the variable index

SAS: Why is the string trimmed at position 955

How read separate lines from files?

Returning list in ANTLR for type checking, language java

Categories

Resources