Populating a DataFrame from 2 other DataFrames - python-2.7

I have been trying to combine the information from 2 dataframes into a single new dataframe without luck. I have searched extensively, but still can't find any relevant answer, so apologies if I have missed it in my search.
When creating an investing strategy, among a large set of currencies (more than 50) I have picked the top 5 currencies to invest in for every date (in top_n.csv) and their respective % weight to invest for each currency on each date (in weights.csv).
top_n.csv lools like:
Date 0 1 2 3 4
Aug 12, 2016 bitcoin ethereum 0 0 0
Aug 11, 2016 bitcoin ethereum ripple steem litecoin
Aug 10, 2016 bitcoin ethereum ripple 0 0
Aug 09, 2016 bitcoin ethereum steem ripple ethereum-classic
weights.csv lools like:
Date 0 1 2 3 4
Aug 12, 2016 0.859 0.089 nan nan nan
Aug 11, 2016 0.856 0.092 0.020 0.016 0.016
Aug 10, 2016 0.853 0.093 0.020 nan nan
Aug 09, 2016 0.858 0.086 0.020 0.020 0.017
The DataFrame which I am trying to populate is one which contains same dates (in the index), but has a number of columns corresponding to a larger set of coins (more than 50), like in W.csv.
Is there an efficient way that (for each date) populates the right weights to any currency that has any, and leaves the others at 0? The tricky part is dealing with dates when there are not enough currencies (thus top_n.csv has less than n currencies, and weights.csv has nans in the respective positions).
W.csv lools like:
Date bitcoin ethereum bitcoin-cash ripple litecoin dash neo nem monero ethereum-classic iota qtum omisego lisk cardano zcash bitconnect tether stellar ....
Aug 12, 2016 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ....
Aug 11, 2016 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ....
Aug 10, 2016 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ....
Aug 09, 2016 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ....
My target is to result to a DataFrame that looks like W_all_target, which I attach as would not appear correctly (I have edited it by hand for this question).
I have saved three indicative CSVs as it will help to examine them.
https://drive.google.com/open?id=1olx9ARI0XP5mqbqF1pfRfJyl9wIEWyZj
I am still learning, so I understand this may be a simple question. Sincere thanks!!

Option 0
This is to accommodate the zeros and nans
dates = top_n.index.repeat(top_n.shape[1])
currs = top_n.values.ravel()
wghts = weights.values.ravel()
mask = currs != '0'
reshaped = pd.Series(wghts[mask], [dates[mask], currs[mask]]).unstack(fill_value=0)
W.update(reshaped)
Option 1
reshaped = pd.concat([d.stack() for d in [top_n, weights]], axis=1) \
.reset_index(1, drop=True).set_index(0, append=True)[1].unstack(fill_value=0)
reshaped
0 bitcoin ethereum ethereum-classic litecoin ripple steem
Date
2016-08-09 0.858 0.086 0.017 0.000 0.02 0.020
2016-08-10 0.853 0.093 0.000 0.016 0.02 0.018
2016-08-11 0.856 0.092 0.000 0.016 0.02 0.016
2016-08-12 0.859 0.089 0.000 0.016 0.02 0.015
Option 2
reshaped = pd.Series(
weights.values.ravel(),
[top_n.index.repeat(top_n.shape[1]), top_n.values.ravel()]
).unstack(fill_value=0)
reshaped
bitcoin ethereum ethereum-classic litecoin ripple steem
Date
2016-08-09 0.858 0.086 0.017 0.000 0.02 0.020
2016-08-10 0.853 0.093 0.000 0.016 0.02 0.018
2016-08-11 0.856 0.092 0.000 0.016 0.02 0.016
2016-08-12 0.859 0.089 0.000 0.016 0.02 0.015
Then you should be able to update W with
W.update(reshaped)
W
bitcoin ethereum bitcoin-cash ripple litecoin dash neo nem monero ethereum-classic iota qtum omisego lisk cardano zcash bitconnect tether stellar
Date
2016-08-12 0.859 0.089 0 0.02 0.016 0 0 0 0 0.000 0 0 0 0 0 0 0 0 0
2016-08-11 0.856 0.092 0 0.02 0.016 0 0 0 0 0.000 0 0 0 0 0 0 0 0 0
2016-08-10 0.853 0.093 0 0.02 0.016 0 0 0 0 0.000 0 0 0 0 0 0 0 0 0
2016-08-09 0.858 0.086 0 0.02 0.000 0 0 0 0 0.017 0 0 0 0 0 0 0 0 0

Related

pandas: count the occurrence of month of years

I have a large number of rows dataframe(df_m) as following, I want to plot the number of occurrence of month for years(2010-2017) of date_m column in the dataframe. Since the year range of date_m is from 2010 -2017.
db num date_a date_m date_c zip_b zip_a
0 old HKK10032 2010-07-14 2010-07-26 NaT NaN NaN
1 old HKK10109 2011-07-14 2011-09-15 NaT NaN NaN
2 old HNN10167 2012-07-15 2012-08-09 NaT 177-003 NaN
3 old HKK10190 2013-07-15 2013-09-02 NaT NaN NaN
4 old HKK10251 2014-07-16 2014-05-02 NaT NaN NaN
5 old HKK10253 2015-07-16 2015-05-01 NaT NaN NaN
6 old HNN10275 2017-07-16 2017-07-18 2010-07-18 1070062 NaN
7 old HKK10282 2017-07-16 2017-08-16 NaT NaN NaN
............................................................
Firstly, I abstract the month occurrence of month(1-12) for every year(2010-2017). But there is error in my code:
lst_all = []
for i in range(2010, 2018):
lst_num = [sum(df_m.date_move.dt.month == j & df_m.date_move.dt.year == i) for j in range(1, 13)]
lst_all.append(lst_num)
print lst_all
You need add () to conditions:
lst_all = []
for i in range(2010, 2018):
lst_num = [((df_m.date_m.dt.month == j) & (df_m.date_m.dt.year == i)).sum() for j in range(1, 13)]
lst_all.append(lst_num)
Then get:
df1 = pd.DataFrame(lst_all, index=range(2010, 2018), columns=range(1, 13))
print (df1)
1 2 3 4 5 6 7 8 9 10 11 12
2010 0 0 0 0 0 0 1 0 0 0 0 0
2011 0 0 0 0 0 0 0 0 1 0 0 0
2012 0 0 0 0 0 0 0 1 0 0 0 0
2013 0 0 0 0 0 0 0 0 1 0 0 0
2014 0 0 0 0 1 0 0 0 0 0 0 0
2015 0 0 0 0 1 0 0 0 0 0 0 0
2016 0 0 0 0 0 0 0 0 0 0 0 0
2017 0 0 0 0 0 0 1 1 0 0 0 0

How to convert Object to Float in Python

I have the following dataframe :
Daily_KWH_System year month day hour minute second
0 4136.900384 2016 9 7 0 0 0
1 3061.657187 2016 9 8 0 0 0
2 4099.614033 2016 9 9 0 0 0
3 3922.490275 2016 9 10 0 0 0
4 3957.128982 2016 9 11 0 0 0
5 4177.014316 2016 9 12 0 0 0
6 3077.103445 2016 9 13 0 0 0
7 4123.103795 2016 9 14 0 0 0
.. ... ... ... ... ... ... ...
551 NaN 2016 11 23 0 0 0
552 NaN 2016 11 24 0 0 0
553 NaN 2016 11 25 0 0 0
.. ... ... ... ... ... ... ...
579 NaN 2016 11 27 0 0 0
580 NaN 2016 11 28 0 0 0
The variables type is as follows:
print(df.dtypes)
Daily_KWH_System object
year int32
month int32
day int32
hour int32
minute int32
second int32
I need to convert "Daily_KWH_System" to Float, so that I use in Linear Regression model.
I tried the below code, which worked fine.
df['Daily_KWH_System'] = pd.to_numeric(df['Daily_KWH_System'], errors='coerce')
Then I replaced the NaN's to Blank space, to use in my model. And I used the following code
df = df.replace(np.nan,' ', regex=True)
But, again the variable " Daily_KWH_System" is getting converted to Object as soon as i replace NaN'.
Please let me know how to go about it

Constraint Programming: Scheduling with multiple workers

I'm new to constraint programming. I imagine this is an easy problem but I can't wrap my head around it. Here's the problem:
We have multiple machines (N), each with a limited resource (let's say memory, and it can be the same for all machines.)
We have T tasks, each with a duration and each requiring some amount of the resource.
A machine can work on multiple tasks at the same time as long as its resource isn't exceeded.
A task cannot be split among machines and it has to be done in one shot (i.e. no pausing).
How do we assign the tasks to the machines to minimize the end-time or the number of machines used?
It seems like I should be able to achieve this with the cumulative predicate but it seems like it's limited to scheduling one set of tasks to one worker with a limited global resource rather than a variable number of workers.
I'm just learning CP & MiniZinc. Any ideas on how to generalize cumulative? Alternatively, is there an existing MiniZinc model I can understand that does something like this (or close enough?)
Thanks,
PS: I don't have any concrete data since this is a hypothetical/learning exercise for the most part. Imagine you have 10 machines and 10 tasks with various durations (in hours): 2,4,6,5,2,1,4,6,3,2,12 with memory requirements (GBs): 1,2,4,2,1,8,12,4,1,10. Each machine has 32 GBs of ram.
Here's a model that seems to be correct. However, it don't use "cumulative" at all since I wanted to visualize as much as possible (see below).
The main idea is that - for each time step, 1..max_step - each machine must have only tasks <= 32 GB. The foreach loop checks - for each machine - that the sum of memory of each task that is active at that time on that machine is below 32GB.
The output section shows the solution in different ways. See comments below.
The model is a slightly edited version of http://hakank.org/minizinc/scheduling_with_multiple_workers.mzn
Update: I should also mention that this model allows for different size of RAM on the machines, e.g. some machines have 64GB and some 32GB. This is demonstrated - but commented - in the model on my site. Since the model use value_precede_chain/2 - which ensures that the machines are used in order - it's recommended that the machines are ordered of decreasing size of RAM (so the bigger machines are used first).
(Also, I've modeled the problem in Picat: http://hakank.org/picat/scheduling_with_multiple_workers.pi )
include "globals.mzn";
int: num_tasks = 10;
int: num_machines = 10;
array[1..num_tasks] of int: duration = [2,4,6,5,2,1,4,6,3,12]; % duration of tasks
array[1..num_tasks] of int: memory = [1,2,4,2,1,8,12,4,1,10]; % RAM requirements (GB)
int: max_time = 30; % max allowed time
% RAM for each machine (GB)
array[1..num_machines] of int: machines_memory = [32 | i in 1..num_machines];
% decision variables
array[1..num_tasks] of var 1..max_time: start_time; % start time for each task
array[1..num_tasks] of var 1..max_time: end_time; % end time for each task
array[1..num_tasks] of var 1..num_machines: machine; % which machine to use
array[1..num_machines,1..max_time] of var 0..max(machines_memory): machine_used_ram;
var 1..num_machines: machines_used = max(machine);
var 1..max_time: last_time = max(end_time);
% solve :: int_search(start_time ++ machine ++ array1d(machine_used_ram), first_fail, indomain_split, complete) minimize last_time;
solve :: int_search(start_time ++ machine ++ array1d(machine_used_ram), first_fail, indomain_split, complete) minimize machines_used;
constraint
forall(t in 1..num_tasks) (
end_time[t] = start_time[t] + duration[t] -1
)
% /\ cumulative(start_time,duration,[1 | i in 1..num_tasks],machines_used)
/\
forall(m in 1..num_machines) (
% check all the times when a machine is used
forall(tt in 1..max_time) (
machine_used_ram[m,tt] = sum([memory[t]*(machine[t]=m)*(tt in start_time[t]..end_time[t]) | t in 1..num_tasks]) /\
machine_used_ram[m,tt] <= machines_memory[m]
% sum([memory[t]*(machine[t]=m)*(tt in start_time[t]..end_time[t]) | t in 1..num_tasks]) <= machines_memory[m]
)
)
% ensure that machine m is used before machine m+1 (for machine_used)
/\ value_precede_chain([i | i in 1..num_machines],machine)
;
output [
"start_time: \(start_time)\n",
"durations : \(duration)\n",
"end_time : \(end_time)\n",
"memory : \(memory)\n",
"last_time : \(last_time)\n",
"machine : \(machine)\n",
"machines_used: \(machines_used)\n",
]
++
[ "Machine memory per time:\n "]
++
[ show_int(3,tt) | tt in 1..max_time ]
++
[
if tt = 1 then "\n" ++ "M" ++ show_int(2, m) ++ ": " else " " endif ++
show_int(2,machine_used_ram[m,tt])
| m in 1..num_machines, tt in 1..max_time
]
++ ["\n\nTime / task: machine(task's memory)\n Task "] ++
[
show_int(7,t)
| t in 1..num_tasks
]
++
[
if t = 1 then "\nTime " ++ show_int(2,tt) ++ " " else " " endif ++
if tt in fix(start_time[t])..fix(end_time[t]) then
show_int(2,fix(machine[t])) ++ "(" ++ show_int(2,memory[t]) ++ ")"
else
" "
endif
| tt in 1..fix(last_time), t in 1..num_tasks
]
;
The model has two "modes": one to minimize the time ("minimize last_time") and one to minimize the number of machine used ("minimize machines_used").
The result of minimizing the time is:
start_time: [11, 8, 3, 8, 11, 8, 9, 7, 8, 1]
durations : [2, 4, 6, 5, 2, 1, 4, 6, 3, 12]
end_time : [12, 11, 8, 12, 12, 8, 12, 12, 10, 12]
memory : [1, 2, 4, 2, 1, 8, 12, 4, 1, 10]
last_time : 12
machine : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
machines_used: 1
Machine memory per time:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
M 1: 10 10 14 14 14 14 18 31 31 31 32 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
M 2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
M 3: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
M 4: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
M 5: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
M 6: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
M 7: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
M 8: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
M 9: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
M10: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Time / task: machine(task's memory)
Task 1 2 3 4 5 6 7 8 9 10
Time 1 1(10)
Time 2 1(10)
Time 3 1( 4) 1(10)
Time 4 1( 4) 1(10)
Time 5 1( 4) 1(10)
Time 6 1( 4) 1(10)
Time 7 1( 4) 1( 4) 1(10)
Time 8 1( 2) 1( 4) 1( 2) 1( 8) 1( 4) 1( 1) 1(10)
Time 9 1( 2) 1( 2) 1(12) 1( 4) 1( 1) 1(10)
Time 10 1( 2) 1( 2) 1(12) 1( 4) 1( 1) 1(10)
Time 11 1( 1) 1( 2) 1( 2) 1( 1) 1(12) 1( 4) 1(10)
Time 12 1( 1) 1( 2) 1( 1) 1(12) 1( 4) 1(10)
----------
==========
The first part "Machine memory per time" shows how loaded each machine (1..10) is per time step (1..30).
The second part "Time / task: machine(task's memory)" shows for each time step (rows) and tasks (columns) which machine that is used and the memory of the task in the form "machine(memory of the machine)".
The second way of using the model, to minimize the number of used machines, give this result (edited to save space). I.e. one machine is enough for handling all the tasks during the allowed time (1..22 time steps).
start_time: [19, 11, 3, 9, 20, 22, 13, 7, 17, 1]
durations : [2, 4, 6, 5, 2, 1, 4, 6, 3, 12]
end_time : [20, 14, 8, 13, 21, 22, 16, 12, 19, 12]
memory : [1, 2, 4, 2, 1, 8, 12, 4, 1, 10]
last_time : 22
machine : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
machines_used: 1
Machine memory per time:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
M 1: 10 10 14 14 14 14 18 18 16 16 18 18 16 14 12 12 1 1 2 2 1 8 0 0 0 0 0 0 0 0
M 2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
....
Time / task: machine(task's memory)
Task 1 2 3 4 5 6 7 8 9 10
Time 1 1(10)
Time 2 1(10)
Time 3 1( 4) 1(10)
Time 4 1( 4) 1(10)
.....
----------
==========
This is an old question, but here is a CP Optimizer model for this problem (in Python).
In this version, I minimize a lexicographical objective : first minimize the makespan (optimal value is 12), then given this makespan, minimize the number of used machines (here, one can execute all the tasks on a single machine and still finish at 12).
DUR = [2,4,6,5,2,1,4,6,3,12]
MEM = [1,2,4,2,1,8,12,4,1,10]
CAP = 32
TASKS = range(len(DUR))
MACHINES = range(10)
from docplex.cp.model import *
model = CpoModel()
# Decision variables: tasks and alloc
task = [interval_var(size=DUR[i]) for i in TASKS]
alloc = [ [interval_var(optional=True) for j in MACHINES] for i in TASKS]
# Objective terms
makespan = max(end_of(task[i]) for i in TASKS)
nmachines = sum(max(presence_of(alloc[i][j]) for i in TASKS) for j in MACHINES)
# Objective: minimize makespan, then number of machine used
model.add(minimize_static_lex([makespan, nmachines]))
# Allocation of tasks to machines
model.add([alternative(task[i], [alloc[i][j] for j in MACHINES]) for i in TASKS])
# Machine capacity
model.add([sum(pulse(alloc[i][j],MEM[i]) for i in TASKS) <= CAP for j in MACHINES])
# Resolution
sol = model.solve(trace_log=True)
# Display solution
for i in TASKS:
for j in MACHINES:
s = sol.get_var_solution(alloc[i][j])
if s.is_present():
print('Task ' + str(i) + ' scheduled on machine ' + str(j) + ' on [' + str(s.get_start()) + ',' + str(s.get_end()) + ')')
And the result is:
! ----------------------------------------------------------------------------
! Minimization problem - 110 variables, 20 constraints
! Initial process time : 0.00s (0.00s extraction + 0.00s propagation)
! . Log search space : 66.4 (before), 66.4 (after)
! . Memory usage : 897.0 kB (before), 897.0 kB (after)
! Using parallel search with 8 workers.
! ----------------------------------------------------------------------------
! Best Branches Non-fixed W Branch decision
0 110 -
+ New bound is 12; 0
* 12 111 0.01s 1 (gap is 100.0% # crit. 2 of 2)
New objective is 12; 7
* 12 131 0.01s 1 (gap is 100.0% # crit. 2 of 2)
New objective is 12; 6
* 12 151 0.01s 1 (gap is 100.0% # crit. 2 of 2)
New objective is 12; 5
* 12 171 0.01s 1 (gap is 100.0% # crit. 2 of 2)
New objective is 12; 4
* 12 191 0.01s 1 (gap is 100.0% # crit. 2 of 2)
New objective is 12; 3
* 12 211 0.01s 1 (gap is 100.0% # crit. 2 of 2)
New objective is 12; 2
* 12 231 0.01s 1 (gap is 100.0% # crit. 2 of 2)
New objective is 12; 1
! ----------------------------------------------------------------------------
! Search completed, 7 solutions found.
! Best objective : 12; 1 (optimal)
! Best bound : 12; 1
! ----------------------------------------------------------------------------
! Number of branches : 1318
! Number of fails : 40
! Total memory usage : 6.7 MB (6.6 MB CP Optimizer + 0.1 MB Concert)
! Time spent in solve : 0.00s (0.00s engine + 0.00s extraction)
! Search speed (br. / s) : 131800.0
! ----------------------------------------------------------------------------
Task 0 scheduled on machine 4 on [4,6)
Task 1 scheduled on machine 4 on [4,8)
Task 2 scheduled on machine 4 on [1,7)
Task 3 scheduled on machine 4 on [0,5)
Task 4 scheduled on machine 4 on [4,6)
Task 5 scheduled on machine 4 on [0,1)
Task 6 scheduled on machine 4 on [0,4)
Task 7 scheduled on machine 4 on [1,7)
Task 8 scheduled on machine 4 on [4,7)
Task 9 scheduled on machine 4 on [0,12)

Vectorized way to create a frequency vector from a set of observations in R?

Question
I have a vector of observations with their year of occurrence, and I want to create a vector of frequencies over a longer period for the purposes of curve fitting. I can do this easily with a function, but is there a simpler method or one that uses inherent vectorization? It may be I'm forgetting something simple.
Reproducible example
Data
Events <- data.frame(c(1991, 1991, 1995, 1999, 2007, 2007, 2010, 2010, 2010, 2014), seq(1100, 2000, 100))
names(Events) <- c("Year", "Loss")
Period <- seq(1990, 2014)
Function
FreqV <- function(Period, Observations){
n <- length(Period)
F <- double(n)
for(i in seq_len(n)) {
F[i] = sum(Observations == Period[i])
}
return(F)
}
Expected Results
FreqV(Period, Events$Year)
[1] 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 3 0 0 0 1
Post acceptance update
It bothered me why the C++ version of the algorithm (see comments under accepted answer) was so much slower, and I finally realized that the reason was that it is a naïve translation of FreqV above. If there are n periods and m events, it has to do n*m calculations. Even in C++ this is slow.
Tabulate probably is set to do a one-pass algorithm, and when I code a simple one-pass algorithm in C++, it's between 5–8 times faster than tabulate:
Naïve C++ Code
// [[Rcpp::export]]
std::vector<int> FV_C(std::vector<int> P, std::vector<int> O) {
int n = P.size();
std::vector<int> F(n);
for (int i = 0; i < n; ++i){
F[i] = std::count(O.begin(), O.end(), P[i]);
}
return(F);
}
One-pass C++ Code
// [[Rcpp::export]]
std::vector<int> FV_C2(std::vector<int> P, std::vector<int> O) {
int n = P.size();
int m = O.size();
int MinP = *std::min_element(P.begin(), P.end());
std::vector<int> F(n, 0);
for (int i = 0; i < m; ++i){
int offset = O[i] - MinP;
F[offset] += 1;
}
return(F);
}
Speed test
Tests done on an i7-2600K overclocked to 4.6Ghz with 16GB RAM using Windows 7 64bit, R-3.1.2 compiled with OpenBLAS 2.13.
set.seed(1)
vals <- sample(sample(10000, 100), 100000, TRUE)
period <- 1:10000
f1a <- function() tabulate(factor(vals, period), nbins = length(period))
f1b <- function() tabulate((vals-period[1])+1, nbins = length(period))
f2 <- function() unname(table(c(period, vals))-1)
library(microbenchmark)
all.equal(f1a(), f1b(), f2(), FV_C(period, vals), FV_C2(period, vals))
[1] TRUE
microbenchmark(f1a(), f1b(), f2(), FV_C(period, vals), FV_C2(period, vals), times = 100L)
Unit: microseconds
expr min lq mean median uq max neval
f1a() 26998.194 27812.6250 29515.375 28167.645 28703.4515 55456.079 100
f1b() 640.049 712.4235 1291.356 800.136 1522.0890 27814.561 100
f2() 34228.449 35746.6655 39686.660 36210.395 36768.3900 65295.374 100
FV_C(period, vals) 647577.794 647927.3040 648729.027 648221.417 648848.5090 659463.813 100
FV_C2(period, vals) 140.877 147.7270 169.085 158.449 170.3625 1095.738 100
I would recommend factor and table or tabulate.
Here's tabulate:
tabulate(factor(Events$Year, Period))
# [1] 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 3 0 0 0 1
It might even be faster to do something like:
tabulate((Events$Year-Period[1])+1)
For both of these, you should probably specify nbins, (nbins = length(Period)) in case the maximum value in "Events$Year" is less than the maximum value in "Period".
Here's a performance comparison:
set.seed(1)
vals <- sample(sample(10000, 100), 100000, TRUE)
period <- 1:10000
f1a <- function() tabulate(factor(vals, period), nbins = length(period))
f1b <- function() tabulate((vals-period[1])+1, nbins = length(period))
f2 <- function() unname(table(c(period, vals))-1)
library(microbenchmark)
microbenchmark(f1a(), f1b(), f2())
# Unit: microseconds
# expr min lq mean median uq max neval
# f1a() 41784.904 43665.394 46789.753 44278.093 45654.546 95032.59 100
# f1b() 884.465 1162.254 2261.118 1275.154 2756.922 46641.87 100
# f2() 54837.666 57615.562 71386.516 58863.272 100893.389 130235.33 100
You can solve this problem with table:
table(c(Period,Events$Year))-1
# 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
# 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0
# 2010 2011 2012 2013 2014
# 3 0 0 0 1
To get rid of the names, use:
unname(table(c(Period,Events$Year))-1)
# [1] 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 3 0 0 0 1
You could try
colSums(Vectorize(function(x) x==Events$Year)(Period))
#[1] 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 3 0 0 0 1
Or
colSums(outer(Events$Year, Period, FUN=function(x,y) x==y))
#[1] 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 3 0 0 0 1
Or using data.table
library(data.table)
CJ(Period, Events$Year)[, V3:=V1][, sum(V1==V2), V3]$V1
#[1] 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 3 0 0 0 1
Or if it is ordered
c(0,diff(findInterval(Period,Events$Year)))
#[1] 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 3 0 0 0 1
Or using a combination of tabulate with fmatch
library(fastmatch)
tabulate(fmatch(Events$Year, Period), nbins=length(Period))
#[1] 0 2 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 3 0 0 0 1

How to create dummy variables to indicate two values are the same using SAS

My data looks like:
ID YEAR A B
1078 1989 1 0
1078 1999 1 1
1161 1969 0 0
1161 2002 1 1
1230 1995 0 0
1230 2002 0 1
1279 1996 0 0
1279 2003 0 1
1447 1993 1 0
1447 2001 1 1
1487 1967 0 0
1487 2008 1 1
1487 2008 1 0
1487 2009 0 1
1678 1979 1 0
1678 2002 1 1
1690 1989 1 0
1690 1993 0 1
1690 1993 0 0
1690 1996 0 1
1690 1996 0 0
1690 1997 1 1
I'd like to create two dummy variables, new and X, the scenarios are as follows:
within each ID-B pair (a pair is 2 observations one with B=0 and the other B=1 with YEAR closet together in sequence)
if the observation with B=1 has a value of 1 for A then new=1 for both observations in that pair, otherwise it is 0 for both observations in that pair, and
if the pair has the same value in A then X=0 and if they have different values then X=1.
Therefore, the output would be:
ID YEAR A B new X
1078 1989 1 0 1 0
1078 1999 1 1 1 0
1161 1969 0 0 1 1
1161 2002 1 1 1 1
1230 1995 0 0 0 0
1230 2002 0 1 0 0
1279 1996 0 0 0 0
1279 2003 0 1 0 0
1447 1993 1 0 1 1
1447 2001 1 1 1 1
1487 1967 0 0 1 1
1487 2008 1 1 1 1
1487 2008 1 0 0 1
1487 2009 0 1 0 1
1678 1979 1 0 1 0
1678 2002 1 1 1 0
1690 1989 1 0 0 1
1690 1993 0 1 0 1
1690 1993 0 0 0 0
1690 1996 0 1 0 0
1690 1996 0 0 1 1
1690 1997 1 1 1 1
My codes are
data want;
set have;
by ID;
if B=1 and A=1 then new=1;
else new=0;
run;
proc sql;
create table out as
select a.*,max(a.B=a.A & a.B=1) as new,^(min(A)=max(A)) as X
from have a
group by ID;quit;
The first one doesn't work and the second one reordered variable B. I am stuck here. Any help will be greatly appreciated.
You need to do some research into first./last. processing and the lag function.
The helpful guys here have already gotten you to this point, maybe take this as an opportunity to read the documentation at SAS' Support Site.
At a high level:
You need a conditional statement to step through each observation in an ID group
Find out how many observations are in that group (let's say N obs)
Flag up if any obs match the logic you mentioned
Lag back N obs and set your new to 1 or 0 respectively
Very manual solution, I just used the retain statement to identify the pairs (dataset already in the required order).
data start;
set start;
retain pair 0;
if B=0 then pair=pair+1;
run;
data ForNew;
set start(where=(B=1));
New=(A=B); /*Boolean variable=1 if the condition in brackets is true*/
keep pair New;
run;
/*if A has equal values mean will be 0 or 1*/
proc means data=start NWAY NOPRINT;
class pair;
var A;
output out=ForX(drop=_: where=(media in (0,1)) keep=pair media) mean(A)=media;
run;
data end;
merge start ForNew ForX(in=INX drop=media);
by pair;
X=(^INX);
run;