I have a dataframe with 3,000,000 IDs. Each ID has the month range from 2015-01-01
t0 2018-12-01. Each ID has column "A" and "B" with numeric values. I need to create a new column "C:.
For each ID, when Date == '2015-01-01' which is the first month for that ID, column C value equal to exp(column_A value).
For the next month (Date == '2015-02-01'), column C value equal to exp(log(column_C_value in previous month) + column_B_value at this month), so here is exp(log(column C # 2015-01-01) + column_B # 2015-02-01). Each of the following months has the same pattern until it reaches 2018-12-01.
In Python, I can setup the loop for each ID and for each row/month, such as:
for ID in range(xxx):
for month in range(xxxx):
However, such calculation takes long time. Can anyone tell me a faster way to do this calculation? Very appreciated for your help!
Consider that
(1) exp(x+y) == exp(x)*exp(y).
And
(2) exp(log(x)) == x.
So exp(log(c1) + b2) == c1 * exp(b2).
The next value simplifies to c1 * exp(b2) * exp(b3), and so on.
That means, you have to multiply all exp(b) values, which can be turned
into adding all(b)-s and then applying exp() to the result.
And don't forget to multiply it with a1, the initial value.
a1 * exp(b2 * b3 * ...)
Can someone pl tell me what is rolling sum and how to implement it in Informatica?
My requirement is as below:(Given by client)
ETI_DUR :
SUM(CASE WHEN AGENT_EXPNCD_DIM.EXCEPTION_CD='SYS/BLDG ISSUES ETI' THEN IEX_AGENT_DEXPN.SCD_DURATION ELSE 0 END)
ETI_30_DAY :
ROLLING SUM(CASE WHEN (SYSDATE-IEX_AGENT_DEXPN.ROW_DT)<=30 AND AGENT_EXPNCD_DIM.EXCEPTION_CD = 'SYS/BLDG ISSUES ETI'
THEN IEX_AGENT_DEXPN.SCD_DURATION ELSE 0 END)
ETI_30_DAY_OVRG :
CASE WHEN ETI_DUR > 0 THEN
CASe
WHEN ROLLINGSUM(ETI_DUR_30_DAY FOR LAST 29 DAYS) BETWEEN 0 AND 600 AND ROLLINGSUM(ETI_DUR_30_DAY FOR LAST 29 DAYS) + ETI_DUR > 600 THEN ROLLINGSUM(ETI_DUR_30_DAY FOR LAST 30 DAYS) - 600
WHEN ROLLINGSUM(ETI_DUR_30_DAY FOR LAST 29 DAYS) > 600 THEN ETI_DUR
ELSE 0 END
ELSE 0 END
And i have implemented as below in Informatica.
Expression Transformation:
o_ETI_DUR-- IIF(UPPER(EXCEPTION_CD_AGENT_EXPNDIM)='SYS/BLDG ISSUES ETI',SCD_DURATION,0)
o_ETI_29_DAY-- IIF(DATE_DIFF(TRUNC(SYSDATE),trunc(SCHD_DATE),'DD') <=29 AND UPPER(EXCEPTION_CD_AGENT_EXPNDIM) = 'SYS/BLDG ISSUES ETI' ,SCD_DURATION,0)
o_ETI_30_DAY -- IIF(DATE_DIFF(TRUNC(SYSDATE),trunc(SCHD_DATE),'DD') <=30 AND UPPER(EXCEPTION_CD_AGENT_EXPNDIM) = 'SYS/BLDG ISSUES ETI' ,SCD_DURATION,0)
Aggregator transformation:
o_ETI_30_DAY_OVRG:
IIF(sum(i_ETI_DUR) > 0,
IIF((sum(i_ETI_29_DAY)>=0 and sum(i_ETI_29_DAY)<=600) and (sum(i_ETI_29_DAY)+sum(i_ETI_DUR)) > 600,
sum(i_ETI_30_DAY) - 600,
IIF(sum(i_ETI_29_DAY)>600,sum(i_ETI_DUR),0)),0)
But is not working. Pl help ASAP.
Thanks a lot....!
Rolling sum is just the sum of some amount over a fixed duration of time. For example, everyday you can calculate the sum of expense for last 30 days.
I guess you can use an aggregator to calculate ETI_DUR, ETI_30_DAY and ETI_29_DAY. After that, in an expression you can implement the logic for ETI_30_DAY_OVRG. Note that you cannot write an IIF expression like that in an aggregator. Output ports must use an aggregate function.
Here is a rolling sum example:
count, rolling_sum
1,1
2,3
5,8
1,9
1,10
Basically it is the sum of the values listed previously. To implement it in Informatica use 'local variables' (variable port in expression transformation) as follows:
input port: count
variable port: v_sum_count = v_sum_count + count
output port: rolling_sum = v_sum_count
we have a moving sum function defined in Numerical functions in Expression transformation:
MOVINGSUM(n as numeric, i as integer, [where as expression]).
Please check if it helps.
I want to measure the number of periods (here years) since an event occurred (here represented by indicator variable pos) up to a given number of leads and lags (here three).
The following code works, but seems hackish and like I'm missing something fundamental. Is there a more robust solution that takes advantage of built in functions or a better logic? I'm on 11.2. Thanks!
version 11.2
clear
* generate annual data
set obs 40
generate country = cond(_n <= 20, "USA", "UK")
bysort country: generate year = 1766 + _n
generate pos = 1 if (year == 1776)
* generate years since event (up to three)
encode country, generate(countryn)
xtset countryn year
generate time_to_pos = 0 if (pos == 1)
forvalues i = 1/3 {
replace time_to_pos = `i' if (l`i'.pos == 1)
replace time_to_pos = -1 * `i' if (f`i'.pos == 1)
}
Clear question.
This can be shortened. Here is one way. Starting with your code to set up a sandpit
version 11.2
clear
* generate annual data
set obs 40
generate country = cond(_n <= 20, "USA", "UK")
bysort country: generate year = 1766 + _n
Now it is
gen time_to_pos = year - 1776 if abs(1776 - year) <= 3
That is all that seems needed for your example. If you want to generalise to multiple events within each panel, I'd like to know the rules for such events.
I was going to show a trick from http://www.stata-journal.com/article.html?article=dm0055 but it doesn't appear needed.
I have this insane homework where I have to create an expression to validate date with respect to Julian and Gregorian calendar and many other things ...
The problem is that it must be all in one expression, so I can't use any ;
Are there any options of defining variable in expression? Something like
d < 31 && (bool leapyear = y % 4 == 0) || (leapyear ? d % 2 : 3) ....
where I could define and initialize one or more variables and use them in that one expression without using any ;?
Edit: It is explicitly said, that it must be a one-line expression. No functions ..
The way I'm doing it right now is writing macros and expanding them, so I end up with stuff like this
#define isJulian(d, m, y) (y < 1751 || (y == 1752 && (m < 9) || (m == 9 && d <= 2)))
#define isJulianLoopYear(y) (y % 4 == 0)
#define isGregorian(d, m, y) (y > 1573 || (y == 1752 && (m > 9) || (m == 9 && d > 13)))
#define isGregorianLoopYear(y) ((y % 4 == 0) || (y % 400 = 0))
// etc etc ....
looks like this is the only suitable way to solve the problem
edit: Here is original question
Suppose we have variables d m and y containing day, month and year. Task is to write one single expression which decides, if date is valid or not. Value should be true (non-zero value) if date is valid and false (zero) if date is not valid.
This is an example of expression (correct expression would look something like this):
d + 4 == y ^ 85 ? ~m : d * (y-2)
These are examples of wrong answers (not expressions):
if ( log ( d ) == 1752 ) m = 1;
or:
for ( i = 0; i < 32; i ++ ) m = m / 2;
Submit only file containing only one single expression without ; at the end. Don't submit commands or whole program.
Until 2.9.1752 was Julian calendar, after that date is Gregorian calendar
In Julian calendar is every year dividable by 4 a leap year.
In Gregorian calendar is leap year ever year, that is dividible by 4, but is not dividible by 100. Years that are dividable by 400 are another exception and are leap years.
1800, 1801, 1802, 1803, 1805, 1806, ....,1899, 1900, 1901, ... ,2100, ..., 2200 are not loop years.
1896, 1904, 1908, ..., 1996, 2000, 2004, ..., 2396,..., 2396, 2400 are loop years
In september 1752 is another exception, when 2.9.1752 was followed by 14.9.1752, so dates 3.9.1752, 4.9.1752, ..., 13.9.1752 are not valid.
((m >0)&&(m<13)&&(d>0)&&(d<32)&&(y!=0)&&(((d==31)&&
((m==1)||(m==3)||(m==5)||(m==7)||(m==8)||(m==10)||(m==12)))
||((d<31)&&((m!=2)||(d<29)))||((d==29)&&(m==2)&&((y<=1752)?((y%4)==0):
((((y%4)==0)&&((y%100)!=0))
||((y%400)==0)))))&&(((y==1752)&&(m==9))?((d<3)||(d>13)):true))
<evil>
Why would you define a new one, if you can reuse an existing one? errno is a perfect temporary variable.
</evil>
I think the intent of the homework is to ask you to do this without using variables, and what you are trying to do might defeat its purpose.
In standard C++, this is not possible. G++ has an extension known as statement expressions that can do that.
I don't believe you can, but even if you could, it would only have scope inside of the parentheses they are defined in (in your example) and cannot be used outside of them.
Your solution, which I will not provide fully for you, will probably go along these lines:
isJulian ? isJulianLeapyear : isGregorianLeapyear
To make it more specific, it could be like this:
isJulian ? (year % 4) == 0 : ((year % 4) == 0 || (year % 400) == 0)
You'll have to just make sure your algorithm is correct. I'm not a calender expert, so I wouldn't know about that.
First: Don't. It may be cute, but even if there's an extension that allows it, code golf is a dangerous game that will almost always end up causing mmore grief than it solves.
Okay, back to the 'real' question as defined by the homework. Can you make additional functions? If so, instead of capturing whether or not it's a leap year in a variable, make a function isLeapYear(int year) that returns the correct value.
Yes, that means you'll calculate it more than once. If that ends up being a performance issue, I'd be incredibly surprised... and it's a premature optimization to worry about that in the first place.
I'd be very surprised if you weren't allowed to write functions as part of doing this. It seems like that'd be half the point of an exercise like this.
......
Okay, so here's a quick overview of what you'll need to do.
First, basic verification - that month, day, year are possible values at all - month 0-11 (assuming 0-based), day 0-30, year non-negative (assuming that's a constraint).
Once you're past that, I'd probably check for the 1752 special cases.
If that's not relevant, the 'regular' months can be handled pretty simply.
This leaves us with the leap year cases, which can be broken down into two expressions - whether something is a leap year (which will be broken down additionally based on gregorian/julian), and whether the date is valid at that point.
So, at the highest level, your expression looks something like this:
areWithinRange(d,m,y) && passes1752SpecialCases(d,m,y) && passes30DayMonths(d,m,y) && passes31DayMonths(d,m,y) && passesFebruaryChecks(d,m,y)
If we assume that we only return false from our sub-expressions if we actively detect a rule break (31 days in June for the 30DayMonth rule returns false, but 30 days in February is irrelevant and so passes true), then we can pretty much say that the logic at that level is correct.
At this point, I'd write separate functions for the individual pieces (as pure expressions, a single return ... statement). Once you've gotten those in place, you can replace the method call in your top-level expression with the expanded version. Just make sure you parenthesize (is that a word?) everything sufficiently.
I'd also make a test harness program that uses the expression and has a number of valid and invalid inputs, and verifies that you're doing the right thing. You can write that in a function for ease of cut and paste for the final turn-in by doing something like:
bool isValidDate(int d, int m, int y)
{
return
// your expression here
}
Since the expression will be on a line by itself, it'll be easy to cut and paste.
You may find other ways to simplify your logic - excepting the 1752 special cases, days between 1 and 28 are always valid, for instance.
Considering it is homework, I think the best advice would be an approach to deriving your own solution.
If I were tackling this assignment, I would start by breaking the rules.
I would write a c++ function that given the variables d, m, and y, returns a boolean result on the validity of the date. I would use as many non-recursive helper functions as needed, and feel free to use if, else if, and else, but no looping aloud.
I would then Inline all helper functions
I would reduce all if, else if, and else statement to ? : notation
If I was successful at limiting my use of variables, I might be able to reduce all this to a single function with no variables - the body of which will contain the single expression I seek.
Good Luck.
You clearly have to have the date passed in somehow. Beyond that, all you're really going to be doing is chaining && and || (assuming we get the date as a tm struct):
#include <ctime>
bool validate(tm date)
{
return (
// sanity check that all values are positive
date.tm_mday >= 1 && date.tm_mon >= 0 && date.tm_year >= 0
// check for valid days
&& ((date.tm_mon == 0 && date.tm_mday <= 31)
|| (date.tm_mon == 1 && date.tm_mday <= (date.tm_year % 4 ? 28 : 29))
|| (date.tm_mon == 2 && date.tm_mday <= 31)
// yadda yadda for the other months
|| (date.tm_mon == 11 && date.tm_mday <= 31))
);
}
The parenthesis around date.tm_year % 4 ? 28 : 29 actually aren't needed, but I'm including them for readability.
UPDATE
Looking at a comment, you'll also need similar rules to validate dates that don't exist in the Gregorian calendar.
UPDATE II
Since you're dealing with dates in the past you will need to implement a more correct leap year test. However, I generally deal with dates in the future, and this incorrect leap year test will give correct results in 2012, 2016, 2020, 2024, 2028, 2032, 2036, 2040, 2044, 2048, 2052, 2056, 2060, 2064, 2068, 2072, 2076, 2080, 2084, 2088, 2092, and 2096. I will make a prediction that before this test fails in 2100 silicon-based computers will be forgotten relics. I seriously doubt we will use C++ on the quantum computers in use then. Besides, I won't be the programmer assigned to fix the bug.