Merge Time Series in Apache Pig - mapreduce

I think my problem is trivial, but I'm new to Pig and I can't see an obvious answer in the documentation. I have two time series I wish to merge. Let's say one of them is just a stream of events X:
100 A
200 B
300 C
400 D
500 E
600 F
Then another one indicates when some state changes happen, call it Y.
50 on
250 off
350 on
450 off
I would like to tag the first time series X with the current on/off status from Y. So specifically I want:
100 A on
200 B on
300 C off
400 D on
500 E off
600 F off
If I was writing this in another language I might do something like merge sort X and Y and take a single pass through it, remembering the last on/off status and tagging the X entries.
What is the best way to do this in Pig? I have received some existing code which uses a JOIN of X and Y and then filters it, but I think the data inflation caused by the join is unnecessary.

I don't think there is a very easy solution. Here is some pseudo-code:
X1 = Rank X;
Y1 = Rank Y;
XY = JOIN X1 BY BY $0 LEFT OUTER, Y1 BY $0;
SPLIT XY INTO status_known IF status is not null, status_unknown OTHERWISE;
--Y2: Find out last status in Y1 (with Group all, max)
--Y3: Cross status_unknown with Y2
UNION status_known and Y3

Related

How can I generate a square wave plot of a pulse train of multiple signals from the data in a csv file (in Linux)?

For instance, given the data in a text file:
10:37:18.459 1
10:37:18.659 0
10:37:19.559 1
How could this be displayed as an image that looked like a square wave that correctly represented the high time and low time? I am trying both gnuplot and scipy. The result should ultimately include more than one sensor, and all plots would have to be displayed above one another so as to show a time delta.
The code in the following link creates a square wave from the formulas listed,
link to waveforms. How can the lower waveform (pwm) be driven by the numbers above if they were in a file (to show a high state for 200 ms, then a low state for 100 ms, and finally a high state)?
If I understood your question correctly you want to plot a step function based on timedata. To avoid further guessing please specify in more detail.
In gnuplot there is the plotting style with steps. Check help steps.
Code:
### display waveform as steps
reset sesion
$Data <<EOD
10:37:18.459 1
10:37:18.659 0
10:37:19.559 1
10:37:19.789 0
10:37:20.123 1
10:37:20.456 0
10:37:20.789 1
EOD
set yrange [-0.05:1.2]
myTimeFmt = "%H:%M:%S" # input time format
set format x "%M:%.1S" time # output time format on x axis
plot $Data u (timecolumn(1,myTimeFmt)):2 w steps lc rgb "red" lw 2 ti "my square wave"
### end of code
Result:
The answer I ended up with was:
file_info = os.stat( self.__outfile)
if file_info.st_size:
x,y,z,a = np.genfromtxt( self.__outfile, delimiter=',',unpack=True )
fig = plt.figure(self.__outfile)
ax = fig.add_subplot(111)
fig.canvas.draw()
test_array = [(datetime.datetime.utcfromtimestamp(e2).strftime('%d_%H:%M:%S.%f')).rstrip('0') for e2 in x]
plt.xticks(x, test_array)
l1, = plt.plot(x,y, drawstyle='steps-post')
l2, = plt.plot(x,a-2, drawstyle='steps-post')
l3, = plt.plot(x,z-4, drawstyle='steps-post')
ax.grid()
ax.set_xlabel('Time (s)')
ax.set_ylabel('HIGH/LOW')
ax.set_ylim((-6.5,1.5))
ax.set_title('Sensor Sequence')
fig.autofmt_xdate()
ax.legend([l1,l2, l3],['sprinkler','lights', 'alarm'], loc='lower left')
plt.show()
I had a input file that had convertDateToFloat values in it. That was passed in to this function. The name is perhaps misleading (__outfile), but on the previous function, it was the output.

Query on plotting Lorenz curves on Stata

I am trying to plot a lorenz curve, using the following command:
glcurve drugs, sortvar(death) pvar(rank) glvar(yord) lorenz nograph
generate rank1=rank
label variable rank "Cum share of mortality"
label variable rank1 "Equality Line"
twoway (line rank1 rank, sort clwidth(medthin) clpat(longdash))(line yord rank , sort clwidth(medthin) clpat(red)), ///
ytitle(Cumulative share of drug activity, size(medsmall)) yscale(titlegap(2)) xtitle(Cumulative share of mortality (2012), size(medsmall)) ///
legend(rows(5)) xscale(titlegap(5)) legend(region(lwidth(none))) plotregion(margin(zero)) ysize(6.75) xsize(6) plotregion(lcolor(none))
However, in the resultant curves, the Line of equality does not start from 0, is there a way to fix this?
Is it recommended to use the following in order to get the perfect 45 degree line of equality:
(function y=x, range(0 1)
Also, how many minimum observations are required to plot the above graph? Does it work well with 2 observations as well?
The reason your Line of Perfect Equality does not pass through (0,0) is because the values for your variable do not contain 0.
The smallest value you will have for rank will be 1/_N. Although this value will asymptotically approach 0, it will never actually reach 0.
To see this, try:
quietly sum rank
di r(min)
di 1/_N
Further, by applying the program code to your data (beginning around line 152 in the ado file and removing unnecessary bits), one can easily see that yord cannot take on a value of 0 without values of 0 for drugs:
glcurve drugs, sortvar(death) pvar(rank) glvar(yord) lorenz nograph
sort death drugs , stable
gen double rank1 = _n / _N
qui sum drugs
gen yord1= (sum(drugs) / _N) / r(mean)
The best way to plot your Equality would be the method from your edit, namely:
twoway(function y = x, ra(0 1))
One quick yet (very) crude fix to force the lorenz curve to start at the origin (if it doesn't already) is to add an observation to the data after obtaining rank and yord, and then deleting it after you have your curve:
glcurve drugs, sortvar(death) pvar(rank) glvar(yord) lorenz nograph
expand 2 in 1
replace yord = 0 in 1
replace rank = 0 in 1
twoway (function y = x, ra(0 1)) ///
(line yord rank)
drop in 1
Like I said, this is admittedly crude and even somewhat ill advised, but I can't see a much better alternative at the moment, and with this method you will not be altering any of the other values of yord by running glcurve on the extrapolated data.

Speedy test on R data frame to see if row values in one column are inside another column in the data frame

I have a data frame of marketing data with 22k records and 6 columns, 2 of which are of interest.
Variable
FO.variable
Here's a link with the dput output of a sample of the dataframe: http://dpaste.com/2SJ6DPX
Please let me know if there's a better way of sharing this data.
All I want to do is create an additional binary keep column which should be:
1 if FO.variable is inside Variable
0 if FO.Variable is not inside Variable
Seems like a simple thing...in Excel I would just add another column with an "if" formula and then paste the formula down. I've spent the past hours trying to get this and R and failing.
Here's what I've tried:
Using grepl for pattern matching. I've used grepl before but this time I'm trying to pass a column instead of a string. My early attempts failed because I tried to force grepl and ifelse resulting in grepl using the first value in the column instead of the entire thing.
My next attempt was to use transform and grep based off another post on SO. I didn't think this would give me my exact answer but I figured it would get me close enough for me to figure it out from there...the code ran for a while than errored because invalid subscript.
transform(dd, Keep = FO.variable[sapply(variable, grep, FO.variable)])
My next attempt was to use str_detect, but I don't think this is the right approach because I want the row level value and I think 'any' will literally use any value in the vector?
kk <- sapply(dd$variable, function(x) any(sapply(dd$FO.variable, str_detect, string = x)))
EDIT: Just tried a for loop. I would prefer a vectorized approach but I'm pretty desperate at this point. I haven't used for-loops before as I've avoided them and stuck to other solutions. It doesn't seem to be working quite right not sure if I screwed up the syntax:
for(i in 1:nrow(dd)){
if(dd[i,4] %in% dd[i,2])
dd$test[i] <- 1
}
As I mentioned, my ideal output is an additional column with 1 or 0 if FO.variable was inside variable. For example, the first three records in the sample data would be 1 and the 4th record would be zero since "Direct/Unknown" is not within "Organic Search, System Email".
A bonus would be if a solution could run fast. The apply options were taking a long, long time perhaps because they were looping over every iteration across both columns?
This turned out to not nearly be as simple as I would of thought. Or maybe it is and I'm just a dunce. Either way, I appreciate any help on how to best approach this.
I read the data
df = dget("http://dpaste.com/2SJ6DPX.txt")
then split the 'variable' column into its parts and figured out the lengths of each entry
v = strsplit(as.character(df$variable), ",", fixed=TRUE)
len = lengths(v) ## sapply(v, length) in R-3.1.3
Then I unlisted v and created an index that maps the unlisted v to the row from which it came from
uv = unlist(v)
idx = rep(seq_along(v), len)
Finally, I found the indexes for which uv was equal to its corresponding entry in FO.variable
test = (uv == as.character(df$FO.variable)[idx])
df$Keep = FALSE
df$Keep[ idx[test] ] = TRUE
Or combined (it seems more useful to return the logical vector than the modified data.frame, which one could obtain with dd$Keep = f0(dd))
f0 = function(dd) {
v = strsplit(as.character(dd$variable), ",", fixed=TRUE)
len = lengths(v)
uv = unlist(v)
idx = rep(seq_along(v), len)
keep = logical(nrow(dd))
keep[ idx[uv == as.character(dd$FO.variable)[idx]] ] = TRUE
keep
}
(This could be made faster using the fact that the columns are factors, but maybe that's not intentional?) Compared with (the admittedly simpler and easier to understand)
f1 = function(dd)
mapply(grepl, dd$FO.variable, dd$variable, fixed=TRUE)
f1a = function(dd)
mapply(grepl, as.character(dd$FO.variable),
as.character(dd$variable), fixed=TRUE)
f2 = function(dd)
apply(dd, 1, function(x) grepl(x[4], x[2], fixed=TRUE))
with
> library(microbenchmark)
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), unname(f2(df)))
[1] TRUE
> microbenchmark(f0(df), f1(df), f1a(df), f2(df))
Unit: microseconds
expr min lq mean median uq max neval
f0(df) 57.559 64.6940 70.26804 69.4455 74.1035 98.322 100
f1(df) 573.302 603.4635 625.32744 624.8670 637.1810 766.183 100
f1a(df) 138.527 148.5280 156.47055 153.7455 160.3925 246.115 100
f2(df) 494.447 518.7110 543.41201 539.1655 561.4490 677.704 100
Two subtle but important additions during the development of the timings were to use fixed=TRUE in the regular expression, and to coerce the factors to character.
I would go with a simple mapply in your case, as you correctly said, by row operations will be very slow. Also, (as suggested by Martin) setting fixed = TRUE and apriori converting to character will significantly improve performance.
transform(dd, Keep = mapply(grepl,
as.character(FO.variable),
as.character(variable),
fixed = TRUE))
# VisitorIDTrue variable value FO.variable FO.value Keep
# 22 44888657 Direct / Unknown,Organic Search 1 Direct / Unknown 1 TRUE
# 2 44888657 Direct / Unknown,System Email 1 Direct / Unknown 1 TRUE
# 6 44888657 Direct / Unknown,TV 1 Direct / Unknown 1 TRUE
# 10 44888657 Organic Search,System Email 1 Direct / Unknown 1 FALSE
# 18 44888657 Organic Search,TV 1 Direct / Unknown 1 FALSE
# 14 44888657 System Email,TV 1 Direct / Unknown 1 FALSE
# 24 44888657 Direct / Unknown,Organic Search 1 Organic Search 1 TRUE
# 4 44888657 Direct / Unknown,System Email 1 Organic Search 1 FALSE
...
Here is a data.table approach that I think is very similar in spirit to Martin's:
require(data.table)
dt <- data.table(df)
dt[,`:=`(
fch = as.character(FO.variable),
rn = 1:.N
)]
dt[,keep:=FALSE]
dtvars <- dt[,strsplit(as.character(variable),',',fixed=TRUE),by=rn]
setkey(dt,rn,fch)
dt[dtvars,keep:=TRUE]
dt[,c("fch","rn"):=NULL]
The idea is to
identify all pairs of rn & variable (saved in dtvars) and
see which of these pairs match with rn & F0.variable pairs (in the original table, dt).

How to normalize the data

Normalize the data set to make the norm of each data point equal to 1.
x1 (1.5,1.7) [x1 (i,j)]
x2 (2,1.9)
x3 (1.6,1.8)
x4 (1.2,1.5)
x5 (1.5,1.0)
Given a new data point, x = (1.4; 1.6) as a query,
The solution after normalization
x(0.6585,0.7526)
x1(0.6616,0.7498 )
x2(0.7250,0.6887)
x3(0.6644,0.7474)
x4(0.6247,0.7809)
x5(0.8321,0.5547)
But iam confused how the solution is obtained, i tried with different formula's none of them worked.
x(1.4,1.6) . norm(x)= sqrt( (1.4)^2 +(1.6)^2) ~ 2.13.
Normalized x is(1.4 /2.13 , 1.6/2.13)
It will work.
You have been trying column wise normalization.
But the text demands a normalization to unit length.

rrd4j archive type

I can't manage to create an archive with the correct type.
What am I missing?
My example is very similar to the official example on https://code.google.com/p/rrd4j/wiki/Tutorial
RRD creation:
rrdDef.setStartTime(L - 300);
rrdDef.addDatasource("speed", DsType.GAUGE, 600, Double.NaN, Double.NaN);
rrdDef.addArchive(ConsolFun.MAX, 0.5, 1, 24);
rrdDef.addArchive(ConsolFun.MAX, 0.5, 6, 10);
I add some values: (1,2,3 for each step)
long x = L;
while (x <= L + 4200) {
Sample sample = rrdDb.createSample();
sample.setAndUpdate((x + 11) + ":1");
sample.setAndUpdate((x + 12) + ":2");
sample.setAndUpdate((x + 14) + ":3");
x += 300;
}
And then I fetch it:
FetchRequest fetchRequest = rrdDb.createFetchRequest(ConsolFun.MAX, (L - 600), L + 4500);
FetchData fetchData = fetchRequest.fetchData();
String s = fetchData.dump();
I get the result: (hoping to find the maximum)
920804100: NaN
920804400: NaN
920804700: +1.0000000000E00
920805000: +1.0166666667E00
920805300: +1.0166666667E00
...
920808600: +1.0166666667E00
920808900: +1.0166666667E00
920809200: NaN
I would like to see the maximum value here. Tried it with total as well, and I get THE SAME result.
What do I have to change, so I get the greatest value sent in one step, or to get the sum of the values sent in one step.
Thanks
The MAX is not the maximum value input but the maximum consolidated data point. What you're saying to rrd given your example is
At one point in time I'm going 1MPH
One second later I'm going 2MPH
Two seconds later I'm going 4MPH
rrd now has 3 data points covering 3 seconds of a 300 second interval. What should rrd store? 1, 2, or 3? None of the above it has to normalize the data in some way to say between X and X+STEP the rate is Y.
To complicate matters it's not for certain that your 3 data points are landing in the the same 300 second interval. Your first 2 data points could be in one interval and the 4MPH could be in the next one. This is because the starting data point stored is not exactly start+step. i.e. if you start at 14090812456 it might be something like 14090812700 even though your step is 300
The only way to store exact input values with GAUGE is to push updates at the exact step times rrd store the data points. I'm going 1MPH at x, 2MPH at x+300, 4MPH at x+300 where x starts at the first data point.
Here is a bash example showing this working using your rrd settings, I'm using a constant start time and x starting at what I know is rrd's first data point.
L=1409080000
rrdtool create max.rrd --start=$L DS:speed:GAUGE:600:U:U RRA:MAX:0.5:1:24 RRA:MAX:0.5:6:10
x=$(($L+200))
while [ $x -lt $(($L+3000)) ]; do
rrdtool update max.rrd "$(($x)):1"
rrdtool update max.rrd "$(($x+300)):2"
rrdtool update max.rrd "$(($x+600)):3"
x=$(($x+900))
done
rrdtool fetch max.rrd MAX -r 600 -s 1409080000
speed
1409080200: 1.0000000000e+00
1409080500: 2.0000000000e+00
1409080800: 3.0000000000e+00
1409081100: 1.0000000000e+00
1409081400: 2.0000000000e+00
1409081700: 3.0000000000e+00
1409082000: 1.0000000000e+00
Not really that usefull but if you increase the resolution to say 1200 seconds you start getting max over larger time intervals
rrdtool fetch max.rrd MAX -r 1200 -s 1409080000
speed
1409081400: 3.0000000000e+00
1409083200: 3.0000000000e+00
1409085000: nan
1409086800: nan
1409088600: nan