Hive: How to calculate difference indat time? - mapreduce

I am using Hive 0.12,
Data:
customer_name val time
cust1 1 2014-05-19 05:12:43
cust1 2 2014-05-19 05:12:50
cust1 3 2014-05-19 05:13:27
cust1 4 2014-05-19 05:14:14
cust2 1 2014-05-19 05:16:27
cust2 2 2014-05-19 05:17:01
cust2 3 2014-05-19 05:17:05
I want difference in time for each customer from val =1 to val=n.
Expected output:
cust1 00:01:31
cust2 00:00:39
Also date could change to next day for a customer. eg
cust3 1 2014-05-19 23:59:00
cust3 1 2014-05-20 00:02:25
expected output:
cust3 00:02:26
First question.Can this be done without UDF??
Second question. If no? How to do it using UDF??

Before I answer this I am making 2 assumptions (correct me if these are wrong as per your needs), first is that the timestamps are in the sorted order of value i.e. for cust1, the timestamp of value 1 is lesser than timestamp of value 2 and so on.
Second, the output is coming in terms of seconds. Please use any function on top of this to convert into your desired format.
Here is the data in correct format:
cust(string),val(string),ts(timestamp)
cust1,1,2014-05-19 05:12:43
cust1,2,2014-05-19 05:12:50
cust1,3,2014-05-19 05:13:27
cust1,4,2014-05-19 05:14:14
cust2,1,2014-05-19 05:16:27
cust2,2,2014-05-19 05:17:01
cust2,3,2014-05-19 05:17:05
Query:
select cust, unix_timestamp(max(ts)) - unix_timestamp(min(ts)) from
temp_txns GROUP BY cust
Output:
cust1 91
cust2 38
Hope this works for you.

Related

PowerBI DAX - Sum table by criteria and date

relatively new to PowerBI/PowerQuery/DAX and have become stuck at the following problem. I am unsure what road to go down to get the best outcome and would appreciate any help.
My data table is connected to a time tracking application. A User will enter a time entry everytime they complete a task. The task can be either a Project task or an Admin task. When selecting either of these, there will be multiple sub-categories beneath each, each with its own ID. This translates to my table as the following :
User ProjectID AdminID Hours Date
John 1 2 01/01/22
John 11 1 01/01/22
John 4 1 01/01/22
John 12 3 01/01/22
John 13 1 01/01/22
Pete 7 1 01/01/22
Pete 2 4 01/01/22
Pete 3 2 01/01/22
Mike 1 6 01/01/22
Mike 9 1 01/01/22
Mike 10 1 01/01/22
My objective is, for each Date in the table, to calculate the total hours spent either doing Project tasks or Admin tasks. I am not concerned about the specific breakdown (ie the sum of the unique IDs), rather the overall total. The above example covers just one day, in reality my data covers multiple years. My expected output will look like this :
User TotalProject TotalAdmin Date
John 3 5 01/01/22
John 3 4 01/02/22
John 5 2 01/03/22
Pete 5 1 01/01/22
Pete 1 8 01/02/22
Pete 6 2 01/03/22
Mike 6 2 01/01/22
Mike 6 1 01/02/22
Mike 7 2 01/03/22
I am unsure the best method to achieve this - either by creating some kind of column in the table through PowerQuery? Or a calculated column using DAX? And if so, what the SUM syntax would look like?
Very willing to learn, to any tips would be greatly appreciated!
For your sample input, just create 2 measures.
Total Admin = CALCULATE( SUM('Table'[Hours]), NOT(ISBLANK('Table'[AdminID])))
Total Project = CALCULATE( SUM('Table'[Hours]), NOT(ISBLANK('Table'[ProjectID])))

PowerBI running Total formula

I have a dataset OvertimeHours with EMPLID, checkdate and NumberOfHours (and other fields). I need a running total NumberOfHours for each employee by checkdate. I tried using the Quick Measure option but that only allows for a single column and I have two. I do not want the measure to recalculate when filters are applied. Ultimately what I am trying to do is identify the records for the first 6 hours of overtime worked on each check so that they can get a category of OCB and all overtime over the first 6 hours is OTP and it does not have to be exact (as demonstrated in the output below). I have only been working with Power BI for about a month and this is a pretty complex (for me) formula to figure out...
EMPLID CheckDate WkDate NumberOfHours RunningTotal Category
124 1/1/19 12/20/18 5 5 OCB
124 1/1/19 12/21/18 9 14 OTP
125 1/1/19 12/20/18 3 3 OCB
125 1/1/19 12/20/18 2 5 OCB
125 1/1/19 12/22/18 2 7 OTP
124 1/15/19 1/8/19 3 3 OCB
*Edited to add the WkDate.
Edit:
I have tweaked my query so that I have the running total and a sequential counter now:
Using the first 12 records, I am looking to get the following results:
I can either do it in a query if that is the easiest way or if there is a way to use DAX in PowerBI with this dataset now that I have the sequential piece, I can do that too.
I got it in the query:
select r.CheckDate,
r.EMPLID,
case
when PayrollRunningOTHours <= 6
then PayrollRunningOTHours
else 6
end as OCBHours,
case
when PayRollRunningOTHours > 6
then PayRollRunningOTHours - 6
end as OTPHours
from #rollingtotal r
inner
join lastone l
on r.CheckDate = l.CheckDate
and r.EMPLID = l.EMPLID
and r.OTCounter = l.lastRec
order by r.emplid,
r.CheckDate,
r.OTCounter

Dataset transpose ideas using AppEngine python without pandas

I need to convert this set of data [bigquery response]:
country metric quarter
Argentina 34174 1
Argentina 83961 2
Argentina 96373 3
Argentina 103782 4
Chile 7636 1
Chile 23434 2
Chile 19103 3
Chile 21729 4
to this:
Quarter Argentina Chile
1 83961 19103
2 96373 21729
3 103782 23434
4 34174 7636
I use AppEngine[python]. My idea is use numpy 1.6.1. but I'm open to receive ideas...
Edit query used:
SELECT country, activity, sum(metric), quarter, year
From *table* where country IN (*countries-parameters*)...
group by 1,2,4,5
order by 1,4 ASC
1 Solution from bigquery query:
Select
quarter,
SUM(IF(country=*country*,metric,0)) AS *country*,
...
From *table*
Where quarter IN (1,2,3,4) and country in (*countries-parameters*)
group by 1
order by 1 ASC;

Converting daily data in to weekly in Pandas

I have a dataframe as given below:
Index Date Country Occurence
0 2013-12-30 US 1
1 2013-12-30 India 3
2 2014-01-10 US 1
3 2014-01-15 India 1
4 2014-02-05 UK 5
I want to convert daily data into weekly,grouped by anatomy,method being sum.
Itried resampling,but the output gave Multi Index data frame from which i was not able to access "Country" and "Date" columns(pls refer above)
The desired output is given below:
Date Country Occurence
Week1 India 4
Week2
Week1 US 2
Week2
Week5 Germany 5
You can groupby on country and resample on week
In [63]: df
Out[63]:
Date Country Occurence
0 2013-12-30 US 1
1 2013-12-30 India 3
2 2014-01-10 US 1
3 2014-01-15 India 1
4 2014-02-05 UK 5
In [64]: df.set_index('Date').groupby('Country').resample('W', how='sum')
Out[64]:
Occurence
Country Date
India 2014-01-05 3
2014-01-12 NaN
2014-01-19 1
UK 2014-02-09 5
US 2014-01-05 1
2014-01-12 1
And, you could use reset_index()
In [65]: df.set_index('Date').groupby('Country').resample('W', how='sum').reset_index()
Out[65]:
Country Date Occurence
0 India 2014-01-05 3
1 India 2014-01-12 NaN
2 India 2014-01-19 1
3 UK 2014-02-09 5
4 US 2014-01-05 1
5 US 2014-01-12 1

Summarizing data in SAS across groups

My data set is in this format as mentioned below:
NEWID
Age
H_PERS
Income
OCCU
FAMTYPE
REGION
Metro(Yes/No)
Exp_alcohol
population sample-(This is the weighted population each new id represents) etc.
I would like to generate a summarized view like below:
average expenditure value (This should be sum of (exp_alcohol/population sample))
% of population sample across Region Metro and each demographic variable
Please help me with your ideas.
Since I can't see your data set and your description was not very clear, I'm going to guess that you have data that looks something like this and you would like add some new variables that summarizes your data...
data alcohol;
input NEWID Age H_PERS Income OCCU $ FAMTYPE $ REGION $ Metro $
Exp_alcohol population_sample;
datalines;
1234 32 4 65000 abc m CA Yes 2 4
5678 23 5 35000 xyz s WA Yes 3 6
9923 34 3 49000 def d OR No 3 9
8844 26 4 54000 gdp m CA No 1 5
;
run;
data summar;
set alcohol;
retain TotalAvg_expend metro_count total_pop;
Divide = exp_alcohol/population_sample;
TotalAvg_expend + Divide;
total_pop + population_sample;
if metro = 'Yes' then metro_count + population_sample;
percent_metro = (metro_count/total_pop)*100;
drop NEWID Age H_PERS Income OCCU FAMTYPE REGION Divide;
run;
Output:
Exp_ population_ TotalAvg_ metro_ total_ percent_
Metro alcohol sample expend count pop metro
Yes 2 4 0.50000 4 4 100.000
Yes 3 6 1.00000 10 10 100.000
No 3 9 1.33333 10 19 52.632
No 1 5 1.53333 10 24 41.667