I have been using the below query to create a table within Athena,
CREATE EXTERNAL TABLE IF NOT EXISTS test.test_table (
`converteddate` string,
`userid` string,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3:XXXX'
TBLPROPERTIES ('has_encrypted_data'='false',"skip.header.line.count"="1")
This returns me:
converteddate | userid
-------------------------------------
2017-11-29T05:00:00 | 00001
2017-11-27T04:00:00 | 00002
2017-11-26T03:00:00 | 00003
2017-11-25T02:00:00 | 00004
2017-11-24T01:00:00 | 00005
I would like to return:
converteddate | userid
-------------------------------------
2017-11-29 05:00:00 | 00001
2017-11-27 04:00:00 | 00002
2017-11-26 03:00:00 | 00003
2017-11-25 02:00:00 | 00004
2017-11-24 01:00:00 | 00005
and have converteddate as a datetime and not a string.
It is not possible to convert the data while table creation. But you can get the data while querying.
You can use date_parse(string,format) -> timestamp function. More details are mentioned here.
For your usecase you can do something like as follows
select date_parse(converteddate, '%y-%m-%dT%H:%i:%s') as converted_timestamp, userid
from test_table
Note : Based on type of your string you have to choose proper specifier for month(always two digits or not), day, hour(12 or 24 hours format), etc
(My answer has one premise: you are using OpenCSVSerDe. It doesn't apply to LazySimpleSerDe, for instance.)
If you have the option of changing the format of your input CSV file, you should convert your timestamp to UNIX Epoch Time. That's the format that OpenCSVSerDe is expecting.
For instance, your sample CSV looks like this:
"converteddate","userid"
"2017-11-29T05:00:00","00001"
"2017-11-27T04:00:00","00002"
"2017-11-26T03:00:00","00003"
"2017-11-25T02:00:00","00004"
"2017-11-24T01:00:00","00005"
It should be:
"converteddate","userid"
"1511931600000","00001"
"1511755200000","00002"
"1511665200000","00003"
"1511575200000","00004"
"1511485200000","00005"
Those integers are the number of milliseconds since Midnight January 1, 1970 for each one of your original dates.
Then you can run a slightly modified version of your CREATE TABLE statement:
CREATE EXTERNAL TABLE IF NOT EXISTS test.test_table (
converteddate timestamp,
userid string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3:XXXX'
TBLPROPERTIES ("skip.header.line.count"="1");
If you query your Athena table with select * from test_table, this will be the result:
converteddate userid
------------------------- --------
2017-11-29 05:00:00.000 00001
2017-11-27 04:00:00.000 00002
2017-11-26 03:00:00.000 00003
2017-11-25 02:00:00.000 00004
2017-11-24 01:00:00.000 00005
As you can see, type TIMESTAMP on Athena includes milliseconds.
I wrote a more comprehensive explanation on using types TIMESTAMP and DATE with OpenCSVSerDe. You can read it here.
Related
I have several tables in my data model.
Among others I have the table
"Sales" (with all sold products by customer, date and marketing source) and the table
"Media Spend" (with all marketing sources and their budget per month).
TABLE "SALES“
Product | Contract ID | Customer ID | Sales Date | Source
A | 001 | C1 | dd.mm.yyyy hh.mm.ss | Source A
B | 002 | C2 | dd.mm.yyyy hh.mm.ss | Source B
B | 003 | C1 | dd.mm.yyyy hh.mm.ss | Source B
C | 004 | C3 | dd.mm.yyyy hh.mm.ss | Source C
D | 005 | C6 | dd.mm.yyyy hh.mm.ss | Source F
TABLE „MEDIA SPEND“
Source | Spend | Campaign Month
Source A | 500 € | mm.yyyy
Source B | 600 € | mm.yyyy
Source C | 300 € | mm.yyyy
Source D | 100 € | mm.yyyy
Source E | 550 € | mm.yyyy
Source F | 1,000 € | mm.yyyy |
The tables are connected by the relation "Source".
It should be mentioned that the "Sales Date" is much more detailed (dd.mm.yyyy hh.mm.ss) than the "Campaign month") (mm.yyyy).
This allows me to filter both customers and marketing budgets by "Source".
But at the same time I want to calculate / filter by Date (e.g. „Sales date“).
But this is not possible.
How can I proceed to relate the two different columns in different tables?
I have already tried the following
Build the Relation based on both Date-Columns
=> Problem shift from Date to Source
Second connection (besides "Source") created for the columns "Sales Date" and "Campaign" month.
=> The data model shows the connection as dashed. Otherwise there is no effect.
THANX!
You can create a dates table by selecting New Table when you're in Power BI as can be seen below:
Copy and paste the below and your dates table will be created:
DimDate = CALENDAR( DATE( 2018, 1, 1 ) , DATE( 2024, 12, 31 ) )
or
DimDate = CALENDARAUTO( 3 )
Both of which will provide you with the below result:
After creating this table, you can create additional columns using the following dax for each column:
CalendarYear = YEAR( DimDate[Date] )
CalendarMonthInt = MONTH( DimDate[Date] )
CalendarDay = DAY( DimDate[Date] )
CalendarMonthName = FORMAT( DimDate[Date], "mmmm" )
CalendarShortMonth = LEFT( DimDate[CalendarMonthName], 3 )
YearMonth = CONCATENATE( YEAR( DimDate[Date] ), FORMAT( MONTH( DimDate[Date] ), "00" ) )
After adding these columns, your result will look like this:
I have a dataset where I wish to place the data in two groups, based on the value name. I then wish to average the result for one group and sum the result for the other. Finally, I wish to sum these two to create a barchart.
updated
Here is the data
Id Total Avail Date group used
A 10 5 9/1/2020 Group1 5
A 40 20 9/1/2020 Group1 20
B 20 10 9/1/2020 Group2 10
B 10 5 9/1/2020 Group2 5
B 10 5 9/1/2020 Group2 5
A 20 10 9/1/2020 Group1 10
A 20 10 9/1/2020 Group1 10
B 10 5 9/1/2020 Group2 5
B 10 5 9/1/2020 Group2 5
This is what I have done
1.
First create calc. field named : group - group the data into groups
IF [Id] = 'A' THEN 'Group1'
ELSEIF [Id] = 'B' THEN 'Group2'
ELSE 'none'
END
Then create calc. field - sum_avg sum of Group1 (A) columns and take average of Group2 (B)
CASE [group]
WHEN 'Group1' THEN { FIXED [Id]: SUM([Avail])}
WHEN 'Group1' THEN {FIXED [Id]: SUM([used])}
WHEN 'Group1' THEN {FIXED [Id]: SUM([Total])}
WHEN 'Group2' THEN { FIXED [Id]: AVG([Avail])}
WHEN 'Group2' THEN {FIXED [Id]: AVG([used])}
WHEN 'Group2' THEN {FIXED [Id]: AVG([Total])}
END
Here is that result:
Desired result:
I wish to add the Group1(sum) and the Group2(avg) for the Avail, Used and Total so that
the final chart combines the two blue values and combines the green values.
The SUM of Group1A Avail = 45 and the AVERAGE of G Avail Group2 = 6
So I wish the Avail section (blue) of the barchart to be: 51
and the Used (green) should be 51 as well
with the total as 102 (Ill add to tooltip)
Any suggestion is appreciated- I am still researching this and any suggestion is appreciated
UPDATE
I have a dataset where I wish to reflect the totals from the custom SQL query. Here is some sample data:
Size Tb Val type Group Sum_AVG SKU Last_refreshed
270 90.5 Free_Space_TB Group2 90.5 Excel 9/1/2020
270 179.5 Used Group2 179.5 Excel 9/1/2020
Here is the custom query output
Here is my view
The avail and used appear when I hover over, but how would I include the total?
This is the calculation I am using (thanks to help from a SO member):
{SUM({Fixed [type]: ZN(sum(if [Group]= 'Group1' then [Val] end))})
+
sum({Fixed [type]: zn(avg(if [Group] = 'Group2' then [Val] end))})}
SUM_AVG is:
zn(sum(if [Group]= 'Group1' then [Val] end))
+
zn(avg(if [Group] = 'Group2' then [Val] end))
I am doing something wrong, because it is totaling up across all the columns, when I just want the total for each column.
(Used was created from using a custom query)
The following solution is proposed-
The total column is of no use. Please drop it. It will unnecessary increase the size of your data. (for tooltip I'll tell you how to do it).
sample data used
+----+-------+--------+------+
| Id | Avail | group | used |
+----+-------+--------+------+
| A | 5 | Group1 | 5 |
+----+-------+--------+------+
| A | 20 | Group1 | 20 |
+----+-------+--------+------+
| B | 10 | Group2 | 10 |
+----+-------+--------+------+
| B | 5 | Group2 | 5 |
+----+-------+--------+------+
| B | 5 | Group2 | 5 |
+----+-------+--------+------+
| A | 10 | Group1 | 10 |
+----+-------+--------+------+
| A | 10 | Group1 | 10 |
+----+-------+--------+------+
| B | 5 | Group2 | 5 |
+----+-------+--------+------+
| B | 5 | Group2 | 5 |
+----+-------+--------+------+
pivot two columns (used and avail). A gif is included below
Create a calculated field CF as
zn(sum(if [Group]= 'Group1' then [Val] end))
+
zn(avg(if [Group] = 'Group2' then [Val] end))
Drag AGG(CF1) to rows shelf and to text also, type to colors in marks card; you'll get your desired view with 51 and 51 values in both types used and avail.
for total tooltip i.e. 102 create a calc field total as
{SUM({Fixed [Type]: ZN(sum(if [Group]= 'Group1' then [Val] end))})
+
sum({Fixed [Type]: zn(avg(if [Group] = 'Group2' then [Val] end))})}
add this field to details in marks card (instead of tooltip as you do normally).
click on tooltip and edit calculation there as per taste. I have edited it like
Out of total <total> TB SKU, <AGG(CF1)> was <Type>
You'll get tooltip like this.
and
P.S./EDIT This is regarding pivoting data in tableau. Instead of connecting to complete data through SQL connection, you can modify the sql query by following the instructions mentioned here. Two options-
Option-1 creating used column in sql and pivoting in tableau. While importing data/creating connection use this query-
select ID, date, avail, total - avail AS used, group from table_name
Option-2 pivot the data in sql itself. Use this query then-
select ID, date, avail as val, "avail" as type, group from table_name
UNION
select ID, date, total - avail AS val, "used" as type, group from table_name
thereafter you can proceed in tableau.
See the tableau uses long data format and your data is in wide format. You should keep your SKU memory allocation in rows instead of columns so whenever you'll need a division, adding an extra field type to marks card will do the job. If instead you will have it in columns, these are always two separate measures whereas in actually it is one measure. I suggest you to read some articles on internet about tidy data format, wherein you'll understand what to keep in rows, columns and column_names clearly. Reshaping data to a correct and tidy format solves a lot of problems.
Remember, if there is any value in variable(column) name you have to pivot a data. Here used and avail are variable values and therefore these cannot be in column_names (i.e. variable names) I think I am pretty clear.
Good Luck.
I have a variable date like this:
I want to calculate how many days have passed since, say, Jan 1 of 1960.
However, this is tedious. Also in some years, February has 28 days.
What I've been trying is basically looking up every single calendar, calculate how many days are there in each year, recognize string like jan as month variable 1 and so on.
Is there any short and efficient way to do this?
You need to use the daily() or date() function:
display date("1/1/2012", "DMY") - date("1/1/1960", "DMY")
18993
More generally, if you have a string variable with dates:
clear
input str10 date1
"01/01/2012"
"01/01/2011"
"01/01/2014"
"19/12/2014"
end
generate date2 = date(date, "DMY") - date("1/1/1960", "DMY")
list
+--------------------+
| date1 date2 |
|--------------------|
1. | 01/01/2012 18993 |
2. | 01/01/2011 18628 |
3. | 01/01/2014 19724 |
4. | 19/12/2014 20076 |
+--------------------+
If the variable containing the dates is numeric:
clear
input date1
18993
18628
19724
20076
end
format %tdDD/NN/CCYY date1
generate date2 = date1 - date("1/1/1960", "DMY")
In Stata, how do I convert date in the form of:
09mar2005 00:00:00
to a month-year variable?
If it matters, the date format is %tc.
What I have in mind is to plot monthly averages (instead of the daily average I have) of variables across time.
To get where you are now, you or somebody else may have done something like this:
clear
set obs 1
gen earlier = "09mar2005 00:00:00"
gen double nowhave = clock(earlier, "DMY hms")
format nowhave %tc
list
+-----------------------------------------+
| earlier nowhave |
|-----------------------------------------|
1. | 09mar2005 00:00:00 09mar2005 00:00:00 |
+-----------------------------------------+
Note that a string date and a numeric date-time variable with appropriate date-time format %tc just look the same when you list them, but they are quite different beasts.
To get where you want to be -- with a monthly date -- you convert from clock (date-time) to daily to monthly:
gen mdate = mofd(dofc(nowhave))
format mdate %tm
list
+--------------------------------------------------+
| earlier nowhave mdate |
|--------------------------------------------------|
1. | 09mar2005 00:00:00 09mar2005 00:00:00 2005m3 |
+--------------------------------------------------+
All is documented at help datetime. The function names stand for month of daily date and daily date of clock.
I am working on large text file of 34gb. I have successfully parsed the file using the graphlab create. There is a column in the dataset about the date. The date is displayed in a unix timestamps. How can I convert a UNIX timestamp from the input file (converted to an SFrame) into a human readable format?
It's a bit quirky, but you can just cast the column of unix timestamps to type "datetime.datetime" (provided you imported datetime).
This code:
import graphlab as gl
import datetime as dt
sf = gl.SFrame({'a':[1,2,3]})
sf['a'] = sf['a'].astype(dt.datetime)
Produces this:
Columns:
a datetime
Rows: 3
Data:
+---------------------+
| a |
+---------------------+
| 1970-01-01 00:00:01 |
| 1970-01-01 00:00:02 |
| 1970-01-01 00:00:03 |
+---------------------+
[3 rows x 1 columns]