I have a query that is working as expected, but the result is coming back as an array of models, rather than the values. The query transforms a date, and then groups by that date.
The ActiveRecord query statement:
"date_trunc('month', participants.completed_at) as Date").
which returns:
[ #<Participant id: nil>,
#<Participant id: nil>,
#<Participant id: nil>,
#<Participant id: nil> ]
Interestingly, the relevant data is there on the Participant... for example, if I look at the model it shows me the information I need:
pry(main)> results.first.attributes
=> {"count"=>70, "date"=>2014-04-01 00:00:00 UTC, "id"=>nil}
But it's sort of strange, since neither count, nor date are attributes of that model.
The raw SQL query:
date_trunc('month', participants.completed_at) as Date
FROM "participants"
WHERE "participants"."deleted_at" IS NULL
Which returns:
count | date
5742 | [NULL]
590 | 2016-05-01 00:00:00
798 | 2016-06-01 00:00:00
293 | 2017-01-01 00:00:00
I know that I can simply map the results to format it the way I want, but I would like to get the data directly from the ActiveRecord statement as an array or hash, rather than on a model object.
Objective: I would like obtain the difference between current and previous sessions based on date slicers
I want the output to be 4 columns as such:
Current Sessions (see measure below)
Previous Sessions (see measure below)
Difference (no measure calculated yet).
I currently have two measures
Current Sessions: SUM(Sales[Sessions])
Previous Sessions (thanks to #Alexis Olson):
VAR datediffs = DATEDIFF(
CALCULATE (MAX ( 'Date'[Date] ) ),
CALCULATE (MAX ('Previous Date'[Date])),
USERELATIONSHIP('Previous Date'[Date],'Date'[Date]),
I have three tables.
Previous Date (carbon copy of Date table)
My previous date table is 1:1 inactive relationship with the Date table. Date table is 1 to many active relationship
with my Sales Table.
I have two slicers at all time comparing the same amount of days from different time periods (e.g. Jan 1th to Jan 7th 2019 vs Dec 25st to Dec 31th 2019)
If i put current sessions, previous sessions and a date column from any of the three tables
| date | current sessions | previous sessions | difference |
| Jan 8th | 10000 | 70000 | 3000 |
| Jan 9th | 20000 | 10000 | 10000 |
| Jan 10th | 15000 | 16000 | -1000 |
| Jan 11th | 14000 | 12000 | 2000 |
| Jan 12th | 12000 | 14000 | -2000 |
| Jan 13th | 11000 | 16000 | -5000 |
| Jan 14th | 15000 | 18000 | -3000 |
When I put the Sessions date on the table along with sessions and previous sessions, I get the sessions amounts right for each day but the previous session amounts doesn't calculate correctly I assume because its being filtered by the date rows.
How can I override that table filter and force it to get the exact previous sessions amounts? Basically have both results appended to each other.The following shows my problem. the previous session is the same on each day and is basically the amount of dec 31st jan 2018 because the max date is different for each row but I want it to be based on the slicer.
The mistake came in the first part of the VAR Datediffs variable within the previous session formula:
This forces to always calculate the last day for each row and overrides the date value in each row.
We have a history table that keeps all instances of a record, and flags which is the current record and when it is changed - here is a cut down version for it
CREATE TABLE *schema*.hist_temp
record_id VARCHAR
,record_created_date DATE
,current_flag BOOLEAN
,value int
INSERT INTO hist_temp VALUES ('Record A','2018-06-01',1,1000);
INSERT INTO hist_temp VALUES ('Record A','2018-04-12',0,900);
INSERT INTO hist_temp VALUES ('Record A','2018-03-13',0,800);
INSERT INTO hist_temp VALUES ('Record A','2018-01-13',0,700);
So what we have is Record A, which has been updated 3 times, the latest record is flagged with a 1 but we want to see all 4 instances of the history.
Then we have a dates table which holds, among other things, month end dates:
,trunc(month_start) as month_start
FROM common.calendar
calendar_year = '2018'
and calendar_date < trunc(sysdate)
ORDER BY 1 desc
Sample data:
calendar_date month_start
2018-06-03 2018-06-01
2018-06-02 2018-06-01
2018-06-01 2018-06-01
2018-05-31 2018-05-01
2018-05-30 2018-05-01
2018-05-29 2018-05-01
2018-05-28 2018-05-01
2018-05-27 2018-05-01
2018-05-26 2018-05-01
2018-05-25 2018-05-01
Required results:
I would like to be able to display the following - show the month start / end position for Record A for 2018
record_id, month_start, value
Record A, '2018-06-01', 1000
Record A, '2018-05-01', 900
Record A, '2018-04-01', 800
Record A, '2018-03-01', 700
Record A, '2018-02-01', 700
I am trying to write this query, I have something but know this is wrong as the value is summed up wrongly, please can someone help out ascertain how to get the correct values?
date_trunc('month', record_created_date)::date AS month_start,
FROM hist_temp
Record A 2018-06-01 1000
Record A 2018-04-01 900
Record A 2018-01-01 700
Record A 2018-03-01 800
I have been using the below query to create a table within Athena,
`converteddate` string,
`userid` string,
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
'serialization.format' = ',',
'field.delim' = ','
TBLPROPERTIES ('has_encrypted_data'='false',"skip.header.line.count"="1")
This returns me:
converteddate | userid
2017-11-29T05:00:00 | 00001
2017-11-27T04:00:00 | 00002
2017-11-26T03:00:00 | 00003
2017-11-25T02:00:00 | 00004
2017-11-24T01:00:00 | 00005
I would like to return:
converteddate | userid
2017-11-29 05:00:00 | 00001
2017-11-27 04:00:00 | 00002
2017-11-26 03:00:00 | 00003
2017-11-25 02:00:00 | 00004
2017-11-24 01:00:00 | 00005
and have converteddate as a datetime and not a string.
It is not possible to convert the data while table creation. But you can get the data while querying.
You can use date_parse(string,format) -> timestamp function. More details are mentioned here.
For your usecase you can do something like as follows
select date_parse(converteddate, '%y-%m-%dT%H:%i:%s') as converted_timestamp, userid
from test_table
Note : Based on type of your string you have to choose proper specifier for month(always two digits or not), day, hour(12 or 24 hours format), etc
(My answer has one premise: you are using OpenCSVSerDe. It doesn't apply to LazySimpleSerDe, for instance.)
If you have the option of changing the format of your input CSV file, you should convert your timestamp to UNIX Epoch Time. That's the format that OpenCSVSerDe is expecting.
For instance, your sample CSV looks like this:
It should be:
Those integers are the number of milliseconds since Midnight January 1, 1970 for each one of your original dates.
Then you can run a slightly modified version of your CREATE TABLE statement:
converteddate timestamp,
userid string
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
TBLPROPERTIES ("skip.header.line.count"="1");
If you query your Athena table with select * from test_table, this will be the result:
converteddate userid
------------------------- --------
2017-11-29 05:00:00.000 00001
2017-11-27 04:00:00.000 00002
2017-11-26 03:00:00.000 00003
2017-11-25 02:00:00.000 00004
2017-11-24 01:00:00.000 00005
As you can see, type TIMESTAMP on Athena includes milliseconds.
I wrote a more comprehensive explanation on using types TIMESTAMP and DATE with OpenCSVSerDe. You can read it here.
We have a system written in Django to track patients recruited to clinical trials.
Spread sheets are used to record the number of patients recruited each month throughout a financial year; so the sheet only contains 12 months of data even though a study may run for years.
There is a table in a django database in to which the spread sheets are imported each month. The data includes the month/year, a count of patients, and some other fields. Each import will include all the previous months data; we need this to make sure no data has been changed on the import sheet since the last import.
For example, the import table containing two imports (the first up to January and the second up to February) would look like this:
id | study_id | data_date | patient_count | [other fields] -->
100 5456 2016-04-01 10 ...
101 5456 2016-05-01 8 ...
102 5456 2016-06-01 5 ...
... all months in between ...
109 5456 2016-01-01 12 ...
110 5456 2016-02-01 NULL ...
111 5456 2016-03-01 NULL ...
112 5456 2016-04-01 10 ...
113 5456 2016-05-01 8 ...
114 5456 2016-06-01 5 ...
... all months in between ...
121 5456 2016-01-01 12 ...
122 5456 2016-02-01 6 ...
123 5456 2016-03-01 NULL ...
The other fields includes a foreign key to another table containing the actual study identification number (iras_number), so I have to join to that to select the rows for a particular study.
I want the most recent values of data_date and patient_count for a study, which may span more than one financial year, so I tried this query (iras_number is passed to the function performing this query):
totals = ImportStudyData.objects.values('data_date', 'patient_count') \
.filter(import_study__iras_number=iras_number) \
However, this produces a SQL query which includes patient_count in the GROUP BY, resulting in duplicate rows:
data_date | patient_count | max_id
2016-04-01 10 100
2016-04-01 10 112
2016-05-01 8 101
2016-05-01 8 113
2016-01-01 12 109
2016-01-01 12 121
2016-02-01 NULL 110
2016-02-01 6 122
How do I select the most recent data_date and patient_count from the table using the ORM?
If I were writing the SQL I would do an inner select of the max(id) grouped by data_date and then use that to join, or use an IN query, to select the fields I require from the table; such as:
SELECT data_date, patient_count
FROM importstudydata
SELECT MAX(id) AS "max_id"
FROM importstudydata INNER JOIN importstudy
ON importstudydata.import_study_id = importstudy.id
WHERE importstudy.iras_number = 5456
GROUP BY importstudydata.data_date
ORDER BY data_date ASC
I've tried to create an inner select to replicate the SQL query, however the inner select returns more than one field (column) a causes the query to fail:
totals = ImportStudyData.objects.values('data_date', 'patient_count') \
.filter(id__in=ImportStudyData.objects.values('data_date') \
.filter(import_study__iras_number=iras_number) \
Now I can't get the inner select to return only the max(id) grouped by `data_date' and for it to be performed in a single SQL query.
For now I'm splitting the query in to a number of steps to get the result I want.
First I query for the most recent id of all rows related to the study
id_qry = ImportStudyData.objects.values('data_date')\
To get a list of just the numbers, stripping out the date, I use list comprehension:
id_list = [x['max_id'] for x in id_qry]
This list is then used as a filter for the final query to get the number of patients
totals = ImportStudyData.objects.values('data_date', 'patient_count') \
It hits the database twice, and is computationally more expensive, but for now it works and I need to move on.
I'll come back to this problem at a later date.
Use: distinct=True
totals = ImportStudyData.objects.values('data_date', 'patient_count').filter(import_study__iras_number=iras_number).annotate(max_id=Max('id')).order_by('data_date').distinct()
I have a DataFrame as follows, where Id is a string and Date is a datetime:
Id Date
1 3-1-2012
1 4-8-2013
2 1-17-2013
2 5-4-2013
2 10-30-2012
3 1-3-2013
I'd like to consolidate the table to just show one row for each Id which has the most recent date.
Any thoughts on how to do this?
You can groupby the Id field:
In [11]: df
Id Date
0 1 2012-03-01 00:00:00
1 1 2013-04-08 00:00:00
2 2 2013-01-17 00:00:00
3 2 2013-05-04 00:00:00
4 2 2012-10-30 00:00:00
5 3 2013-01-03 00:00:00
In [12]: g = df.groupby('Id')
If you are not certain about the ordering, you could do something along the lines:
In [13]: g.agg(lambda x: x.iloc[x.Date.argmax()])
1 2013-04-08 00:00:00
2 2013-05-04 00:00:00
3 2013-01-03 00:00:00
which for each group grabs the row with largest (latest) date (the argmax part).
If you knew they were in order you could take the last (or first) entry:
In [14]: g.last()
1 2013-04-08 00:00:00
2 2012-10-30 00:00:00
3 2013-01-03 00:00:00
(Note: they're not in order, so this doesn't work in this case!)
In the Hayden response, I think that using x.loc in place of x.iloc is better, as the index of the df dataframe could be sparse (and in this case the iloc will not work).
(I haven't enought points on stackoverflow to post it in comments of the response).