Django "Join" and "count" - Easy in psql, not so trivial in Django - django

user id 8a0615d2-b123-4714-b76e-a9607a518979 has many entries in mylog table. each with an ip_id field. I'd like to see a weighted list of these ip_id fields.
in sql i use:
select distinct(ip_id), count(ip_id) from mylog
where user_id = '8a0615d2-b123-4714-b76e-a9607a518979'
group by ip_id
this gets me:
ip_id count
--------------------------------------+--------
84285515-0855-41f4-91fb-bcae6bf840a2 | 187
fc212052-71e3-4489-86ff-eb71b73c54d9 | 102
687ab635-1ec9-4c0a-acf1-3a20d0550b7f | 84
26d76a90-df12-4fb7-8f9e-a5f9af933706 | 18
389a4ae4-1822-40d2-a4cb-ab4880df6444 | 10
b5438f47-0f3a-428b-acc4-1eb9eae13c9e | 3
Now I am trying to get to the same result in django. It's surprisingly elusive.
Getting the user:
u = User.objects.get(id='8a0615d2-b123-4714-b76e-a9607a518979') #this works fine.
I tried:
logs = MyLog.objects.filter(Q(user=u) & Q(ip__isnull=False)).values('ip').annotate(total=Count('ip', distinct=True))
I am getting 6 rows in logs which is fine, but the count is always 6, not the weight of the unique ip as it is in the SQL response above.
What am I doing wrong?

You seem to be mistaken about what the keyword argument distinct does in the Count function. It simply means you want to count only the distinct values (you actually don't want to do that). In fact the part in your SQL query distinct(ip_id) is also redundant as you are going to use the group by clause on that anyway.
Furthermore you write .value('ip') which is a typo and should be .values('ip').
So your ORM query should be:
logs = MyLog.objects.filter(Q(user=u) & Q(ip__isnull=False)).values('ip').annotate(total=Count('ip'))

Related

Amazon DynamoDB multiple scan conditions with multiple BeginsWith

I have table in Amazon DynamoDB with partition key and range key.
Table structure
Subscriber ID (partition key) | Item Id (Range Key) | Date |...
123 | P_345 | some date 1 | ...
123 | I_456 | some date 2 |
123 | A_678 | some date 3 | ...
Now I want to retrieve the data from the table using QueryAsync C# library with multiple scan conditions.
HashKey = 123
condition 1; Date is between 'some date 1' and 'some date 2'
condition 2. Range Key begins_with I_ and P_
Is there any way which I can achieve this using c# dynamoDB APIs?
Please help
You'll need to do the following (I'm not a C# expert, but you can use the following instructions to find the right C# syntax to do it):
Because you are looking for a specific hashkey, this will be a Query request, not a Scan.
You have a begins_with() condition on the range key. You specify that using the KeyConditionExpression parameter to the Query. The KeyConditionExpression will ask for HashKey=123 AND begins_with(RangeKey,"P_").
However, KeyConditionExpression does not allow an "OR" (rangekey begins with either "P_" or "I_"). You'll just need to run two separate queries - one with "I_" and one with "P_" (you can even do the two queries in parallel, if you wish).
The date is not one of the key columns, so you will need to filter it with a FilterExpression parameter to the query. Note that filtering only happens in the last step, after DynamoDB already read all the items matching the KeyConditionExpression above (this may increase your costs if filtering removes a lot of items and you will still pay for them).

DynamoDB QuerySpec {MaxResultSize + filter expression}

From the DynamoDB documentation
The Query operation allows you to limit the number of items that it
returns in the result. To do this, set the Limit parameter to the
maximum number of items that you want.
For example, suppose you Query a table, with a Limit value of 6, and
without a filter expression. The Query result will contain the first
six items from the table that match the key condition expression from
the request.
Now suppose you add a filter expression to the Query. In this case,
DynamoDB will apply the filter expression to the six items that were
returned, discarding those that do not match. The final Query result
will contain 6 items or fewer, depending on the number of items that
were filtered.
Looks like the following query should return (at least sometimes) 0 records.
In summary, I have a UserLogins table. A simplified version is:
1. UserId - HashKey
2. DeviceId - RangeKey
3. ActiveLogin - Boolean
4. TimeToLive - ...
Now, let's say UserId = X has 10,000 inactive logins in different DeviceIds and 1 active login.
However, when I run this query against my DynamoDB table:
QuerySpec{
hashKey: null,
rangeKeyCondition: null,
queryFilters: null,
nameMap: {"#0" -> "UserId"}, {"#1" -> "ActiveLogin"}
valueMap: {":0" -> "X"}, {":1" -> "true"}
exclusiveStartKey: null,
maxPageSize: null,
maxResultSize: 10,
req: {TableName: UserLogins,ConsistentRead: true,ReturnConsumedCapacity: TOTAL,FilterExpression: #1 = :1,KeyConditionExpression: #0 = :0,ExpressionAttributeNames: {#0=UserId, #1=ActiveLogin},ExpressionAttributeValues: {:0={S: X,}, :1={BOOL: true}}}
I always get 1 row. The 1 active login for UserId=X. And it's not happening just for 1 user, it's happening for multiple users in a similar situation.
Are my results contradicting the DynamoDB documentation?
It looks like a contradiction because if maxResultSize=10, means that DynamoDB will only read the first 10 items (out of 10,001) and then it will apply the filter active=true only (which might return 0 results). It seems very unlikely that the record with active=true happened to be in the first 10 records that DynamoDB read.
This is happening to hundreds of customers that are running similar queries. It works great, when according to the documentation it shouldn't be working.
I can't see any obvious problem with the Query. Are you sure about your premise that users have 10,000 items each?
Your keys are UserId and DeviceId. That seems to mean that if your user logs in with the same device it would overwrite the existing item. Or put another way, I think you are saying your users having 10,000 different devices each (unless the DeviceId rotates in some way).
In your shoes I would just remove the filterexpression and print the results to the log to see what you're getting in your 10 results. Then remove the limit too and see what results you get with that.

Getting table information for Redshift `stl_load_errors` errors

I am using Redshift COPY command to load data into Redshift table from S3. When something goes wrong, I typically get an error ERROR: Load into table 'example' failed. Check 'stl_load_errors' system table for details. I can always lookup stl_load_errors manually to get details. Now, I am trying to figure out how I can do that automatically.
From documentation it looks like the following query should give me all the details I need:
SELECT *
FROM stl_load_errors errors
INNER JOIN svv_table_info info
ON errors.tbl = info.table_id
AND info.schema = '<schema-name>'
AND info.table = '<table-name>'
However it always returns nothing. I also tried using stv_tbl_perm instead of svv_table_info, and still nothing.
After some troubleshooting, I see two things I don't understand:
I see multiple different IDs in stv_tbl_perm and svv_table_info for the same exact table. Why is that?
I see tbl filed on stl_load_errors referencing ids that do not exist in stv_tbl_perm or svv_table_info. Again why?
Feels like I don't understanding something in structure of these tables, but it completely escapes me what.
This is because tbl and table_id are with different types. First one is integer, second one is iod.
When you cast iod to integer the columns have the same values. You could check this query:
SELECT table_id::integer, table_id
FROM SVV_TABLE_INFO
I have result when I execute
SELECT errors.tbl, info.table_id::integer, info.table_id, *
FROM stl_load_errors errors
INNER JOIN svv_table_info info
ON errors.tbl = info.table_id
Please note that inner join is ON errors.tbl = info.table_id
I finally got to the bottom of it, and it is surprisingly boring and probably not useful to many ...
I had an existing table. My code that was creating the table was wrapped in transaction, and it was dropping the table inside the transaction. The code that was querying the stl_load_errors was outside the transaction. So the table_id outside and inside the transaction where different, as it was a different table.
You could try looking by filename. Doesn't really answer the question about joining the various tables, but I use a query like so to group up files that are part of the same manifest file and let me compare it to the maxerror setting:
select min(starttime) over (partition by substring(filename, 1, 53)) as starttime,
substring(filename, 1, 53) as filename, btrim(err_reason) as err_reason, count(*)
from stl_load_errors where filename like '%/some_s3_path/%'
group by starttime, filename, err_reason order by starttime desc;
This worked for me without any casting:
schemaz=# select i.database, e.err_code from stl_load_errors e join svv_table_info i on e.tbl=i.table_id limit 5
schemaz-# ;
database | err_code
-----------+----------
schemaz | 1204
schemaz | 1204
schemaz | 1204
schemaz | 1204
schemaz | 1204

Amazon RedShift: Unique Column not being honored

I use the following query to create my table.
create table t1 (url varchar(250) unique);
Then I insert about 500 urls, twice. I am expecting that the second time I had the URLs that no new entries show up in my table, but instead my count value doubles for:
select count(*) from t1;
What I want is that when I try and add a url that is already in my table, it is skipped.
Have I declared something in my table deceleration incorrect?
I am using RedShift from AWS.
Sample
urlenrich=# insert into seed(url, source) select 'http://www.google.com', '1';
INSERT 0 1
urlenrich=# select * from seed;
url | wascrawled | source | date_crawled
-----------------------+------------+--------+--------------
http://www.google.com | 0 | 1 |
(1 row)
urlenrich=# insert into seed(url, source) select 'http://www.google.com', '1';
INSERT 0 1
urlenrich=# select * from seed;
url | wascrawled | source | date_crawled
-----------------------+------------+--------+--------------
http://www.google.com | 0 | 1 |
http://www.google.com | 0 | 1 |
(2 rows)
Output of \d seed
urlenrich=# \d seed
Table "public.seed"
Column | Type | Modifiers
--------------+-----------------------------+-----------
url | character varying(250) |
wascrawled | integer | default 0
source | integer | not null
date_crawled | timestamp without time zone |
Indexes:
"seed_url_key" UNIQUE, btree (url)
Figured out the problem
Amazon RedShift does not enforce constraints...
As explained here
http://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html
They said they may get around to changing it at some point.
NEW 11/21/2013
RDS has added support for PostGres, if you need unique and such an postgres rds instance is now the best way to go.
In redshift, constraints are recommended but doesn't take effect, constraints will just help to the query planner to select better ways to perform the query.
Usually, columnar databases do not manage indexes or constraints.
Although Amazon Redshift doesn't support unique constraints, there are some ways to delete duplicated records that can be helpful.
See the following link for the details.
copy data from Amazon s3 to Red Shift and avoid duplicate rows
Primary and unique key enforcement in distributed systems, never mind column store systems, is difficult. Both RedShift (Paracel) and Vertica face the same problems.
The challenge with a column store is that the question that is being asked is "does this table row have a relevant entry in another table row" but column stores are not designed for row operations.
In HP Vertica there is an explicit command to report on constraint violations.
In Redshift it appears that you have to roll your own.
SELECT COUNT(*) AS TotalRecords, COUNT(DISTINCT {your PK_Column}) AS UniqueRecords
FROM {Your table}
HAVING COUNT(*)> COUNT(DISTINCT {your PK_Column})
Obviously, if you have a multi-column PK you have to do something more heavyweight.
SELECT COUNT(*)
FROM (
SELECT {PkColumns}
FROM {Your Table}
GROUP BY {PKColumns}
HAVING COUNT(*)>1
) AS DT
If the above returns a value greater than zero then you have a primary key violation.
For anyone who:
Needs to use redshift
Wants unique inserts in a single query
Doesn't care too much about query performance
Only really cares about inserting a single unique value at a time
Here's an easy way to get it done
INSERT INTO MY_TABLE (MY_COLUMNS)
SELECT MY_UNIQUE_VALUE WHERE MY_UNIQUE_VALUE NOT IN (
SELECT MY_UNIQUE_VALUE FROM MY_TABLE
WHERE MY_UNIQUE_COLUMN = MY_UNIQUE_VALUE
)

SQL Where between word value ranges (eg., "low" to "high")

I have a field in my database that has 5 possible values: fair, good, very good, ideal, siganture ideal
I have a coldfusion form that has 2 drop-downs each with all the values. What I am looking to do is be able to have the user select a range. For example dropdown1 = Fair dropdown2 = Very Good. So this would somehow generate the SQL WHERE statement:
grade IN ('fair', 'good', 'very good')
Can you think of a smart way to program this given that the values have to be this way. I think maybe if I put them in an array and then looped through it or something. I'm a little stumped on this any help would be appreciated.
As others mentioned, redesigning is ultimately the better course of action, both in terms of efficiency and data integrity. However, if you absolutely cannot change the structure, a possible workaround is to create a lookup table of the allowable grade descriptions, along with a numeric rating value for each one:
GradeID | GradeText | Rating
1 | Fair | 0
2 | Good | 1
3 | Very Good | 2
4 | Ideal | 3
5 | Signature Ideal | 4
Then populate your select list from a query on the lookup table. Be sure to ORDER BY Rating ASC and use the rating number as the list value. Then on your action page, use the selected values to filter by range. (Obviously validate the selected range is valid as well)
SELECT t.ColumnName1, t.ColumnName2
FROM SomeTable t INNER JOIN YourLookupTable lt ON lt.Grade = t.GradeText
WHERE lt.Rating BETWEEN <cfqueryparam value="#form.dropdown1#" cfsqltype="cf_sql_integer">
AND <cfqueryparam value="#form.dropdown2#" cfsqltype="cf_sql_integer">
Again, I would recommend restructuring instead. However, the above should work if that is really not an option.