Question:
"For each user, show me the groups they belong (1) to and the tables they can query as a result of being part of that group (2)."
Details: This is on redshift.
I'm curious if anyone has done anything like this.
The first part isn't hard, I did that with this:
SELECT usename, groname
FROM pg_user,
pg_group
WHERE pg_user.usesysid = ANY (pg_group.grolist)
AND pg_group.groname in (SELECT DISTINCT pg_group.groname from pg_group)
The second part is much harder I think. Showing table permissions isn't, but that's not really what's being asked. That can be accomplished with this:
SELECT has_table_privilege(user, table, 'select'::text) as select...
But that does not show which group gives the user that permission.
Any advice is great!
Related
I am trying to implement row-level security on one table based on a separate users table. I've seen this talked about in places like this, but haven't been able to get things working for my case.
Users table:
Transactions table:
The table I'd like to secure is called Transactions. One of the fields in each row is CompanyID. The Users table contains three columns: AccountID, UserEmail, and CompanyID. What I'd like is for only users assigned to given CompanyIDs to be able to view rows in the Transactions table with those CompanyIDs.
Since more than one User may view a given row, I established a one-to-many relationship from Transactions to Users on the CompanyID field.
I created a DAX expression that filters on the Users table with the following:
[UserEmail] = USERPRINCIPALNAME()
When I select "View As -> Other User" in Power BI Desktop and enter a random email, though, I can still see the entire report. Any idea what I'm leaving out?
EDIT:
I left out an important stipulation: Any user associated with a CompanyID of 1 can view all the records of the Transaction table. I've tried approaches similar to this
[UserEmail] = USERPRINCIPALNAME() ||
COUNTROWS(FILTER('Users', [UserEmail] = USERPRINCIPALNAME() && [CompanyId] = 1)) = 1
but they don't work. Even users with CompanyId of 1 are prohibited from viewing the table.
From the docs:
By default, row-level security filtering uses single-directional
filters, whether the relationships are set to single direction or
bi-directional. You can manually enable bi-directional cross-filtering
with row-level security by selecting the relationship and checking the
Apply security filter in both directions checkbox. Select this option
when you've also implemented dynamic row-level security at the server
level, where row-level security is based on username or login ID.
https://learn.microsoft.com/en-us/power-bi/admin/service-admin-rls
I am getting error when trying to use listagg function.
Query
select
a.user_name,
listagg(a.group_name::text)
within group (order by a.group_name) as group_name
from (
SELECT
usename as user_name,
groname as group_name
FROM
pg_user
join
pg_group
on
pg_user.usesysid = ANY(pg_group.grolist) AND
pg_group.groname in (SELECT DISTINCT pg_group.groname from pg_group)
)a
group by user_name
Error
[Code: 500310, SQL State: XX000] Amazon Invalid operation: One or more of the used functions must be applied on at least one user created tables. Examples of user table only functions are LISTAGG, MEDIAN, PERCENTILE_CONT, etc;
None of the value is null.
Just like there are some functions that can only be run on the leader node there are some that can only be run on compute nodes - listagg() is one of these. If you need to run listagg() on leader data there are a few approaches you can use: (sorry I'm not on a cluster now so cannot test these directly - I saw your question was aging and thought I'd get you started. Grain of salt as I also cannot directly observe your issue but I think I've know what is going on.)
You can use a cursor to save the data from the leader node and use
this as the source for listagg(). A stored procedure can
streamline this. There are examples of this on stackoverflow.
You can make a temp table out of the leader node data and use this
in listagg() but I expect you will need to exit(unload) and
reenter(copy) the cluster to do this.
There just isn't a direct path from leader-node-only results to the compute nodes without some sort of this kind of push-up. Consequence of the large networked cluster architecture of Redshift.
UPDATE
I got some cluster time and there are several unexpected issues with this one. grolist is an array type that isn't generally support cluster wide and the need to user pg_group as source are key ones. So this is going to require #1 AND #2 from above.
The process goes like this:
Define cursor to hold the result of the pg_user / pg_group join select statement
Move cursor results to temp table
Use temp table as source to outer (list_agg()) select
A stored procedure can be written to do #1 and #2 which streamlines things. So you end up with the following SQL:
CREATE OR REPLACE procedure make_user_group()
AS
$$
DECLARE
row record;
BEGIN
create temp table user_group (user_name varchar(256),group_name varchar(256));
for row in SELECT
usename::text as user_name,
groname::text as group_name
FROM
pg_user
join
pg_group
on
pg_user.usesysid = ANY(pg_group.grolist) AND
pg_group.groname in (SELECT DISTINCT pg_group.groname from pg_group)
LOOP
INSERT INTO user_group(user_name,group_name) VALUES (row.user_name,row.group_name);
END LOOP;
END;
$$ LANGUAGE plpgsql;
call make_user_group();
select
user_name,
listagg(group_name::text, ', ')
within group (order by group_name) as group_name
from user_group
group by user_name;
Clearly the stored procedure only needs to be created once but called every time the temp table needs to be created.
Getting the list of users belonging to a group in Redshift seems to be a fairly common task but I don't know how to interpret BLOB in grolist field.
I am literally getting "BLOB" in grolist field from TeamSQL. Not so sure this is specific to TeamSQL but I kind of remember thatI got a list of IDs there instead previously in other tool
This worked for me:
select usename
from pg_user , pg_group
where pg_user.usesysid = ANY(pg_group.grolist) and
pg_group.groname='<YOUR_GROUP_NAME>';
SELECT usename, groname
FROM pg_user, pg_group
WHERE pg_user.usesysid = ANY(pg_group.grolist)
AND pg_group.groname in (SELECT DISTINCT pg_group.groname from pg_group);
This will provide the usernames along with the respective groups.
this worked better for me:
SELECT
pu.usename,
pg.groname
FROM
pg_user pu
left join pg_group pg
on pu.usesysid = ANY(pg.grolist)
order by pu.usename
From
https://cloud.google.com/bigquery/docs/partitioned-tables:
you can shard tables using a time-based naming approach such as [PREFIX]_YYYYMMDD
This enables me to do:
SELECT count(*) FROM `xxx.xxx.xxx_*`
and query across all the shards. Is there a special notation that queries only the latest shard? For example say I had:
xxx_20180726
xxx_20180801
could I do something along the lines of
SELECT count(*) FROM `xxx.xxx.xxx_{{ latest }}`
to query xxx_20180801?
SINGLE QUERY INSPIRED BY Mikhail Berlyant:
SELECT count(*) as c FROM `XXX.PREFIX_*` WHERE _TABLE_SUFFIX IN ( SELECT
SUBSTR(MAX(table_id), LENGTH('PREFIX_') + 2)
FROM
`XXX.__TABLES_SUMMARY__`
WHERE
table_id LIKE 'PREFIX_%')
If you do care about cost (meaning how many tables will be scaned by your query) - the only way to do so is to do in two steps like below
First query
#standardSQL
SELECT SUBSTR(MAX(table_id), LENGTH('PREFIX') + 1)
FROM `xxx.xxx.__TABLES_SUMMARY__`
WHERE table_id LIKE 'PREFIX%'
Second Query
#standardSQL
SELECT COUNT(*)
FROM `xxx.xxx.PREFIX_*`
WHERE _TABLE_SUFFIX = '<result of first query>'
so, if result of first query is 20180801 so, second query will obviously look like below
#standardSQL
SELECT COUNT(*)
FROM `xxx.xxx.PREFIX_*`
WHERE _TABLE_SUFFIX = '20180801'
If you don't care about cost but rather need just result - you can easily combine above two queries into one - but - again - remember - even though result will be out of last table - cost will be as you query all table that match xxx.xxx.PREFIX_*
Forgot to mention (even though it should be obvious): of course when you have only COUNT(1) in your SELECT - the cost will be 0(zero) for both options - but in reality - most likely you will have something more valuable than just count(1)
I know this is a kind of an old thread but I was surprised why no one offers an answer using Variables.
"Héctor Neri" already mentioned this in the comments but I thought might be better to have an actual answer with a sample code posted.
#standardSQL
DECLARE SHARD_DATE STRING;
SET SHARD_DATE=(
SELECT MAX(REPLACE(table_name,'{TABLE}_',''))
FROM `{PRJ}.{DATASET}.INFORMATION_SCHEMA.TABLES`
WHERE table_name LIKE '{TABLE}_20%'
);
SELECT * FROM `{PRJ}.{DATASET}.{TABLE}_*`
WHERE _TABLE_SUFFIX = SHARD_DATE
Make sure to replace {PRJ}, {DATASET}, and {TABLE} values with your table location.
If you run this on BigQuery Web UI, you will see this message:
WARNING: Could not compute bytes processed estimate for script.
But you can see that variable properly reduce the table scan to the latest partition and does not cause any extra cost after running the script.
class Log:
project = ForeignKey(Project)
msg = CharField(...)
date = DateField(...)
I want to select the four most recent Log entries where each Log entry must have a unique project foreign key. I've tries the solutions on google search but none of them works and the django documentation isn't that very good for lookup..
I tried stuff like:
Log.objects.all().distinct('project')[:4]
Log.objects.values('project').distinct()[:4]
Log.objects.values_list('project').distinct('project')[:4]
But this either return nothing or Log entries of the same project..
Any help would be appreciated!
Queries don't work like that - either in Django's ORM or in the underlying SQL. If you want to get unique IDs, you can only query for the ID. So you'll need to do two queries to get the actual Log entries. Something like:
id_list = Log.objects.order_by('-date').values_list('project_id').distinct()[:4]
entries = Log.objects.filter(id__in=id_list)
Actually, you can get the project_ids in SQL. Assuming that you want the unique project ids for the four projects with the latest log entries, the SQL would look like this:
SELECT project_id, max(log.date) as max_date
FROM logs
GROUP BY project_id
ORDER BY max_date DESC LIMIT 4;
Now, you actually want all of the log information. In PostgreSQL 8.4 and later you can use windowing functions, but that doesn't work on other versions/databases, so I'll do it the more complex way:
SELECT logs.*
FROM logs JOIN (
SELECT project_id, max(log.date) as max_date
FROM logs
GROUP BY project_id
ORDER BY max_date DESC LIMIT 4 ) as latest
ON logs.project_id = latest.project_id
AND logs.date = latest.max_date;
Now, if you have access to windowing functions, it's a bit neater (I think anyway), and certainly faster to execute:
SELECT * FROM (
SELECT logs.field1, logs.field2, logs.field3, logs.date
rank() over ( partition by project_id
order by "date" DESC ) as dateorder
FROM logs ) as logsort
WHERE dateorder = 1
ORDER BY logs.date DESC LIMIT 1;
OK, maybe it's not easier to understand, but take my word for it, it runs worlds faster on a large database.
I'm not entirely sure how that translates to object syntax, though, or even if it does. Also, if you wanted to get other project data, you'd need to join against the projects table.
I know this is an old post, but in Django 2.0, I think you could just use:
Log.objects.values('project').distinct().order_by('project')[:4]
You need two querysets. The good thing is it still results in a single trip to the database (though there is a subquery involved).
latest_ids_per_project = Log.objects.values_list(
'project').annotate(latest=Max('date')).order_by(
'-latest').values_list('project')
log_objects = Log.objects.filter(
id__in=latest_ids_per_project[:4]).order_by('-date')
This looks a bit convoluted, but it actually results in a surprisingly compact query:
SELECT "log"."id",
"log"."project_id",
"log"."msg"
"log"."date"
FROM "log"
WHERE "log"."id" IN
(SELECT U0."id"
FROM "log" U0
GROUP BY U0."project_id"
ORDER BY MAX(U0."date") DESC
LIMIT 4)
ORDER BY "log"."date" DESC