Use a pcollection as input of another pcollection - mapreduce

Using python sdk in google dataflow, I would like to do a query like this:
query_a_and_b = "SELECT a, b FROM TableA"
This query returns a list of tuples i'd like to use to perform more queries:
query_param = SELECT * from TableA WHERE a = {} and b = {}.format(a, b)
(here i set TableA but it will also be used with TableB, C and D that are inner joined with TableA...)
So what I am trying to do:
coll = (p
| 'read a_b_tuples' >> beam.io.Read(beam.io.BigQuerySource(query=query_a_and_b, use_standard_sql=True))
| 'Build SQL' >> beam.Map(lambda x: query_param.format(x['a'], x['b']))
| 'Query pardo' >> beam.ParDo(lambda q: [beam.io.Read(beam.io.BigQuerySource(query=q, use_standard_sql=True))])
| 'Save' >> beam.io.WriteToText('results.csv')
)
I am not sure that the best approach and it does not work. What is the preferred way to achieve this in dataflow?
Ultimately, each of these queries will return a small amount of rows (less than 5k), that i'd like to load in a pandas dataframe for filtering/processing, then combine all TableA,B,C,D for every tuple (a,b) and write each tuple datafarm to a csv file the result.
I might be map-reducing the problem incorrectly in a sense I could use the beam functions to group by a and b and then do my processing...?

Beam doesn't directly support this for BigQuery yet. Some other transforms support similar use cases, e.g. JdbcIO.readAll() can query a database for a collection of query parameters, TextIO.readAll() can read a collection of filenames - but BigQueryIO doesn't do this yet, neither in the Java nor Python SDKs.
In your "Query pardo", you can instead explicitly talk to the BigQuery REST API - it should be fine because your queries return a small number of results.

Related

Optimizing Database Population with Django

I have a 10GB csv file (34 million rows) with data(without column description/header) that needs to be populated in a Postgres database.
The row data has columns that need to go in different Models.
I have the following DB schema:
Currently what I do is:
Loop through rows:
Create instance B with specific columns from row and append to an array_b
Create instance C with specific columns from row and append to an array_c
Create instance A with specific columns from row and relation to B and C, and
append to an array_a
Bulk create in order: B, C and A
This works perfectly fine, however, it takes 4 hours to populate the DB.
I wanted to optimize the population and came across the psql COPY FROM command.
So I thought I would be able to do something like:
Create instance A with specific columns from the file
for foreign key to C
create instance C with specific columns from the row
for foreign key to B
create instance B with specific columns from the row
Go to 1.
After a short research on how to do that, I found out that it does not allow table manipulations while copying data (such as looking up to another table for fetching proper foreign keys to insert)
Anyone can guide me in what to look for, any other method or 'hack' on how to optimize the data population?
Thanks in advance.
Management command:
with open(settings.DATA_PATH, 'r') as file:
csvreader = csv.reader(file)
next(csvreader)
b_array = []
c_array = []
a_array = []
for row in csvreader:
some_data = row[0]
some_data = row[1]
....
b = B(
some_data=some_data
)
b_array.append(b)
c = C(
some_data=some_data
)
c_array.append(c)
a = A(
some_data=_some_data,
b=b,
c=c
)
a_array.append(property)
B.objects.bulk_create(b_array)
C.objects.bulk_create(c_array)
A.objects.bulk_create(a_array)
References:
Use COPY FROM command in PostgreSQL to insert in multiple tables
Postgres: \copy syntax
Create multiple tables using single .sql script file
https://www.sqlshack.com/different-approaches-to-sql-join-multiple-tables/

GCP MQL: Rename a line in monitoring chart

I have a simple Google Cloud Monitoring Query Language to show the count of all requests to all containers in kubernetes from log-based metrics. The query is below.
k8s_container::logging.googleapis.com/user/service-api-gateway-prod-request-in-count | sum
The widget will look like below
I would like to rename the long label for the line chart to something shorter like "request count". How do I do it?
So the best I can do is to add a new column to the table and map the column.
In my example, I add add [p: 'error count'] | map [p] to the line, and become like this.
k8s_container::logging.googleapis.com/user/service-api-gateway-prod-request-in-count | sum | add [p: 'error count'] | map [p]
This works in my case.
References
https://cloud.google.com/monitoring/mql/reference#map
Instead of using MQL(Monitoring Query Language), try the Advanced tab. Just for an example I will be using a metric name mysite-container-exited, you can name it whatever you want.
Select your resource type and metric that you created in log-based metric.
Select No preprocessing step.
Select alignment function as SUM.
Now the widget will just show the name that you entered in the log-based metric details tab.
The sum is actually a shortcut to group_by table operation with sum aggregator. Using the complete form of group_by allow you to control the output value column name.
k8s_container::logging.googleapis.com/user/service-api-gateway-prod-request-in-count
| group_by [], [request_count: sum(val())]
You can try renaming the value column with | value [request_count: val()].
Reference entry for the value operator

Amazon DynamoDB multiple scan conditions with multiple BeginsWith

I have table in Amazon DynamoDB with partition key and range key.
Table structure
Subscriber ID (partition key) | Item Id (Range Key) | Date |...
123 | P_345 | some date 1 | ...
123 | I_456 | some date 2 |
123 | A_678 | some date 3 | ...
Now I want to retrieve the data from the table using QueryAsync C# library with multiple scan conditions.
HashKey = 123
condition 1; Date is between 'some date 1' and 'some date 2'
condition 2. Range Key begins_with I_ and P_
Is there any way which I can achieve this using c# dynamoDB APIs?
Please help
You'll need to do the following (I'm not a C# expert, but you can use the following instructions to find the right C# syntax to do it):
Because you are looking for a specific hashkey, this will be a Query request, not a Scan.
You have a begins_with() condition on the range key. You specify that using the KeyConditionExpression parameter to the Query. The KeyConditionExpression will ask for HashKey=123 AND begins_with(RangeKey,"P_").
However, KeyConditionExpression does not allow an "OR" (rangekey begins with either "P_" or "I_"). You'll just need to run two separate queries - one with "I_" and one with "P_" (you can even do the two queries in parallel, if you wish).
The date is not one of the key columns, so you will need to filter it with a FilterExpression parameter to the query. Note that filtering only happens in the last step, after DynamoDB already read all the items matching the KeyConditionExpression above (this may increase your costs if filtering removes a lot of items and you will still pay for them).

Django: Distinct foreign keys

class Log:
project = ForeignKey(Project)
msg = CharField(...)
date = DateField(...)
I want to select the four most recent Log entries where each Log entry must have a unique project foreign key. I've tries the solutions on google search but none of them works and the django documentation isn't that very good for lookup..
I tried stuff like:
Log.objects.all().distinct('project')[:4]
Log.objects.values('project').distinct()[:4]
Log.objects.values_list('project').distinct('project')[:4]
But this either return nothing or Log entries of the same project..
Any help would be appreciated!
Queries don't work like that - either in Django's ORM or in the underlying SQL. If you want to get unique IDs, you can only query for the ID. So you'll need to do two queries to get the actual Log entries. Something like:
id_list = Log.objects.order_by('-date').values_list('project_id').distinct()[:4]
entries = Log.objects.filter(id__in=id_list)
Actually, you can get the project_ids in SQL. Assuming that you want the unique project ids for the four projects with the latest log entries, the SQL would look like this:
SELECT project_id, max(log.date) as max_date
FROM logs
GROUP BY project_id
ORDER BY max_date DESC LIMIT 4;
Now, you actually want all of the log information. In PostgreSQL 8.4 and later you can use windowing functions, but that doesn't work on other versions/databases, so I'll do it the more complex way:
SELECT logs.*
FROM logs JOIN (
SELECT project_id, max(log.date) as max_date
FROM logs
GROUP BY project_id
ORDER BY max_date DESC LIMIT 4 ) as latest
ON logs.project_id = latest.project_id
AND logs.date = latest.max_date;
Now, if you have access to windowing functions, it's a bit neater (I think anyway), and certainly faster to execute:
SELECT * FROM (
SELECT logs.field1, logs.field2, logs.field3, logs.date
rank() over ( partition by project_id
order by "date" DESC ) as dateorder
FROM logs ) as logsort
WHERE dateorder = 1
ORDER BY logs.date DESC LIMIT 1;
OK, maybe it's not easier to understand, but take my word for it, it runs worlds faster on a large database.
I'm not entirely sure how that translates to object syntax, though, or even if it does. Also, if you wanted to get other project data, you'd need to join against the projects table.
I know this is an old post, but in Django 2.0, I think you could just use:
Log.objects.values('project').distinct().order_by('project')[:4]
You need two querysets. The good thing is it still results in a single trip to the database (though there is a subquery involved).
latest_ids_per_project = Log.objects.values_list(
'project').annotate(latest=Max('date')).order_by(
'-latest').values_list('project')
log_objects = Log.objects.filter(
id__in=latest_ids_per_project[:4]).order_by('-date')
This looks a bit convoluted, but it actually results in a surprisingly compact query:
SELECT "log"."id",
"log"."project_id",
"log"."msg"
"log"."date"
FROM "log"
WHERE "log"."id" IN
(SELECT U0."id"
FROM "log" U0
GROUP BY U0."project_id"
ORDER BY MAX(U0."date") DESC
LIMIT 4)
ORDER BY "log"."date" DESC

From a one to many SQL dataset Can I return a comma delimited list in SSRS?

I am returning a SQL dataset in SSRS (Microsoft SQL Server Reporting Services) with a one to many relationship like this:
ID REV Event
6117 B FTG-06a
6117 B FTG-06a PMT
6117 B GTI-04b
6124 A GBI-40
6124 A GTI-04b
6124 A GTD-04c
6136 M GBI-40
6141 C GBI-40
I would like to display it as a comma-delimited field in the last column [Event] like so:
ID REV Event
6117 B FTG-06a,FTG-06a PMT,GTI-04b
6124 A GBI-40, GTI-04b, GTD-04c
6136 M GBI-40
6141 C GBI-40
Is there a way to do this on the SSRS side of things?
You want to concat on the SQL side not on the SSRS side, that way you can combine these results in a stored procedure say, and then send it to the reporting layer.
Remember databases are there to work with the data. The report should just be used for the presentation layer, so there is no need to tire yourself with trying to get a function to parse this data out.
Best thing to do is do this at the sproc level and push the data from the sproc to the report.
Based on your edit this is how you would do it:
To concat fields take a look at COALESCE.
You will then get a string concat of all the values you have listed.
Here's an example:
use Northwind
declare #CategoryList varchar(1000)
select #CategoryList = coalesce(#CategoryList + ‘, ‘, ”) + CategoryName from Categories
select ‘Results = ‘ + #CategoryList
Now because you have an additional field namely the ID value, you cannot just add on values to the query, you will need to use a CURSOR with this otherwise you will get a notorious error about including additional fields to a calculated query.
Take a look here for more help, make sure you look at the comment at the bottom specifically posted by an 'Alberto' he has a similiar issue as you do and you should be able to figure it out using his comment.