Is there a way to add an index to a dataflow without breaking query folding? Using the add index function breaks it and was hoping for a work around?
Related
I am trying to query big query to get max date of two columns in google data fusion and pass the result into another pipeline as run time arguments.
select max(datecolumn1) as passargmnt1, max(datecolumn2) as passargmnt2 from dummy table;
upon research, it looks like Bigquery Argument setter might help...but the documentation is not much of a help.
can anyone provide some detail on how to achieve this ? Any better alternative solution is also preferred.
DK
I tried big query execute plugin and choose RUN AS ARGUMENTS but the didn't help
I have a client application which querys data in Spanner..
Lets say I have a table with 10 columns and my client application can search on a combination of columns.. Lets say I've added 5 indexes to optimise searching.
According to https://cloud.google.com/spanner/docs/sql-best-practices#secondary-indexes
it says:
In this scenario, Spanner automatically uses the secondary index SingersByLastName when executing the query (as long as three days have passed since database creation; see A note about new databases). However, it's best to explicitly tell Spanner to use that index by specifying an index directive in the FROM clause:
And also https://cloud.google.com/spanner/docs/secondary-indexes#index-directive suggests
When you use SQL to query a Spanner table, Spanner automatically uses any indexes that are likely to make the query more efficient. As a result, you don't need to specify an index for SQL queries. However, for queries that are critical for your workload, Google advises you to use FORCE_INDEX directives in your SQL statements for more consistent performance.
Both links suggest YOU (The developer) should be supplying Force_Index on yours queries.. This means I now need business logic in my client to say something like:
If (object.SearchTermOne)
queryBuilder.IndexToUse = "Idx_SearchTermOne"
This feels like I'm essentially trying to do the job of the optimiser by setting the index to use.. It also means if I add an extra index I need a code change to make use of it
So what are the best practices when it comes to using Force_Index in spanner queries?
The best practice is to use the Force_Index as described in the documentation at this time.
This feels like I'm essentially trying to do the job of the optimiser by setting the index to use..
I feel the same.
https://cloud.google.com/spanner/docs/secondary-indexes#index-directive
Note: The query optimizer requires up to three days to collect the databases statistics required to select a secondary index for a SQL query. During this time, Cloud Spanner will not automatically use any indexes.
As noted in this note, even if an amount of data is added that would allow the index to function effectively, it may take up to three days for the optimizer to figure it out.
Queries during that time will probably be full scans.
If you want to prevent this other than using Force_Index, you will need to run ANALYZE DDL manually.
https://cloud.google.com/blog/products/databases/a-technical-overview-of-cloud-spanners-query-optimizer
But none of this changes the fact that we are essentially trying to do the optimizer's job...
I have a Target insert transformation that I'd like to do a delete on the row before insertion (weird niche case that may pop up).
I know the update override allows for :TU.xyz to point at incoming data, but Pre/Post SQL doesn't have the same configure menu.
How would I accomplish this correctly?
From what I recall, Pre- and Post- SQL uses a separate connection so there is no way of referring incoming data.
One thing you could do is flagging/storing the key somewhere and using the flag/instance in the PostSQL query, for example.
Maciejg is correct, there is no dynamic use of Pre and Post SQL.
I would normally recommend an Upsert approach.
But, I found using a MS SQL target, IICS has a bug with doing Insert and Update off a Router. The workaround of using a data driven operation removes batch loading on your insert, so... I now recommend a full data load approach.
From a target with the operation set to Insert, I do batch deletes with Pre SQL.
I found this faster and more cost affective than doing delete/insert/update operations individually.
This is my code to write rows to a new BigQuery table:
PCollection<TableRow> lhReports = results.get(BigQueryImport.LIGHTHOUSE_TAG);
lhReports.apply(BigQueryIO.Write
.named("write-lighthouse")
.to(getBigQueryOutput(options, "lighthouse"))
.withSchema(lhReportSchema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
Half of the time my pipeline will not have any rows to write, but the tables will be created regardless. I'd like to ensure that I don't create any empty tables.
The first thing I checked was how CREATE_IF_NEEDED works. It just specifies that the table can be created if it doesn't already exist. There is no other CreateDisposition enum that depends on the output length.
I'm not super sophisticated with Dataflow, so my next thought was to wrap the pipeline in a condition that first checks the size of the PCollection, lhReports. But I'm not seeing any kind of size/length methods in the API.
Am I on the right track?
I currently have a package pulling data from an excel file, but when pulling the data out I get rows I do not want. So I need to extract everything from the 'ID' field that has any sort of letter in it.
I need to be able to run a RegEx command such as "%[a-zA-Z]%" to pull out that data. But with the current limitation of conditional split it's not letting me do that. Any ideas on how this can be done?
At the core of the logic, you would use a Script Transformation as that's the only place you can access the regex.
You could simply a second column to your data flow, IDCleaned and that column would only contain cleaned values or a NULL. You could then use the Conditional Split to filter good rows vs bad. System.Text.RegularExpressions.Regex.Replace error in C# for SSIS
If you don't want to add another column, you can set your current ID column to be ReadWrite for the Script and then update in place. Perhaps adding a boolean column might make the Conditional Split logic easier at this point.