Does the Spanner API support DML statements? For example, is the following supported:
UPDATE SET foo="bar" WHERE foo="baz"
Update as pr mid October 2018:
Cloud Spanner now does support INSERT, UPDATE, and DELETE using direct DML:
Blog post about the change:
https://cloud.google.com/blog/products/databases/develop-and-deploy-apps-more-easily-with-cloud-spanner-and-cloud-bigtable-updates
Docs:
https://cloud.google.com/spanner/docs/dml-tasks
Cloud Spanner does not support INSERT/UPDATE/DELETE DML operations, however you can achieve the same effect by using read-write transactions. All mutations to your data must go through the transaction commit method (in either REST or gRPC), which accepts Mutation objects.
In your example, you would
Start a read-write transaction and execute a SQL statement such as: SELECT <key> from MyTable where foo="baz".
Then commit the transaction and include a list of Mutation objects (one for each row you got back from your select) with the update property to set all the values to "bar".
Google Cloud Spanner itself does not support this, but this JDBC Driver https://github.com/olavloite/spanner-jdbc does support it by parsing the supplied SQL and calling the read/write API of Google Cloud Spanner. Have a look at the code in CloudSpannerPreparedStatement to see how it's done. The driver relies on the SQL parsing offered by https://github.com/JSQLParser/JSqlParser.
As of version 0.16 and newer of the abovementioned JDBC driver, full DML-statements operating on multiple rows are supported. You can use the driver in combination with a tool like SQuirreL or DBVisualizer in order to send the statements to Cloud Spanner.
Have a look here for some examples: http://www.googlecloudspanner.com/2018/02/data-manipulation-language-with-google.html
Related
My scenario is:
I have 3 Dataflows:
Recent Data (from SQL Server. Refreshes 8 times a day)
Historical Data (does not refresh, just once initially)
Sharepoint Excel file Data
In my Dataset, I want to have a single Fact table that "union all" all 3 sources.
Instead of Append transformation, I want to create 3 custom Partition (well explained here: https://www.youtube.com/watch?v=6CRqdsLjHNA&t=127s).
I want to somehow tell the schedule refresh to only process the Recent Data and Excel Data partitions only.
The reasoning is - if I do Append, then the dataset will each time process the Historical Data again and again.
Now 2 questions:
How do I tell the scheduled refresh to only process two of 3 partitions? (I can do it manually via XMLA endpoint, but I need it scheduled)
What if I change something in my report (like visuals) - how do I deploy the changes without needing to recreate the partitions?
See Advanced Refresh Scenarios which includes Metadata Only Deployment, and Automate Premium workspace and dataset tasks with service principals.
The easiest way to generate the TMSL scripts for the advanced refresh scenarios is with SQL Server Management Studio (SSMS) which has wizards for configuring refresh, and can generate the script for you. Then you use the script through PowerShell cmdlets or using ADOMD.NET, which in turn can be automated with Azure Automation or an Azure Function.
If you don't need full TMSL scripting capabilities, Power Automate has connectors that hit the Power BI REST APIs, but doesn't support partition-based refresh currently.
But you can call the REST Refresh API directly through any programming language, or the Power Automate HTTP Action.
Also you should take a look at the new (Preview) Hybrid Tables feature which would enable you to have the recent data in a DirectQuery partition, while the historical data is in Import mode.
I am trying to use pre aggregations over CLOUD SQL on Google Cloud Platform but the database is denying access and giving error Statement violates GTID consistency.
Any help is appreciated.
Cube.js done pre-aggregation by CREATE TABLE ... SELECT, but you are using MySQL on top of Google SQL with --enforce-gtid-consistency (has limitations).
Since only transactionally safe statements can be logged, there is a limitation to use CREATE TABLE ... SELECT (and some another SQL), because this statement is actually logged as two separate events.
There are two ways how to solve this issue:
1. Use pre-aggregations to an external database. (recommended way).
https://cube.dev/docs/pre-aggregations/#read-only-data-source-pre-aggregations
2. Use not documented flag loadPreAggregationWithoutMetaLock
Attention: This flag is an experimental and can be removed or changed in the feature..
Take a look at the source code
You can pass it directly in the driver constructor. This will produce two SQL statements to pass the limitation:
CREATE TABLE
INSERT INTO
Thanks
I am working on a project which crunching data and doing a lot of processing. So I chose to work with BigQuery as it has good support to run analytical queries. However, the final result that is computed is stored in a table that has to power my webpage (used as a Transactional/OLTP). My understanding is, BigQuery is not suitable for transactional queries. I was looking more into other alternatives and I realized I can use DataFlow to do analytical processing and move the data to Cloud SQL (relationalDb fits my purpose).
However, It seems, it's not as straightforward as it seems. First I have to create a pipeline to move the data to the GCP bucket and then move it to Cloud SQL.
Is there a better way to manage it? Can I use "Create Job from SQL" in the dataflow to do it? I haven't found any examples which use "Create Job From SQL" to process and move data to GCP Cloud SQL.
Consider a simple example on Robinhood:
Compute the user's returns by looking at his portfolio and show the graph with the returns for every month.
There are other options, beside pipeline use, but in all cases you cannot export table data to a local file, to Sheets, or to Drive. The only supported export location is Cloud Storage, as stated on the Exporting table data documentation page.
Hive Partitioned Tables have a folder structure with partition date as the folder. I have explored loading externally partitioned tables directly to bigquery which is possible.
What I would like to know is if this feature is possible to do with dataflow since I am going to be running some feature transforms and such using dataflow before loading the data into bigquery. what I have found is if I add the partition date as a column then partitioning is possible but I am looking for a direct method with which I wouldn't be adding the column during transforms but directly while loading data into bigquery.
Is such a thing possible?
Hive partitioning is a beta feature in BigQuery, and it was released on Oct 31st, 2019. The latest version of Apache Beam SDK supported by Dataflow is 2.16.0 which was released on Oct 7th, 2019. At the moment, neither Java nor Python supports this feature directly. So, if you want to use it from Dataflow, maybe you could try calling the BigQuery API directly
Suppose I have a million rows in a table. I want to flip a flag in a column from true to false. How do I do that in spanner with a single statement?
That is, I want to achieve the following DML statement.
Update mytable set myflag=true where 1=1;
Cloud Spanner doesn't currently support DML, but we are working on a Dataflow connector (Apache Beam) that would allow you to do bulk mutations.
You can use this open source JDBC driver in combination with a standard JDBC tool like for example SQuirreL or SQL Workbench. Have a look here for a short tutorial on how to use the driver with these tools: http://www.googlecloudspanner.com/2017/10/using-standard-database-tools-with.html
The JDBC driver supports both DML- and DDL-statements, so this statement should work out-of-the-box:
Update mytable set myflag=true
DML-statements operating on a large number of rows are supported, but the underlying transaction quotas of Cloud Spanner continue to apply (max 20,000 mutations in one transaction). You can bypass this by setting the AllowExtendedMode=true connection property (see the Wiki-pages of the driver). This breaks a large update into several smaller updates and executes each of these in its own transaction. You can also do this batching yourself by dividing your update statement into several different parts.