Bulk Update a million rows - google-cloud-platform

Bulk Update a million rows - google-cloud-platform

Suppose I have a million rows in a table. I want to flip a flag in a column from true to false. How do I do that in spanner with a single statement?
That is, I want to achieve the following DML statement.
Update mytable set myflag=true where 1=1;

Cloud Spanner doesn't currently support DML, but we are working on a Dataflow connector (Apache Beam) that would allow you to do bulk mutations.

You can use this open source JDBC driver in combination with a standard JDBC tool like for example SQuirreL or SQL Workbench. Have a look here for a short tutorial on how to use the driver with these tools: http://www.googlecloudspanner.com/2017/10/using-standard-database-tools-with.html
The JDBC driver supports both DML- and DDL-statements, so this statement should work out-of-the-box:
Update mytable set myflag=true
DML-statements operating on a large number of rows are supported, but the underlying transaction quotas of Cloud Spanner continue to apply (max 20,000 mutations in one transaction). You can bypass this by setting the AllowExtendedMode=true connection property (see the Wiki-pages of the driver). This breaks a large update into several smaller updates and executes each of these in its own transaction. You can also do this batching yourself by dividing your update statement into several different parts.

Related

What does "interactive query" or "interactive analytics" mean in Presto SQL?

The Presto website (and other docs) talk about "interactive queries" on Presto. What is an "interactive query"? From the Presto Website: "Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse."

An interactive query system is basically a user interface that translates the input from the user into SQL queries. These are then sent to Presto, which processes the queries and gets the data and sends it back to the user interface.
The UI then renders the output, which is typically NOT just a simple table of numbers and text, but rather a complex chart, a diagram or some other powerful visualization.
The users expects to be able to e.g. update one criteria and get the updated chart or visualization in near real time, just like you expect on any application typically. Even if the creation of this analysis involves LOTS of data to be processed.
Presto can do that since it can query massive distributed object storage systems like HDFS and many other cloud storage systems, as well as RDBMSs and so on. And it can be set up to have a huge cluster of workers that query the source in parallel and therefore process massive amounts of data for analysis, and still be fast enough for the user expectations.
A typical application to use for the visualization is Apache Superset. You just hook up Presto to it via the JDBC driver. Presto has to be configured to point at the underlying data sources and you are ready to go.

Support for Update, Insert or Delete

Does the Spanner API support DML statements? For example, is the following supported:
UPDATE SET foo="bar" WHERE foo="baz"

Update as pr mid October 2018:
Cloud Spanner now does support INSERT, UPDATE, and DELETE using direct DML:
Blog post about the change:
https://cloud.google.com/blog/products/databases/develop-and-deploy-apps-more-easily-with-cloud-spanner-and-cloud-bigtable-updates
Docs:
https://cloud.google.com/spanner/docs/dml-tasks

Cloud Spanner does not support INSERT/UPDATE/DELETE DML operations, however you can achieve the same effect by using read-write transactions. All mutations to your data must go through the transaction commit method (in either REST or gRPC), which accepts Mutation objects.
In your example, you would
Start a read-write transaction and execute a SQL statement such as: SELECT <key> from MyTable where foo="baz".
Then commit the transaction and include a list of Mutation objects (one for each row you got back from your select) with the update property to set all the values to "bar".

Google Cloud Spanner itself does not support this, but this JDBC Driver https://github.com/olavloite/spanner-jdbc does support it by parsing the supplied SQL and calling the read/write API of Google Cloud Spanner. Have a look at the code in CloudSpannerPreparedStatement to see how it's done. The driver relies on the SQL parsing offered by https://github.com/JSQLParser/JSqlParser.
As of version 0.16 and newer of the abovementioned JDBC driver, full DML-statements operating on multiple rows are supported. You can use the driver in combination with a tool like SQuirreL or DBVisualizer in order to send the statements to Cloud Spanner.
Have a look here for some examples: http://www.googlecloudspanner.com/2018/02/data-manipulation-language-with-google.html

AWS Redshift + Tableau Performance Booster

I'm using AWS Redshift as a back-end to my tableau desktop. AWS cluster is running with two dc1.large nodes and database table which I'm analyzing is of 30GB (with redshift compression enabled), I chose Redshift over tableau extract for performance issue but seems like Redshift live connection is much slower than extract. Any suggestions where shall I look into?

To use Redshift as a backend for a BI platform like Tableau, there are four things you can do to address latency:
1) Concurrency: Redshift is not great at running multiple queries at the same time so before you start tuning the database, make sure your query is not waiting in line behind other queries. (If you are the only one on the cluster, this shouldn't be a problem.)
2) Table size: Whenever you can, use aggregate tables for better performance. Fewer rows to scan means less IO and faster turnaround!
3) Query complexity: Ideally, you want your BI tool to issue simple, fast performing queries. Make sure your source tables are fast, and that Tableau isn't being forced to do a bunch of joins. Also, if your query does need to join multiple tables, make sure any large tables have the same distribution key.
4) "Indexing": Technically, Redshift does not support true indexing, but you can get close to the same thing by using "interleaved" sort keys. Traditional compound sort keys won't help, but an interleaved sort key can allow you to quickly access rows from multiple vectors (date and customer_id, for instance) without having to scan the entire table.
Reality Check
After all of these things are optimized, you will often find that you still can't be as fast as a Tableau extract. Simply stated, a "fast" Tableau dashboard needs to return data to it's user in <5 seconds. If you have 7 visuals on your dashboard, and each of the underlying queries takes 800 milliseconds to return (which is super fast for a database query), then you still are just barely reaching your target performance. Now, if just one of those queries take 5 seconds or more, your dashboard is going to feel "slow" no matter what you do.
In Summary
Redshift can be tuned using the approach above, but it may or may not be worth the effort. The best applications for using a live Redshift query instead of Tableau Extracts are in cases where the data is physically too large to create an extract of, and when you require data at a level of granularity that makes pre-aggregation infeasible. One good strategy is to create your main dashboard using an extract so that exploration/discovery is as fast as possible, and then use direct (live) Redshift queries for your drill-through reports (for instance, when you want to see exactly which customers roll up into your totals).

Few pointers as below
1) Use vacuum & Analyze once your ETL completes
2) Have you created the Table with proper Dist key and Sort Key
3) Aggregation if it's ok from the point of Data Granularity, requirement etc

1.Remove cursor, tableau access data from redshift leader node using cursor. Cursor works iteratively. Thus, impacting the performance.
2. Perform manual analyze on the table, after running heavy load operations. https://docs.aws.amazon.com/redshift/latest/dg/r_ANALYZE.html
3.Check the dist key distribution to avoid data skewness and improve performance.

Push from one sql server to another autonomously

I have an application that requires me to pull certain information from DB#1 and push it to DB#2 every time a certain entry in a table from DB#1 is updated. The polling rate doesn't need to be extremely fast, but it probably shouldn't be any slower than 1 second.
I was planning on writing a small service using the C++ Connector library, but I am worried about putting too much load on DB#1. Is there a more efficient way of doing this? Such as built in functionality within an SQL script?

There are many methods to accomplish this, so it may be other factors you prefer that drive the approach.
If the SQL Server databases are on the same server instance:
Trigger on the DB1 tables that push to the DB2 tables
Stored procedure (in DB1 or DB2) that uses MERGE to identify changes and sync them to DB2, then use SQL job to call the procedure on your schedule
Enable Change Tracking on database and desired tables, then use stored proc + SQL job to send changes without any queries on source tables
If on different instances or servers (can also work if on same instance though):
SSIS Package to identify changes and push to DB2 (bonus can work with change data capture)
Merge Replication to synchronize changes
AlwaysOn Availability Groups to synchronize entire dbs
Microsoft Sync Framework
Knowing nothing about your preferences or comfort levels, I would probably start with Merge Replication - can be a bit tricky and tedious to setup, but performs very well.

You can create a trigger in DB1 and dblinks in between DB1 and DB2. So you can natively invoke trigger within DB1 and transfer data directly to DB2.

How to Execute 2 or more insert statements using CFQuery in coldfusion?

Is it possible to Execute 2 insert or Update Statements using cfquery?
If yes how?
if no, what is the best way to execute multiple queries in Coldfusion, by opening only one Connection to DB.
I think every time we call cfquery we are opening new connection DB

Is it possible to Execute 2 insert or
Update Statements using cfquery?
Most likely yes. But whether you can run multiple statements is determined by your database type and driver/connection settings. For example, when you create an MS SQL datasource, IIRC multiple statements are allowed by default. Whereas MySQL drivers often disable multiple statements by default. That is to help avoid sql injection. So in that case you must to enable multiple statements explicitly in your connection settings. Otherwise, you cannot use multiple statements. There are also some databases (usually desktop ones like MS Access) that do not support multiple statements at all. So I do not think there is a blanket answer to this question.
If the two insert/update statements are related, you should definitely use a cftransaction as Sam suggested. That ensures the statements are treated as a single unit: ie Either they all succeed or they all fail. So you are not left with partial or inconsistent data. In order to accomplish that, a single connection will be used for both queries in the transaction.
I think every time we call cfquery we
are opening new connection DB
As Sam mentioned, that depends on your settings and whether you are using cftransaction. If you enable Maintain Connections (under Datasource settings in the CF Administrator) CF will maintain a pool of open connections. So when you run a query, CF just grabs an open connection from the pool, rather than opening a new one each time. When using cftransaction, the same connection should be used for all queries. Regardless of whether Maintain Connections is enabled or not.

Within the data source settings you can tell it whether to keep connections open or not with the Maintain Connections setting.
Starting with, I believe, ColdFusion 8 datasources are set up to run only one query at a time due to concerns with SQL injection. To change this you would need to modify with the connection string.
Your best bet is to turn on Maintain Connections and if needed use cftransaction:
<cftransaction>
<cfquery name="ins" datasource="dsn">
insert into table1 values(<cfqueryparam value="#url.x#">)
</cfquery>
<cfquery name="ins" datasource="dsn">
insert into table2 values(<cfqueryparam value="#url.x#">)
</cfquery>
</cftransaction>
And always, always use cfqueryparam for values submitted by users.

the mySQL driver in CF8 does now allow multiple statements.
as Sam says, you can use to group many statements together
or in the coldfusion administrator | Data & Services | Data sources,
add
allowMultiQueries=true
to the Connection String field

I don't have CF server to try, but it should work fine IIRC.
something like:
<cfquery name="doubleInsert" datasource="dsn">
insert into table1 values(x,y,z)
insert into table1 values(a,b,c)
</cfquery>
if you want a more specific example you will have to give more specific information.
Edit: Thanks to #SamFarmer : Newer versions of CF than I have used may prevent this

Sorry for the Necro (I'm new to the site).
You didn't mention what DB you're using. If you happen to use mySQL you can add as many records as the max heap size will allow.
I regularly insert up to ~4500 records at a time with the default heap size (but that'll depend on the amount of data you have).
INSERT INTO yourTable (x,y,z) VALUES ('a','b','c'),('d','e','f'),('g','h','i')
All DBs should do this IMO.
HTH

Use CFTRANSACTION to group multiple queries into a single unit.
Any queries executed with CFQUERY and placed between and tags are treated as a single transaction. Changes to data requested by these queries are not committed to the database until all actions within the transaction block have executed successfully. If an error occurs in a query, all changes made by previous queries within the transaction block are rolled back.
Use the ISOLATION attribute for additional control over how the database engine performs locking during the transaction.
For more information visit http://www.adobe.com/livedocs/coldfusion/5.0/CFML_Reference/Tags103.htm

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js