Impala: when refresh tables? - refresh

using impala I noticed a deterioration in performance when I perform several times truncate and insert operations in internal tables.
The question is: can refreshing the tables avoid the problem?
So far I have used refresh only for external tables every time I copied files to hdfs to be loaded into the tables themselves.
Many thanks in advance!
Moreno

You can use compute stats instead of refresh.
Refresh is normally used when you add a data file or change something in table metadata - like add column or partition /change column etc. It quickly reloads the metadata. There is another related command invalidate metadata but this is more expensive than refresh and will force impala to reload metadata when table is called in next query.
compute stats - This is to compute stats of the table or columns when around 30% data changed. Its expensive operation but effective when you do frequent truncate and load.

Related

Billions of rows table as input to Quicksight dataset

there are two redshift table named A & B and a Quicksight dashboard where it takes A MINUS B as query to display content for a visual. If we use DIRECT query option and it is getting timedout because query is not completing in 2 mins(Quicksight have hard limit to run query within 2 mins) . Is there a way to use such large datasets as input Quicksight dasboard visual ?
Can't use SPICE engine because it have limit 1B or 1TB size limit.Also, it have 15 mins of delay to refresh data.
You will likely need to provide more information to fully resolve. MINUS can be a very expensive operation especially if you haven't optimized the tables for this operation. Can you provide information about your table setup and the EXPLAIN plan of the query you are running?
Barring improving the query, one way to work around a poorly performing query behind quicksight is to move this query to a materialized view. This way the result of the query can be stored for later retrieval but needs to be refreshed when the source data changes. It sounds like your data only changes every 15 min (did I get this right?) then this may be an option.

How best cache bigquery table for fast lookup of individual row?

I have a raw data table in bigquery that has hundreds of millions of rows. I run a scheduled query every 24 hours to produce some aggregations that results a table in the ballmark of 33 million rows (6gb) but may be expected to grow slowly to approximately double its current size.
I need a way to get 1 row at a time quick access lookup by id to that aggregate table in a separate event driven pipeline. i.e. A process is notified that person A just took an action, what do we know about this person's history from the aggregation table?
Clearly bigquery is the right tool to produce the aggregate table, but not the right tool for the quick lookups. So I need to offset it to a secondary datastore like firestore. But what is the best process to do so?
I can envision a couple strategies:
1) Schedule a dump of agg table to GCS. Kick off a dataflow job to stream contents of gcs dump to pubsub. Create a serverless function to listen to pubsub topic and insert rows into firestore.
2) A long running script on compute engine which just streams the table directly from BQ and runs inserts. (Seems slower than strategy 1)
3) Schedule a dump of agg table to GCS. Format it in such a way that can be directly imported to firestore via gcloud beta firestore import gs://[BUCKET_NAME]/[EXPORT_PREFIX]/
4) Maybe some kind of dataflow job that performs lookups directly against the bigquery table? Not played with this approach before. No idea how costly / performant.
5) some other option I've not considered?
The ideal solution would allow me quick access in milliseconds to an agg row which would allow me to append data to the real time event.
Is there a clear best winner here in the strategy I should persue?
Remember that you could also CLUSTER your table by id - making your lookup queries way faster and less data consuming. They will still take more than a second to run though.
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
You could also set up exports from BigQuery to CloudSQL, for subsecond results:
https://medium.com/#gabidavila/how-to-serve-bigquery-results-from-mysql-with-cloud-sql-b7ddacc99299
And remember, now BigQuery can read straight out of CloudSQL if you'd like it to be your source of truth for "hot-data":
https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229

Temporary blocks in update command making DISK FULL redshift

I am new to redshift and struggling to update a column in a redshift table. I have a huge data table and added an empty column to it. I am trying to fill this empty column by joining it with another table using the update command. What I am worried about is that even though there is 291 GB of space left, temporary blocks being created by this UPDATE statement produce the DISK FULL error. Any solutions or suggestions are appreciated. Thanks in advance!
It is not recommended to perform a large UPDATE command in Amazon Redshift tables.
The reason is that updating even just one column in a row causes the following:
The existing row will be marked as Deleted, but still occupies disk space until the table is VACUUMed
A new row is added to the end of the table storage, which is then out of sort order
If you are updating every row in the table, this means that the storage required for the table is twice as much, possibly more due to less-efficient compression. This is possibly what is consuming your disk space.
The suggested alternate method is to select the joined data into a new table. Yes, this will also require more disk space, but it will be more efficiently organized. You can then delete the original table and rename the new table to the old table name.
Some resources:
Updating and Inserting New Data - Amazon Redshift
How to Improve Amazon Redshift Upload Performance

BigQuery - querying only a subset of keys in a table with key value schema

So I have a table with the following schema:
timestamp: TIMESTAMP
key: STRING
value: FLOAT
There are around 200 unique keys. I am partitioning the dataset by date.
I want to run several (5-6 currently, but I expect to add at least 15 more) queries on a daily basis on this database. Brute forcing these would cost me a lot daily, which I want to avoid.
The issue is that because of this key - value format, and BigQuery being a columnar database, each query queries the whole day's data, despite each query actually using a maximum of 4 keys. What is a best way to optimize this?
I am thinking the best way I can go about it right now is to create separate temp tables for each key as a daily batch process, run my queries on them and then delete them.
Ideal way I would want to go about it is partitioning by key, I am not sure there is any such provision?
You can try using recently introduced clustering partitioned tables
When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.
Update (moved from comments)
Also have in mind below
Feature Partitioning Clustering
--------------- ------------- -------------
Cardinality Less than 10k Unlimited
Dry Run Pricing Available Not available
Query Pricing Exact Best Effort
Pay special attention to Dry Run Pricing - unfortunately - clustered tables do not support dry run (validation) based on clustered keys - and rather show only validation based on partitions. but if you set your clustering properly - actual run will end up with lower cost. you should try with smaller data to get comfortable with this
See more at Clustering partitioned tables

Reflecting changes on big tables in hdfs

I have an order table in the OLTP system.
Each order record has a OrderStatus field.
When end users created an order, OrderStatus field set as "Open".
When somebody cancels the order, OrderStatus field set as "Canceled".
When an order process finished(transformed into invoice), OrderStatus field set to "Close".
There are more than one hundred million record in the table in the Oltp system.
I want to design and populate data warehouse and data marts on hdfs layer.
In order to design data marts, I need to import whole order table to hdfs and then I need to reflect changes on the table continuously.
First, I can import whole table into hdfs in the initial load process by using sqoop. I may take long time but I will do this once.
When an order record is updated or a new order record entered, I need to reflect changes in hdfs. How can I achieve this in hdfs for such a big transaction table?
Thanks
One of the easier ways is to work with database triggers in your OLTP source db and every change an update happens use that trigger to push an update event to your Hadoop environment.
On the other hand (this depends on the requirements for your data users) it might be enough to reload the whole data dump every night.
Also, if there is some kind of last changed timestamp, it might be a possible way to load only the newest data and do some kind of delta check.
This all depends on your data structure, your requirements and your ressources at hand.
There are several other ways to do this but usually those involve messaging, development and new servers and I suppose in your case this infrastructure or those ressources are not available.
EDIT
Since you have a last changed date, you might be able to pull the data with a statement like
SELECT columns FROM table WHERE lastchangedate < (now - 24 hours)
or whatever your interval for loading might be.
Then process the data with sqoop or ETL tools or the like. If the records are already available in your Hadoop environment, you want to UPDATE it. If the records are not available, INSERT them with your appropriate mechanism. This is also called UPSERTING sometimes.