How to a Drop Poly Base external table if exists? - azure-sqldw

I have loaded the File data from Azure blob storage to Azure SQL DW external table through poly-base. Now the File present in Blob container has been updated. Now I want to load the fresh data. Can any one suggest how the fresh data can be loaded to external table through poly base.? I am trying a way to drop the external table if exists and create it again to load the fresh data.

There is no need to drop external tables to view new data. However you can use the DROP EXTERNAL TABLE syntax to drop Polybase / external tables if required, eg to change the definition or REJECT_TYPE. You can also check DMV sys.external_tables for their existence prior to dropping them, eg
IF EXISTS ( SELECT * FROM sys.external_tables WHERE object_id = OBJECT_ID('yourSchema.yourTable') )
DROP EXTERNAL TABLE yourSchema.yourTable
GO
Azure SQL Data Warehouse does not yet support the DROP IF EXISTS (DIE) syntax as in SQL Server 2016. However as mentioned there should be no need to drop external tables just to view new data. If the blob file has been updated then new data will simply appear in the external table next time you query it.
Another approach is to supply a directory name in your external table definition. Then by simply dropping new files in the folder, the data will appear next time you query the table, eg
CREATE EXTERNAL TABLE dbo.DimDate2External (
DateId INT NOT NULL,
CalendarQuarter TINYINT NOT NULL,
FiscalQuarter TINYINT NOT NULL
)
WITH (
LOCATION='/textfiles/dimDate/',
DATA_SOURCE=AzureStorage,
FILE_FORMAT=TextFile
);
So say if you had an initial file in that folder called DimDate1.txt and then added a new file called DimDate2.txt it would appear in the table as one.

if you have created an External table as specified in
https://msdn.microsoft.com/en-us/library/dn935021.aspx, then there shouldn't be nothing to do.
The external table will be a "pointer" to your file, and every time you query the table, data will be read from the original file. This way if you update the file there are no actions to take on Azure SQL DW.
If you have imported your data IN Azure SQL DW using CREATE TABLE AS SELECT syntax (see https://msdn.microsoft.com/en-us/library/mt204041.aspx), reading from an external table, you will need to drop the table, but not the external one, as the above applies here as well, and when you query the external table the updated file will be read.
So:
--creating an external table (using defined external data source and file format):
CREATE EXTERNAL TABLE ClickStream (
url varchar(50),
event_date date,
user_IP varchar(50)
)
WITH (
LOCATION='/webdata/employee.tbl',
DATA_SOURCE = mydatasource,
FILE_FORMAT = myfileformat
)
;
When you select from ClickStream it will always read content from /webdata/employee.tbl file. If you only update the employee.tbl file with new data, there are no actions to take.
Instead:
--Use CREATE TABLE AS SELECT to import the Azure blob storage data into a new
--SQL Data Warehouse table called ClickStreamData
CREATE TABLE ClickStreamData
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH (user_IP)
)
AS SELECT * FROM ClickStream
;
data will be copied to ClickStreamData table in the instance, and updates to file will not be reflected. In this case you will need to drop ClickStreamData and re-create it, but you can still use ClickStream as the source, as that external table will read data from the updated file.

Related

BigQuery Hive external partitioning and partition column names

I have created an external BigQuery partitioned table on GCS using Hive partition strategy with Parquet export. All works fine on key:value name on CGS. BQ picks up the partitoned stripes as external columns and shows these colums as added to the table schema.
Let us assume that I have a column called reporting_date on BQ native table. I cannot call my partition stripe as reporting_date=2021-08-31. Otherwise BQ does not allow the external table to be created. So I call it reportingdate=2021-08-31.
so far so good. So I can use a predicate like
select * FROM <MY_DATASET>.<MYEXTERNAL__TABLE> WHERE reportingdate = '2021-08-31'
This will use partition pruning.
However, I can also do the following as well
select * FROM <MY_DATASET>.<MYEXTERNAL__TABLE> WHERE reporting_date = '2021-08-31'
Note that I am using the existing column name NOT the external partitioned column name. Surprisingly this predicate also uses partition pruning. I was wondering how this works as the external partition is built on reportingdate and NOT reporting_date.
My assumption is that underneath the bonnet these two columns use the same storage space? Somehow BigQuery at time of creating the external table remembers the partition column name. Any ideas or clarification will be appreciated.

Create a table using `pg_table_def` data in Redshift or DBT

To create a table from all the data in pg_table_def that is visible to my user, I tried:
create table adhoc_schema.pg_table_dump as (
select *
from pg_table_def
);
But it throws an error:
Column "schemaname" has unsupported type "name"
Any way to create a table from pg_table_def or information_schema.columns?
Found this from another thread, which seems like it would help.
Amazon considers the internal functions that INFORMATION_SCHEMA.COLUMNS is using Leader-Node Only functions. Rather than being sensible and redefining the standardized INFORMATION_SCHEMA.COLUMNS, Amazon sought to define their own proprietary version. For that they made available another function PG_TABLE_DEF which seems to address the same need. Pay attention to the note in the center about adding the schema to search_path.
Stores information about table columns.
PG_TABLE_DEF only returns information about tables that are visible to the user. If PG_TABLE_DEF does not return the expected results, verify that the search_path parameter is set correctly to include the relevant schemas.
You can use SVV_TABLE_INFO to view more comprehensive information about a table, including data distribution skew, key distribution skew, table size, and statistics.
So using your example code (rewritten to use NOT EXISTS for clarity),
SET SEARCH_PATH to '$user', 'public', 'target_schema';
SELECT "column"
FROM dev.fields f
WHERE NOT EXISTS (
SELECT 1
FROM PG_TABLE_DEF pgtd
WHERE pgtd.column = f.field
AND schemaname = 'target_schema'
);
See also,
Official docs on Querying Redshift System Tables: https://docs.aws.amazon.com/redshift/latest/dg/t_querying_redshift_system_tables.html
pg_table_def is a leader-node system table with some additional data types not supported in user tables. You will need to cast to text first.
However, this still won't work because pg_table_def is a leader-node table and the table you are creating is a user table stored on the compute nodes. You will need to pull your source information from tables that are stored on compute-nodes. Since I don't know what information you are looking for from pg_table_def I cannot say exactly which ones you need but you can start with stv_tbl_perm and join in pg_class and other tables as more info is needed.

How to insert Billing Data from one Table into another Table in BigQuery

I have two tables both billing data from GCP in two different regions. I want to insert one table into the other. Both tables are partitioned by day, and the larger one is being written to by GCP for billing exports, which is why I want to insert the data into the larger table.
I am attempting the following:
Export the smaller table to Google Cloud Storage (GCS) so it can be imported into the other region.
Import the table from GCS into Big Query.
Use Big Query SQL to run INSERT INTO dataset.big_billing_table SELECT * FROM dataset.small_billing_table
However, I am getting a lot of issues as it won't just let me insert (as there are repeated fields in the schema etc). An example of the dataset can be found here https://bigquery.cloud.google.com/table/data-analytics-pocs:public.gcp_billing_export_v1_EXAMPL_E0XD3A_DB33F1
Thanks :)
## Update ##
So the issue was exporting and importing the data with the Avro format and using the auto-detect schema when importing the table back in (Timestamps were getting confused with integer types).
Solution
Export the small table in JSON format to GCS, use GCS to do the regional transfer of the files and then import the JSON file into a Bigquery table and DONT use schema auto detect (e.g specify the schema manually). Then you can use INSERT INTO no problems etc.
I was able to reproduce your case with the example data set you provided. I used dummy tables, generated from the below queries, in order to corroborate the cases:
Table 1: billing_bigquery
SELECT * FROM `data-analytics-pocs.public.gcp_billing_export_v1_EXAMPL_E0XD3A_DB33F1`
where service.description ='BigQuery' limit 1000
Table 2: billing_pubsub
SELECT * FROM `data-analytics-pocs.public.gcp_billing_export_v1_EXAMPL_E0XD3A_DB33F1`
where service.description ='Cloud Pub/Sub' limit 1000
I will propose two methods for performing this task. However, I must point that the target and the source table must have the same columns names, at least the ones you are going to insert.
First, I used INSERT TO method. However, I would like to stress that, according to the documentation, if your table is partitioned you must include the columns names which will be used to insert new rows. Therefore, using the dummy data already shown, it will be as following:
INSERT INTO `billing_bigquery` ( billing_account_id, service, sku, usage_start_time, usage_end_time, project, labels, system_labels, location, export_time, cost, currency, currency_conversion_rate, usage, credits )#invoice, cost_type
SELECT billing_account_id, service, sku, usage_start_time, usage_end_time, project, labels, system_labels, location, export_time, cost, currency, currency_conversion_rate, usage, credits
FROM `billing_pubsub`
Notice that for nested fields I just write down the fields name, for instance: service and not service.description, because they will already be used. Furthermore, I did not select all the columns in the target dataset but all the columns I selected in the target's tables are required to be in the source's table selection as well.
The second method, you can simply use the Query settings button to append the small_billing_table to the big_billing_table. In BigQuery Console, click in More >> Query settings. Then the settings window will appear and you go to Destination table, check Set a destination table for query results, fill the fields: Project name,
Dataset name and Table name -these are the destination table's information-. Subsequently, in
Destination table write preference check Append to table, which according to the documentation:
Append to table — Appends the query results to an existing table
Then you run the following query:
Select * from <project.dataset.source_table>
Then after running it, the source's table data should be appended in the target's table.

Is there anyway to keep only one week data in redshift table

I have a source where everyday file get populated and every day it is loaded into redshift table
but I want to keep only one week of data in the table after one week it should delete the data which are older.
Suggest a, way for that.
A common method is:
Load each day's data into a separate table
Use CREATE VIEW to create a combined view of the past week's tables
For example:
CREATE VIEW data
AS
SELECT * FROM monday_table
UNION ALL
SELECT * FROM tuesday_table
UNION ALL
SELECT * FROM wednesday_table
...etc
Your users can simply use the View as a normal table.
Then, each day when new data has arrived, DROP or TRUNCATE the oldest table and load the new data
Either load the new data in the same-named table as the one dropped/truncated, or re-create the View to include this new table and not the dropped one
There is no automatic process to do the above steps, but you could make it part of the script that runs your Load process.

First time Updating a table

I was recently given permissions to update a single table in our database but this is not something I have done before and I do not what to mess anything up. I have tried searching for something online that was similar to what I am wanting to do with no success.
The table name is dbo.Player_Miles and it only has two columns of data Player_ID and Miles both of which are set as (int,null).
Currently there are about 300K records in this table and I have a csv file I need to use to update this table. In the file there are 500k Records so I need to be able to:
INSERT the new records ~250k records
UPDATE the records with that have new information ~200K records
Leave untouched and record that has the same information(although updating those to the same thing would not hurt the database would be a resource hog I would guess) ~50K records
Also leave untouched any records in the table currently that are not in the updated file. ~50k records
I am using SSMS 2008 but the Server is 2000.
You could approach this in stages...
1) Backup the database
2) Create a temporary SQL table to hold your update records
create table Player_Miles_Updates (
PlayerId int not null,
Miles int null)
3) Load the records from your text file into your temporary table
bulk insert Player_Miles_Updates
from 'c:\temp\myTextRecords.csv'
with
(
FIELDTERMINATOR =' ,',
ROWTERMINATOR = '\n'
)
4) Begin a transaction
begin transaction
5) Insert your new data
insert into Player_Miles
select PlayerId, Miles
from Player_Miles_Updates
where PlayerId not in (select PlayerId from Player_Miles)
6) Update your existing data
update Player_Miles
set Player_Miles.Miles = pmu.Miles
from Player_Miles pm join Player_Miles_Updates pmu on pm.Player_Id = pmu.Player_Id
7) Select a few rows to make sure what you wanted to happen, happened
select *
from Player_Miles
where Player_Id in (1,45,86,14,83) -- use id's that you have seen in the csv file
8a) If all went well
commit transaction
8b) If all didn't go well
rollback transaction
9) Delete the temporary table
drop table Player_Miles_Updates
You should use SSIS (or DTS, which was replaced by SSIS in SQL Server 2005).
Use the CSV as your source and "upsert" the data to your destination table.
In SSIS there are different ways to get this task done.
An easy way would be to use a lookup task on Player_ID.
If there's a match update the value and if there's no match just insert the new value.
See this link for more informations on lookup-pattern-upsert