how to load multibyte character to SQL DW external table pointing to BLOB - azure-sqldw

I am trying to load the multibyte character with length more than 7000 char from my External Tables to SQL DW Internal tables. I have the data stores in a compressed format in BLOB Storage and External tables are pointed to the BLOB Storage Location.
External table with varchar supports till 4000 charater. is there any other approach for this.

If you use PolyBase to load the data directly into the SQL DW production tables (dbo.) from the Azure Blob Storage via the linked External (ext.) tables, you should be able to get around the external table limitation.
This tutorial walks you through the process: Tutorial: Load New York Taxicab data to Azure SQL Data Warehouse

How have you defined your database column?
The limit for a varchar is 8,000 characters, but an nvarchar is 4,000 characters. Because you're using multi-byte characters I guess you're using nvarchar.
Consider using nvarchar(max) as your target type for this column. (EDIT) As pointed out in the comments, an EXTERNAL table does not support (max).

Related

Redshift - What is CREATE EXTERNAL TABLE command for?

When I run CREATE EXTERNAL SCHEMA, I automatically see the tables inside that schema.
Recently, I've been watching some tutorial videos where CREATE EXTERNAL TABLE is run after CREATE EXTERNAL SCHEMA. What's the point of CREATE EXTERNAL TABLE if CREATE EXTERNAL SCHEMA already links to those tables?
External schema can be of various types. One can create external schemas from Glue Data Catalog, another Redshift local database, or a remote Postgres or MySQL DB, etc.
Some of these are read-only. Both in terms of data and metadata. i.e. One cannot insert a new record in any of the tables and one cannot create another table in the schema. The second and third types mentioned above fall into this category.
On the other hand, a schema created from Glue Catalog is read-only in terms of data. But one can add new tables to it. To create tables on top of files in this schema, we need the CREATE EXTERNAL SCHEMA statement.
You can either choose to create these tables through Redshift or you can create them through Athena or Glue Crawlers etc.

GCS to BigQuery table using stored Proc

I want to create a stored Proc which can read data from GCS bucket and store into the table in bigquery. I was able to do it using python bu conecting to gcs and creating bigquery client.
credentials = service_account.Credentials.from_service_account_file(path_to_key)
bq_client = bigquery.Client(credentials=credentials, project=project_id)
Can we achieve the same using stored procedure?
Have you looked into using an external table? You can query directly off of Google Cloud Storage without needing to load anything. Just define the schema of the expected data along with the GCS URI's and once the data is in GCS it is accessible via SQL in BigQuery. Otherwise, no. There is no LOAD statement that you can execute via BigQuery SQL. See the docs here for all of the ways to load data into a table. You could have the external table and maybe create a stored procedure that does an INSERT into another table using the data from the external table you created. That is if you are really hell bent on having a stored procedure to "load" data to a normal BigQuery table. Otherwise the external tables are an excellent option to obviate the need to even load the data in the first place.

Update BigQuery permanent external tables

I'm using BigQuery both to store data within "native" BigQuery tables and to query data stored in Google Cloud Storage. According to the documentation, it is possible to query external sources using two types of tables: permanent and temporary external tables.
Consider the following scenario: every day some parquet files are written in GCS, and with a certain frequency I want to do a JOIN between the data stored in a BigQuery table and the data stored in parquet files. If I create a permanent external table, and then I update the files below, is the content of the table automatically updated as well, or do I have to recreate it from the new files?
What are the best practices for such a scenario?
You don't have to re-create the external table again when you add new files into cloud storage bucket. The only exception is, if the number of columns is different in new file then the external table will not work as expected.
You need to use wildcard symbol to read files that matches to a specific pattern rather than providing a static file name. Example: "gs://bucketName/*.csv"

Is it possible to query log data stored Cloud Storage without Cleaning it using BigQuery?

I have a huge amount of log data exported from StackDriver to Google Cloud Storage. I am trying to run queries using BigQuery.
However, while creating the table in BigQuery Dataset I am getting
Invalid field name "k8s-app".
Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.
Table: bq_table
A huge amount of log data is exported from StackDriver sinks which contains a large number of unique column names. Some of these names aren't valid as per BigQuery tables.
What is the solution for this? Is there a way to query the log data without cleaning it? Using temporary tables or something else?
Note: I do not want to load(put) my data into BigQuery Storage, just to query data which is present in Google Cloud Storage.
* EDIT *
Please refer to this documentation for clear understanding
I think you can go any of these two routes based on your application:
A. Ignore Header
If the problematic field is in the header row of your logs, you can choose to ignore the header row by adding the --skip_leading_rows=1 parameter in your import command. Something like:
bq location=US load --source_format=YOURFORMAT --skip_leading_rows=1 mydataset.rawlogstable gs://mybucket/path/* 'colA:STRING,colB:STRING,..'
B. Load Raw Data
If the above is not applicable, then just simply load the data in its un-structured raw format into BigQuery. Once your data is in there, you can go about doing all sorts of stuff.
So, first create a table with a single column:
bq mk --table mydataset.rawlogstable 'data:STRING'
Now load your dataset in the table providing appropriate location:
bq --location=US load --replace --source_format=YOURFORMAT mydataset.rawlogstable gs://mybucket/path/* 'data:STRING'
Once your data is loaded, now you can process it using SQL queries, and split it based on your delimiter and skip the stuff you don't like.
C. Create External Table
If you do not want to load data into BigQuery but still want to query it, you can choose to create an external table in BigQuery:
bq --location=US mk --external_table_definition=data:STRING#CSV=gs://mybucket/path/* mydataset.rawlogstable
Querying Data
If you pick option A and it works for you, you can simply choose to query your data the way you were already doing.
In the case you pick B or C, your table now has rows from your dataset as singular column rows. You can now choose to split these singular column rows into multiple column rows, based on your delimiter requirements.
Let's say your rows should have 3 columns named a,b and c:
a1,b1,c1
a2,b2,c2
Right now its all in the form of a singular column named data, which you can separate by the delimiter ,:
select
splitted[safe_offset(0)] as a,
splitted[safe_offset(1)] as b,
splitted[safe_offset(2)] as c
from (select split(data, ',') as splitted from `mydataset.rawlogstable`)
Hope it helps.
To expand on #khan's answer:
If the files are JSON, then you won't be able to use the first method (skip headers).
But you can load each JSON row raw to BigQuery - as if it was a CSV - and then parse in BigQuery
Find a full example for loading rows raw at:
https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6
And then you can use JSON_EXTRACT_SCALAR to parse JSON in BigQuery - and transform the existing field names into BigQuery compatible ones.
Unfortunately no!
As part of log analytics, it is common to reshape the log data and run few ETL's before the files are committed to a persistent sink such as BigQuery.
If performance monitoring is all you need for log analytics, and there is no rationale to create additional code for ETL, all metrics can be derived from REST API endpoints of stackdriver monitoring.
If you do not need fields containing - you can set up to ignore ignore_unknown_values. You have to provide the schema you want and using ignore_unknown_values any field not matching the schema will be ignored.

Converting data in AWS S3 to another schema structure (also in S3)

quite a beginner's question -
I have log data stored in S3 files, in zipped JSON format.
The files reside in a directory hierarchy which reflects partitioning, in the following way: s3://bucket_name/year=2018/month=201805/day=201805/some_more_partitions/file.json.gz
I recently changed the schema of the logging to a slightly different directory structure. I Added some more partition levels, the fields currently reside inside of the JSON and I want to move them to the folder hierarchy. Also, I changed the inner JSON schema slightly. They reside in a different S3 bucket.
I wish to convert the old logs to the new format, because I have Athena mapping over the new schema structure.
Is AWS EMR the tool for this? If so, what's the simplest way to achieve this? I thought I need an EMR cluster of type step execution but it probably creates just one output file, no?
Thanks
Yes, Amazon EMR is an appropriate tool to use.
You could use Hive, which has similar-ish syntax to Athena:
Create an External Table pointing to your existing data, using your old schema
Create an External Table pointing to where you wish to store the data, using your new schema
INSERT INTO new-table SELECT * FROM old-table
If your intention is to query the data with Amazon Athena, you can use Amazon EMR to convert the data into Parquet format, which will give even better query performance.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
Yes EMR can be used for such conversion.
Here's the sample code where to covert the data coming as csv (stg folder aka source folder) format to orc file format. You may want to do the insert overwrite in case you have overlapping partitions between your staging (aka source) files and Target files
DROP TABLE IF EXISTS db_stg.stg_table;
CREATE EXTERNAL TABLE `db_stg.stg_table`(
GEO_KEY string,
WK_BEG_DT string,
FIS_WK_NUM Double,
AMOUNT1 Double
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket.name/stg_folder_name/'
TBLPROPERTIES ('has_encrypted_data'='false');
drop table db_tgt.target_table;
CREATE EXTERNAL TABLE db_tgt.target_table(
GEO_KEY string,
FIS_WK_NUM Double,
AMOUNT1 Double
)
PARTITIONED BY(FIS_WK_NUM)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
location 's3://bucket.name/tgt_folder_name/'
TBLPROPERTIES (
'orc.compress'='SNAPPY');
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table db_tgt.target_table partition(FIS_WK_NUM)
select
GEO_KEY ,
WK_BEG_DT ,
FIS_WK_NUM ,
AMOUNT1
from db_stg.stg_table;
Agree with John that converting to a columnar file format like Parquet or ORC (along with compression like SNAPPY) will give you the best performance with AWS Athena.
Remember the key to using Athena is to optimize the amount of data you scan an read. Hence, if the data is in columnar format and you are reading certain partitions, you AWS Athena cost will go down significantly. All you need to do is to make sure you are using the filter condition in your Athena queries that selects the required partitions.