Access Data from Azure Data Lake Store using Polybase with Azure Data Warehouse - azure-sqldw

I get a error when create external table
https://exoticbaryon.anset.org/2017/06/26/access-data-from-azure-data-lake-store-using-polybase-with-azure-data-warehouse/#comment-157
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'xxxxx'
CREATE DATABASE SCOPED CREDENTIAL ADLUser
WITH IDENTITY = xxxxx#/https://login.microsoftonline.com/xxxxx/oauth2/v2.0/token',
SECRET = xxxxx' ;
CREATE EXTERNAL DATA SOURCE AzureDataLakeStore
WITH (TYPE = HADOOP,
CREDENTIAL = ADLUser,
LOCATION = N'adl://xxxxx.azuredatalakestore.net'
)
CREATE EXTERNAL FILE FORMAT TextFileFormat
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (FIELD_TERMINATOR =',',
STRING_DELIMITER = '"',
USE_TYPE_DEFAULT = TRUE)
);
CREATE EXTERNAL TABLE [dbo].[xxxxx_external](
[EventMonth] [nvarchar](10) NULL,
[UserCount] [bigint] NULL,
[UserType] [nchar](8) NULL,
[StageType] [bigint] NULL,
[StageName] [nvarchar](9) NULL)
WITH
(
LOCATION=N'/test/xxxxx.csv',
DATA_SOURCE = AzureDataLakeStore ,
FILE_FORMAT = TextFileFormat
) ;
CREATE TABLE [dbo].[xxxxx]
WITH (DISTRIBUTION = HASH([EventMonth] ) )
AS SELECT * FROM
[dbo].[xxxxx_external] ;
When run CREATE EXTERNAL TABLE
Failed to execute query. Error: External file access failed due to internal error: 'Error occurred while accessing HDFS: Java exception raised on call to HdfsBridge_IsDirExist. Java exception message:
HdfsBridge::isDirExist - Unexpected error encountered checking whether directory exists or not: MalformedURLException: no protocol: /https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/v2.0/token'

You have to modify you external Data Source to the similar format
CREATE EXTERNAL DATA SOURCE <data_source_name>
WITH
( LOCATION = '<prefix>://<path>[:<port>]'
[, CONNECTION_OPTIONS = '<name_value_pairs>']
[, CREDENTIAL = <credential_name> ]
[, PUSHDOWN = ON | OFF]
[, TYPE = HADOOP | BLOB_STORAGE ]
[, RESOURCE_MANAGER_LOCATION = '<resource_manager>[:<port>]'
)
[;]
You can find more info in the following link :https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-ver15
As you are accessing Azure Data Lake you need to mention your prefix with 'wasbs'
For first time try uploading a single file in your folder container and donot mention any .csv file name and load into external tables.
Later you can mention your specific filename and test your code.

Related

BQ - No permission to INFORMATION_SCHEMA.JOBS

Issue description
Trying to insert DML statistics from Bigquery system tables to BQ native tables for monitoring purpose using Airflow task.
For this need, I am using below query:
INSERT INTO
`my-project-id.my_dataset.my_table_metrics`
(table_name,
row_count,
inserted_row_count,
updated_row_count,
creation_time)
SELECT
b.table_name,
a.row_count AS row_count,
b.inserted_row_count,
b.updated_row_count,
b.creation_time
FROM
`my-project-id.my_dataset`.__TABLES__ a
JOIN (
SELECT
tables.table_id AS table_name,
dml_statistics.inserted_row_count AS inserted_row_count,
dml_statistics.updated_row_count AS updated_row_count,
creation_time AS creation_time
FROM
`my-project-id`.`region-europe-west3`.INFORMATION_SCHEMA.JOBS,
UNNEST(referenced_tables) AS tables
WHERE
DATE(creation_time) = current_date ) b
ON
a.table_id = b.table_name
WHERE
a.table_id = 'my_bq_table'
The query is working in Bigquery console but not working via airflow.
Error as per Airflow
python_http_client.exceptions.UnauthorizedError: HTTP Error 401:
Unauthorized
[2022-10-21, 12:18:30 UTC] {standard_task_runner.py:93}
ERROR - Failed to execute job 313922 for task load_metrics (403 Access
Denied: Table
my-project-id:region-europe-west3.INFORMATION_SCHEMA.JOBS: User does
not have permission to query table
my-project-id:region-europe-west3.INFORMATION_SCHEMA.JOBS, or perhaps
it does not exist in location europe-west3.
How to fix this access issue?
Appricate your help.

Change location of the glue table

I've created glue table (external) via terraform where I din't put location of the table.
Location of the table should be updated after app run. And when app runs it receives an exception:
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Error: ',', ':', or ';' expected at position 291 from 'bigint:bigint:bigint:bigint:bigint:bigint:bigint:bigint:bigint:bigint:bigint:bigint:bigint:bigint:string:string:smallint:smallint:smallint:decimal(12,2):decimal(12,2):decimal(12,2):bigint:string:bigint:string:timestamp:timestamp:bigint:bigint:bigint:bigint:bigint:string:string:decimal(12,2) :bigint:timestamp:string:bigint:decimal(12,2):string:bigint:bigint:timestamp:int' [0:bigint, 6::, 7:bigint, 13::, 14:bigint, 20::, 21:bigint, 27::, 28:bigint, 34::, 35:bigint, 41::, 42:bigint, 48::, 49:bigint, 55::, 56:bigint, 62::, 63:bigint, 69::, 70:bigint, 76::, 77:bigint, 83::, 84:bigint, 90::, 91:bigint, 97::, 98:string, 104::, 105:string, 111::, 112:smallint, 120::, 121:smallint, 129::, 130:smallint, 138::, 139:decimal, 146:(, 147:12, 149:,, 150:2, 151:), 152::, 153:decimal, 160:(, 161:12, 163:,, 164:2, 165:), 166::, 167:decimal, 174:(, 175:12, 177:,, 178:2, 179:), 180::, 181:bigint, 187::, 188:string, 194::, 195:bigint, 201::, 202:string, 208::, 209:timestamp, 218::, 219:timestamp, 228::, 229:bigint, 235::, 236:bigint, 242::, 243:bigint, 249::, 250:bigint, 256::, 257:bigint, 263::, 264:string, 270::, 271:string, 277::, 278:decimal, 285:(, 286:12, 288:,, 289:2, 290:), 291: , 292::, 293:bigint, 299::, 300:timestamp, 309::, 310:string, 316::, 317:bigint, 323::, 324:decimal, 331:(, 332:12, 334:,, 335:2, 336:), 337::, 338:string, 344::, 345:bigint, 351::, 352:bigint, 358::, 359:timestamp, 368::, 369:int]
This exception kind of represents fields which were defined in terraform.
From aws console I couldn't set location after table was created. When I connected to AWS EMR which uses Glue metastore and tried to execute same query I receive same exception.
So I have several questions:
Does anybody know how to alter empty location of the external glue table?
The default location of the table should looks like that hive/warehouse/dbname.db/tablename. So what is the correct path in that case in EMR ?

How to export SnowFlake S3 data file to my AWS S3?

Snowflake S3 data is in .txt.bz2, I need to export the data files present in this SnowFlake S3 to my AWS S3, the exported results must be the same format as in the source location.This is wat I tried.
COPY INTO #mystage/folder from
(select $1||'|'||$2||'|'|| $3||'|'|| $4||'|'|| $5||'|'||$6||'|'|| $7||'|'|| $8||'|'|| $9||'|'|| $10||'|'|| $11||'|'|| $12||'|'|| $13||'|'|| $14||'|'||$15||'|'|| $16||'|'|| $17||'|'||$18||'|'||$19||'|'|| $20||'|'|| $21||'|'|| $22||'|'|| $23||'|'|| $24||'|'|| $25||'|'||26||'|'|| $27||'|'|| $28||'|'|| $29||'|'|| $30||'|'|| $31||'|'|| $32||'|'|| $33||'|'|| $34||'|'|| $35||'|'|| $36||'|'|| $37||'|'|| $38||'|'|| $39||'|'|| $40||'|'|| $41||'|'|| $42||'|'|| $43
from #databasename)
CREDENTIALS = (AWS_KEY_ID = '*****' AWS_SECRET_KEY = '*****' )
file_format=(TYPE='CSV' COMPRESSION='BZ2');
PATTERN='*/*.txt.bz2
Right now Snowflake does not support exporting data to file in bz2.
My suggestion is to set COMPRESSION='gzip', then you can export the Data to your S3 in gzip.
If exporting file in bz2 is high priority for you, please contact Snowflake support.
If you want to unload bz2 file from a Snowflake stage to your own S3, you can do something like this.
COPY INTO #myS3stage/folder from
(select $1||'|'||$2||'|'|| $3||'|'|| $4||'|'|| $5||'|'||$6||'|'|| $7||'|'|| $8||'|'|| $9||'|'|| $10||'|'|| $11||'|'|| $12||'|'|| $13||'|'|| $14||'|'||$15||'|'|| $16||'|'|| $17||'|'||$18||'|'||$19||'|'|| $20||'|'|| $21||'|'|| $22||'|'|| $23||'|'|| $24||'|'|| $25||'|'||26||'|'|| $27||'|'|| $28||'|'|| $29||'|'|| $30||'|'|| $31||'|'|| $32||'|'|| $33||'|'|| $34||'|'|| $35||'|'|| $36||'|'|| $37||'|'|| $38||'|'|| $39||'|'|| $40||'|'|| $41||'|'|| $42||'|'|| $43
from #snowflakeStage(PATTERN => '*/*.txt.bz2'))
CREDENTIALS = (AWS_KEY_ID = '*****' AWS_SECRET_KEY = '*****' )
file_format=(TYPE='CSV');

AWS SDK php S3 refuses to access bucket name xx.my_domain.com

I want to use AWS S3 to store image files for my website. I create a bucket name images.mydomain.com which was referred by dns cname images.mydomain.com from AWS Route 53.
I want to check whether a folder or file exists; if not I will create one.
The following php codes work fine for regular bucket name using stream wrapper but fails for this type of bucket name such as xxxx.mydomain.com. This kind of bucket name fails in doesObjectExist() method too.
// $new_dir = "s3://aaaa/akak3/kk1/yy3/ww4" ; // this line works !
$new_dir = "s3://images.mydomain.com/us000000/10000" ; // this line fails !
if( !file_exists( $new_dir) ){
if( !mkdir( $new_dir , 0777 , true ) ) {
echo "create new dir $new_dir failed ! <br>" ;
} else {
echo "SUCCEED in creating new dir $new_dir <br" ;
}
} else {
echo "dir $new_dir already exists. Skip creating dir ! <br>" ;
}
I got the following message
Warning: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint: "images.mydomain.com.s3.amazonaws.com". in C:\AppServ\www\ecity\vendor\aws\aws-sdk-php\src\Aws\S3\StreamWrapper.php on line 737
What is the problem here?
Any advise on what to do for this case?
Thanks!

WS02 BAM - Analytics Framework

In Cassandra Cluster EVENT_KS Key Space , I have a bookTicket1 (stream) and it has columns
payload_provider,payload_totalNoTickets. When I tried to a new Analytics script as below ,
CREATE EXTERNAL TABLE IF NOT EXISTS BusTicketTable
(provider STRING, totalNoTickets STRING, version STRING)
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES (
"cassandra.host" = "127.0.0.1" ,
"cassandra.port" = "9160" ,
"cassandra.ks.name" = "EVENT_KS" ,
"cassandra.ks.username" = "admin" ,
"cassandra.ks.password" = "admin" ,
"cassandra.cf.name" = "bookTicket1" ,
"cassandra.columns.mapping" = ":payload_provider,payload_totalNoTickets, Version" );
It returns the error:
ERROR: Error while executing Hive script.Query returned non-zero code: 9, cause: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask "
Consider this line,
"cassandra.columns.mapping" = ":payload_provider,payload_totalNoTickets, Version"
There the key is not set in Cassandra. I am not sure but I think you may have to set the key as well because the row key is mandatory for Cassandra column family.
e.g.:
"cassandra.columns.mapping" = ":key, payload_provider,payload_totalNoTickets, Version"
You may need to set a unique field as the key.