Azure Data Warehouse External Table - azure-sqldw

How to only read specific file from External Table that is pointing to a Folder in ADLS that has thousands of file ?

You can't do that with external tables / Polybase when the external table has already been created, but what you could do is create your own external table specifying the filename in the definition. eg if you table definition is like this (where a filename is not specified):
CREATE EXTERNAL TABLE ext.LINEITEM (
L_ORDERKEY BIGINT NOT NULL,
...
)
WITH (
LOCATION = 'input/lineitem/',
DATA_SOURCE = AzureDataLakeStore,
FILE_FORMAT = TextFileFormat
);
You could copy it and your own table, eg
CREATE EXTERNAL TABLE ext.LINEITEM_42 (
L_ORDERKEY BIGINT NOT NULL,
...
)
WITH (
LOCATION = 'input/lineitem/lineitem42.txt',
DATA_SOURCE = AzureDataLakeStore,
FILE_FORMAT = TextFileFormat
);
See the difference? Another alternative would be to use one of the languages / platforms that can easily access data lake eg U-SQL, Databricks to write a query accessing the lake, eg a little U-SQL:
#input =
EXTRACT
l_orderkey int
...
FROM "/input/lineitem/lineitem42.txt"
USING Extractors.Csv(skipFirstNRows : 1 );
A little Scala:
val lineitem42 = "/mnt/lineitem/lineitem42.txt"
var df42 = spark.read
.option("sep", "|") // Use pipe separator
.csv(lineitem42)

Related

How to define filegroup for SQL server in AWS?

I've been tasked to move on-prem SQL Server database to AWS RDS SQL Server and from there migrate the data to postgresql.
I'm preparing for those and in the DB creation script found the term "FILEGROUP" and each file group has been mentioned a path in the local server like below.
CREATE DATABASE [PROD_MIGTN]
CONTAINMENT = NONE
ON PRIMARY
( NAME = N'SDL_DEV', FILENAME = N'D:\SQL_DB\MSSQL12.SDL_PRODDB\MSSQL\DATA\PROD_MIGTN.mdf' , SIZE = 503269504KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB ),
FILEGROUP [FG_EDU_GROUP]
( NAME = N'INV_EDU_GROUP', FILENAME = N'D:\SQL_DB\MSSQL12.SDL_PRODDB\MSSQL\DATA\INV_EDU_GROUP.ndf' , SIZE = 13393920KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB ),
FILEGROUP [FG_PAYMNT_HISTORY]
( NAME = N'EXT_PAYMNT_HISTORY', FILENAME = N'D:\SQL_DB\MSSQL12.SDL_PRODDB\MSSQL\DATA\INV_PAYMNT_HISTORY.ndf' , SIZE = 16516736KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB )
LOG ON
( NAME = N'PROD_MIGTN_DEV_log', FILENAME = N'D:\SQL_DB\MSSQL12.SDL_PRODDB\MSSQL\DATA\PROD_MIGTN_1.LDF' , SIZE = 133711872KB , MAXSIZE = 2048GB , FILEGROWTH = 51200KB )
Also noticed that the filegroups were tied to a partition scheme like below
CREATE PARTITION SCHEME [PS_SALE] AS PARTITION [PF_WHOLE_SALE] TO ([FG_EDU_GROUP])
GO
and the above in turn to a table
CREATE TABLE [dbo].[T_SALE](
[SALE_ID] [varchar](50) NOT NULL,
[REF_NO] [varchar](50) NOT NULL,
[COMPY_ID] [varchar](50) NOT NULL,
[STATUS_CODE] [nvarchar](128) NOT NULL
CONSTRAINT [PK_T_SALE_SALE_ID] PRIMARY KEY CLUSTERED
(
[SALE_ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PS_SALE]([SALE_ID])
) ON [PS_SALE](SALE_ID)
GO
Want to know,
What is the significance of the filegroup given the scenario that i'm only going to migrate the data to postgresql and move it to archive?
Will there be any impact to data that removing the filegroup and its connected partition scheme result in mess up ?
In the event that avoiding them both will create issues, then how to define the filegroup in the cloud database i.e. how to give a file path ? If so, should i need to allocate a S3 bucket or some other kind of storage separately ?
Thanks

How to load key-value pairs (MAP) into Athena from Parquet file?

I have an S3 bucket full of .gz.parquet files. I want to make them accessible in Athena. In order to do this I am creating a table in Athena that points at the s3 bucket:
CREATE EXTERNAL TABLE user_db.table (
pan_id bigint,
dev_id bigint,
parameters ?????,
start_time_local bigint
)
STORED AS PARQUET
LOCATION ‘s3://bucket/path/to/folder/containing_files/’
tblproperties (“parquet.compression”=“GZIP”)
;
How do I correctly specify the data type for the parameters column?
Using # parquet-tools schema, I see the following schema of the data files:
optional int64 pan_id;
optional int64 dev_id;
optional group parameters (MAP) {
repeated group key_value {
required binary key (UTF8);
optional binary value (UTF8);
}
}
optional int96 start_time_local;
Using # parquet-tools head, I see the following value for one row of data:
pan_id = 1668490
dev_id = 6843371
parameters:
.key_value:
..key = doc_id
..value = c2bd3593d7015fb912d4de229a302379babcf6a00a203fcf
.key_value:
..key = variables
..value = {“video_id”:“2313675068886132",“surface”:“post”}
start_time_local = QFOHvvYvAAAzhCUA
I appreciate any help you can give. I have not been able to find good documentation for the MAP datatype being used in CREATE TABLE.
Maps are declared as map<string,string> (for string-to-string maps, other types are also possible), in your case the whole table DDL would be:
CREATE EXTERNAL TABLE user_db.table (
pan_id bigint,
dev_id bigint,
parameters map<string,string>,
start_time_local bigint
)
STORED AS PARQUET
LOCATION 's3://bucket/path/to/folder/containing_files/'
tblproperties ("parquet.compression" = "GZIP")
The map type is the second to last in the list in the list of Athena data types
You can use an AWS Glue Crawler to automatically derive the schema from your Parquet files.
Defining AWS Glue Crawlers: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html

Athena: Possible To Use Alias For Column?

My AWS Athena table contains a schema as follows:
CREATE EXTERNAL TABLE IF NOT EXISTS .... (
name STRING,
address STRING,
phone STRING,
...
)
However, when querying against this table I want to be able to query against name and for example personName
Ideally I'd like to be able to do this
CREATE EXTERNAL TABLE IF NOT EXISTS .... (
name STRING as personName,
address STRING as personAddress,
phone STRING as personPhone,
...
)
...but I don't see how to achieve this using the documentation. (I am using Avro)
How might I achieve this without having 2 tables?

Aws Athena - Create external table skipping first row

I'm trying to create an external table on csv files with Aws Athena with the code below but the line TBLPROPERTIES ("skip.header.line.count"="1") doesn't work: it doesn't skip the first line (header) of the csv file.
CREATE EXTERNAL TABLE mytable
(
colA string,
colB int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://mybucket/mylocation/'
TBLPROPERTIES (
"skip.header.line.count"="1")
Any advise?
Just tried the "skip.header.line.count"="1" and seems to be working fine now.
On the AWS Console you can specify it as Serde parameters key-value keypair
While if you apply your infrastructure as code with terraform you can use ser_de_info parameter - "skip.header.line.count" = 1. Example bellow
resource "aws_glue_catalog_table" "banana_datalake_table" {
name = "mapping"
database_name = "banana_datalake"
table_type = "EXTERNAL_TABLE"
owner = "owner"
storage_descriptor {
location = "s3://banana_bucket/"
input_format = "org.apache.hadoop.mapred.TextInputFormat"
output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
compressed = "false"
number_of_buckets = -1
ser_de_info {
name = "SerDeCsv"
serialization_library = "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
parameters {
"field.delim" = ","
"skip.header.line.count" = 1 # Skip file headers
}
}
columns {
name = "column_1"
type = "string"
}
columns {
name = "column_2"
type = "string"
}
columns {
name = "column_3"
type = "string"
}
}
}
This is a feature that has not yet been implemented. See Abhishek#AWS' response here:
"We are working on it and will report back as soon as we have an
outcome. Sorry for this again. This ended up taking longer than what
we anticipated."
My workaround has been to preprocess the data before creating the table:
download the csv file from S3
strip the header using bash sed -e 1d -e 's/\"//g' file.csv > file-2.csv
upload the results to its own folder on S3
create the table
I recently tried:
TBLPROPERTIES ('skip.header.line.count'='1')
And it works fine now. This issue arose when I had the column header as a string (timestamp) and the records where actual timestamps. My queries would bomb as it would scan the table and find a string instead of timestamp.
Something like this:
ts
2015-06-14 14:45:19.537
2015-06-14 14:50:20.546
When this question was asked there was no support for skipping headers, and when it was later introduced it was only for the OpenCSVSerDe, not for LazySimpleSerDe, which is what you get when you specify ROW FORMAT DELIMITED FIELDS …. I think this is what has caused some confusion about whether or not it works in the answers to this question.

WSO2 - Table created using Analytic Script Invisible in Gadget Generation Tool

My use case: pushes data from a stream configured in the ESB to BAM and create a report using “Gadget Generation Tool”
Publishing the stream from ESB to BAM after adding an agent to the proxy service worked fine.
From the stream I created a table using the Analytics->Add screen and the table seems to persist as I am able to do a select and see results from the same screen.
Now I am trying to generate a Dashboard using the Gadget Generation Tool but the table is not available, though the jdbc connection is working fine but the table is nowhere:
Script for Analytic Table run from Analytics->Add screen
CREATE EXTERNAL TABLE IF NOT EXISTS CREDITTABLE(creditkey STRING, creditFlag STRING, version STRING)
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES ( "cassandra.host" = "127.0.0.1" ,
cassandra.port" = "9163" , "cassandra.ks.name" = "EVENT_KS" ,
"cassandra.ks.username" = "admin" ,
"cassandra.ks.password" = "admin" ,
"cassandra.cf.name" = "firstStream" ,
"cassandra.columns.mapping" = ":key,payload_k1-constant, Version" );
Tried looking for table in following databases:
jdbc:h2:repository/database/WSO2CARBON_DB;AUTO_SERVER=TRUE
jdbc:h2:repository/database/metastore_db;AUTO_SERVER=TRUE
jdbc:h2:repository/database/samples/BAM_STATS_DB;AUTO_SERVER=TRUE
Have not done any custom db configurations.
Did you try jdbc:h2:repository/database/samples/WSO2CARBON_DB;AUTO_SERVER=TRUE? Also, what you have pasted is the Cassandra Storage Definition, probably used for getting the input, not persisting the output. If you give the full hive query, that would help to figure out the problem more.
Why did I not see the table in Gadget Generation tool?
The table I have created using the Hive script is a Casandra Distributed database table and the reference I gave in the Gadget generation tool while looking up for the table were from the h2 RDBMS database table.
Below are the references to the h2 RDBMS databse which comes out of box with WSO2
jdbc:h2:repository/database/WSO2CARBON_DB;AUTO_SERVER=TRUE
jdbc:h2:repository/database/metastore_db;AUTO_SERVER=TRUE
jdbc:h2:repository/database/samples/BAM_STATS_DB;AUTO_SERVER=TRUE
Resolution ----- How to get tables listed in the Gadget Generation tool?
To get the tables listed in the Gadget Generation tool you have to extensively use the Hive Script to complete the following 3 steps:
Create a Hive table reference for the Casandra data stream to which data is pushed from ESB in my case.
CREATE EXTERNAL TABLE IF NOT EXISTS CREDITTABLE(
payload_creditkey STRING, payload_creditFlag STRING, payload_version STRING) STORED BY
'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES ( "cassandra.host" = "127.0.0.1" ,
"cassandra.port" = "9163" , "cassandra.ks.name" = "EVENT_KS" , "cassandra.ks.username" = "admin" , "cassandra.ks.password" = "admin" ,
"cassandra.cf.name" = "firstStream" , "cassandra.columns.mapping" = ":key,payload_k1-constant, Version" );
Using Hive script create a H2 RDBMS script and reference to which I would be copying my data from the Casandra stream.
CREATE EXTERNAL TABLE IF NOT EXISTS CREDITTABLEh2summary(
creditFlg STRING,
verSion STRING
)
STORED BY
'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler'
TBLPROPERTIES (
'mapred.jdbc.driver.class' = 'org.h2.Driver' ,
'mapred.jdbc.url' = 'jdbc:h2:C:/wso2bam-2.2.0/repository/samples/database/BAM_STATS_DB' ,
'mapred.jdbc.username' = 'wso2carbon' ,
'mapred.jdbc.password' = 'wso2carbon' ,
'hive.jdbc.update.on.duplicate' = 'true' ,
'hive.jdbc.primary.key.fields' = 'creditFlg' ,
'hive.jdbc.table.create.query' = 'CREATE TABLE CREDITTABLE_newh2(creditFlg VARCHAR(100), version VARCHAR(100))' );
Write a Hive query using which data would be copied from Casandra to H2[RDBMS]
insert overwrite table CREDITTABLEh2summary select a.payload_creditFlag,a.payload_version from CREDITTABLE a;
On doing this I was able to see the table in the Gadget Generation tool however I also had to chage the referenc to the H2 Database to absolute in the JDBC URL value that I passed.
Observation:
Was wondering if the Gadget generation tool can directly point to the Casandra Stream without having to copy the tables to a RDBMS database.