Redshift copy doesn't insert data into my table - amazon-web-services

I have table SampleTable
and run the following Redshift command through a SQL Client (JackDB)
copy SampleTable
from 's3://bucket-name/backup/data.csv.gz'
credentials 'aws_access_key_id=xxx;aws_secret_access_key=xxx'
gzip
csv;
the command does return
Executed successfully Updated 0 rows in 2.771 seconds.
but no data are inserted into empty table SampleTable
select count(*)
from SampleTabe
return 0
there are 100MB data in data.csv.gz

solved by myself, the data do not correspond to the query.
I should include delimiter to overwrite the default and IGNOREHEADER 1 to skip the csv header.
just bothered by the fact that no stl_load_error is recorded in this case

Related

How to export hive table data into csv.gz format stored in s3

So I have two hive queries, one that creates the table and the other one that reads parquet data from another table and inserts the relevant columns into my new table. I would like this new hive table to export its data to an s3 location with data in csv.gz format. My hive queries running on emr are currently outputting 00000_0.gz and I have to rename them using a bash script to csv.gz. This is quite a hacky way as I have to mount my s3 directory into my terminal and it would be ideal if my queries could directly do this. Could someone please review my queries to see where if there's any fault, many thanks.
CREATE TABLE db.test (
app_id string,
app_account_id string,
sdk_ts BIGINT,
device_id string)
PARTITIONED BY (
load_date string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION "s3://test_unload/";
set hive.execution.engine=tez;
set hive.cli.print.header=true;
set hive.exec.compress.output=true;
set hive.merge.tezfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=1024000000;
set hive.merge.size.per.task=1024000000;
set hive.exec.dynamic.partition.mode=nonstrict;
insert into db.test
partition(load_date)
select
'' as app_id,
'288' as app_account_id,
from_unixtime(CAST(event_epoch as BIGINT), 'yyyy-MM-dd HH:mm:ss') as sdk_ts,
device_id,
'20221106' as load_date
FROM processed_events.test
where load_date = '20221106'; ```

Snowflake table is not accepting null values in date field

I have one table in snowflake, I am performing bulk load using.
one of the columns in table is date, but in the source table which is on sql server is having null values in date column.
The flow of data is as :
sql_server-->S3 buckets -->snowflake_table
I am able to perform the sqoop job in EMR , but not able to load the data into snowflake table, as it is not accepting null values in the date column.
The error is :
Date '' is not recognized File 'schema_name/table_name/file1', line 2, character 18 Row 2,
column "table_name"["column_name":5] If you would like to continue loading when an error is
encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option.
can anyone help, where I am missing
Using below command you can able to see the values from stage file:
select t.$1, t.$2 from #mystage1 (file_format => myformat) t;
Based on the data you can change your copy command as below:
COPY INTO my_table(col1, col2, col3) from (select $1, $2, try_to_date($3) from #mystage1)
file_format=(type = csv FIELD_DELIMITER = '\u00EA' SKIP_HEADER = 1 NULL_IF = ('') ERROR_ON_COLUMN_COUNT_MISMATCH = false EMPTY_FIELD_AS_NULL = TRUE)
on_error='continue'
The error shows that the dates are not arriving as nulls. Rather, they're arriving as blank strings. You can address this a few different ways.
The cleanest way is to use the TRY_TO_DATE function on your COPY INTO statement for that column. This function will return database null when trying to convert a blank string into a date:
https://docs.snowflake.com/en/sql-reference/functions/try_to_date.html#try-to-date

Does AWS Athena supports Sequence File

Has any one tried creating AWS Athena Table on top of Sequence Files. As per the Documentation looks like it is possible. I was able to execute below create table statement.
create external table if not exists sample_sequence (
account_id string,
receiver_id string,
session_index smallint,
start_epoch bigint)
STORED AS sequencefile
location 's3://bucket/sequencefile/';
The Statement executed Successfully but when i try to read data from the table it throws below error
Your query has the following error(s):
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://viewershipforneo4j/2017-09-26/000030_0 (offset=372128055, length=62021342) using org.apache.hadoop.mapred.SequenceFileInputFormat: s3://viewershipforneo4j/2017-09-26/000030_0 not a SequenceFile
This query ran against the "default" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 9f0983b0-33da-4686-84a3-91b14a39cd09.
Sequence file are valid one . Issue here is there is not deliminator defined.
Ie row format delimited fields terminated by is missing
if in your case if tab is column deliminator row data is in next row it will be
create external table if not exists sample_sequence (
account_id string,
receiver_id string,
session_index smallint,
start_epoch bigint)
row format delimited fields terminated by '\t'
STORED AS sequencefile
location 's3://bucket/sequencefile/';

Amazon Athena : How to store results after querying with skipping column headers?

I ran a simple query using Athena dashboard on data of format csv.The result was a csv with column headers.
When storing the results,Athena stores with the column headers in s3.How can i skip storing header column names,as i have to make new table from the results and it is repetitive
Try "skip.header.line.count"="1", This feature has been available on AWS Athena since 2018-01-19, here's a sample:
CREATE EXTERNAL TABLE IF NOT EXISTS tableName (
`field1` string,
`field2` string,
`field3` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://fileLocation/'
TBLPROPERTIES ('skip.header.line.count'='1')
You can refer to this question:
Aws Athena - Create external table skipping first row
From an Eric Hammond post on AWS Forums:
...
WHERE
date NOT LIKE '#%'
...
I found this works! The steps I took:
Run an Athena query, with the output going to Amazon S3
Created a new table pointing to this output based on How do I use the results of my Amazon Athena query in another query?, changing the path to the correct S3 location
Ran a query on the new table with the above WHERE <datefield> NOT LIKE '#%'
However, subsequent queries store even more data in that S3 directory, so it confuses any subsequent executions.

how to copy data when delimiters are missing in source file

Lets say I have a 4 table in redshift with 4-columns as :
Create Table m.mytab(
col_1 BIGINT NOT NULL
col_2 Varchar(200)
col_3 Varchar(200)
col_4 INT
);
And my Source row file contains data as:
col_1^col_2^col_3^col_4
myrowdata1^myrowdata2
myrowdata3^myrowdata4
.....
Here I want to load this data in mytab I tried copy command of redshift as :
copy m.mytab
from 's3://mybucket/folder/fileA.gz '
credentials 'aws_access_key_id=somexxx;aws_secret_access_key=somexxx'
DELIMITER '^'
GZIP
IGNOREHEADER 1
ACCEPTINVCHARS;
Since last 2-delimiters are missing in each row , I am unable to load the data here, can someone suggest me how to resolve this issue?
Thanks
1) Try adding FILLRECORD parameter to your COPY statement
For more information, see Data Conversion Parameters documentation
2) If all rows are missing col3 and col4 you can just create a staging table with col1 and col2 only, copy data to staging table and then issue
ALTER TABLE target_tablename
APPEND FROM staging_tablename
FILLTARGET;
This will move data to the target_tablename very efficiently (just changing the pointer without writing or deleting data) and take care about missing col3 and col4.
More information about the command: ALTER TABLE APPEND