I created the table below with ';' delimiter, but in the column 'value' it only appears numbers without 'comma', numbers with comma doesn't appear.
CREATE EXTERNAL TABLE `table` (
`value` double,
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
Inside CSV: | output:
value value
24 24.0
24,96
Expect:
Inside CSV: | output:
value value
24 24.0
24,96 24.96
Related
I am trying to combine CSVs which I created and put into S3 into a single CSV file. I am using AWS Athena queries to do that, but I'm getting errors like
HIVE_BAD_DATA: Error parsing column '0': For input string: apples</li><li>Item 1</li><li>Item 2</li><div class=""atlanta"">organges</div> ...
How do I solve this?
This is my code - the description field/column contains HTML.
Creating table:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`my_table` (
`item_no` double,
`description` string,
`name` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://my_location/'
TBLPROPERTIES ('has_encrypted_data'='false', 'skip.header.line.count'='1');
Querying table:
select * from my_table
where "item_no" < 30000
limit 10
The following query returns:
line 1:8: mismatched input 'EXTERNAL'. Expecting: 'OR', 'SCHEMA', 'TABLE', 'VIEW'
CREATE EXTERNAL TABLE IF NOT EXISTS adult_data_clean(
age bigint,
workclass string,
education string,
relationship string,
occupation string,
country string,
income_cat string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
STORED AS OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://census-income-example/clean.data'
TBLPROPERTIES (
'classification'='csv'
'skip.header.line.count'='1')
There are two errors:
The TBLPROPERTIES is missing a comma between the parameters
OUTPUTFORMAT should not have STORED AS before it (since it simply continues from INPUTFORMAT):
CREATE EXTERNAL TABLE IF NOT EXISTS adult_data_clean(
age bigint,
workclass string,
education string,
relationship string,
occupation string,
country string,
income_cat string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://census-income-example/clean.data'
TBLPROPERTIES (
'classification'='csv',
'skip.header.line.count'='1')
Yes, those error messages were confusing and misleading.
I am trying to create an external table in AWS Athena from a csv file that is stored in my S3.
The csv file looks as follows. As you can see, the data is not enclosed in quotation marks (") and is delimited by commas (,).
ID,PERSON_ID,DATECOL,GMAT
612766604,54723367,2020-01-15,637
615921503,158634997,2020-01-25,607
610656030,90359154,2020-01-07,670
I tried the following code to create a table:
CREATE EXTERNAL TABLE my_table
(
ID string,
PERSON_ID int,
DATE_COL date,
GMAT int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION 's3://my_bucket/som_bucket/dat/'
TBLPROPERTIES
(
'skip.header.line.count'='1'
)
;
I tried to preview the table with the following code:
select
*
from
my_table
limit 10
Which raises this error:
HIVE_BAD_DATA: Error parsing field value '2020-01-15' for field 2: For input string: "2020-01-15"
My question is: Am I passing the correct serde? And if so, how can I format the date column (DATE_COL) such that it reads and displays days in YYYY-MM-DD?
I replaced ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with
FIELDS TERMINATED BY ',' and enclosed the column names with "`". The following code creates the table correctly:
CREATE EXTERNAL TABLE my_table
(
`ID` string,
`PERSON_ID` int,
`DATE_COL` date,
`GMAT` int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://my_bucket/som_bucket/dat/'
TBLPROPERTIES ('skip.header.line.count'='1')
;
I do not understand the concept of a serde, but I suppose I did not need one to begin with.
Per documentation, a column with type DATE must have a values representing the number of days since January 1, 1970. For example, the date on row 1 after your header should have a value of 18276. When the table is queried the date will then be rendered as 2020-01-15.
I tried below query to create a bucket table but failed. However, if I remove
the clause CLUSTERED BY, the query could succeed. Any suggestion? Thank you.
the error message: no viable alternative at input create external
CREATE EXTERNAL TABLE nation5(
n_nationkey bigint,
n_name string,
n_rgionkey int,
n_comment string)
CLUSTERED BY
n_regionkey INTO 256 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'path'='s3://test/testbucket/nation5/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://test/testbucket/nation5/'
The CLUSTERED BY column needs to be in brackets, the following works:
CREATE EXTERNAL TABLE test.nation5(
n_nationkey bigint,
n_name string,
n_regionkey int,
n_comment string)
CLUSTERED BY
(n_regionkey) INTO 256 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'path'='s3://test/testbucket/nation5/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://test/testbucket/nation5/'
(You also have a spelling mistake in n_rgionkey in the column definition.
Task: Trying to load a bunch of JSON files from s3 buckets to Redshift using Redshift Spectrum.
Problem: JSON objects in few files are having data wrapped with Square brackets but other JSON files are having the same objects without square brackets. Is there a way to consume data both with/ without square brackets "[ ]" while creating an external table using the Redshift spectrum table?
JSON FILES to be consumed:
File 1: "x":{"y":{ "z":["ABCD"]}}
File 2: "x":{"y":{ "z":"EFGH"}}
Case 1
When column z is defined as an array, I am missing out the data from JSON file which are "without square brackets"
CREATE EXTERNAL TABLE spectrum.table
(x struct<y:struct<z:array<varchar(256)>>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'dots.in.keys'='true') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 's3://****'
Query: Select c from spectrum.table t , t.x.y.z c;
Case 2
When column z is defined as varchar (without declaring as an array), below is the error:
Create Statement:
CREATE EXTERNAL TABLE spectrum.table
(x struct<y:struct<z:varchar(256)>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'dots.in.keys'='true') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 's3://****'
Query: Select regexp_replace( t.x.y.z ,'\\([\\"])', '' ) from spectrum.table t;
or Select t.x.y.z from spectrum.table t;
[XX000][500310] [Amazon](500310) Invalid operation: Spectrum Scan Error
Details:
-----------------------------------------------
error: Spectrum Scan Error
code: 15001
context: Unsupported implicit cast: Column ('x' 'y' 'z'), From Type: LIST, To Type: VARCHAR,
----------------------------------------------