I have saved monthly data in a given S3 bucket and can run athena query without any problem. But if I use symbolic file, athena is reading only January and July data. This is very strange.
My symbolic file looks something like this...
s3://some_bucket/sub_bucket/no_details_201801.csv.gz
s3://some_bucket/sub_bucket/no_details_201802.csv.gz
s3://some_bucket/sub_bucket/no_details_201803.csv.gz
s3://some_bucket/sub_bucket/no_details_201804.csv.gz
s3://some_bucket/sub_bucket/no_details_201805.csv.gz
s3://some_bucket/sub_bucket/no_details_201806.csv.gz
s3://some_bucket/sub_bucket/no_details_201807.csv.gz
s3://some_bucket/sub_bucket/no_details_201808.csv.gz
s3://some_bucket/sub_bucket/no_details_201808.csv.gz
s3://some_bucket/sub_bucket/no_details_201810.csv.gz
s3://some_bucket/sub_bucket/no_details_201811.csv.gz
s3://some_bucket/sub_bucket/no_details_201812.csv.gz
Out of these 12 files, 2 months data files are missing and athena is not complaining about it. That is nice. But it is not reading the rest 10 files. That is OK. But it seems to be reading only 2 files (randomly selected) that is not acceptable.
Has anyone experienced this with athena symbolic file?
I assume you are using the SymlinkTextInputFormat. If any of the files are missing in the symlink file, both Athena and Presto on EMR should fail the query. I I was not able to reproduce the issue.
CREATE EXTERNAL TABLE `symlink_test`(
`col1` string,
`col2` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\',
'quoteChar'='`',
'separatorChar'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://my-bucket/datasets/symlink'
If any of the files in the symlink does not exist, Athena and Presto gives an error message similar to:
HIVE_UNKNOWN_ERROR: Input path does not exist:[...]
Related
I have a CSV file with data that looks like "John Doe",Washington,100,22,.... The values that have quotes around them are the ones that contain whitespace. The values that don't have quotes don't have whitespace. The data has been processed by an AWS Glue Crawler, and when queried by AWS Athena, it returns all values, including the quotes. I don't want the quotes returned in my queries. I've tried looking at https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#schema-csv to try and fix this problem. However, that method only works if all values in the CSV contain quotes around them. Is there any way to fix this problem?
CREATE EXTERNAL TABLE `weatherdata_output`(
`name` string,
`state` string,
`lat` double,
`lng` double,
...)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<bucket>/output'
TBLPROPERTIES (
'transient_lastDdlTime'='1630559293')
To test your situation, I uploaded this data file:
"John Doe",Washington,100,22
"Peter Smith",Sydney,200,88
"Mary Moss",Tokyo,300,44
I then raw the Glue crawler, and it gave a similar DDL to yours. When I queried the data, it also had the problem with include quotation marks in the first column.
I then ran this command instead:
CREATE EXTERNAL TABLE `city2`(
`col0` string,
`col1` string,
`col2` bigint,
`col3` bigint)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION
's3://bucket/city/'
Then, when I queried the table, it came out correctly:
Therefore, I would recommend using ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' as the table format.
I gave up dealing with the peculiarities of AWS Glue Crawler's CSV deserializer(s), and other schema-less file formats as well.
Instead, I prefer to convert the CSV (or other file formats) datasets into Parquet first - it's very simple to do so, for example via this simple Python script:
import pandas as pd
df = pd.read_csv('dataset.csv', sep=',', dtype=str)
df.info() # Print the autodetected schema for verification and further tuning
print(df.head(30)) # Print a sample of dataframe records, for verification
print(f"records count = {df.count()}")
df.to_parquet('dataset.parquet')
This way I can have full control of my schema and even enforce a custom schema if not happy with Pandas's autodetected one. As a bonus, Parquet is faster for querying than CSV. Avro and ORC are good alternatives too, each one is better optimized for a specific use case (for example, columnar vs row data access). Most importantly, all of these formats support embedding the schema inside the file itself, unlike CSV and similar dumb plain file formats.
CSV files (and other schema-less file formats) should really burn in hell and never be used for data exchange purposes :-)
Make your life easier and simpler and just switch to a smarter format which supports embedded schemas, such as Parquet, ORC, or Avro. Glue Crawlers work perfectly with all of those, and you'll save yourself lots of time and quite a few headaches.
I'm attempting to do some analysis on one of our S3 buckets using Athena and I'm getting some errors that I can't explain or find solutions for anywhere I look.
The guide I'm following is https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory-athena-query.html.
I created my S3 inventory yesterday and have now received the first report in S3. The format is Apache ORC, the last export shows as yesterday and the additional fields stored are Size, Last modified, Storage class, Encryption.
I can see the data stored under s3://{my-inventory-bucket}/{my-bucket}/{my-inventory} so I know there is data there.
The default encryption on the inventory bucket and inventory configuration both have SSE-S3 encryption enabled.
To create the table, I am using the following query:
CREATE EXTERNAL TABLE my_table (
`bucket` string,
key string,
version_id string,
is_latest boolean,
is_delete_marker boolean,
size bigint
)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://{my-inventory-bucket}/{my-bucket}/{my-inventory}/hive/';
Once the table has been created, I load the data using:
MSCK REPAIR TABLE my_table;
The results from loading the data show that data has been loaded:
Partitions not in metastore: my_table=2021-07-17-00-00
Repair: Added partition to metastore my_table=2021-07-17-00-00
Once that's loaded, I verify the data is available using:
SELECT DISTINCT dt FROM my_table ORDER BY 1 DESC limit 10;
Which outputs:
1 2021-07-17-00-00
Now if I run something like the below, everything runs fine and I get the expected results:
SELECT key FROM my_table ORDER BY 1 DESC limit 10;
But as soon as I include the size column, I receive an error:
SELECT key, size FROM my_table ORDER BY 1 DESC limit 10;
Your query has the following error(s):
HIVE_CURSOR_ERROR: Failed to read ORC file: s3://{my-inventory-bucket}/{my-bucket}/{my-inventory}/data/{UUID}.orc
This query ran against the "my_table" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: {UUID}.
I feel like I've got something wrong with my size column. Can anyone help figure this out?
So frustrating. Think I found the answer here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html
IsLatest – Set to True if the object is the current version of the object. (This field is not included if the list is only for the current version of objects.)
Removing that column fixed the problem.
I have a Hive partitioned table populated by Hive and stored on S3 as Parquet. The data size for a specific partition is 3GB. Then I make a copy with Athena with:
CREATE TABLE tmp_partition
AS SELECT *
FROM original_table
where hour=11
The resulting data size is less than half (1.4GB). What could be the reason?
EDIT: related hive table definition statement:
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://...'
TBLPROPERTIES (
'parquet.compress'='SNAPPY',
'transient_lastDdlTime'='1558011438'
)
Different compression settings is one possible explanation. If your original files were not compressed or compressed with Snappy that could explain it. If you don't specify what compression to use Athena will default to gzip, which compresses better than Snappy.
If you want a more thorough answer than that you will have to give us some more details. How did you create the original files, are they compressed, what compression, what does the data look like, etc.
I have a file in S3 with the following data:
name,age,gender
jill,30,f
jack,32,m
And a redshift external table to query that data using spectrum:
create external table spectrum.customers (
"name" varchar(50),
"age" int,
"gender" varchar(1))
row format delimited
fields terminated by ','
lines terminated by \n'
stored as textfile
location 's3://...';
When querying the data I get the following result:
select * from spectrum.customers;
name,age,g
jill,30,f
jack,32,m
Is there an elegant way to skip the header row as part of the external table definition, similar to the tblproperties ("skip.header.line.count"="1") option in Hive? Or is my only option (at least for now) to filter out the header rows as part of the select statement?
Answered this in: How to skip headers when we are reading data from a csv file in s3 and creating a table in aws athena.
This works in Redshift:
You want to use table properties ('skip.header.line.count'='1')
Along with other properties if you want, e.g. 'numRows'='100'.
Here's a sample:
create external table exreddb1.test_table
(ID BIGINT
,NAME VARCHAR
)
row format delimited
fields terminated by ','
stored as textfile
location 's3://mybucket/myfolder/'
table properties ('numRows'='100', 'skip.header.line.count'='1');
Currently, AWS Redshift Spectrum does not support skipping header rows. If you can, you could raise a support issue that would allow tracking the availability of this feature.
It would be possible to forward this request to the development team for consideration.
I'm trying to convert a bunch of multi-part avro files stored on HDFS (100s of GBs) to parquet files (preserving all data)
Hive can read the avro files as an external table using:
CREATE EXTERNAL TABLE as_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '<location>'
TBLPROPERTIES ('avro.schema.url'='<schema.avsc>');
But when I try to create a parquet table:
create external table as_parquet like as_avro stored as parquet location 'hdfs:///xyz.parquet'
it throws an error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.UnsupportedOperationException: Unknown field type: uniontype<...>
Is it possible to convert uniontype to something that is a valid datatype for the external parquet table?
I'm open to alternative, simpler methods as well. MR? Pig?
Looking for a way that's fast, simple and has minimal dependencies to bother about.
Thanks
Try splitting this:
create external table as_parquet like as_avro stored as parquet location 'hdfs:///xyz.parquet'
into 2 steps:
CREATE EXTERNAL TABLE as_parquet (col1 col1_type, ... , coln coln_type) STORED AS parquet LOCATION 'hdfs:///xyz.parquet';
INSERT INTO TABLE as_parquet SELECT * FROM as_avro;
Or, if you have partitions, which I guess you have for this amount of data:
INSERT INTO TABLE as_parquet PARTITION (year=2016, month=07, day=13) SELECT <all_columns_except_partition_cols> FROM as_avro WHERE year='2016' and month='07' and day='13';
Note:
For step 1, in order to save any typos or small mistakes in columns types and such, you can:
Run SHOW CREATE TABLE as_avro and copy the create statement of as_avro table
Replace the table name, file format and location of the table
Run the new create statement.
This works for me...