Converting avro to parquet (using hive maybe?) - mapreduce

I'm trying to convert a bunch of multi-part avro files stored on HDFS (100s of GBs) to parquet files (preserving all data)
Hive can read the avro files as an external table using:
CREATE EXTERNAL TABLE as_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '<location>'
TBLPROPERTIES ('avro.schema.url'='<schema.avsc>');
But when I try to create a parquet table:
create external table as_parquet like as_avro stored as parquet location 'hdfs:///xyz.parquet'
it throws an error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.UnsupportedOperationException: Unknown field type: uniontype<...>
Is it possible to convert uniontype to something that is a valid datatype for the external parquet table?
I'm open to alternative, simpler methods as well. MR? Pig?
Looking for a way that's fast, simple and has minimal dependencies to bother about.
Thanks

Try splitting this:
create external table as_parquet like as_avro stored as parquet location 'hdfs:///xyz.parquet'
into 2 steps:
CREATE EXTERNAL TABLE as_parquet (col1 col1_type, ... , coln coln_type) STORED AS parquet LOCATION 'hdfs:///xyz.parquet';
INSERT INTO TABLE as_parquet SELECT * FROM as_avro;
Or, if you have partitions, which I guess you have for this amount of data:
INSERT INTO TABLE as_parquet PARTITION (year=2016, month=07, day=13) SELECT <all_columns_except_partition_cols> FROM as_avro WHERE year='2016' and month='07' and day='13';
Note:
For step 1, in order to save any typos or small mistakes in columns types and such, you can:
Run SHOW CREATE TABLE as_avro and copy the create statement of as_avro table
Replace the table name, file format and location of the table
Run the new create statement.
This works for me...

Related

Redshift external catalog error when copying parquet from s3

I am trying to copy Google Analytics data into redshift via parquet format. When I limit the columns to a few select fields, I am able to copy the data. But on including few specific columns I get an error:
ERROR: External Catalog Error. Detail: ----------------------------------------------- error: External Catalog Error. code: 16000 context: Unsupported column type found for column: 6. Remove the column from the projection to continue. query: 18669834 location: s3_request_builder.cpp:2070 process: padbmaster [pid=23607] -----------------------------------------------
I know the issue is most probably with the data, but I am not sure how can I debug as this error is not helpful in anyway. I have tried changing data types of the columns to super, but without any success. I am not using redshift spectrum here.
I found the solution. In the error message it says Unsupported column type found for column: 6. Redshift column ordinality starts from 0. I was counting columns from 1, instead of 0 (my mistake). So this means issue was with column 6 (which I was reading as column 7), which was a string or varchar column in my case. I created a table with just this column and tried uploading data in just this column. Then I got
redshift_connector.error.ProgrammingError: {'S': 'ERROR', 'C': 'XX000', 'M': 'Spectrum Scan Error', 'D': '\n -----------------------------------------------\n error: Spectrum Scan Error\n code: 15001\n context: The length of the data column display_name is longer than the length defined in the table. Table: 256, Data: 1020
Recreating the column with varchar(max) for those columns solved the issue
I assume you have semistructured data in your parquet (like an array).
In this case, you can have a look at this page at the very bottom https://docs.aws.amazon.com/redshift/latest/dg/ingest-super.html
It says:
If your semistructured or nested data is already available in either
Apache Parquet or Apache ORC format, you can use the COPY command to
ingest data into Amazon Redshift.
The Amazon Redshift table structure should match the number of columns
and the column data types of the Parquet or ORC files. By specifying
SERIALIZETOJSON in the COPY command, you can load any column type in
the file that aligns with a SUPER column in the table as SUPER. This
includes structure and array types.
COPY foo FROM 's3://bucket/somewhere'
...
FORMAT PARQUET SERIALIZETOJSON;
For me, the last line
...
FORMAT PARQUET SERIALIZETOJSON;
did the trick.

Why does Parquet file size get smaller when copied with Amazon Athena

I have a Hive partitioned table populated by Hive and stored on S3 as Parquet. The data size for a specific partition is 3GB. Then I make a copy with Athena with:
CREATE TABLE tmp_partition
AS SELECT *
FROM original_table
where hour=11
The resulting data size is less than half (1.4GB). What could be the reason?
EDIT: related hive table definition statement:
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://...'
TBLPROPERTIES (
'parquet.compress'='SNAPPY',
'transient_lastDdlTime'='1558011438'
)
Different compression settings is one possible explanation. If your original files were not compressed or compressed with Snappy that could explain it. If you don't specify what compression to use Athena will default to gzip, which compresses better than Snappy.
If you want a more thorough answer than that you will have to give us some more details. How did you create the original files, are they compressed, what compression, what does the data look like, etc.

Redshift Spectrum : Getting no values/ empty while select using Parquet

I have tried using textfile and it works perfectly. I am using Redshift spectrum. To increase performance, I am trying using PARQUET. The table gets created but I get no value returned while firing a Select query. Below are my queries:
CREATE EXTERNAL TABLE gf_spectrum.order_headers
(
header_id numeric(38,18) ,
org_id numeric(38,18) ,
order_type_id numeric(38,18) )
partitioned by (partition1 VARCHAR(240))
stored as PARQUET
location 's3://aws-bucket/Spectrum/order_headers';
Select * from gf_spectrum.order_headers limit 1000;
Also, does PARQUET require partitioning compulsory? I tried that as well, and the table got created. But while retrieving data, I got an S3 Fetch error of invalid version number which did not happen with the text file. Is it something related to PARQUET format?
Thanks for your help.

Skipping header rows in AWS Redshift External Tables

I have a file in S3 with the following data:
name,age,gender
jill,30,f
jack,32,m
And a redshift external table to query that data using spectrum:
create external table spectrum.customers (
"name" varchar(50),
"age" int,
"gender" varchar(1))
row format delimited
fields terminated by ','
lines terminated by \n'
stored as textfile
location 's3://...';
When querying the data I get the following result:
select * from spectrum.customers;
name,age,g
jill,30,f
jack,32,m
Is there an elegant way to skip the header row as part of the external table definition, similar to the tblproperties ("skip.header.line.count"="1") option in Hive? Or is my only option (at least for now) to filter out the header rows as part of the select statement?
Answered this in: How to skip headers when we are reading data from a csv file in s3 and creating a table in aws athena.
This works in Redshift:
You want to use table properties ('skip.header.line.count'='1')
Along with other properties if you want, e.g. 'numRows'='100'.
Here's a sample:
create external table exreddb1.test_table
(ID BIGINT
,NAME VARCHAR
)
row format delimited
fields terminated by ','
stored as textfile
location 's3://mybucket/myfolder/'
table properties ('numRows'='100', 'skip.header.line.count'='1');
Currently, AWS Redshift Spectrum does not support skipping header rows. If you can, you could raise a support issue that would allow tracking the availability of this feature.
It would be possible to forward this request to the development team for consideration.

Parquet: read particular columns into memory

I have exported a mysql table to a parquet file (avro based). Now i want to read particular columns from that file. How can i read particular columns completely? I am looking for java code examples.
Is there an api where i can pass the columns i need and get back a 2D array of table?
If you can use hive, creating a hive table and issuing a simple select query would be by far the easiest option.
create external table tbl1(<columns>) location '<file_path>' stored as parquet;
select col1,col2 from tbl1;
//this works in hive 0.14
You can use JDBC driver to do that from java program as well.
Otherwise, if you want to stay completely in java, you need to modify the avro schema by excluding all the fields but the ones you want to fetch. Then when you read the file supply the modified schema as reader schema and it will only read the included columns. But you will get you original avro record back with excluded fields nullified, not a 2D array.
To modify the schema look at org.apache.avro.Schema and org.apache.avro.SchemaBuilder. make sure that modified schema is compatible with the original schema.
Options:
Use Hive table to create table with all columns with storage format parquet and read the required columns by specifying the column names
Create Thrift for the table and use the thrift fields to read the data from code (Java or Scala)
You can also use apache drill that natively parse parquet files.