Querying Parquet file in HDFS using Impala - hdfs

I'm trying to read a parquet file with Impala.
impala-shell> SELECT * FROM `/path/in/hdfs/*.parquet`
I know I can do that using Spark or Drill, but I wonder if it's possible with Impala ?
Thanks

You would need to create a structured table on top of the parquet files to query via Impala.
General example of external table pointing to parquet directory ... Cloudera docs provide all methods here:
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_parquet.html#parquet_ddl
CREATE EXTERNAL TABLE ingest_existing_files LIKE PARQUET '/user/etl/destination/datafile1.dat'
STORED AS PARQUET
LOCATION '/user/etl/destination';

Related

create table and load data in redshift from Parquet file in S3

I have my Parquet file in S3. I want to load this to the redshift table. I don't know the schema of the Parquet file.
Is there any command to create a table and then copy parquet data to it?
Also, I want to add the default time column date timestamp DEFAULT to_char(CURRDATE, 'YYYY-MM-DD').
You need first to create an external schema. Normal colunar schemas do not support parquet files.
Them you need to create a external table to support the columns on the s3 file. So look for a good relation between the file and datatypes. I usually avoid smalls ints for example. To floats create as REAL data type.
Then, do the copy command bellow:
COPY schema_name.table_name
FROM bucket_object iam_role user_arn
FORMAT PARQUET;

How to query hdfs file with presto

I am trying to query hdfs file with Presto like Apache Drill. I have searched but found anything due to lack of Presto resources. I can query hdfs data with hive connector, no problem with that. But I want to query a file in hdfs that not controlled by hive. Is it possible?
Yes, it is possible. You can use Hive connector with
hive.metastore=file
hive.metastore.catalog.dir=/home/youruser/metastore
This will create "embedded metastore" with /home/youruser/metastore directory as the storage. Then you can declare your table as if you used Hive metastore and read from it.

Load Parquet files into Redshift

I have a bunch of Parquet files on S3, i want to load them into redshift in most optimal way.
Each file is split into multiple chunks......what is the most optimal way to load data from S3 into Redshift?
Also, how do you create the target table definition in Redshift? Is there a way to infer schema from Parquet and create table programatically? I believe there is a way to do this using Redshift spectrum, but i want to know if this can be done in scripting.
Appreciate your help!
I am considering all AWS tools such as Glue, Lambda etc to do this the most optimal way(in terms of performance, security and cost).
The Amazon Redshift COPY command can natively load Parquet files by using the parameter:
FORMAT AS PARQUET
See: Amazon Redshift Can Now COPY from Parquet and ORC File Formats
The table must be pre-created; it cannot be created automatically.
Also note from COPY from Columnar Data Formats - Amazon Redshift:
COPY inserts values into the target table's columns in the same order as the columns occur in the columnar data files. The number of columns in the target table and the number of columns in the data file must match.
use parquet-tools from GitHub to dissect the file :
parquet-tool schema <filename> #will dump the schema w/datatypes
parquet-tool head <filename> #will dump the first 5 data structures
Use the jsonpaths file to specify mappings

Remember last filename created by Hive on S3

Hi I would like to know if there is a way that I can get the name of last parquet file that Hive created on S3 after I insert new data into table?
Please look at the hive warehouse directory for changes after you write the data to hive.

Create a Hive (0.10) table for schema data using Parquet Fileformat

I want to export data from server to hive.
I have a 3 level nested data in form of java classes.
I was successfully able to create a avro schema using Avro Tools ReflectData and write out the data in avro files using ReflectDatumWriter. In Hive I was able to create a table and specified the schema using the
TBLPROPERTIES
('avro.schema.url'='hdfs:///schema.avsc');
I can see there are way to export the same data in parquet format
http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/
Let say I get that done and have same data in parquet files ..
How do I query this export parquet data in Hive ?
But how i specify the schema for hive ?
I don't want to write a huge create table statement in hive with the whole nested schema. How do i specify null values for some members in schema ?
I there a way I can directly create a parquet schema like avro schema and give to Hive using create table statement ?
To query data in hive you can create hive external table and specify location of the file. Like this
CREATE EXTERNAL TABLE XXX (...)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
LOCATION '/you/location/here';
There is no way to not specify this statement, because generation of file independent from hive metastore and all you can do with AVRO is just generate data file