I am trying to query hdfs file with Presto like Apache Drill. I have searched but found anything due to lack of Presto resources. I can query hdfs data with hive connector, no problem with that. But I want to query a file in hdfs that not controlled by hive. Is it possible?
Yes, it is possible. You can use Hive connector with
hive.metastore=file
hive.metastore.catalog.dir=/home/youruser/metastore
This will create "embedded metastore" with /home/youruser/metastore directory as the storage. Then you can declare your table as if you used Hive metastore and read from it.
Related
We are currently using Apache sqoop once daily to export an oracle DB table containing a CLOB column into HDFS. As part of this we first map the CLOB column to java string(using --map-column-java) and have the imported data to be saved in the format of parquet. We have this scheduled as an oozie workflow.
There is a plan to move from apache hive to bigquery. I am not able to find a way to get this table into bigquery and would like help on the best approach to get this done.
If we go withreal time streaming from oracle DB into bigquery using google datastream, can you tell me if the clob column will get streamed correctly, as it has some malformed xml data (close to xml structure but might have some discrepancies in obeying the structure).
Another option i read was to have the table extracted as a csv file,and have it transferred to GCS and have the bigquery table refer it there.But since mydata in CLOB column is very large and is wild with multiple commas and special chsracters in between, i think there will be issues with parsing or exporting. Any options to do it in parquet or ORC formats?
The preferred approach is to have a scheduled batch upload performed daily from oracle to bigquery. Appreciate any inputs on how to achieve the same.
We can convert CLOB data from Oracle DB to desired format like ORC, Parquet, TSV, Avro files through Enterprise Flexter.
Also, you can refer to this on how to ingest on-premises Oracle data with Google Cloud Dataflow via JDBC, using the Hybrid Data Pipeline On-Premises Connector.
For your other query moving from apache hive to bigquery-
The fastest way to import to BQ is using GCP resources. Dataflow is a scalable solution to read and write. Dataproc is also another option that is more flexible and you can use more open source stacks to read from the Hive cluster.
You can also use this Dataflow template, which would require a connection to be established directly between the Dataflow workers and the Apache Hive nodes.
There is also a plugin for moving data from Hive into BigQuery which utilises GCS as a temporary storage and uses BigQuery Storage API to move data to BigQuery.
You can also use Cloud SQL to migrate your Hive data to BigQuery.
I'm trying to read a parquet file with Impala.
impala-shell> SELECT * FROM `/path/in/hdfs/*.parquet`
I know I can do that using Spark or Drill, but I wonder if it's possible with Impala ?
Thanks
You would need to create a structured table on top of the parquet files to query via Impala.
General example of external table pointing to parquet directory ... Cloudera docs provide all methods here:
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_parquet.html#parquet_ddl
CREATE EXTERNAL TABLE ingest_existing_files LIKE PARQUET '/user/etl/destination/datafile1.dat'
STORED AS PARQUET
LOCATION '/user/etl/destination';
What is the best practice for working with Vertica and Parquet
my application architecture is:
Kafka Topic (Avro Data).
Vertica DB.
Vertica's scheduler consumed the data from Kafka and ingest it into a managed table in Vertica.
let's say I have Vertica's Storage only for one month of data.
As far as I understood I can create an external table on HDFS using parquet and Vertica API enables me to query these tables as well.
What is the best practice for this scenario? can I add some Vertica scheduler for coping the date from managed tables to external tables (as parquet).
how do I configure the rolling data in Vertica (dropped 30 days ago every day )
Thanks.
You can use external tables with Parquet data, whether that data was ever in Vertica or came from some other source. For Parquet and ORC formats specifically, there are some extra features, like predicate pushdown and taking advantage of partition columns.
You can export data in Vertica to Parquet format. You can export the results of a query, so you can select only the 30-day-old data. And despite that section being in the Hadoop section of Vertica's documentation, you can actually write your Parquet files anywhere; you don't need to be running HDFS at all. It just has to be somewhere that all nodes in your database can reach, because external tables read the data at query time.
I don't know of an in-Vertica way to do scheduled exports, but you could write a script and run it nightly. You can run a .sql script from the command line using vsql -f filename.sql.
Using the unload command in any database, we can get the data into a flat file and we can copy that to HDFS.
What is the advantage of using sqoop over unload command in this regard?
Please explain. Thanks in advance.
Sqoop will translate data with the type of you DB to a Java type wich can be then used on a process. Sqoop also allow you to offload various relationnal db to Hive or Hbase. Or it will provide a parquet output ( one of the most used format for datasets under hadoop )
Sqoop will also split and distribute the export file folowing a provided key.
I was looking for ways to move data onto a HDFS system, wanted to know if Apache Sqoop can be used to pull/extract data from an external REST service?
My favorite way to pull data from a REST service:
curl http:// | hdfs -put - /my/hdfs/directory
From http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html
Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
So it doesn't support import data from REST service.