What is the best practice for working with Vertica and Parquet
my application architecture is:
Kafka Topic (Avro Data).
Vertica DB.
Vertica's scheduler consumed the data from Kafka and ingest it into a managed table in Vertica.
let's say I have Vertica's Storage only for one month of data.
As far as I understood I can create an external table on HDFS using parquet and Vertica API enables me to query these tables as well.
What is the best practice for this scenario? can I add some Vertica scheduler for coping the date from managed tables to external tables (as parquet).
how do I configure the rolling data in Vertica (dropped 30 days ago every day )
Thanks.
You can use external tables with Parquet data, whether that data was ever in Vertica or came from some other source. For Parquet and ORC formats specifically, there are some extra features, like predicate pushdown and taking advantage of partition columns.
You can export data in Vertica to Parquet format. You can export the results of a query, so you can select only the 30-day-old data. And despite that section being in the Hadoop section of Vertica's documentation, you can actually write your Parquet files anywhere; you don't need to be running HDFS at all. It just has to be somewhere that all nodes in your database can reach, because external tables read the data at query time.
I don't know of an in-Vertica way to do scheduled exports, but you could write a script and run it nightly. You can run a .sql script from the command line using vsql -f filename.sql.
Related
I am writing the AWS Glue ETL job and I have 2 options to construct the spark dataframe :
Use the AWS Glue Data Catalog as the metastore for Spark SQL
df = spark.sql("select name from bronze_db.table_tbl")
df.write.save("s3://silver/...")
another option is to read directly from s3 location like this
df = spark.read.format("parquet").load("s3://bronze/table_tbl/1.parquet","s3://bronze/table_tbl/2.parquet")
df.write.save("s3://silver/...")
should I consider reading files directly to save cost or any limit on the number of queries (select name from bronze_db.table_tbl) or to get better read performance?
I am not sure if this query will be run on Athena to return the results
If you only have one file and you know the schema there is no need for a table. A table is useful when there are multiple files, you don't know the schema (e.g. the table was set up and is populated by another process), or if you are querying the data from multiple engines (Athena, EMR, Redshift Spectrum, etc.)
Think of tables as an interoperability thing. Interoperability with other processes, other engines, etc.
We have an AWS EMR which includes a Hive backed by aurora metadata and data stored in s3. There are programs that create the database(s) and tables inside in Hive and populate data.
After a while, these databases are no longer needed (say after 1 year). We want to delete those hive databases automatically after a set period. The usual way is to set a cron job that runs every month or so, to find the databases from an internal metadata table that are older than 1 year, and programmatically fire the queries in Hive which deletes it. But this has some drawbacks like Manually created tables are not being covered.
Is there any hive built-in feature that does the above?
Hive is actually just a metadata store that defines how data should be interpreted. It does not manage any of the underlying data. (This is a major difference between hive and a conventional database. And why hive can use multiple file backends(hdfs&S3) in the same hive instance.)
I'm going to guess you are using an s3 bucket for you data so you likely want to look into expiring objects. This will do exactly what you want. Delete data after a period of time. This will not disrupt hive.
If you are using partitions you may wish to do some additional cleanup.
MSCK REPAIR TABLE will help maintain the partitions in hive but is really slow in S3 and periodically can timeout. YMMV.
It's better to drop partitions:
ALTER TABLE bills DROP IF EXISTS PARTITION (mydate='2022-02') PURGE;
In Hive you can implement partitions retention (since Hive 3.1.0)
For example to drop partitions and their data after 7 days:
ALTER TABLE employees SET TBLPROPERTIES ('partition.retention.period'='7d');
There is not a hive internal tool that removes 'databases' according to a "retention period" in hive.
You have been doing this for a while so you are likely well aware of the risks of deleting metadata older than a year.
There are several ways to define retention on data, but none that I'm aware to remove metadata.
Things you could look at:
You could add a trigger to Aurora to delete tables directly from the hive metadata. (Hive tables have values for create time and they're last access time) you could create some logic to work at that level.
We are currently using Apache sqoop once daily to export an oracle DB table containing a CLOB column into HDFS. As part of this we first map the CLOB column to java string(using --map-column-java) and have the imported data to be saved in the format of parquet. We have this scheduled as an oozie workflow.
There is a plan to move from apache hive to bigquery. I am not able to find a way to get this table into bigquery and would like help on the best approach to get this done.
If we go withreal time streaming from oracle DB into bigquery using google datastream, can you tell me if the clob column will get streamed correctly, as it has some malformed xml data (close to xml structure but might have some discrepancies in obeying the structure).
Another option i read was to have the table extracted as a csv file,and have it transferred to GCS and have the bigquery table refer it there.But since mydata in CLOB column is very large and is wild with multiple commas and special chsracters in between, i think there will be issues with parsing or exporting. Any options to do it in parquet or ORC formats?
The preferred approach is to have a scheduled batch upload performed daily from oracle to bigquery. Appreciate any inputs on how to achieve the same.
We can convert CLOB data from Oracle DB to desired format like ORC, Parquet, TSV, Avro files through Enterprise Flexter.
Also, you can refer to this on how to ingest on-premises Oracle data with Google Cloud Dataflow via JDBC, using the Hybrid Data Pipeline On-Premises Connector.
For your other query moving from apache hive to bigquery-
The fastest way to import to BQ is using GCP resources. Dataflow is a scalable solution to read and write. Dataproc is also another option that is more flexible and you can use more open source stacks to read from the Hive cluster.
You can also use this Dataflow template, which would require a connection to be established directly between the Dataflow workers and the Apache Hive nodes.
There is also a plugin for moving data from Hive into BigQuery which utilises GCS as a temporary storage and uses BigQuery Storage API to move data to BigQuery.
You can also use Cloud SQL to migrate your Hive data to BigQuery.
In AWS Glue, Although I read documentation, but I didn't get cleared one thing. Below is what I understood.
Regarding Crawlers: This will create a metadata table for either S3 or DynamoDB table. But what I don't understand is: how does Scala/Python script able to retrieve data from Actual Source (say DynamoDB or S3) using Metadata created tables.
val input = glueContext
.getCatalogSource(database = "my_data_base", tableName = "my_table")
.getDynamicFrame()
Does above line retrieve data from actual source via metadata tables?
I will be glad if someone can able to explain me behind the scenes of retrieving data in Glue script via metadata tables.
When you run a Glue crawler it will fetch metadata from S3 or JDBC (depends on your requirement) and creates tables in AWS Glue Data Catalog.
Now if you want to connect to this data/tables from Glue ETL job then you can do it in multiple ways depending on your requirement:
[from_options][1] : if you want to load directly from S3/JDBC with out connecting to Glue catalog.
[from_catalog][1] : If you want to load data from Glue catalog then you need to link it with catalog using getCatalogSource method as shown in your code. As the name infers it will use Glue data catalog as source and load particular table that you pass to this method.
Once it looks at your table definition which is pointed to a location then it will make a connection and load the data present in the source.
Yes you need to use getCatalogSource if you want to load tables from Glue catalog.
Does Catalog look into Crawler and refer to actual source and load data?
Check out the diagram in this [link][2] . It will give you an idea about the flow.
What if crawler deleted before I run getCatalogSource, then will I can able to load data in this case?
Crawler and Table are two different components. It all depends on when the table is deleted. If you delete the table after your job start to execute then there will not be any problem. If you delete it before execution starts then you will encounter an error.
What if my Source has lots of million of records? then will this load all records or how in this case?
It is good to have large files to be present in source so it will avoid most of the small files problem. Glue based on Spark and it will read files which can be fit in memory and then do the computations. Check this [answer][3] and [this][4] for best practices while reading larger files in AWS Glue.
[1]: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.html
[2]: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html
[3]: https://stackoverflow.com/questions/46638901/how-spark-read-a-large-file-petabyte-when-file-can-not-be-fit-in-sparks-main
[4]: https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/#:~:text=Incremental%20processing:%20Processing%20large%20datasets
I have been looking at options to load (basically empty and restore) Parquet file from S3 to DynamoDB. Parquet file itself is created via spark job that runs on EMR cluster. Here are few things to keep in mind,
I cannot use AWS Data pipeline
File is going to contain millions of rows (say 10 million), so would need an efficient solution. I believe boto API (even with batch write) might not be that efficient ?
Are there any other alternatives ?
Can you just refer to the Parquet files in a Spark RDD and have the workers put the entries to dynamoDB? Ignoring the challenge of caching the DynamoDB client in each worker for reuse in different rows, it some bit of scala to take a row, build an entry for dynamo and PUT that should be enough.
BTW: Use DynamoDB on demand here, as it handles peak loads well without you having to commit to some SLA.
Look at the answer below:
https://stackoverflow.com/a/59519234/4253760
To explain the process:
Create desired dataframe
Use .withColumn to create new column and use psf.collect_list to convert to desired collection/json format, in the new column in the
same dataframe.
Drop all un-necessary (tabular) columns and keep only the JSON format Dataframe columns in Spark.
Load the JSON data into DynamoDB as explained in the answer.
My personal suggestion: whatever you do, do NOT use RDD. RDD interface even in Scala is 2-3 times slower than Dataframe API of any language.
Dataframe API's performance is programming language agnostic, as long as you dont use UDF.