Best storage format to backup hive internal table - amazon-web-services

I have one hive internal table which has around 500 million records.
My hive is deployed on top of AWS EMR. I do not want to keep the AWS EMR always running. Hence I want to backup the hive internal table data.
One easy way of doing it to create an external table pointing to S3 Location and then moving all records into that external table using insert command.
When ever I need internal table back, I can use this external S3 table to get all the data back.
Since this table only purpose is for backup, I want to ask which stored as format will be best choice for me.
Hive as of now supports following formats
TEXTFILE
SEQUENCEFILE
ORC
PARQUET
AVRO
RCFILE
Also is there any other way to backup your internal tables other than the approach mentioned above.

In Short
I'd think changing file format(the list you mentioned) will not have much difference in size. But file size and type of access you want on that file plays crucial role your cloud account billing.
So consider following,
Compression - To reduce the size
Amazon Glacier - Cost effective solution than S3 in AWS, as the data is less likely to access (archival)
Things to consider when choosing a solution, How much time you can buy
To access file from archival storage.
to convert data format to Hive managed table (if you change during archival)
to data uncompress(each compression is trade of between time and size)
Extended answer
Here are some of the file formats with their decompression speed and space efficiency, pick the balanced(means time/space as per above questions) and available compression format for you.
more compress and compress benchmarks at

Related

Push RedShift table to S3 by doing some aggregation as CSV

I have been looking to the best way to programatically pull Redshift table (table needs to be aggregated) into s3.
What would be the best solution. From Athena to s3 I found this article however, I could not find any information to do it from Redshift to s3.
https://www.datastackpros.com/2020/07/export-athena-view-as-csv-to-aws-s3.html
I would be daily ingestion and the csv file should be overwritten.
Thanks
There are 2 ways that come to mind right away - UNLOAD and CREATE EXTERNAL TABLE. Each has its pros and cons. Your use case isn't completely clear as to what you need the resulting file(s) to look like but let me take a guess.
I expect you need a single CSV file (with or without header row?) for other tools to read / use. In this case I'd use UNLOAD with PARALLEL OFF to save the result of the query to S3. This will produce 1 file in S3 ONLY IF the resulting size is less than 5GB.

Bigquery Pricing Comparison : Loading data into Bigquery vs Using Create External Table

My team is working on developing data platform using Google Cloud Platform.
We uploaded our company's data on Google Cloud Storage and try to make data mart on Bigquery.
However, in order to save GCP usage cost, we are considering to load all data from gcs to bigquery or create external table on bigquery.
Which way is more cost efficienct?
BigQuery and the external table capacity make the border between datalake (file) and data warehouse (structured data) blurry, and your question is relevant.
When you use external table, several feature are missing, like clustering and partitioning, and your file are parsed on the fly (with type casting) -> the processing time is slower and you can't control/limit the volume of data that your process. In addition of possible errors in file that will break your query
When you use native table, the data storage is optimize for the BigQuery processing, the data already clean and parsed, the table partitioned and clustered.
The question of cost is hard multiple. Firstly, we can talk about data storage. if you have file in GCS and the same data in BigQuery, you will pay the storage twice. However, after 90 days without any update, the data goes to "archive" storage mode in BigQuery and are 2 time cheaper. In addition, you can also move your GCS file to a cold storage after their integration in BigQuery.
That's for the storage. Then the processing. First of all, the processing roughly cost 10 times more than the storage, and it's the most important things to focus on. When you perform a BigQuery request, you pay for the volume of data that your query scan. If you have partitions or clusters, with BigQuery native tables, you can limit the amount of data that you scan and therefore reduce a lot the cost. With external tables, you can't use partitioning and clustering feature and therefore you always pay for the full amount of data.
Therefore, it depends (as always) on your volume of data and the frequency of the requests.
Don't forget something additional: with external table you can have error that can break your queries. In production mode, it can be dramatic. Think smart on that.
Finally, requesting external table is slower that native table (no partitioning, therefore more data to process and parsing/casting duration). Because time is money (if you have time critical queries), and that immaterial cost can also influence your choices.
The #guillaume blaquiere answer is okay, but he forget mention something important: it is possible to do partitioned queries. You can create partitioned external tables linked to a bucket in the storage. Eg:
gs://myBucket/myTable/dt=2019-10-31/lang=en/foo
gs://myBucket/myTable/dt=2018-10-31/lang=fr/bar
Then, you can use "dt" or "lang" filters in SQL queries from BigQuery.
https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs

Athena query timeout for bucket containing too many log entries

I am running a simple Athena query as in
SELECT * FROM "logs"
WHERE parse_datetime(requestdatetime,'dd/MMM/yyyy:HH:mm:ss Z')
BETWEEN parse_datetime('2021-12-01:00:00:00','yyyy-MM-dd:HH:mm:ss')
AND
parse_datetime('2021-12-21:19:00:00','yyyy-MM-dd:HH:mm:ss');
However this times out due to the default DML 30 min timeout.
The entries of the path I am querying are a few millions.
Is there a way to address this in Athena or is there a better suited alternative for this purpose?
This is normally solved with partitioning. For data that's organized by date, partition projection is the way to go (versus an explicit partition list that's updated manually or via Glue crawler).
That, of course, assumes that your data is organized by the partition (eg, s3://mybucket/2021/12/21/xxx.csv). If not, then I recommend changing your ingest process as a first step.
You my want to change your ingest process anyway: Athena isn't very good at dealing with a large number of small files. While the tuning guide doesn't give an optimal filesize, I recommend at least a few tens of megabytes. If you're getting a steady stream of small files, use a scheduled Lambda to combine them into a single file. If you're using Firehose to aggregate files, increase the buffer sizes / time limits.
And while you're doing that, consider moving to a columnar format such as Parquet if you're not already using it.

Optimal Big Data solution for aggregating time-series data and storing results to DynamoDB

I am looking into different Big Data solutions and have not been able to find a clear answer or documentation on what might be the best approach and frameworks/services to use to address my Big Data use-case.
My Use-case:
I have a data producer that will be sending ~1-2 billion events to a
Kinesis Data Firehose delivery stream daily.
This data needs to be stored in some data lake / data warehouse, aggregated, and then
loaded into DynamoDB for our service to consume the aggregated data
in its business logic.
The DynamoDB table needs to be updated hourly. (hourly is not a hard requirement but we would like DynamoDB to be updated as soon as possible, at the longest intervals of daily updates if required)
The event schema is similar to: customerId, deviceId, countryCode, timestamp
The aggregated schema is similar to: customerId, deviceId, countryCode (the aggregation is on the customerId's/deviceId's MAX(countryCode) for each day over the last 29 days, and then the MAX(countryCode) overall over the last 29 days.
Only the CustomerIds/deviceIds that had their countryCode change from the last aggregation (from an hour ago) should be written to DynamoDB to keep required write capacity units low.
The raw data stored in the data lake / data warehouse needs to be deleted after 30 days.
My proposed solution:
Kinesis Data Firehose delivers the data to a Redshift staging table (by default using S3 as intermediate storage and then using the COPY command to load to Redshift)
An hourly Glue job that:
Drops the 30 day old time-series table and creates a new time-series table for today in Redshift if this is the first job run of a new day
Loads data from staging table to the appropriate time-series table
Creates a view on top of the last 29 days of time-series tables
Aggregates by customerId, deviceId, date, and MAX(CountryCode)
Then aggregates by customerId, deviceId, MAX(countryCode)
Writes the aggregated results to an S3 bucket
Checks the previous hourly Glue job's run aggregated results vs. the current runs aggregated results to find the customerIds/deviceIds that had their countryCode change
Writes the customerIds/deviceIds rows that had their countryCode change to DynamoDB
My questions:
Is Redshift the best storage choice here? I was also considering using S3 as storage and directly querying data from S3 using a Glue job, though I like the idea of a fully-managed data warehouse.
Since our data has a fixed retention period of 30 days, AWS documentation: https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series-tables.html suggests to use time-series tables and running DROP TABLE on older data that needs to be deleted. Are there other approaches (outside of Redshift) that would make the data lifecycle management easier? Having the staging table, creating and loading into new time-series tables, dropping older time-series tables, updating the view to include the new time-series table and not the one that was dropped could be error prone.
What would be an optimal way to find the the rows (customerId/deviceId combinations) that had their countryCode change since the last aggregation? I was thinking the Glue job could create a table from the previous runs aggregated results S3 file and another table from the current runs aggregated results S3 file, run some variation of a FULL OUTER JOIN to find the rows that have different countryCodes. Is there a better approach here that I'm not aware of?
I am a newbie when it comes to Big Data and Big Data solutions so any and all input is appreciated!
tldr: Use step functions, not Glue. Use Redshift Spectrum with data in S3. Otherwise you overall structure looks on track.
You are on the right track IMHO but there are a few things that could be better. Redshift is great for sifting through tons of data and performing analytics on it. However I'm not sure you want to COPY the data into Redshift if all you are doing is building aggregates to be loaded into DDB. Do you have other analytic workloads being done that will justify storing the data in Redshift? Are there heavy transforms being done between the staging table and the time series event tables? If not you may want to make the time series tables external - read directly from S3 using Redshift Spectrum. This could be a big win as the initial data grouping and aggregating is done in the Spectrum layer in S3. This way the raw data doesn't have to be moved.
Next I would advise not using Glue unless you have a need (transform) that cannot easily be done elsewhere. I find Glue to require some expertise to get to do what you want and it sounds like you would just be using it for a data movement orchestrator. If this impression is correct you will be better off with a step function or even a data pipeline. (I've wasted way too much time trying to get Glue to do simple things. It's a powerful tool but make sure you'll get value from the time you will spend on it.)
If you are only using Redshift to do these aggregations and you go the Spectrum route above you will want to get as small a cluster as you can get away with. Redshift can be pricy and if you don't use its power, not cost effective. In this case you can run the cluster only as needed but Redshift boot up times are not fast and the smallest clusters are not expensive. So this is a possibility but only in the right circumstances. Depending on how difficult the aggregation is that you are doing you might want to look at Athena. If you are just running a few aggregating queries per hour then this could be the most cost effective approach.
Checking against the last hour's aggregations is just a matter of comparing the new aggregates against the old which are in S3. This is easily done with Redshift Spectrum or Athena as they can makes files (or sets of files) the source for a table. Then it is just running the queries.
In my opinion Glue is an ETL tool that can do high power transforms. It can do a lot of things but is not my first (or second) choice. It is touchy, requires a lot of configuration to do more than the basics, and requires expertise that many data groups don't have. If you are a Glue expert, knock you self out; If not, I would avoid.
As for data management, yes you don't want to be deleting tons of rows from the beginning of tables in Redshift. It creates a lot of data reorganization work. So storing your data in "month" tables and using a view is the right way to go in Redshift. Dropping tables doesn't create this housekeeping. That said if you organize you data in S3 in "month" folders then unneeded removing months of data can just be deleting these folders.
As for finding changing country codes this should be easy to do in SQL. Since you are comparing aggregate data to aggregate data this shouldn't be expensive either. Again Redshift Spectrum or Athena are tools that allow you to do this on S3 data.
As for being a big data newbie, not a worry, we all started there. The biggest difference from other areas is how important it is to move the data the fewest number of times. It sounds like you understand this when you say "Is Redshift the best storage choice here?". You seem to be recognizing the importance of where the data resides wrt the compute elements which is on target. If you need the horsepower of Redshift and will be accessing the data over and over again then the Redshift is the best option - The data is moved once to a place where the analytics need to run. However, Redshift is an expensive storage solution - it's not what it is meant to do. Redshift Spectrum is very interesting in that the initial aggregations of data is done in S3 and much reduced partial results are sent to Redshift for completion. S3 is a much cheaper storage solution and if your workload can be pattern-matched to Spectrum's capabilities this can be a clear winner.
I want to be clear that you have only described on area where you need a solution and I'm assuming that you don't have other needs for a Redshift cluster operating on the same data. This would change the optimization point.

Database suggestion for large unstructured datasets to integrate with elasticsearch

A scenario where we have millions of records saved in database, currently I was using dynamodb for saving metadata(and also do write, update and delete operations on objects), S3 for storing files(eg: files can be images, where its associated metadata is stored in dynamoDb) and elasticsearch for indexing and searching. But due to dynamodb limit of 400kb for a row(a single object), it was not sufficient for data to be saved. I thought about saving for an object in different versions in dynamodb itself, but it would be too complicated.
So I was thinking for replacement of dynamodb with some better storage:
AWS DocumentDb
S3 for saving metadata also, along with object files
So which one is better option among both in your opinion and why, which is also cost effective. (Also easy to sync with elasticsearch, but this ES syncing is not much issue as somehow it is possible for both)
If you have any other better suggestions than these two you can also tell me those.
I would suggest looking at DocumentDB over Amazon S3 based on your use case for the following reasons:
Pricing of storing the data would be $0.023 for standard and $0.0125 for infrequent access per GB per month (whereas Document DB is $0.10per GB-month), depending on your size this could add up greatly. If you use IA be aware that your costs for retrieval could add up greatly.
Whilst you would not directly get the data down you would use either Athena or S3 Select to filter. Depending on the data size being queried it would take from a few seconds to possibly minutes (not the milliseconds you requested).
For unstructured data storage in S3 and the querying technologies around it are more targeted at a data lake used for analysis. Whereas DocumentDB is more driven for performance within live applications (it is a MongoDB compatible data store after all).