Dynamic partitioning in Hive

Dynamic partitioning in Hive - mapreduce

I am new to hive.
My input file is of the form
(ID, Date(YYYY-MM-DD), hour(HH), key, value).Table is partitioned on (date, hour)
the input file contains data for seven days(24 hours for each day). When i load this data into hive table, i need the data to be loaded in respective partitions of the table.
Can some please help me out.
Thanks,
Sudhakar.

one way is to first load the data into an unpartitioned table (e.g. tmp_some_table in the example below). Then you can do something like:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
from tmp_some_table tt
insert overwrite table some_table partition(day, hour)
select
id,
key,
value,
day,
hour
the partitions need to be the last columns in your select clause. The above works on hive 0.7.1. See the wiki for more info. Note that if you have too many partitions you'll get errors.

Related

How to backfill partitioned data in BigQuery?

I am trying to backfill data from GCP billing export table to another table say T1.
Both tables are partitioned.
Below scheduled query runs everyday to get yesterday’s data.
SELECT * FROM gcp_billing_export_v1 WHERE DATE(_PARTITIONTIME) = DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY)
Now I need to backfill the data , say for 15th May - how do I do that ?
I tried the backfill feature with the below query - expecting the backfill utility to take the past date i.e. May 15th as a param for the #run_date but that didn’t help.
SELECT * FROM gcp_billing_export_v1 WHERE DATE(_PARTITIONTIME) = #run_date
The data is pulled for 15th May from the source table(gcp_billing_export_v1) but is populated against current date in the destination table i.e May 15th May data is populated against June 22nd in the destination table T1. Where am I going wrong ?
Any guidance ?

Looks like you're using ingestion partitioning.
You would need to create a new table with the partitioning you want ie EventDate and populate that new table with historical and new daily data - as you can't overwrite an existing partition.
Link here: https://cloud.google.com/bigquery/docs/querying-partitioned-tables#query_an_ingestion-time_partitioned_table

As #Lemon already pointed out that you're using Ingestion time partitioned tables(both source and dest), you need to understand how it works. Ingestion time partitioned tables are different from the Regular partitioned tables.
From the Documentation-
When you create a table partitioned by ingestion time,BigQuery automatically assigns rows to partitions based on the time when BigQuery ingests the data.
This type of table has a pseudo-column named _PARTITIONTIME.The value of this column is the ingestion time for each row.
Since you are using the SELECT * FROM gcp_billing_export_v1 you are getting all the data but without any _PARTITIONTIME column. And when you are saving the same result into the destination table , it is updating the _PARTITIONTIME column as per destintaion table's data ingestion-time.
Thus you have old data with the current date in _PARTITIONTIME
To avoid this your destination table needs to be either a normal table or regular partitioned table.
Also you need to have an extra column to hold the Datetime value from the source's_PARTITIONTIME column.You can create a regular partition on this new column.
Then to get _PARTITIONTIME in your result set , in your quer you need to mention the column name specifically in your query.
SELECT *,_PARTITIONTIME AS ingestionTime
FROM gcp_billing_export_v1
WHERE DATE(_PARTITIONTIME) = #run_date
The above query will return all the data from the gcp_billing_export_v1 table with 1 extra column ingestionTime.
Now you can backfill the data for 15th,May and save it to the new table.
You can also tweak around this below query to achieve the same
SELECT *,_PARTITIONTIME AS ingestionTime
FROM gcp_billing_export_v1
WHERE DATE(_PARTITIONTIME) = DATE_ADD(#run_date, INTERVAL -1 DAY)
It will run daily as per your need .Now if you want to pull data for 15th,May then you have to schedule the backfill for 16th,May(as per the where clause)

AWS Athena query on parquet file - using columns in where clause

We are planning to use Athena as a backend service for our data(stored as parquet files in partitions) in S3.
Some of the things we are interested to find out is how does adding additional columns in where clause of the query affect the query run time.
For example, we have 10million records in one hive partition(partition based on column 'date')
And all queries below return same volume - 10million. would all these queries take same time or does it reduce query run when we add additional columns in where clause(as parquet is columnar fomar)?
I tried to test this but results were not consistent as there was some queuing time as well I guess
select * from table where date='20200712'
select * from table where date='20200712' and type='XXX'
select * from table where date='20200712' and type='XXX' and subtype='YYY'

Parquet file contains page "indexes" (min, max and bloom filters.) If you sorting the data by columns in question during insert for example like this:
insert overwrite table mytable partition (dt)
select col1, --some columns
type,
subtype,
dt
distribute by dt
sort by type, subtype
then these indexes may work efficiently because data withe the same type, subtype will be loaded into the same pages, data pages will be selected using indexes. See some benchmarks here: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
Switch-on predicate-push-down: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_ig_predicate_pushdown_parquet.html

Querying S3 using Athena

I have a setup with Kinesis Firehose ingesting data, AWS Lambda performing data transformation and dropping the incoming data into an S3 bucket. The S3 structure is organized by year/month/day/hour/messages.json, so all of the actual json files I am querying are at the 'hour' level with all year, month, day directories only containing sub directories.
My problem is I need to run a query to get all data for a given day. Is there an easy way to query at the 'day' directory level and return all files in its sub directories without having to run a query for 2020/06/15/00, 2020/06/15/01, 2020/06/15/02...2020/06/15/23?
I can successfully query the hour level directories since I can create a table and define the column name and type represented in my .json file, but I am not sure how to create a table in Athena (if possible) to represent a day directory with sub directories instead of actual files.

To query only the data for a day without making Athena read all the data for all days you need to create a partitioned table (look at the second example). Partitioned tables are like regular tables, but they contain additional metadata that describes where the data for a particular combination of the partition keys is located. When you run a query and specify criteria for the partition keys Athena can figure out which locations to read and which to skip.
How to configure the partition keys for a table depends on the way the data is partitioned. In your case the partitioning is by time, and the timestamp has hourly granularity. You can choose a number of different ways to encode this partitioning in a table, which one is the best depends on what kinds of queries you are going to run. You say you want to query by day, which makes sense, and will work great in this case.
There are two ways to set this up, the traditional, and the new way. The new way uses a feature that was released just a couple of days ago and if you try to find more examples of it you may not find many, so I'm going to show you the traditional too.
Using Partition Projection
Use the following SQL to create your table (you have to fill in the columns yourself, since you say you've successfully created a table already just use the columns from that table – also fix the S3 locations):
CREATE EXTERNAL TABLE cszlos_firehose_data (
-- fill in your columns here
)
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'
TBLPROPERTIES (
"projection.enabled" = "true",
"projection.date.type" = "date",
"projection.date.range" = "2020/06/01,NOW",
"projection.date.format" = "yyyy/MM/dd",
"projection.date.interval" = "1",
"projection.date.interval.unit" = "DAYS",
"storage.location.template" = "s3://cszlos-data/is/here/${date}"
)
This creates a table partitioned by date (please note that you need to quote this in queries, e.g. SELECT * FROM cszlos_firehose_data WHERE "date" = …, since it's a reserved word, if you want to avoid having to quote it use another name, dt seems popular, also note that it's escaped with backticks in DDL and with double quotes in DML statements). When you query this table and specify a criteria for date, e.g. … WHERE "date" = '2020/06/05', Athena will read only the data for the specified date.
The table uses Partition Projection, which is a new feature where you put properties in the TBLPROPERTIES section that tell Athena about your partition keys and how to find the data – here I'm telling Athena to assume that there exists data on S3 from 2020-06-01 up until the time the query runs (adjust the start date necessary), which means that if you specify a date before that time, or after "now" Athena will know that there is no such data and not even try to read anything for those days. The storage.location.template property tells Athena where to find the data for a specific date. If your query specifies a range of dates, e.g. … WHERE "date" > '2020/06/05' Athena will generate each date (controlled by the projection.date.interval property) and read data in s3://cszlos-data/is/here/2020-06-06, s3://cszlos-data/is/here/2020-06-07, etc.
You can find a full Kinesis Data Firehose example in the docs. It shows how to use the full hourly granularity of the partitioning, but you don't want that so stick to the example above.
The traditional way
The traditional way is similar to the above, but you have to add partitions manually for Athena to find them. Start by creating the table using the following SQL (again, add the columns from your previous experiments, and fix the S3 locations):
CREATE EXTERNAL TABLE cszlos_firehose_data (
-- fill in your columns here
)
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'
This is exactly the same SQL as above, but without the table properties. If you try to run a query against this table now you will not get any results. The reason is that you need to tell Athena about the partitions of a partitioned table before it knows where to look for data (partitioned tables must have a LOCATION, but it really doesn't mean the same thing as for regular tables).
You can add partitions in many different ways, but the most straight forward for interactive use is to use ALTER TABLE ADD PARTITION. You can add multiple partitions in one statement, like this:
ALTER TABLE cszlos_firehose_data ADD
PARTITION (`date` = '2020-06-06') LOCATION 's3://cszlos-data/is/here/2020/06/06'
PARTITION (`date` = '2020-06-07') LOCATION 's3://cszlos-data/is/here/2020/06/07'
PARTITION (`date` = '2020-06-08') LOCATION 's3://cszlos-data/is/here/2020/06/08'
PARTITION (`date` = '2020-06-09') LOCATION 's3://cszlos-data/is/here/2020/06/09'
If you start reading more about partitioned tables you will probably also run across the MSCK REPAIR TABLE statement as a way to load partitions. This command is unfortunately really slow, and it only works for Hive style partitioned data (e.g. …/year=2020/month=06/day=07/file.json) – so you can't use it.

What could make these two queries returning different results?

I have an ELB logs table on Amazon Athena, and I'm trying to request daily requests by url. The table is structure is the one described here, but I'm also added partitions for day, month and year for querying logs by day, month, etc...
I'm partitioning my table with a query like this:
ALTER TABLE elb_logs ADD IF NOT EXISTS PARTITION (year='2019',month='03',day='*') location 's3://my-logs-bucket/my-load-balancer/AWSLogs/526654419886/elasticloadbalancing/eu-west-1/2019/03/'
Then I ask for the log entries on the first of March 2019 like this:
SELECT count(*)
FROM elb_logs
WHERE year='2019'
AND month='03'
AND day='01'
and get 590 results, then if I execute this query:
SELECT count(*), DATE(from_iso8601_timestamp(time))
FROM elb_logs
WHERE year='2019'
AND month='03'
AND day='*'
GROUP BY DATE(from_iso8601_timestamp(time))
I get 590 as the count as well for the first of March, BUT if I execute this one (without the day condition):
SELECT count(*), DATE(from_iso8601_timestamp(time))
FROM elb_logs
WHERE year='2019'
AND month='03'
GROUP BY DATE(from_iso8601_timestamp(time))
I get 1180 as a resulting count, which is incorrect. Why is this? What's the difference from specifying DAY='*' and not specifying DAY? Shouldn't they be equivalent?

There are partition names and partition locations.
Partitions:
month=03,day=01
month=03,day=*
When you query without condition on the day column, both partitions match.
As it happens, they contains same files (since they share their physical location).
As there is (apparently) no deduplication of the files being read (the partitions are supposed to be non-overlapping), the same data files are being read twice.

Athena: Minimize data scanned by query including JOIN operation

Let there be an external table in Athena which points to a large amount of data stored in parquet format on s3. It contains a lot of columns and is partitioned on a field called 'timeid'. Now, there's another external table (small one) which maps timeid to date.
When the smaller table is also partitioned on timeid and we join them on their partition id (timeid) and put date into where clause, only those specific records are scanned from large table which contain timeids corresponding to that date. The entire data is not scanned here.
However, if the smaller table is not partitioned on timeid, full data scan takes place even in the presence of condition on date column.
Is there a way to avoid full data scan even when the large partitioned table is joined with an unpartitioned small table? This is required because the small table contains only one record per timeid and it might not be expected to create a separate file for each.

That's an interesting discovery!
You might be able to avoid the large scan by using a sub-query instead of a join.
Instead of:
SELECT ...
FROM large-table
JOIN small-table
WHERE small-table.date > '2017-08-03'
you might be able to use:
SELECT ...
FROM large-table
WHERE large-table.date IN
(SELECT date from small-table
WHERE date > '2017-08-03')
I haven't tested it, but that would avoid the JOIN you mention.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js