Get Hive partition's values using Map Reduce (HCatalog) - mapreduce

Given a Hive table as shown below which is partitioned on gender, I would like to know the partition column's values. So, for this table the values for gender are (male, female). :
── table
├── gender=male
└── gender=female
Can we achieve this using Map Reduce program using HCatalog related libraries.

This can be achieved in Hive by running a show partitions statement.
show partitions yourDatabaseName.yourTableName;

Related

BigQuery Hive external partitioning and partition column names

I have created an external BigQuery partitioned table on GCS using Hive partition strategy with Parquet export. All works fine on key:value name on CGS. BQ picks up the partitoned stripes as external columns and shows these colums as added to the table schema.
Let us assume that I have a column called reporting_date on BQ native table. I cannot call my partition stripe as reporting_date=2021-08-31. Otherwise BQ does not allow the external table to be created. So I call it reportingdate=2021-08-31.
so far so good. So I can use a predicate like
select * FROM <MY_DATASET>.<MYEXTERNAL__TABLE> WHERE reportingdate = '2021-08-31'
This will use partition pruning.
However, I can also do the following as well
select * FROM <MY_DATASET>.<MYEXTERNAL__TABLE> WHERE reporting_date = '2021-08-31'
Note that I am using the existing column name NOT the external partitioned column name. Surprisingly this predicate also uses partition pruning. I was wondering how this works as the external partition is built on reportingdate and NOT reporting_date.
My assumption is that underneath the bonnet these two columns use the same storage space? Somehow BigQuery at time of creating the external table remembers the partition column name. Any ideas or clarification will be appreciated.

AWS Athena query on parquet file - using columns in where clause

We are planning to use Athena as a backend service for our data(stored as parquet files in partitions) in S3.
Some of the things we are interested to find out is how does adding additional columns in where clause of the query affect the query run time.
For example, we have 10million records in one hive partition(partition based on column 'date')
And all queries below return same volume - 10million. would all these queries take same time or does it reduce query run when we add additional columns in where clause(as parquet is columnar fomar)?
I tried to test this but results were not consistent as there was some queuing time as well I guess
select * from table where date='20200712'
select * from table where date='20200712' and type='XXX'
select * from table where date='20200712' and type='XXX' and subtype='YYY'
Parquet file contains page "indexes" (min, max and bloom filters.) If you sorting the data by columns in question during insert for example like this:
insert overwrite table mytable partition (dt)
select col1, --some columns
type,
subtype,
dt
distribute by dt
sort by type, subtype
then these indexes may work efficiently because data withe the same type, subtype will be loaded into the same pages, data pages will be selected using indexes. See some benchmarks here: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
Switch-on predicate-push-down: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_ig_predicate_pushdown_parquet.html

Querying S3 using Athena

I have a setup with Kinesis Firehose ingesting data, AWS Lambda performing data transformation and dropping the incoming data into an S3 bucket. The S3 structure is organized by year/month/day/hour/messages.json, so all of the actual json files I am querying are at the 'hour' level with all year, month, day directories only containing sub directories.
My problem is I need to run a query to get all data for a given day. Is there an easy way to query at the 'day' directory level and return all files in its sub directories without having to run a query for 2020/06/15/00, 2020/06/15/01, 2020/06/15/02...2020/06/15/23?
I can successfully query the hour level directories since I can create a table and define the column name and type represented in my .json file, but I am not sure how to create a table in Athena (if possible) to represent a day directory with sub directories instead of actual files.
To query only the data for a day without making Athena read all the data for all days you need to create a partitioned table (look at the second example). Partitioned tables are like regular tables, but they contain additional metadata that describes where the data for a particular combination of the partition keys is located. When you run a query and specify criteria for the partition keys Athena can figure out which locations to read and which to skip.
How to configure the partition keys for a table depends on the way the data is partitioned. In your case the partitioning is by time, and the timestamp has hourly granularity. You can choose a number of different ways to encode this partitioning in a table, which one is the best depends on what kinds of queries you are going to run. You say you want to query by day, which makes sense, and will work great in this case.
There are two ways to set this up, the traditional, and the new way. The new way uses a feature that was released just a couple of days ago and if you try to find more examples of it you may not find many, so I'm going to show you the traditional too.
Using Partition Projection
Use the following SQL to create your table (you have to fill in the columns yourself, since you say you've successfully created a table already just use the columns from that table – also fix the S3 locations):
CREATE EXTERNAL TABLE cszlos_firehose_data (
-- fill in your columns here
)
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'
TBLPROPERTIES (
"projection.enabled" = "true",
"projection.date.type" = "date",
"projection.date.range" = "2020/06/01,NOW",
"projection.date.format" = "yyyy/MM/dd",
"projection.date.interval" = "1",
"projection.date.interval.unit" = "DAYS",
"storage.location.template" = "s3://cszlos-data/is/here/${date}"
)
This creates a table partitioned by date (please note that you need to quote this in queries, e.g. SELECT * FROM cszlos_firehose_data WHERE "date" = …, since it's a reserved word, if you want to avoid having to quote it use another name, dt seems popular, also note that it's escaped with backticks in DDL and with double quotes in DML statements). When you query this table and specify a criteria for date, e.g. … WHERE "date" = '2020/06/05', Athena will read only the data for the specified date.
The table uses Partition Projection, which is a new feature where you put properties in the TBLPROPERTIES section that tell Athena about your partition keys and how to find the data – here I'm telling Athena to assume that there exists data on S3 from 2020-06-01 up until the time the query runs (adjust the start date necessary), which means that if you specify a date before that time, or after "now" Athena will know that there is no such data and not even try to read anything for those days. The storage.location.template property tells Athena where to find the data for a specific date. If your query specifies a range of dates, e.g. … WHERE "date" > '2020/06/05' Athena will generate each date (controlled by the projection.date.interval property) and read data in s3://cszlos-data/is/here/2020-06-06, s3://cszlos-data/is/here/2020-06-07, etc.
You can find a full Kinesis Data Firehose example in the docs. It shows how to use the full hourly granularity of the partitioning, but you don't want that so stick to the example above.
The traditional way
The traditional way is similar to the above, but you have to add partitions manually for Athena to find them. Start by creating the table using the following SQL (again, add the columns from your previous experiments, and fix the S3 locations):
CREATE EXTERNAL TABLE cszlos_firehose_data (
-- fill in your columns here
)
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'
This is exactly the same SQL as above, but without the table properties. If you try to run a query against this table now you will not get any results. The reason is that you need to tell Athena about the partitions of a partitioned table before it knows where to look for data (partitioned tables must have a LOCATION, but it really doesn't mean the same thing as for regular tables).
You can add partitions in many different ways, but the most straight forward for interactive use is to use ALTER TABLE ADD PARTITION. You can add multiple partitions in one statement, like this:
ALTER TABLE cszlos_firehose_data ADD
PARTITION (`date` = '2020-06-06') LOCATION 's3://cszlos-data/is/here/2020/06/06'
PARTITION (`date` = '2020-06-07') LOCATION 's3://cszlos-data/is/here/2020/06/07'
PARTITION (`date` = '2020-06-08') LOCATION 's3://cszlos-data/is/here/2020/06/08'
PARTITION (`date` = '2020-06-09') LOCATION 's3://cszlos-data/is/here/2020/06/09'
If you start reading more about partitioned tables you will probably also run across the MSCK REPAIR TABLE statement as a way to load partitions. This command is unfortunately really slow, and it only works for Hive style partitioned data (e.g. …/year=2020/month=06/day=07/file.json) – so you can't use it.

How to insert Billing Data from one Table into another Table in BigQuery

I have two tables both billing data from GCP in two different regions. I want to insert one table into the other. Both tables are partitioned by day, and the larger one is being written to by GCP for billing exports, which is why I want to insert the data into the larger table.
I am attempting the following:
Export the smaller table to Google Cloud Storage (GCS) so it can be imported into the other region.
Import the table from GCS into Big Query.
Use Big Query SQL to run INSERT INTO dataset.big_billing_table SELECT * FROM dataset.small_billing_table
However, I am getting a lot of issues as it won't just let me insert (as there are repeated fields in the schema etc). An example of the dataset can be found here https://bigquery.cloud.google.com/table/data-analytics-pocs:public.gcp_billing_export_v1_EXAMPL_E0XD3A_DB33F1
Thanks :)
## Update ##
So the issue was exporting and importing the data with the Avro format and using the auto-detect schema when importing the table back in (Timestamps were getting confused with integer types).
Solution
Export the small table in JSON format to GCS, use GCS to do the regional transfer of the files and then import the JSON file into a Bigquery table and DONT use schema auto detect (e.g specify the schema manually). Then you can use INSERT INTO no problems etc.
I was able to reproduce your case with the example data set you provided. I used dummy tables, generated from the below queries, in order to corroborate the cases:
Table 1: billing_bigquery
SELECT * FROM `data-analytics-pocs.public.gcp_billing_export_v1_EXAMPL_E0XD3A_DB33F1`
where service.description ='BigQuery' limit 1000
Table 2: billing_pubsub
SELECT * FROM `data-analytics-pocs.public.gcp_billing_export_v1_EXAMPL_E0XD3A_DB33F1`
where service.description ='Cloud Pub/Sub' limit 1000
I will propose two methods for performing this task. However, I must point that the target and the source table must have the same columns names, at least the ones you are going to insert.
First, I used INSERT TO method. However, I would like to stress that, according to the documentation, if your table is partitioned you must include the columns names which will be used to insert new rows. Therefore, using the dummy data already shown, it will be as following:
INSERT INTO `billing_bigquery` ( billing_account_id, service, sku, usage_start_time, usage_end_time, project, labels, system_labels, location, export_time, cost, currency, currency_conversion_rate, usage, credits )#invoice, cost_type
SELECT billing_account_id, service, sku, usage_start_time, usage_end_time, project, labels, system_labels, location, export_time, cost, currency, currency_conversion_rate, usage, credits
FROM `billing_pubsub`
Notice that for nested fields I just write down the fields name, for instance: service and not service.description, because they will already be used. Furthermore, I did not select all the columns in the target dataset but all the columns I selected in the target's tables are required to be in the source's table selection as well.
The second method, you can simply use the Query settings button to append the small_billing_table to the big_billing_table. In BigQuery Console, click in More >> Query settings. Then the settings window will appear and you go to Destination table, check Set a destination table for query results, fill the fields: Project name,
Dataset name and Table name -these are the destination table's information-. Subsequently, in
Destination table write preference check Append to table, which according to the documentation:
Append to table — Appends the query results to an existing table
Then you run the following query:
Select * from <project.dataset.source_table>
Then after running it, the source's table data should be appended in the target's table.

Athena equivalent to information_schema

For background, I come from a SQLServer background and make heavy use of the system tables & information_schema, to tell me all about my tables and columns.
I didn't expect the exact same power in Athena, but currently very shocked and frustrated with what little seems to be available - unless I've missed something ?
For example, 'describe mytable' - just describes 1 table at a time.
How about showing the columns for ALL tables in one result ?
It also does not output the table name, nor allow you to manually add that in as a custom column.
All the results of these "show/list/describe" commands seem to produce a text list - not a recordset, so you cannot take the results and join them to other tables or views to make more complex outputs.
Is there any other way to query the contents of my databases ?
Thanks in advance
Athena is based on Presto. Presto provides information_schema schema and I checked and it is accessible in Athena.
You can run e.g. a query like:
SELECT * FROM information_schema.columns;
to get a list of columns of all tables.
You can filter this by "database":
SELECT * FROM information_schema.columns WHERE table_schema = '<databasename>';
Note however that these types of queries are not necessarily very performant.