Parquet: read particular columns into memory - mapreduce

I have exported a mysql table to a parquet file (avro based). Now i want to read particular columns from that file. How can i read particular columns completely? I am looking for java code examples.
Is there an api where i can pass the columns i need and get back a 2D array of table?

If you can use hive, creating a hive table and issuing a simple select query would be by far the easiest option.
create external table tbl1(<columns>) location '<file_path>' stored as parquet;
select col1,col2 from tbl1;
//this works in hive 0.14
You can use JDBC driver to do that from java program as well.
Otherwise, if you want to stay completely in java, you need to modify the avro schema by excluding all the fields but the ones you want to fetch. Then when you read the file supply the modified schema as reader schema and it will only read the included columns. But you will get you original avro record back with excluded fields nullified, not a 2D array.
To modify the schema look at org.apache.avro.Schema and org.apache.avro.SchemaBuilder. make sure that modified schema is compatible with the original schema.

Options:
Use Hive table to create table with all columns with storage format parquet and read the required columns by specifying the column names
Create Thrift for the table and use the thrift fields to read the data from code (Java or Scala)

You can also use apache drill that natively parse parquet files.

Related

How to determine if my AWS Glue Custom CSV Classifier is working?

I am using AWS Glue to catalog (and hopefully eventually transform) data. I am trying to create a Custom CSV Classifier for the crawler so that I can provide a known set of column headers to the table. The data is in TSV (tab separated value) format, and the files themselves have no header row. There is no 'quote' character in the data, but there are 1 or 2 columns which use a double quote in the data, so I've indicated in the Classifier that it should use single quote ('). To ensure I start clean, I delete the AWS Glue Catalog Table, and then run the Crawler with the Classifier attached. When I subsequently check the created table, it lists csv as the classification, and the columns names specified in the Classifier are not associated with the table (and instead are labelled as col0, col1, col2, col3 etc.). Further, when inspecting a few rows in the table, it appears as though the data associated with the columns does not use the same column ordering as in the raw data itself, which I can confirm because I have a copy of the raw data open locally on my computer.
AWS Glue Classifier documentation indicates that a crawler will attempt to use the Custom Classifiers associated with a Crawler in the order they are specified in the Crawler definition, and if no match is found with certainty 1.0, it will use Built-in Classifiers. In the event a match with certainty 1.0 is still not found, the Classifier with the highest certainty will be used.
My questions are these:
How do I determine if my Custom CSV Classifier (which I have specifically named, say for sake of argument customClassifier) is actually being used, or if it is defaulting to the Built-In CSV Classifier?
More importantly, given the situation above (having the columns known but separate from the data, and having double quotes be used in the actual data but no quoted values), how do I get the Crawler to use the specified column names for the table schema?
Why does it appear as though my data in the Catalog is not using the column order specified in the file (even if it is the generic column names)?
If it is even possible, how could I use an ApplyMapping transform to rename the columns for the workflow (which would be sufficient for my cases).? I need to do so without enabling script-only mode (by modifying an AWS Glue Studio Workflow), and without manually entering in over 200 columns

When storing impala table as textfile, is it posisble to tell it to save column names in the textfile?

I have created an impala table as
create table my_schema.my_table stored as textfile as select ...
As per the definition the table has data stored in textfiles somewhere in HDFS. Now when i run hdfs command such as:
hadoop fs -cat path_to_file | head
I do not see any column names. I suppose impala stores the column names somewhere else, but since i would like to work with these textfiles also outside of impala, it would be great if the files would include the headers.
Is there some option i can set when creating the table to add the headers to the text files? Or do i need to figure out the names by parsing the results of show create table?

Querying S3 using Athena

I have a setup with Kinesis Firehose ingesting data, AWS Lambda performing data transformation and dropping the incoming data into an S3 bucket. The S3 structure is organized by year/month/day/hour/messages.json, so all of the actual json files I am querying are at the 'hour' level with all year, month, day directories only containing sub directories.
My problem is I need to run a query to get all data for a given day. Is there an easy way to query at the 'day' directory level and return all files in its sub directories without having to run a query for 2020/06/15/00, 2020/06/15/01, 2020/06/15/02...2020/06/15/23?
I can successfully query the hour level directories since I can create a table and define the column name and type represented in my .json file, but I am not sure how to create a table in Athena (if possible) to represent a day directory with sub directories instead of actual files.
To query only the data for a day without making Athena read all the data for all days you need to create a partitioned table (look at the second example). Partitioned tables are like regular tables, but they contain additional metadata that describes where the data for a particular combination of the partition keys is located. When you run a query and specify criteria for the partition keys Athena can figure out which locations to read and which to skip.
How to configure the partition keys for a table depends on the way the data is partitioned. In your case the partitioning is by time, and the timestamp has hourly granularity. You can choose a number of different ways to encode this partitioning in a table, which one is the best depends on what kinds of queries you are going to run. You say you want to query by day, which makes sense, and will work great in this case.
There are two ways to set this up, the traditional, and the new way. The new way uses a feature that was released just a couple of days ago and if you try to find more examples of it you may not find many, so I'm going to show you the traditional too.
Using Partition Projection
Use the following SQL to create your table (you have to fill in the columns yourself, since you say you've successfully created a table already just use the columns from that table – also fix the S3 locations):
CREATE EXTERNAL TABLE cszlos_firehose_data (
-- fill in your columns here
)
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'
TBLPROPERTIES (
"projection.enabled" = "true",
"projection.date.type" = "date",
"projection.date.range" = "2020/06/01,NOW",
"projection.date.format" = "yyyy/MM/dd",
"projection.date.interval" = "1",
"projection.date.interval.unit" = "DAYS",
"storage.location.template" = "s3://cszlos-data/is/here/${date}"
)
This creates a table partitioned by date (please note that you need to quote this in queries, e.g. SELECT * FROM cszlos_firehose_data WHERE "date" = …, since it's a reserved word, if you want to avoid having to quote it use another name, dt seems popular, also note that it's escaped with backticks in DDL and with double quotes in DML statements). When you query this table and specify a criteria for date, e.g. … WHERE "date" = '2020/06/05', Athena will read only the data for the specified date.
The table uses Partition Projection, which is a new feature where you put properties in the TBLPROPERTIES section that tell Athena about your partition keys and how to find the data – here I'm telling Athena to assume that there exists data on S3 from 2020-06-01 up until the time the query runs (adjust the start date necessary), which means that if you specify a date before that time, or after "now" Athena will know that there is no such data and not even try to read anything for those days. The storage.location.template property tells Athena where to find the data for a specific date. If your query specifies a range of dates, e.g. … WHERE "date" > '2020/06/05' Athena will generate each date (controlled by the projection.date.interval property) and read data in s3://cszlos-data/is/here/2020-06-06, s3://cszlos-data/is/here/2020-06-07, etc.
You can find a full Kinesis Data Firehose example in the docs. It shows how to use the full hourly granularity of the partitioning, but you don't want that so stick to the example above.
The traditional way
The traditional way is similar to the above, but you have to add partitions manually for Athena to find them. Start by creating the table using the following SQL (again, add the columns from your previous experiments, and fix the S3 locations):
CREATE EXTERNAL TABLE cszlos_firehose_data (
-- fill in your columns here
)
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'
This is exactly the same SQL as above, but without the table properties. If you try to run a query against this table now you will not get any results. The reason is that you need to tell Athena about the partitions of a partitioned table before it knows where to look for data (partitioned tables must have a LOCATION, but it really doesn't mean the same thing as for regular tables).
You can add partitions in many different ways, but the most straight forward for interactive use is to use ALTER TABLE ADD PARTITION. You can add multiple partitions in one statement, like this:
ALTER TABLE cszlos_firehose_data ADD
PARTITION (`date` = '2020-06-06') LOCATION 's3://cszlos-data/is/here/2020/06/06'
PARTITION (`date` = '2020-06-07') LOCATION 's3://cszlos-data/is/here/2020/06/07'
PARTITION (`date` = '2020-06-08') LOCATION 's3://cszlos-data/is/here/2020/06/08'
PARTITION (`date` = '2020-06-09') LOCATION 's3://cszlos-data/is/here/2020/06/09'
If you start reading more about partitioned tables you will probably also run across the MSCK REPAIR TABLE statement as a way to load partitions. This command is unfortunately really slow, and it only works for Hive style partitioned data (e.g. …/year=2020/month=06/day=07/file.json) – so you can't use it.

Cannot read and write the query result back to BigQuery

I am using BigQueryIO.readTableRows().fromQuery(...) to read rows from BigQuery then writing TableRow back to BigQuery using BigQueryIO.writeTableRows(). I have table with correct schema already created so using CreateDisposition.CREATE_NEVER and will not have to set schema in Beam client. Problem is that the all Record fields are flattened(underscore appended) in query result and does not match the schema of the table which is not in flattened form. Using .withoutResultFlattening() on reads did not help unflattening the Records so cannot get around this discrepancy. How do we query without flattening the result?
You can use Standard SQL since the results will not be flattened as explained here.

Insert Csv file values into table apex

I am trying to insert bulk values into the table through an excel.csv file.
I have created a file browser item on the page, now in the process have to write insert code for this to insert the excel values into the table.
the following table I have created: NON_DYNAMIC_USER_GROUPS
columns: ID,NAME,GROUP,GROUP_TYPE.
Need to create insert process code for this.
I prefer the Excel2Collection plugin for converting any form of Excel document into rows in an Oracle table.
http://www.apex-plugin.com/oracle-apex-plugins/process-type-plugin/excel2collections_271.html
PL/SQL already written, and formulated into an APEX plugin, making it easy to use.
It is possible to uncompress the code and convert it to using your own table, instead of apex_collections, which are limited to 50 columns/fields.