How can I explicitly specify the size of the files to be split or the number of files? - amazon-athena

Situation:
If only specify the partition clause, it will be divided into multiple files. The size of one file is less than 1MB (~ 40 files).
What I am thinking of:
I want to explicitly specify the size of the files to be split or the number of files when registering data with CTAS or INSERT INTO.
I have read this article: https://aws.amazon.com/premiumsupport/knowledge-center/set-file-number-size-ctas-athena/
Problem:
Using bucketing method (like said in above article ) can help me specify the number of file or file size. However, it also said that "Note: The INSERT INTO statement isn't supported on bucketed tables". I would like to register data daily with Athena's INSERT INTO in the data mart.
what is the best way to build a partitioned data mart without compromising search efficiency? Is it best to register the data with Glue and save it as one file?

Related

The source files structure will be changed on daily basis in informatica cloud

Requirement is, The source files structure will be changed on daily basis / dynamically. how we can achieve in Informatica could:
For example,
Let's consider the source is a flat file with different formats like with header, without header, different metadata(today file with 4 columns and tomorrow its 7 different columns and day after tomorrow without header , another day file with count of records in file)
I need to consume all dynamically changed files in one informatica cloud mapping. could you please help me on this.
This is a tricky situation. I know its not a perfect solution but here is my idea-
create a source file structure having maximum number of columns of type text, say 50. Read file, apply filter to cleanup header data etc. Then use router to treat files as per their structure - may be filename can give you a hint what it contains. Once you identify the type of file, treat,convert columns according to their data type and load into correct target.
Mapping would look like Source -> SQ -> EXP -> FIL -> RTR -> TGT1, TGT2
There has to be a pattern to identify the dynamic file structure.
HTH...
To summarise my understanding of the problem:
You have a random number of file formats
You don't know the file formats in advance
The files don't contain the necessary information to determine their format.
If this is correct then I don't believe this is a solvable problem in Informatica or in any other tool, coding language, etc. You don't have enough information available to enable you to define the solution.
The only solution is to change your source files. Possibilities include:
a standard format (or one of a small number of standard formats with information in the file that allows you to programatically determine the format being used)
a self-documenting file type such as JSON

How to set an upper bound to BigQuery's extracted file parts?

Say I have a BigQuery table that contains 3M rows, and I want to export it to gcs.
What I do is standard bq extract <flags> ... <project_id>:<dataset_id>.<table_id> gs://<bucket>/file_name_*.<extension>
I am bound by a limit on the number of rows a file (part) can have. Is there a way to set a hard limit to the size of a file part?
For example, If I want each partition not to be above 10Mb for example, or even better, to set the maximum number of rows allowed to go in a file part? The documentation doesn't seem to mention any flags for this purpose.
You can't do it with BigQuery extract API.
But you can script it (perform an export of thousands of row in a loop) but you will have to pay for the processed data (the extract is free!). You can also set up a Dataflow job for this (but it's also not free!).

AWS Athena - how to process huge results file

Looking for a way to process ~ 4Gb file which is a result of Athena query and I am trying to know:
Is there some way to split Athena's query result file into small pieces? As I understand - it is not possible from Athena side. Also, looks like it is not possible to split it with Lambda - this file too large and looks like s3.open(input_file, 'r') does not work in Lambda :(
Is there some other AWS services that can solve this issue? I want to split this CSV file to small (about 3 - 4 Mb) to send them to external source (POST requests)
You can use the option to CTAS with Athena and use the built-in partition capabilities.
A common way to use Athena is to ETL raw data into a more optimized and enriched format. You can turn every SELECT query that you run into a CREATE TABLE ... AS SELECT (CTAS) statement that will transform the original data into a new set of files in S3 based on your desired transformation logic and output format.
It is usually advised to have the newly created table in a compressed format such as Parquet, however, you can also define it to be CSV ('TEXTFILE').
Lastly, it is advised to partition a large table into meaningful partitions to reduce the cost to query the data, especially in Athena that is charged by data scanned. The meaningful partitioning is based on your use case and the way that you want to split your data. The most common way is using time partitions, such as yearly, monthly, weekly, or daily. Use the logic that you would like to split your files as the partition key of the newly created table.
CREATE TABLE random_table_name
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
partitioned_by = ARRAY['year','month'])
AS SELECT ...
When you go to s3://bucket/folder/ you will have a long list of folders and files based on the selected partition.
Note that you might have different sizes of files based on the amount of data in each partition. If this is a problem or you don't have any meaningful partition logic, you can add a random column to the data and partition with it:
substr(to_base64(sha256(some_column_in_your_data)), 1, 1) as partition_char
Or you can use bucketing and provide how many buckets you want:
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
bucketed_by = ARRAY['column_with_high_cardinality'],
bucket_count = 100
)
You won't be able to do this with Lambda as your memory is maxed out around 3GB and your file system storage is maxed out at 512 MB.
Have you tried just running the split command on the filesystem (if you are using a Unix based OS)?
If this job is reoccurring and needs to be automated and you wanted to still be "serverless", you could create a Docker image that contains a script to perform this task and then run it via a Fargate task.
As for the specific of how to use split, this other stack overflow question may help:
How to split CSV files as per number of rows specified?
You can ask S3 for a range of the file with the Range option. This is a byte range (inclusive), for example bytes=0-1000 to get the first 1000 bytes.
If you want to process the whole file in the same Lambda invocation you can request a range that is about what you think you can fit in memory, process it, and then request the next. Request the next chunk when you see the last line break, and prepend the partial line to the next chunk. As long as you make sure that the previous chunk gets garbage collected and you don't aggregate a huge data structure you should be fine.
You can also run multiple invocations in parallel, each processing its own chunk. You could have one invocation check the file size and then invoke the processing function as many times as necessary to ensure each gets a chunk it can handle.
Just splitting the file into equal parts won't work, though, you have no way of knowing where lines end, so a chunk may split a line in half. If you know the maximum byte size of a line you can pad each chunk with that amount (both at the beginning and end). When you read a chunk you skip ahead until you see the last line break in the start padding, and you skip everything after the first line break inside the end padding – with special handling of the first and last chunk, obviously.

Parquet partitioning and HDFS filesize

My data are in the form of relatively small Avro records, written in Parquet files (on average < 1mb).
Up to now I used my local filesystem to do some tests with Spark.
I partitioned the data using a hierarchy of directories.
I wonder if it would be better to "build" the partitioning onto the Avro record and accumulate bigger files... However I imagine that partitioned Parquet files would "map" onto HDFS partitioned files too.
What approach would be best?
Edit (clarifying based on comments):
"build the partitioning onto the Avro record": imagine that my directory structure is P1=/P2=/file.avro and that the Avro record contains fields F1 and F2. I could save all of that in a single Avro file containing the fields P1, P2, F1 and F2. Ie there is no need for a partitioning structure with directories as it is all present in the Avro records
about Parquet partitions and HDFS partitions: will HDFS split a big Parquet file on different machines, will that correspond to distinct Parquet partitions ? (I don't know if that is clarifying my question - if not that means I don't really understand)
the main reasoning behind using partitioning on folder level is that when Spark for instance reads the data and there is a filter on the partitioned column (extracted from the folder name as long as the format is path/partitionName=value) it will only read the needed folders (instead of reading everything and then applying filter). so if you want to use this mechanism use hierarchy in your folder structure (I use it often).
generally speaking I would recommend avoiding many folders with little data in them (not sure if is the case here)
about Spark input partitioning (same word different meaning), when reading from HDFS Spark will try to read files so that partitions will match files on HDFS (to prevent shuffling) so if data is partitioned by HDFS spark will match the same partitions. To my knowledge HDFS does not partition files rather it replicates them (to increase reliability) so I think a single large parquet file will translate to a single file on HDFS which will be read into a single partition unless you repartition it or define number of partition when reading (there are several ways to do it depending on Spark version. see this)

Hive -- split data across files

Is there a way to instruct Hive to split data into multiple output files? Or maybe cap the size of the output files.
I'm planning to use Redshift, which recommends splitting data into multiple files to allow parallel loading http://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files.html
We preprocess all out data in hive, and I'm wondering if there's a way to create, say 10 1GB files which might make copying to redshift faster.
I was looking at https://cwiki.apache.org/Hive/adminmanual-configuration.html and https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties but I can't find anything
There are a couple of ways you could go about splitting Hive output. The first and easiest way is to set the number of reducers. Since each reduces writes to its own output file, the number of reducers you specify will correspond to the number of output files written. Note that some Hive queries will not result in the number of reducers you specify (for example, SELECT COUNT(*) FROM some_table always results in one reducer). To specify the number of reducers run this before your query:
set mapred.reduce.tasks=10
Another way you could split into multiple output files would be to have Hive insert the results of your query into a partitioned table. This would result in at least one file per partition. For this to make sense you must have some reasonable column to partition on. For example, you wouldn't want to partition on a unique id column or you would have one file for each record. This approach will guarantee at least output file per partition, and at most numPartitions * numReducers. Here's an example (don't worry too much about hive.exec.dynamic.partition.mode, it needs to be set for this query to work).
hive.exec.dynamic.partition.mode=nonstrict
CREATE TABLE table_to_export_to_redshift (
id INT,
value INT
)
PARTITIONED BY (country STRING)
INSERT OVERWRITE TABLE table_to_export_to_redshift
PARTITION (country)
SELECT id, value, country
FROM some_table
To get more fine grained control, you can write your own reduce script to pass to hive and have that reduce script write to multiple files. Once you are writing your own reducer, you can do pretty much whatever you want.
Finally, you can forgo trying to maneuver Hive into outputting your desired number of files and just break them apart yourself once Hive is done. By default, Hive stores its tables uncompressed and in plain text in it's warehouse directory (ex, /apps/hive/warehouse/table_to_export_to_redshift). You can use Hadoop shell commands, a MapReduce job, Pig, or pull them into Linux and break them apart however you like.
I don't have any experience with Redshift, so some of my suggestions may not be appropriate for consumption by Redshift for whatever reason.
A couple of notes: Splitting files into more, smaller files is generally bad for Hadoop. You might get a speed increase for Redshift, but if the files are consumed by other parts of the Hadoop ecosystem (MapReduce, Hive, Pig, etc) you might see a performance loss if the files are too small (though 1GB would be fine). Also make sure that the extra processing/developer time is worth the time savings you get for paralleling your Redshift data load.