How to set Vora Table partition size? - vora

I have set the 'partitionSize' option to multiple different values, and I seem to get the same amount of partitions no matter the number. According to the documentation the should correspond to the HDFS block size. Is there something that I am missing.
HDFS block size 64M
CREATE TABLE TABLE_TEST (DEFINITION_INFO)
USING com.sap.spark.vora
OPTIONS (
tablename "TABLE_TEST",
partitionSize "64",
paths "/load_from_here/combined.csv",
eagerLoad "true"
)
The csv is about 680M

The name of the parameter is a bit misleading. It is not for partitioning tables, but rather to influence the load performance when loading data into tables. In newer versions it might be renamed to avoid this confusion.

Related

How to avoid AWS Athena CTAS query creating small files?

I'm unable to figure out what is wrong with my CTAS query, it breaks the data into smaller files while storing inside a partition even though I haven't mentioned any bucketing columns. Is there a way to avoid these small files and store as one single file per partition as files lesser than 128 MB would cause additional overhead?
CREATE TABLE sampledb.yellow_trip_data_parquet
WITH(
format = 'PARQUET'
parquet_compression = 'GZIP',
external_location='s3://mybucket/Athena/tables/parquet/'
partitioned_by=ARRAY['year','month']
)
AS SELECT
VendorID,
tpep_pickup_datetime,
tpep_dropoff_datetime,
passenger_count,
trip_distance,
RatecodeID,
store_and_fwd_flag,
PULocationID,
DOLocationID,
payment_type,
fare_amount,
extra,
mta_tax,
tip_amount,
tolls_amount,
improvement_surcharge,
total_amount,
date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%Y') AS year,
date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%c') AS month
FROM sampleDB.yellow_trip_data_raw;
I was able to overcome the issue by creating a bucketing column month_a. Below is the code
CREATE TABLE sampledb.yellow_trip_data_avro
WITH (
format = 'AVRO',
external_location='s3://a4189e1npss3001/Athena/internal_tables/avro/',
partitioned_by=ARRAY['year','month'],
bucketed_by=ARRAY['month_a'],
bucket_count=12
) AS SELECT
VendorID,
tpep_pickup_datetime,
tpep_dropoff_datetime,
passenger_count,
trip_distance,
RatecodeID,
store_and_fwd_flag,
PULocationID,
DOLocationID,
payment_type,
fare_amount,
extra,
mta_tax,
tip_amount,
tolls_amount,
improvement_surcharge,
total_amount,
date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month_a,
date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%Y') AS year,
date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month
FROM sampleDB.yellow_trip_data_raw;
Athena is a distributed system, and it will scale the execution on your query by some unobservable mechanism. It looks like it decided to use five workers for your CTAS query, which will result in five files in each partition.
You could try explicitly specifying a bucket size of one, but you might still get multiple files, if I remember correctly.

AWS Athena MSCK REPAIR TABLE takes too long for a small dataset

I am having issues with amazon athena, I have a small bucket ( 36430 objects , 9.7 mb ) with 4 levels of partition ( my-bucket/p1=ab/p2=cd/p3=ef/p4=gh/file.csv ) but when I run the command
MSCK REPAIR TABLE db.table
is taking over 25 minutes, and I have plans to put data of the magnitude of TB on Athena and I won't do it if this issue remains
Does anybody know why is taking too long?
Thanks in advance
MSCK REPAIR TABLE can be a costly operation, because it needs to scan the table's sub-tree in the file system (the S3 bucket). Multiple levels of partitioning can make it more costly, as it needs to traverse additional sub-directories. Assuming all potential combinations of partition values occur in the data set, this can turn into a combinatorial explosion.
If you are adding new partitions to an existing table, then you may find that it's more efficient to run ALTER TABLE ADD PARTITION commands for the individual new partitions. This avoids the need to scan the table's entire sub-tree in the file system. It is less convenient than simply running MSCK REPAIR TABLE, but sometimes the optimization is worth it. A viable strategy is often to use MSCK REPAIR TABLE for an initial import, and then use ALTER TABLE ADD PARTITION for ongoing maintenance as new data gets added into the table.
If it's really not feasible to use ALTER TABLE ADD PARTITION to manage the partitions directly, then the execution time might be unavoidable. Reducing the number of partitions might reduce execution time, because it won't need to traverse as many directories in the file system. Of course, then the partitioning is different, which might impact query execution time, so it's a trade-off.
While the marked answer is technically correct, it doesn't address your real issue, which is that you have too many files.
I have a small bucket ( 36430 objects , 9.7 mb ) with 4 levels of
partition ( my-bucket/p1=ab/p2=cd/p3=ef/p4=gh/file.csv )
For such a small table, 36430 files creates a huge amount of overhead on S3, and the partitioning with 4 levels is super-overkill. The partitioning has hindered query performance rather than optimizing it. MSCK is slow because it is waiting for S3 listing among other things.
Athena would read the entire 9.7MB table if it were in one file faster than it would be able to list that huge directory structure.
I recommend removing the partitions completely, or if you really must have them then remove p2, p3 and p4 levels. Also consider processing it into another table to compact the files into larger ones.
Some suggest optimal file sizes are between 64MB and 4GB, which relates to the native block sizes on S3. It's also helpful to have a number of files that is some multiple of the workers in the cluster, although that is unknown with Athena. Your data is smaller than that range, so 1 or perhaps 8 files at most would be appropriate.
Some references:
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/#OptimizeFileSizes
https://www.upsolver.com/blog/small-file-problem-hdfs-s3
Use Athena Projection to manage partitions automatically.

Why Amazon Redshift UNLOAD performance is much better for fresh data?

I wonder why unloading from a big table (>100 bln rows) when selecting by a column, which is NOT a sort key or a part of sort key, is immensely faster for newly added data. How Redshift understands that it is time to stop sequential scan in the second scenario?
Time the query spent executing. 39m 37.02s:
UNLOAD ('SELECT * FROM production.some_table WHERE daytime BETWEEN
\\'2017-01-15\\' AND \\'2017-01-16\\'') TO ...
vs.
Time the query spent executing. 23.01s :
UNLOAD ('SELECT * FROM production.some_table WHERE daytime BETWEEN
\\'2017-06-24\\' AND \\'2017-06-25\\'') TO ...
Thanks!
Amazon Redshift uses zone maps to identify the minimum and maximum value stored in each 1MB block on disk. Each block only stores data related to a single column (eg daytime).
If the SORTKEY is not set to daytime, then the data is unsorted and any particular date could appear in many different blocks. If SORTKEY is used, then a particular date will only appear in a minimum number of blocks.
Your second query possibly executes faster, even without a SORTKEY, because you are querying data that was probably added recently and is therefore all stored together in just a few blocks. The historical data might be spread in many blocks because a VACUUM probably reordered the data based upon the correct SORTKEY. In fact, if you did a VACUUM now, you might find that your second query becomes slower.

VoltDB is exhausting the RAM while loading the data

I am trying to load the database tables into VoltDB database using csvloader utility of VoltDB. When I am trying to load one table of size 5GB, Voltdb eats the RAM so fast that free RAM become 200 MB from 55 GB, then the VoltDB process gets killed by the system.
What can be the reason for this and what are the recommended setting for VoltDB to avoid this?
Is the table you are loading partitioned? That's the first thing to check, because if you have the default sitesperhost=8 on a single server, and the table is not partitioned, there will be a complete copy of the table in each of the 8 partitions. If the table is partitioned, the data is distributed among the partitions based on the hashing assignment of the values of the partitioning key column.
If it's partitioned and you still can't load all of the data, the next thing to look at would be the schema. There are formulas in the Planning Guide that describe the memory usage for given datatypes and for indexes. The VMC interface also has a sizing worksheet that gives you the mins and maxes based on the schema. You could also post the definition of the table you are trying to load, along with any indexes you have defined on it, and we can explain more about the bytes it would use per row.

Hive -- split data across files

Is there a way to instruct Hive to split data into multiple output files? Or maybe cap the size of the output files.
I'm planning to use Redshift, which recommends splitting data into multiple files to allow parallel loading http://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files.html
We preprocess all out data in hive, and I'm wondering if there's a way to create, say 10 1GB files which might make copying to redshift faster.
I was looking at https://cwiki.apache.org/Hive/adminmanual-configuration.html and https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties but I can't find anything
There are a couple of ways you could go about splitting Hive output. The first and easiest way is to set the number of reducers. Since each reduces writes to its own output file, the number of reducers you specify will correspond to the number of output files written. Note that some Hive queries will not result in the number of reducers you specify (for example, SELECT COUNT(*) FROM some_table always results in one reducer). To specify the number of reducers run this before your query:
set mapred.reduce.tasks=10
Another way you could split into multiple output files would be to have Hive insert the results of your query into a partitioned table. This would result in at least one file per partition. For this to make sense you must have some reasonable column to partition on. For example, you wouldn't want to partition on a unique id column or you would have one file for each record. This approach will guarantee at least output file per partition, and at most numPartitions * numReducers. Here's an example (don't worry too much about hive.exec.dynamic.partition.mode, it needs to be set for this query to work).
hive.exec.dynamic.partition.mode=nonstrict
CREATE TABLE table_to_export_to_redshift (
id INT,
value INT
)
PARTITIONED BY (country STRING)
INSERT OVERWRITE TABLE table_to_export_to_redshift
PARTITION (country)
SELECT id, value, country
FROM some_table
To get more fine grained control, you can write your own reduce script to pass to hive and have that reduce script write to multiple files. Once you are writing your own reducer, you can do pretty much whatever you want.
Finally, you can forgo trying to maneuver Hive into outputting your desired number of files and just break them apart yourself once Hive is done. By default, Hive stores its tables uncompressed and in plain text in it's warehouse directory (ex, /apps/hive/warehouse/table_to_export_to_redshift). You can use Hadoop shell commands, a MapReduce job, Pig, or pull them into Linux and break them apart however you like.
I don't have any experience with Redshift, so some of my suggestions may not be appropriate for consumption by Redshift for whatever reason.
A couple of notes: Splitting files into more, smaller files is generally bad for Hadoop. You might get a speed increase for Redshift, but if the files are consumed by other parts of the Hadoop ecosystem (MapReduce, Hive, Pig, etc) you might see a performance loss if the files are too small (though 1GB would be fine). Also make sure that the extra processing/developer time is worth the time savings you get for paralleling your Redshift data load.