Pyspark pandas partition issues - pyspark-pandas

I am using pyspark.pandas to read a parquet file which is partitioned. However when I check using spark.explain(), it says no partition defined. ps.read_parquet does not seem to have an optin to specify partition explicitly and some how it is also not automatically inferring the partition from parquet file.
import pyspark.pandas as ps
df=ps.read_parquet(path_to_partitioned_file)
df.spark.explain()
22/05/19 21:06:27 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/05/19 21:06:27 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

Related

Q: AWS Redshift: ANALYZE COMPRESSION 'Table Name' - How to save the result set into a table / Join to other Table

I am looking for a way to save the result set of an ANALYZE Compression to a table / Joining it to another table in order to automate compression scripts.
Is it Possible and how?
You can always run analyze compression from an external program (bash script is my go to), read the results and store them back up to Redshift with inserts. This is usually the easiest and fastest way when I run into these type of "no route from leader to compute node" issues on Redshift. These are often one-off scripts that don't need automation or support.
If I need something programatic I'll usually write a Lambda function (or possibly a python program on an ec2). Fairly easy and execution speed is high but does require an external tool and some users are not happy running things outside of the database.
If it needs to be completely Redshift internal then I make a procedure that keeps the results of the leader only query in cursor and then loops on the cursor inserting the data into a table. Basically the same as reading it out and then inserting back in but the data never leaves Redshift. This isn't too difficult but is slow to execute. Looping on a cursor and inserting 1 row at a time is not efficient. Last one of these I did took 25 sec for 1000 rows. It was fast enough for the application but if you need to do this on 100,000 rows you will be waiting a while. I've never done this with analyze compression before so there could be some issue but definitely worth a shot if this needs to be SQL initiated.

AWS Athena partition fetch all paths

Recently, I've experienced an issue with AWS Athena when there is quite high number of partitions.
The old version had a database and tables with only 1 partition level, say id=x. Let's take one table; for example, where we store payment parameters per id (product), and there are not plenty of IDs. Assume its around 1000-5000. Now while querying that table with passing id number on where clause like ".. where id = 10". The queries were returned pretty fast actually. Assume we update the data twice a day.
Lately, we've been thinking to add another partition level for day like, "../id=x/dt=yyyy-mm-dd/..". This means that partition number grows xID times per day if a month passes and if we have 3000 IDs, we'd approximately get 3000x30=90000 partitions a month. Thus, a rapid grow in number of partitions.
On, say 3 months old data (~270k partitions), we'd like to see a query like the following would return in at most 20 seconds or so.
select count(*) from db.table where id = x and dt = 'yyyy-mm-dd'
This takes like a minute.
The Real Case
It turns out Athena first fetches the all partitions (metadata) and s3 paths (regardless the usage of where clause) and then filter those s3 paths that you would like to see on where condition. The first part (fetching all s3 paths by partitions lasts long proportionally to the number of partitions)
The more partitions you have, the slower the query executed.
Intuitively, I expected that Athena fetches only s3 paths stated on where clause, I mean this would be the one way of magic of the partitioning. Maybe it fetches all paths
Does anybody know a work around, or do we use Athena in a wrong way ?
Should Athena be used only with small number of partitions ?
Edit
In order to clarify the statement above, I add a piece from support mail.
from Support
...
You mentioned that your new system has 360000 which is a huge number.
So when you are doing select * from <partitioned table>, Athena first download all partition metadata and searched S3 path mapped with
those partitions. This process of fetching data for each partition
lead to longer time in query execution.
...
Update
An issue opened on AWS forums. The linked issue raised on aws forums is here.
Thanks.
This is impossible to properly answer without knowing the amount of data, what file formats, and how many files we're talking about.
TL; DR I suspect you have partitions with thousands of files and that the bottleneck is listing and reading them all.
For any data set that grows over time you should have a temporal partitioning, on date or even time, depending on query patterns. If you should have partitioning on other properties depends on a lot of factors and in the end it often turns out that not partitioning is better. Not always, but often.
Using reasonably sized (~100 MB) Parquet can in many cases be more effective than partitioning. The reason is that partitioning increases the number of prefixes that have to be listed on S3, and the number of files that have to be read. A single 100 MB Parquet file can be more efficient than ten 10 MB files in many cases.
When Athena executes a query it will first load partitions from Glue. Glue supports limited filtering on partitions, and will help a bit in pruning the list of partitions – so to the best of my knowledge it's not true that Athena reads all partition metadata.
When it has the partitions it will issue LIST operations to the partition locations to gather the files that are involved in the query – in other words, Athena won't list every partition location, just the ones in partitions selected for the query. This may still be a large number, and these list operations are definitely a bottleneck. It becomes especially bad if there is more than 1000 files in a partition because that's the page size of S3's list operations, and multiple requests will have to be made sequentially.
With all files listed Athena will generate a list of splits, which may or may not equal the list of files – some file formats are splittable, and if files are big enough they are split and processed in parallel.
Only after all of that work is done the actual query processing starts. Depending on the total number of splits and the amount of available capacity in the Athena cluster your query will be allocated resources and start executing.
If your data was in Parquet format, and there was one or a few files per partition, the count query in your question should run in a second or less. Parquet has enough metadata in the files that a count query doesn't have to read the data, just the file footer. It's hard to get any query to run in less than a second due to the multiple steps involved, but a query hitting a single partition should run quickly.
Since it takes two minutes I suspect you have hundreds of files per partition, if not thousands, and your bottleneck is that it takes too much time to run all the list and get operations in S3.

AWS Athena MSCK REPAIR TABLE takes too long for a small dataset

I am having issues with amazon athena, I have a small bucket ( 36430 objects , 9.7 mb ) with 4 levels of partition ( my-bucket/p1=ab/p2=cd/p3=ef/p4=gh/file.csv ) but when I run the command
MSCK REPAIR TABLE db.table
is taking over 25 minutes, and I have plans to put data of the magnitude of TB on Athena and I won't do it if this issue remains
Does anybody know why is taking too long?
Thanks in advance
MSCK REPAIR TABLE can be a costly operation, because it needs to scan the table's sub-tree in the file system (the S3 bucket). Multiple levels of partitioning can make it more costly, as it needs to traverse additional sub-directories. Assuming all potential combinations of partition values occur in the data set, this can turn into a combinatorial explosion.
If you are adding new partitions to an existing table, then you may find that it's more efficient to run ALTER TABLE ADD PARTITION commands for the individual new partitions. This avoids the need to scan the table's entire sub-tree in the file system. It is less convenient than simply running MSCK REPAIR TABLE, but sometimes the optimization is worth it. A viable strategy is often to use MSCK REPAIR TABLE for an initial import, and then use ALTER TABLE ADD PARTITION for ongoing maintenance as new data gets added into the table.
If it's really not feasible to use ALTER TABLE ADD PARTITION to manage the partitions directly, then the execution time might be unavoidable. Reducing the number of partitions might reduce execution time, because it won't need to traverse as many directories in the file system. Of course, then the partitioning is different, which might impact query execution time, so it's a trade-off.
While the marked answer is technically correct, it doesn't address your real issue, which is that you have too many files.
I have a small bucket ( 36430 objects , 9.7 mb ) with 4 levels of
partition ( my-bucket/p1=ab/p2=cd/p3=ef/p4=gh/file.csv )
For such a small table, 36430 files creates a huge amount of overhead on S3, and the partitioning with 4 levels is super-overkill. The partitioning has hindered query performance rather than optimizing it. MSCK is slow because it is waiting for S3 listing among other things.
Athena would read the entire 9.7MB table if it were in one file faster than it would be able to list that huge directory structure.
I recommend removing the partitions completely, or if you really must have them then remove p2, p3 and p4 levels. Also consider processing it into another table to compact the files into larger ones.
Some suggest optimal file sizes are between 64MB and 4GB, which relates to the native block sizes on S3. It's also helpful to have a number of files that is some multiple of the workers in the cluster, although that is unknown with Athena. Your data is smaller than that range, so 1 or perhaps 8 files at most would be appropriate.
Some references:
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/#OptimizeFileSizes
https://www.upsolver.com/blog/small-file-problem-hdfs-s3
Use Athena Projection to manage partitions automatically.

Why Amazon Redshift UNLOAD performance is much better for fresh data?

I wonder why unloading from a big table (>100 bln rows) when selecting by a column, which is NOT a sort key or a part of sort key, is immensely faster for newly added data. How Redshift understands that it is time to stop sequential scan in the second scenario?
Time the query spent executing. 39m 37.02s:
UNLOAD ('SELECT * FROM production.some_table WHERE daytime BETWEEN
\\'2017-01-15\\' AND \\'2017-01-16\\'') TO ...
vs.
Time the query spent executing. 23.01s :
UNLOAD ('SELECT * FROM production.some_table WHERE daytime BETWEEN
\\'2017-06-24\\' AND \\'2017-06-25\\'') TO ...
Thanks!
Amazon Redshift uses zone maps to identify the minimum and maximum value stored in each 1MB block on disk. Each block only stores data related to a single column (eg daytime).
If the SORTKEY is not set to daytime, then the data is unsorted and any particular date could appear in many different blocks. If SORTKEY is used, then a particular date will only appear in a minimum number of blocks.
Your second query possibly executes faster, even without a SORTKEY, because you are querying data that was probably added recently and is therefore all stored together in just a few blocks. The historical data might be spread in many blocks because a VACUUM probably reordered the data based upon the correct SORTKEY. In fact, if you did a VACUUM now, you might find that your second query becomes slower.

Redshift unload's file name

I'm running a Redshift unload command, but am not getting the name I desire. The command is:
UNLOAD ('select * from foo')
TO 's3://mybucket/foo'
CREDENTIALS 'xxxxxx'
GZIP
NULL AS 'NULL'
DELIMITER as '\t'
allowoverwrite
parallel off
The result is mybucket/foo-000.gz. I don't want the slice number to be the end of the file name (it'd be great if it can be eliminated completely), I want to add a file extension at end of the file name. I'd like to see either of the following:
mybucket/foo-000.txt.gz
mybucket/foo.txt.gz
Is there any way to do this (without writing a lambda post process renamer script)?
TL;DR
No.
Explanation:
As it says in Amazon Redshift UNLOAD document, if you do not want it to be split into several parts, you can use PARALLEL FALSE, but it is strongly recommended to leave it enabled. Even then, the file will always include the 000.[EXT] suffix (when the [EXT] exists only when the compression is enabled), because there is a limit to a file size that Redshift can output, as says in the documentation:
By default, UNLOAD writes data in parallel to multiple files,
according to the number of slices in the cluster. The default option
is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or
more data files serially, sorted absolutely according to the ORDER BY
clause, if one is used. The maximum size for a data file is 6.2 GB.
So, for example, if you unload 13.4 GB of data, UNLOAD creates the
following three files.
s3://mybucket/key000 6.2 GB
s3://mybucket/key001 6.2 GB
s3://mybucket/key002 1.0 GB
Therefore, it will alway add at least the prefix 000, because Redshift doesn't know what size of the file he is going to output in the first place, so he's adding this suffix in case the output will reach the size of 6.2 GB.
If you ask why the use of PARALLEL FALSE is not recommended, I'll try to explain it in several points:
The most important reason is because of the way a Redshift cluster designed. Each cluster includes at least 2 servers, when one of them is a leader node and the rest are data nodes. The purpose of leader node, is to control the data nodes, it hold the necessary information to work with all data in Redshift, either read or write.
When you unload data from Redshift while the flag PARALLEL is TRUE, it will create at least X files, when X is the number of nodes you choose to construct the Redshift cluster of, in the first place. It means, that the data is written directly from the data nodes themselves, which is much faster because it's doing it in parallel and skips the leader node.
When you decide to turn this flag to off, all data is gathered from all of the data nodes into a single node, the leader node, because it needs to reorganize the sorting of the rows to output and also compress it if needed as a single stream. This action causes you data to be written much slower.
Also, this is significantly decreases Redshift cluster performance in a matter of reading and writing data, because everything (read and write queries) goes through the leader node, and as it says above, when the leader node is overloaded, there will be a performance issue.
The queries COPY and UNLOAD work directly with the data nodes, therefore, they behave almost the same way as if you would use PARALLEL TRUE. In the contrary, queries like SELECT, UPDATE, DELETE and INSERT, are processed by the leader node, that's why they suffer from the leader node loads.