I am currently working with Big Query and understand that there is a partition limit of up to 4,000 partitions.
Does anyone know if this limit apply to Active Storage Tier only or both Active & Long Term Storage Tier?
Reason for asking because I have a partitioned table, partitioned by hour and have been using it for more than 6 months already but we don't get any error prompting partition limit exceed 4,000 when we insert new data.
I have did a count on the number of partition attached image below:
As we can see the total partitions is 6,401 and we are still able to insert new data.
At the same we also create a new partitioned table and try moving data into this newly created partitioned table but we encountered some error saying we have exceeded the limit of 4,000.
In addition, I also tried to insert data incrementally but I still get error as follow:
Steps to reproduce error:
Create a partitioned table (partition by hour)
Start moving data by month from another table
My finding:
The mentioned partition limit is only applicable to active storage tier.
Can anyone help to confirm on this?
As I understood the limitation, you can't modify more than 4000 partitions in one job. Your jobs that you describe first are supposedly working because they are modifying only a few partitions.
When you try to move more than 4000 partitions in one go, you will hit the limitation as you described.
I noticed I was hitting this limitation on both Active Storage and Long Term Storage. This is a BigQuery-wide limitation.
Related
I have a problem related to storing data on GCP Bigquery. I have a partition table which is size 10 Terabyte and increasing day by day. How can I store this data with minimum cost and max performance?
Fisrt Option : I can store last 1 month's data on Bigquery and the rest of the data on GCS.
Second Option: Deleting after the last 1 month's data but this option is illogical to me.
What do you think about this issue?
BigQuery Table
The best solution is to use a BigQuery table which is partition by a usefull date column. A huge part of this table will be charged with the lower long time storage rate. Please consider for your whole project a region-zone, which has lower costs, if this is possible for your organisation, because all needed data needs to be in the same region.
For each query only the needed time and the needed columns are charged.
GCS files
There is an option for using external tables for files stored in GCS. These have some drawbacks: for each query the complete data is read and charged. There are some partition possibilities using hive partition keys (https://cloud.google.com/bigquery/docs/hive-partitioned-queries). It is also not possible to precalculate the cost of a query, which is very bad for testing and debugging.
use cases
If you need only your monthly data for daily reports, it is enough to store these data in BigQuery and the rest in gcs. If you only need to run a query over longer times once a month, you can load the data from gcs into BigQuery and delete the table after your queries.
I was importing some old data to the Timestream table successfully for a while but then it started to give the error:
Timestream error: com.amazonaws.services.timestreamwrite.model.ThrottlingException: Your magnetic store writes to Timestream are throttled for this database. Refer to Timestream documentation for 'ActiveMagneticStorePartitions' metric to prevent recurrence of this issue or contact AWS support.
The metrics it refers to raise to the limit of 250 but it drops to 0 after a while even after that when I start the import it immediately hits the limit and the error is raised again so nothing is imported at all.
I am not running import in parallel but only one at a time but nevertheless it still raises the error.
As a workaround, I've decided to increase the memory retention period for this table but still get the same error for some reason even when importing data within the new memory retention period.
If you're ingesting old data, you should try to sort your data by timestamp. This will help to create fewer active partitions.
Then, before inserting the old data into Timestream, you should check the active partitions.
I met with the AWS support team several times to understand the best way to ingest data into the magnetic store (the memory store doesn't have this constraint). They suggested ingesting data sorted by timestamp. So if you have multiple devices, you should ingest the data by timestamp instead of by device.
The criteria behind an active partition is not clear and they always talk about likelihood...
I've run load tests to ingest the same data into the magnetic store and ended up with different numbers of active partitions.
Here are the results of my load tests:
I ingest 2142288 records belonging to January 2022, which it will be written in the magnetic store with my current timestream configuration. Between each execution, I increased the record version to override the previous record.
January (total active partitions: 0)
Ingest 2142288 records -> new 16 active partitions (new: 16)
Ingest 2142288 records -> new 16 active partitions (new: 16, total: 32)
Ingest 2142288 records -> new 16 active partitions (new: 16, total: 48)
Ingest 2142288 records -> new 0 active partitions (new: 0, total: 48)
Ingest 2142288 records -> new 0 active partitions (new: 0, total: 48)
Without waiting for the active partitions to drop to zero, I ingested 1922784 records belonging to February 2022.
February(total active partitions: 48)
Ingest 1922784 records -> new 0 active partitions (new: 0, total:48 )
I waited until active partitions decreased to zero, increased the record version and ran the same tests
February(total active partitions: 0)
Ingest 1922784 records -> new 82 active partitions (new: 0, total:82)
As you can see, there is no clear pattern regarding the creation of active partitions but if you sorted your data by timestamp you'll get a better likelihood of success while ingesting data into the magnetic store.
have the exact same issue.
have a table that we were ingesting historical records to that was working fine until we got past some threshold. (not sure if its worth mentioning but this table is also being written to in realtime with current data as it arrives). we got ~500m rows into the table without ever hitting the 250 active partitions limit, the data is ordered, etc.
then a few weeks ago something changed and ever since then whenever writing historical rows to the table, it almost immediately jumps from 0 to 250 active magnetic partitions and historical ingestion is halted. we've been battling this for weeks.
our solution was create another empty table, import historical records to that, and then every 50m rows or so use a scheduled query to copy all the data from this "temp" table to the actual table we want to use.
this temp tables settings are basically minimal memory store and maximal magnetic store since we're only writing historical data thats 6 months old at a minumum.
for some reason this works fine, all rows are accounted for and it never hits the 250 active partition limit. it costs a bit more, but not much more in our case and its the only thing we've found that works.
if we write the same data to our original table, it immediately hits 250 active mag partitions. pause the process, change the target table, run it again and the new target table barely gets beyond the 8-12 active magnetic partion range for the same data.
running the scheduled query to copy the data from the temp table to the target table seems to have zero impact on the target tables active partition counter that i can see, im assuming its just happening behind the scenes somewhere.
at present this seems to be the only path to finishing our historical data import.
streaming realtime or "present day" data to the mem-store always works fine. this is specifically only happening when writing historical data to the magnetic storage.
I'm writing a script, that should fill the new table with data in the shortest terms (~650Gb table).
The partition(hash) key is different between all records, so I can't imagine the better key.
I've set the provisioned WCU for this table at 4k.
When script works, 16 independent threads put different data into the table at a high rate. During execution, I receive ProvisionedThroghputException. The Cloudwatch graphs show that consumed WCU is capped at 1000WCU.
It could happen if all data is put to one partition.
As I understand, the DynamoDb would create the new partition, when data size would exceed the 10Gb limit. Is it so?
So, during this data fill operation, I have only 1 partition and the limit of 1000WCU is understandable.
I've checked the https://aws.amazon.com/ru/premiumsupport/knowledge-center/dynamodb-table-throttled/
But seems that these suggestions are applied to already filled tables and you try to add a lot of new data there.
So I have 3 questions:
1. How I can speed up the process of inserting data into the new empty table?
2. When DynamoDB decide to create a new partition?
3. Can I set up a minimum number of partitions (for ex. 4), to use all the power of provisioned WCU (4k)?
UPD Cloudwatch graph:
UPD2 the HASH key is long number. Actually it's not strongly unique. But max rows with same HASH key but different RANGE keys is 2.
You can't manually specify the number of partitions used by DDB. It's automatically handled behind the scenes.
However, the way it's handled is laid out in the link provided by F_SO_K.
1 for every 10GB of data
1 for every 3000RCU and/or 1000WCU provisioned.
If you've provisioned 4000WCU, then you should have at least 4 partitions and you should be seeing 4000WCU consumed. Especially given that you said your hash key is unique for every record, you should have data uniformly spread out and not be running into a "hot" partition.
You mentioned cloudwatch showing consumed WCU at 1000, does cloudwatch also show provisioned capacity at 4000WCU?
If so, not sure what's going on, may have to call AWS.
I am trying to use AWS Athena to provide analytics for an existing platform. Currently the flow looks like this:
Data is pumped into a Kinesis Firehose as JSON events.
The Firehose converts the data to parquet using a table in AWS Glue and writes to S3 either every 15 mins or when the stream reaches 128 MB (max supported values).
When the data is written to S3 it is partitioned with a path /year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/...
An AWS Glue crawler update a table with the latest partition data every 24 hours and makes it available for queries.
The basic flow works. However, there are a couple of problems with this...
The first (and most important) is that this data is part of a multi-tenancy application. There is a property inside each event called account_id. Every query that will ever be issued will be issued by a specific account and I don't want to be scanning all account data for every query. I need to find a scalable way query only the relevant data. I did look into trying to us Kinesis to extract the account_id and use it as a partition. However, this currently isn't supported and with > 10,000 accounts the AWS 20k partition limit quickly becomes a problem.
The second problem is file size! AWS recommend that files not be < 128 MB as this has a detrimental effect on query times as the execution engine might be spending additional time with the overhead of opening Amazon S3 files. Given the nature of the Firehose I can only ever reach a maximum size of 128 MB per file.
With that many accounts you probably don't want to use account_id as partition key for many reasons. I think you're fine limits-wise, the partition limit per table is 1M, but that doesn't mean it's a good idea.
You can decrease the amount of data scanned significantly by partitioning on parts of the account ID, though. If your account IDs are uniformly distributed (like AWS account IDs) you can partition on a prefix. If your account IDs are numeric partitioning on the first digit would decrease the amount of data each query would scan by 90%, and with two digits 99% – while still keeping the number of partitions at very reasonable levels.
Unfortunately I don't know either how to do that with Glue. I've found Glue very unhelpful in general when it comes to doing ETL. Even simple things are hard in my experience. I've had much more success using Athena's CTAS feature combined with some simple S3 operation for adding the data produced by a CTAS operation as a partition in an existing table.
If you figure out a way to extract the account ID you can also experiment with separate tables per account, you can have 100K tables in a database. It wouldn't be very different from partitions in a table, but could be faster depending on how Athena determines which partitions to query.
Don't worry too much about the 128 MB file size rule of thumb. It's absolutely true that having lots of small files is worse than having few large files – but it's also true that scanning through a lot of data to filter out just a tiny portion is very bad for performance, and cost. Athena can deliver results in a second even for queries over hundreds of files that are just a few KB in size. I would worry about making sure Athena was reading the right data first, and about ideal file sizes later.
If you tell me more about the amount of data per account and expected life time of accounts I can give more detailed suggestions on what to aim for.
Update: Given that Firehose doesn't let you change the directory structure of the input data, and that Glue is generally pretty bad, and the additional context you provided in a comment, I would do something like this:
Create an Athena table with columns for all properties in the data, and date as partition key. This is your input table, only ETL queries will be run against this table. Don't worry that the input data has separate directories for year, month, and date, you only need one partition key. It just complicates things to have these as separate partition keys, and having one means that it can be of type DATE, instead of three separate STRING columns that you have to assemble into a date every time you want to do a date calculation.
Create another Athena table with the same columns, but partitioned by account_id_prefix and either date or month. This will be the table you run queries against. account_id_prefix will be one or two characters from your account ID – you'll have to test what works best. You'll also have to decide whether to partition on date or a longer time span. Dates will make ETL easier and cheaper, but longer time spans will produce fewer and larger files, which can make queries more efficient (but possibly more expensive).
Create a Step Functions state machine that does the following (in Lambda functions):
Add new partitions to the input table. If you schedule your state machine to run once per day it can just add the partition that correspond to the current date. Use the Glue CreatePartition API call to create the partition (unfortunately this needs a lot of information to work, you can run a GetTable call to get it, though. Use for example ["2019-04-29"] as Values and "s3://some-bucket/firehose/year=2019/month=04/day=29" as StorageDescriptor.Location. This is the equivalent of running ALTER TABLE some_table ADD PARTITION (date = '2019-04-29) LOCATION 's3://some-bucket/firehose/year=2019/month=04/day=29' – but doing it through Glue is faster than running queries in Athena and more suitable for Lambda.
Start a CTAS query over the input table with a filter on the current date, partitioned by the first character(s) or the account ID and the current date. Use a location for the CTAS output that is below your query table's location. Generate a random name for the table created by the CTAS operation, this table will be dropped in a later step. Use Parquet as the format.
Look at the Poll for Job Status example state machine for inspiration on how to wait for the CTAS operation to complete.
When the CTAS operation has completed list the partitions created in the temporary table created with Glue GetPartitions and create the same partitions in the query table with BatchCreatePartitions.
Finally delete all files that belong to the partitions of the query table you deleted and drop the temporary table created by the CTAS operation.
If you decide on a partitioning on something longer than date you can still use the process above, but you also need to delete partitions in the query table and the corresponding data on S3, because each update will replace existing data (e.g. with partitioning by month, which I would recommend you try, every day you would create new files for the whole month, which means that the old files need to be removed). If you want to update your query table multiple times per day it would be the same.
This looks like a lot, and looks like what Glue Crawlers and Glue ETL does – but in my experience they don't make it this easy.
In your case the data is partitioned using Hive style partitioning, which Glue Crawlers understand, but in many cases you don't get Hive style partitions but just Y/M/D (and I didn't actually know that Firehose could deliver data this way, I thought it only did Y/M/D). A Glue Crawler will also do a lot of extra work every time it runs because it can't know where data has been added, but you know that the only partition that has been added since yesterday is the one for yesterday, so crawling is reduced to a one-step-deal.
Glue ETL is also makes things very hard, and it's an expensive service compared to Lambda and Step Functions. All you want to do is to convert your raw data form JSON to Parquet and re-partition it. As far as I know it's not possible to do that with less code than an Athena CTAS query. Even if you could make the conversion operation with Glue ETL in less code, you'd still have to write a lot of code to replace partitions in your destination table – because that's something that Glue ETL and Spark simply doesn't support.
Athena CTAS wasn't really made to do ETL, and I think the method I've outlined above is much more complex than it should be, but I'm confident that it's less complex than trying to do the same thing (i.e. continuously update and potentially replace partitions in a table based on the data in another table without rebuilding the whole table every time).
What you get with this ETL process is that your ingestion doesn't have to worry about partitioning more than by time, but you still get tables that are optimised for querying.
I have approximately 100TB of data that I need to backfill by running query against to transform fields, then write the transformation to another table. This table is partitioned by ingestion time timestamp. I have both action as a part of single query as you can see below. I am planning to run this query multiple times in smaller chunks manually by ingestion timestamp ranges.
Is there a better way handle this process rather than running query in manual chunks? For example maybe using Dataflow or other framework.
CREATE TABLE IF NOT EXISTS dataset.table
PARTITION BY DATE(timestamp) AS
with load as (SELECT *, _TABLE_SUFFIX as tableId
FROM `project.dataset.table_*`
WHERE _TABLE_SUFFIX BETWEEN '1' AND '1531835999999'
),................
...................
You need to accurately dose the queries you run as there are very limiting quote enforcement.
Partitioned tables
Maximum number of partitions per partitioned table — 4,000
Maximum number of partitions modified by a single job — 2,000
Each job operation (query or load) can affect a maximum of 2,000 partitions. Any query or load job that affects more than 2,000 partitions is rejected by Google BigQuery.
Maximum number of partition modifications per day per table — 5,000
You are limited to a total of 5,000 partition modifications per day for a partitioned table. A partition can be modified by using an operation that appends to or overwrites data in the partition. Operations that modify partitions include: a load job, a query that writes results to a partition, or a DML statement (INSERT, DELETE, UPDATE, or MERGE) that modifies data in a partition.
More than one partition may be affected by a single job. For example, a DML statement can update data in multiple partitions (for both ingestion-time and partitioned tables). Query jobs and load jobs can also write to multiple partitions but only for partitioned tables. Google BigQuery uses the number of partitions affected by a job when determining how much of the quota the job consumes. Streaming inserts do not affect this quota.
Maximum rate of partition operations — 50 partition operations every 10 seconds
Most of the time you hit the second limitation, single job no more than 2000, and if you parallelise further you hit the last one, 50 partition operations every 10 seconds.
On the other hand the DML MERGE syntax could come into your help.
If you have a sales representative reach out to the BQ team and if they can increase some of your quotas they will respond positive.
Also I've seen people using multiple projects to run jobs past of the quotas.