Cloud Dataflow - Heap Space error while using PcollectionList - google-cloud-platform

I have to partition the data by a date field in it. I am doing it using Partition Transform.
When I divide yearly data by month, Partition returns a Pcollectionlist which has 12 pcollection. This works fine.
When I have to divide it by day. I will have to create 1*12*31 Pcollection in PcollectionList. This throughs Heap space error. I tried only for 2 months data. That is,
a PcollectionList of 2*31 Pcollection
I tried using n1-highmem-4 and n1-highmem-8 machines with more than 10 workers. Still it throughs Heap space error. I am testing with only 2.0 MiB file. So I believe data size should not be a problem. The screen shots are below.
Please help me to fix this. Or a work around to my solution is also most welcome.
Thanks in advance.

It sounds like you're trying to get time-based divisions of your data. Have you looked at windowing? It should allow you to do monthly/daily/hourly windowing without needing to perform the partition. If windowing isn't applicable, could you explain why you need to partition by day?
How are you consuming the partitioned results? You may be running into a known bug with pipelines with many sinks running into OOM errors due to the byte buffers for each of the sinks.

Related

Redshift: experiencing slow query performance between 2 segments

We’re experiencing slow query performance on AWS Redshift. Frequently we see that queries can take ±12 seconds to run, but only very little time (<500ms) is spent actually executing the query (according to the AWS Redshift console for an individual query).
Querying from svl_compile we can confirm that the query compilation plan is already compiled.
In svl_query_report we see a long time delay between the start times of 2 segments accounting for the majority of the run time, although the segments themselves all execute very quickly (milliseconds)
There are a number of things that could be going on but I suspect network distribution is involved. Check STL_DIST.
Another possibility is that Redshift broke the query up and a subquery is running during that window. This can happen with very complex queries. Review the plan and see if there are any references to computer generated table names (I think they begin with't' but this is just from memory).
Spilling to disk could be happening but this seems unlikely given what you have said so far. Also queuing delays doesn't seem like a match. Both are possible but not likely.
If you post more info about how the query is running things will narrow down. Actual execution report, explain plan, and/or logging table info would help hone in on what is happening during this time window.

AWS Elasticsearch indexing memory usage issue

The problem: very frequent "403 Request throttled due to too many requests" errors during data indexing which should be a memory usage issue.
The infrastructure:
Elasticsearch version: 7.8
t3.small.elasticsearch instance (2 vCPU, 2 GB memory)
Default settings
Single domain, 1 node, 1 shard per index, no replicas
There's 3 indices with searchable data. 2 of them have roughly 1 million documents (500-600 MB) each and one with 25k (~20 MB). Indexing is not very simple (has history tracking) so I've been testing refresh with true, wait_for values or calling it separately when needed. The process is using search and bulk queries (been trying sizes of 500, 1000). There should be a limit of 10MB from AWS side so these are safely below that. I've also tested adding 0,5/1 second delays between requests, but none of this fiddling really has any noticeable benefit.
The project is currently in development so there is basically no traffic besides the indexing process itself. The smallest index generally needs an update once every 24 hours, larger ones once a week. Upscaling the infrastructure is not something we want to do just because indexing is so brittle. Even only updating the 25k data index twice in a row tends to fail with the above mentioned error. Any ideas how to reasonably solve this issue?
Update 2020-11-10
Did some digging in past logs and found that we used to have 429 circuit_breaking_exception-s (instead of the current 403) with a reason among the lines of [parent] Data too large, data for [<http_request>] would be [1017018726/969.9mb], which is larger than the limit of [1011774259/964.9mb], real usage: [1016820856/969.7mb], new bytes reserved: [197870/193.2kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=197870/193.2kb, accounting=4309694/4.1mb]. Used cluster stats API to track memory usage during indexing, but didn't find anything that I could identify as a direct cause for the issue.
Ended up creating a solution based on the information that I could find. After some searching and reading it seemed like just trying again when running into errors is a valid approach with Elasticsearch. For example:
Make sure to watch for TOO_MANY_REQUESTS (429) response codes
(EsRejectedExecutionException with the Java client), which is the way
that Elasticsearch tells you that it cannot keep up with the current
indexing rate. When it happens, you should pause indexing a bit before
trying again, ideally with randomized exponential backoff.
The same guide has also useful information about refreshes:
The operation that consists of making changes visible to search -
called a refresh - is costly, and calling it often while there is
ongoing indexing activity can hurt indexing speed.
By default, Elasticsearch periodically refreshes indices every second,
but only on indices that have received one search request or more in
the last 30 seconds.
In my use case indexing is a single linear process that does not occur frequently so this is what I did:
Disabled automatic refreshes (index.refresh_interval set to -1)
Using refresh API and refresh parameter (with true value) when and where needed
When running into a "403 Request throttled due to too many requests" error the program will keep trying every 15 seconds until it succeeds or the time limit (currently 60 seconds) is hit. Will adjust the numbers/functionality if needed, but results have been good so far.
This way the indexing is still fast, but will slow down when needed to provide better stability.

Google AutoML Importing text items very slow

I'm importing text items to Google's AutoML. Each row contains around 5000 characters and I'm adding 70K of these rows. This is a multi-label data set. There is no progress bar or indication of how long this process will take. Its been running for a couple of hours. Is there any way to calculate time remaining or total estimated time. I'd like to add additional data sets, but I'm worried that this will be a very long process before the training even begins. Any sort of formula to create even a semi-wild guess would be great.
-Thanks!
I don't think that's possible today, but I filed a feature request [1] that you can follow for updates. I asked for both training and importing data, as for training it could be useful too.
I tried training with 50K records (~ 300 bytes/record) and the load took more than 20 mins after which I killed it. I retried with 1K, which ran for 20 mins and then emailed me an error message saying I had multiple labels per input (yes, so what? training data is going to have some of those) and I had >100 labels. I simplified the classification buckets and re-ran. It took another 20 mins and was successful. Then I ran 'training' which took 3 hours and billed me $11. That maps to $550 for 50K recs, assuming linear behavior. The prediction results were not bad for a first pass, but I got the feeling that it is throwing a super large neural net at the problem. Would help if they said what NN it was and its dimensions. They do say "beta" :)
don't wast your time trying to using google for text classification. I am a GCP hard user but microsoft LUIS is far better, precise and so much faster that I can't believe that both products are trying to solve same problem.
Luis has a much better documentation, support more languages, has a much better test interface, way faster.. I don't know if is cheaper yet because the pricing model is different but we are willing to pay more.

Could not allocate a new page for database ‘TEMPDB’ because of insufficient disk space in filegroup ‘DEFAULT’

ETL developer reports they have been trying to run our weekly and daily processes on ADW consistently. While for the most part they are executing without exception, I am now getting this error:
“Could not allocate a new page for database ‘TEMPDB’ because of insufficient disk space in filegroup ‘DEFAULT’. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.”
Is there a limit on TEMPDB space associated with the DWU setting?
The database is limited to 100TB (per the portal) and not full.
Azure SQL Data Warehouse does allocate space for a tempdb, at around 399 GB per 100 DWU. Reference here.
What DWU are you using at the moment? Consider temporarily raising your DWU aka service objective or refactoring your job to be less dependent on tempdb. Lower it when your batch process is finished.
It might also be worth checking your workload for anything like cartesian products, excessive sorting, over-dependency on temp tables etc to see if any optimisation can be done.
Have a look at the Explain Plans for your code, and see whether you have a lot more data movement going on than you expect. If you find that one query does moved a lot more into Q tables, you can probably tune it to avoid the data movement (which may mean redesigning tables to distribute in a different key).

Incremental update of millions of records, indexed vs. join

I'm currently developing a strategy for an incremental update of our user data. We assume 100_000_000 records in our database of which approximately 1_000_000 records are updated per workflow.
The idea is to update records in a MapReduce job. Is it useful to use an indexed storage (eg. Cassandra) to be able to access current records randomly? Or is it preferable to retrieve data from HDFS and join new information to existing records.
The record size is O(200 Bytes). The user data has a fixed length but should be extendable. The log events have a similar but not equal structure. The number of user records is likely to grow. Near real-time updates are desirable, ie. a 3 hour time gap is not acceptable, few minutes is OK.
Have you made any experiences with either of these strategies and data of this size?
Is the pig JOIN fast enough? Is it a bottleneck always to read all records? Is Cassandra able to hold this amount of data efficiently? Which solution is scalable? What about the complexity of the system?
You need to define your requirements first. Your record volumes are not a problem, but you don't give a record length. Are they fixed length, fixed field number, likely to change format over time? Are we talking 100 byte records or 100,000 byte records? You need an index on a field/column if you wish to query by that field/column, unless you do all your work using map/reduce. Will the number of user records stay at 100mill (1 server will probably suffice) or will it grow 100% per year ( probably multiple servers adding new ones over time).
How you access records for updating depends on whether you need to update them in real-time or whether you can run a batch job. Will updates be every minute, or hour, or month?
I would strongly suggest you do some experimenting. Have you done any testing already? This will give you a context for your questions and this will lead to more objective questions and answers. It is unlikely that you can 'whiteboard' a solution based on your question.