GCP Datastore times out on large download - google-cloud-platform

I'm using Objectify to access my GCP Datastore set of Entites. I have a full list of around 22000 items that I need to load into the frontend:
List<Record> recs = ofy().load().type(Record.class).order("-sync").list();
The number of records has recently increased and I get an error from the backend:
com.google.apphosting.runtime.HardDeadlineExceededError: This request (00000185caff7b0c) started at 2023/01/19 17:06:58.956 UTC and was still executing at 2023/01/19 17:08:02.545 UTC.
I thought that the move to Cloud Firestore in Datastore mode last year would have fixed this problem.
My only solution is to break down the load() into batches using 2 or 3 calls to my Ofy Service.
Is there a better way to grab all these Entities in one go?
Thanks
Tim

Related

Delete data in Elasticache Redis after 2 days

I'm using Redis to store json messages, which need to be consumed by other services. I want the data to store in Redis for 2 days and the data need to be deleted after the 2 days.
Is there any approach to achieve it?
Assuming that you are storing the data in a String—which is likely if you're using ElastiCache—you can SET the value and EXPIRE it in a single command:
SETEX foo 172800 value
You can set a timeout on key using the EXPIRE command. See the doc for details.

Max file count using big query data transfer job

I have about 54 000 files in my GCP bucket. When I try to schedule a big query data transfer job to move files from GCP bucket to big query, I am getting the following error:
Error code 9 : Transfer Run limits exceeded. Max size: 15.00 TB. Max file count: 10000. Found: size = 267065994 B (0.00 TB) ; file count = 54824.
I thought the max file count was 10 million.
I think that BigQuery transfer service lists all the files matching the wildcard and then use the list to load them. So it will be same that providing the full list to bq load ... therefore reaching the 10,000 URIs limit.
This is probably necessary because BigQuery transfer service will skip already loaded files, so it needs to look them one by one to decide which to actually load.
I think that your only option is to schedule a job yourself and load them directly into BigQuery. For example using Cloud Composer or writing a little cloud run service that can be invoked by Cloud Scheduler.
The Error message Transfer Run limits exceeded as mentioned before is related to a known limit for Load jobs in BigQuery. Unfortunately this is a hard limit and cannot be changed. There is an ongoing Feature Request to increase this limit but for now there is no ETA for it to be implemented.
The main recommendation for this issue is to split a single operation in multiple processes that will send data in requests that don't exceed this limit. With this we could cover the main question: "Why I see this Error message and how to avoid it?".
Is is normal to ask now "how to automate or perform these actions easier?" I can think of involve more products:
Dataflow, which will help you to process the data that will be added to BigQuery. Here is where you can send multiple requests.
Pub/Sub, will help to listen to events and automate the times where the processing will start.
Please, take a look at this suggested implementation where the aforementioned scenario is wider described.
Hope this is helpful! :)

Errors importing large CSV file to DynamoDB using Lambda

I want to import a large csv file (around 1gb with 2.5m rows and 50 columns) into a DynamoDb, so have been following this blog from AWS.
However it seems I'm up against a timeout issue. I've got to ~600,000 rows ingested, and it falls over.
I think from reading the CloudWatch log that the timeout is occurring due to the boto3 read on the CSV file (it opens the entire file first, iterates through and batches up for writing)... I tried to reduce the file size (3 columns, 10,000 rows as a test), and I got a timeout after 2500 rows.
Any thoughts here?!
TIA :)
I really appreciate the suggestions (Chris & Jarmod). After trying and failing to break things programmatically into smaller chunks, I decided to look at the approach in general.
Through research I understood there were 4 options:
Lambda Function - as per the above this fails with a timeout.
AWS Pipeline - Doesn't have a template for importing CSV to DynamoDB
Manual Entry - of 2.5m items? no thanks! :)
Use an EC2 instance to load the data to RDS and use DMS to migrate to DynamoDB
The last option actually worked well. Here's what I did:
Create an RDS database (I used the db.t2.micro tier as it was free) and created a blank table.
Create an EC2 instance (free Linux tier) and:
On the EC2 instance: use SCP to upload the CSV file to the ec2 instance
On the EC2 instance: Firstly Sudo yum install MySQL to get the tools needed, then use mysqlimport with the --local option to import the CSV file to the rds MySQL database, which took literally seconds to complete.
At this point I also did some data cleansing to remove some white spaces and some character returns that had crept into the file, just using standard SQL queries.
Using DMS I created a replication instance, endpoints for the source (rds) and target (dynamodb) databases, and finally created a task to import.
The import took around 4hr 30m
After the import, I removed the EC2, RDS, and DMS objects (and associated IAM roles) to avoid any potential costs.
Fortunately, I had a flat structure to do this against, and it was only one table. I needed the cheap speed of the dynamodb, otherwise, I'd have stuck to the RDS (I almost did halfway through the process!!!)
Thanks for reading, and best of luck if you have the same issue in the future.

deleting old indexes in amazon elasticsearch

We are using AWS Elasticsearch for logs. The logs are streamed via Logstash continuously. What is the best way to periodically remove the old indexes?
I have searched and various approaches recommended are:
Use lambda to delete old indexes - https://medium.com/#egonbraun/periodically-cleaning-elasticsearch-indexes-using-aws-lambda-f8df0ebf4d9f
Use scheduled docker containers - http://www.tothenew.com/blog/running-curator-in-docker-container-to-remove-old-elasticsearch-indexes/
These approaches seem like an overkill for such a basic requirement as "delete indexes older than 15 days"
What is the best way to achieve that? Does AWS provide any setting that I can tweak?
Elasticsearch 6.6 brings a new technology called Index Lifecycle Manager See here. Each index is assigned a lifecycle policy, which governs how the index transitions through specific stages until they are deleted.
For example, if you are indexing metrics data from a fleet of ATMs into Elasticsearch, you might define a policy that says:
When the index reaches 50GB, roll over to a new index.
Move the old index into the warm stage, mark it read only, and shrink it down to a single shard.
After 7 days, move the index into the cold stage and move it to less expensive hardware.
Delete the index once the required 30 day retention period is reached.
The technology is in beta stage yet, however is probably the way to go from now on.
Running curator is pretty light and easy.
Here you can find a Dockerfile, config and action-file.
https://github.com/zakkg3/curator
Also, Curator can help you if you need to (among others):
Add or remove indices (or both!) from an alias
Change shard routing allocation
Delete snapshots
Open closed indices
forceMerge indices
reindex indices, including from remote clusters
Change the number of replicas per shard for indices
rollover indices
Take a snapshot (backup) of indices
Restore snapshots
https://www.elastic.co/guide/en/elasticsearch/client/curator/current/index.html
Here is a typical action file for delete indices older than 15 days:
actions:
1:
action: delete_indices
description: >-
Delete indices older than 15 days (based on index name), for logstash-
prefixed indices. Ignore the error if the filter does not result in an
actionable list of indices (ignore_empty_list) and exit cleanly.
options:
ignore_empty_list: True
disable_action: True
filters:
- filtertype: pattern
kind: prefix
value: logstash-
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 15
I followed the elasticsearch-curator documentation to install the package:
https://www.elastic.co/guide/en/elasticsearch/client/curator/current/pip.html
Then I used the AWS base example of how to automate the indexes cleanup using the signed based authentication provided by requests_aws4auth package:
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/curator.html
It worked like a charm.
You can decide to run this inside a lambda, docker or include it in your own DevOps cli.

Amazon redshift query aborts automatically after 1 hour

I have around 500GB compressed data in amazon s3. I wanted to load this data to Amazon Redshift. For that, I have created an internal table in AWS Athena and I am trying to load data in the internal table of Amazon Redshift.
Loading of this big data into Amazon Redshift is taking more than an hour. The problem is when I fired a query to load data it gets aborted after 1hour. I tried it 2-3 times but it's getting aborted after 1 hour. I am using Aginity Tool to fire the query. Also, in Aginity tool it is showing that query is currently running and the loader is spinning.
More Details:
Redshift cluster has 12 nodes with 2TB space for each node and I used 1.7 TB space.
S3 files are not the same size. One of them is 250GB. Some of them in MB.
I am using the command
create table table_name as select * from athena_schema.table_name
it stops exactly after 1hr.
Note: I have set the current query timeout in Aginity to 90000 sec.
I know this is an old thread, but for anyone coming here because of the same issue, I've realised that, at least for my case, the problem was the Aginity client; so, it's not related with Redshift or its Workload Manager, but only with such third party client called Aginity. In summary, use a different client like SQL Workbench and run the COPY command from there.
Hope this helps!
Carlos C.
More information, about my environment:
Redshift:
Cluster TypeThe cluster's type: Multi Node
Cluster: ds2.xlarge
NodesThe cluster's type: 4
Cluster Version: 1.0.4852
Client Environment:
Aginity Workbench for Redshift
Version 4.9.1.2686 (build 05/11/17)
Microsoft Windows NT 6.2.9200.0 (64-bit)
Network:
Connected to OpenVPN, via SSH Port tunneling.
The connection is not being dropped. This issue is only affecting the COPY command. The connection remains active.
Command:
copy tbl_XXXXXXX
from 's3://***************'
iam_role 'arn:aws:iam::***************:role/***************';
S3 Structure:
120 files of 6.2 GB each. 20 files of 874MB.
Output:
ERROR: 57014: Query (22381) cancelled on user's request
Statistics:
Start: ***************
End: ***************
Duration: 3,600.2420863
I'm not sure if following answer will solve your exact problem of timeout at exactly 1 Hr.
But, based on my experience, in case of Redshift loading data via Copy command is best and fast way. SO I feel that timeout issue shouldn't happen at all in your case.
The copy command in RedShift could load data from S3 or via SSH.
e.g.
Simple copy
copy sales from 'emr://j-SAMPLE2B500FC/myoutput/part-*' iam_role
'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '\t' lzop;
e.g. Using Menifest
copy customer
from 's3://mybucket/cust.manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;
PS: Even if you do it using Menifest and divide your data into Multiple files, it will be more faster as RedShift loads data in parallel.