Elasticsearch Indices keeps getting lost - amazon-web-services

I have an Elasticsearch cluster on AWS ec2 instance. It is t3.small with 2 vcores and 2 GB ram. I have installed Elasticsearch & kibana. For extensions, I have installed Heartbeat and Metricbeat. The database I'm working with is mongo DB and all my data is no-SQL. I feed my engine from my MongoDB cluster which is in my local machine with a script. I feed my engine and run the queries from my app and also from the console. So far so good. Everything is fine. Well, the cluster is always yellow it is not green.
The problem starts after hitting multiple requests on the engine. After 50 or 60 search queries the data just disappears. Well somehow my engine is forcefully dumping my indices and it's not being able to restore those data ( obviously I have no snapshot and no restore point ) and I keep getting lose those data. I have to manually feed the engine again and again. Well at first I had 1 GB ram so I thought upgrading would fix the issue but after upgrading to 2 GB ram it didn't stop. Well, now the data stays there for some more time.
So here are my DB configs.
I have 70K + no SQL documents.
Which contains text and geo_point types
I make post request on my engine through my front end application.
I don't have logstash installed, but metricbeat is not showing any error logs.
All my elastic search engine setup is for Testing purposes this is not the production mode.
We will upgrade when we go to the production mode.
So I need to know
What is the reason behind this and
how to prevent this huge data loss
So please help me or just suggest how to solve this huge problem.
Thank you

Ideally first thing you should do is to make cluster green.
To see the exact elasticsearch error that is causing this situation, you should look at elasticsearch.log file. It will contain exact error causing it.
One way to keep cluster data safe is to take regular snapshots and restore incase of data loss. Details of snapshots procedure can be found here.

Related

Aurora MySQL Serverless V1 to V2 Migration Doesn't Migrate - Rolls Back Version During Process

This is driving me insane, so hopefully someone can help.
I am attempting to upgrade/migrate an Aurora MySQL Serverless instance from V1 to V2 utilizing the process found in the documentation. When I reach step 4...
Restore the snapshot to create a new, provisioned DB cluster running Aurora MySQL version 3 that's compatible with Aurora Serverless v2, for example, 3.02.0.
... the database that results from the restored snapshot is Aurora v2 again, even though the cluster was v3 until the database was created. This means that I can't change it to Serverless V2 (I hate how confusing these version numbers are...).
I've tried several different tiers and types of provisioned databases for the interim copies, and I've tried using the CLI tool in case it was an issue with the GUI, and I get the same result every time.
Has anyone ran into this? Am I just missing something? I'm pretty much at a complete loss here, so any help is appreciated.
I'm not entirely sure what happened initially, but trying again on a different day resulted in an error log. Previously, no logs were coming through on the migrated instance. It may have been my impatience then, but at least now I have an answer.
In my case, it was some corrupted Views that were part of some legacy code that was disallowing the migration. If anyone else runs into this, make sure you give the log files time to generate, and look at the upgrade_prechecks.log file to see what the actual errors are.
More information about the logs and how to find them can be found in the official documentation.

Pyspark job freezes with too many vcpus

TLDR: I have a pyspark job that finishes in 10 minutes when I run it in a ec2 instance with 16 vcpus but freezes out (it doesn't fail, just never finishes) if I use an instance with over 20 vcpus. I have tried everything I could think of and I just don't know why this happens.
Full story:
I have around 200 small pyspark jobs that for a matter of costs and flexibility I execute using aws batch with spark dockers instead of EMR. Recently I decided to experiment around the best configuration for those jobs and I realized something weird: a job that finished quickly (around 10 minutes) with 16 vcpus or less would just never end with 20 or more (I waited for 3 hours). First thing I thought is that it could be a problem with batch or the way ecs-agents manage the task, so I tried running the docker in an ec2 directly and had the same problem. Then I thought the problem was with the docker image, so I tried creating a new one:
First one used with spark installed as per AWS glue compatible version (https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz)
New one was ubuntu 20 based with spark installed from the apache mirror (https://apache.mirror.digitalpacific.com.au/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz)
Same thing happened. Then I decided the problem was with using docker at all, so I installed everything directly in the ec2, had the same result. Tried changing spark version, also the same thing happened. Thought it could be a problem with hardware blocking too many threads, so I switched to an instance with AMD, nothing changed. Tried modifying some configurations, memory amount used by driver, but it always has the same result: 16 vcpus it work, more than it, it stops.
Other details:
According to the logs it seems to always stop at the same point: a parquet read operation on s3, but the parquet file is super small (> 1mb) so I don't think that is the actual problem.
After that it still has logs sometimes but nothing really useful, just "INFO ContextCleaner: Cleaned accumulator".
I use s3a to read the files from s3.
I don't get any errors or spark logs.
I appreciate any help on the matter!
Stop using the Hadoop 2.7 binaries. They are woefully obsolete, especially for S3 connectivity. replace all the hadoop 2.7 artifacts with Hadoop 2.8 ones, or, preferably, Hadoop 3.2 or later, with the consistent dependencies.
set `spark.hadoop.fs.s3a.experimental.fadvise" to random.
If you still see problems, see if you can replicate them on hadoop 3.3.x, and if so: file a bug.
(advice correct of 2021-03-9; the longer it stays in SO unedited, the less it should be believed)

Optimal way to sync NoSql (DynamoDB) with Elastisearch (not the AWS managed one) in local environment

I currently have aws dynamodb (port: 8000) and elasticsearch (port: 9200)+kibana(port: 5601) running locally.I pretty much spent almost half a day figuring out how to sync these services together. To provide context, I have a nextJS app running thats integrated with both clients however, when I upsert data into dynamo, Id like to get it synced with elasticsearch right away.
Note, for upper environments I plan on using aws lambdas integrated with dynamodb streams to update indexes in elasticsearch instances (I am not a fan of the aws elasticsearch managed service).
Here is what I have tried thus far for local env syncing:
Logstash w/dynamodb plugin (https://github.com/awslabs/logstash-input-dynamodb) - the repo hasnt been updated in 4 years and issues support newer version of ES. Been a headache all morning trying to get the thing running due to this and other issues --> https://github.com/awslabs/logstash-input-dynamodb/issues/10
result: its a lost cause as far as I know... ive even tried their docker images but not workin..
node-scheduler: oh jeez... given Im using nextJS Id have to create a custom server just to get this extra piece synced... not to mention it removes a lot of the important features of nextJS -> https://nextjs.org/docs/advanced-features/custom-server
host the development dynamodb in aws (not local) + provision ES on ec2 and point local app there. I think this is overkill imo as now pretty much everything is hosted except my app, which could be ok. Let me know your thoughts as I already have dev environment using a separate cognito pool to authenticate.
Just chain ES calls to any updates made on dynamodb. Can someone tell me why we cant do this locally and in general? I get that data can get out of sync but maybe we can do hourly, or even twice a day (offpeak) cron jobs to bulk update the dynamo records to ES.
Would really appreciate your thoughts. Thanks!

Inconsistent RDS query results using EBS+flask-security+sqlalchemy

I have a Flask app running using Elastic Beanstalk, using flask-security and plain sqlalchemy. The app's data is stored on a RDS instance, living outside of EB. I am following the flask-security Quick Start with a session, which doesn't use flask-sqlalchemy. Everything is free tier.
Everything is up and running well, but I've encountered that
After a db insert of a certain type, a view which reads all objects of that type gives alternating good/bad results. Literally, on odd refreshes I get a read that includes the newly inserted object and on even refreshes the newly inserted object is missing. This persists for as long as I've tried refreshing (dozens of times... several minutes).
After going away for an afternoon and coming back, the app's db connection is broken. I feel like my free tier resources are going to sleep, and the example code is not recovering well.
I'm looking for pointers on how to debug, for a more robust starting code example, or for suggestions on what else to try.
I may try to switch to flask-sqlalchemy (perhaps getting better session handling), or drop flask-security for flask-login (a downgrade in functionality... sniff).

Django pages loading very slowly in EC2

I'm at a loss here.
I am attempting to transfer a Django application to EC2. I ave moved the DB to RDS(Postgres image) and have static and media on S3.
However for some reason, all my pages are taking 25-30 seconds to load. I have checked the images and CPU and memory barely blips. I checked and took off KeepAlive in Apache, and changed the WSGI to work in daemon mode, but none of this made any difference. I have gone into the shell on the machine and accessed the DB and that appears to be reacting fine as well. I ahve also increased the EC2 image, with no effect.
S3 items are also being delivered quickly and without issue. Only the rendering of the html is taking long times.
On our current live and test server, there are no issues with the pages which load in ms
Can anyone point me to where or what I should be looking at?
Marc
The issue appeared to be connected with using RDS. I installed Postgres on the EC2 image and appart from a little mucking around it worked fine on there.
I'm going to try building a new RDS, but that was the issue here. Strange it worked ok directly via manage.py shell