I am trying to understand how code deployments work with aws blue green stacks. I have been reading alot of documentation on it and the concept of it makes sense. So I understand that we have two environments (stacks) that are replica of each other. When I make changes and deploy it, the changes get applied to the inactive region and then when I activate it, the inactive region will become active and active region becomes inactive.
However I am not sure about the roll back process:
So my questions are:
If i am using an load balancer and lets say after i went to production some calls started to fail. However I know that my inactive stack (old version) is still healthy then how would i go on about manually telling the traffic go to my inactive stack or how do make that stack active again with the old changes?
Is there a better way to do it using route53? If can I just change the dns of the route to be dns of the old stack and will that work?
I am still new to aws and trying my best to understand it so if I didn't explain things clearly please let me know in comments and I will try to clarify further.
Related
I am a total beginner in cloud service management, so this is a very basic question.
I have inherited a kubernetes based project running in Google Cloud. I have discovered recently that there are millions of errors I am unaware of in APIs & Services > Compute Engine API > Metrics menu:
I have tried searching for these values both on google in the docs to no avail. With no link to the list of logs and hundreds of sub menu items I feel completely lost on where to start.
How can I get more information about these errors?
How can I navigate to the relevant logs?
Your question is rather general so I will make some assumptions and educated guesses about your project and try to explain.
This level of error with API calls is of course unusually high and suggesting that some things don't work (for example someone deleted a backend service but left the load balancer without any health checks and it's accepting requests from the outside but there's nothing in the backend to process them).
That is just an exmaple - without more details I'm not even speculate further.
If you want to read more about the messages take the second one from the top - documentation for compute.v1.BackendServicesService.delete.
You can also explore other Compute Engine API methods to see what they do to give you more insight what is happening with your project.
This should give you a good starting point to explore the API.
Now - regarding logs. Just navigate to Logs Viewer and select as a resource whatever you want to analyse (all or a single VM, Load Balancer, firewall rule, etc). You can also include (or exclude) certain level of logs (warning, error etc). Pissibilities are endless.
Your query may look something like this:
Here's more documentation on GCP Logs Viewer to help you out.
We have recently come across an issue on one of our cluster pods, which caused an outage on our application and impacted our customers.
Here is the thing: We were able to pull the gke.gcr.io/istio/operator:1.6.3 image from GCR, though, it started failing overnight.
Finally, we noticed that this image is no longer available in the public istio-release registry, on gcr.io, causing a ImagePullBackoff failure. However, we are still able to find it on docker.io.
Having said that, we're sticking with the solution approach of pulling the image from docker.io/istio/operator:1.6.3, which is a pretty straightforward one for now. Nevertheless, we're still skeptical and wondering why this image has suddenly vanished from gcr.io.
Has anyone been facing something similar?
Best regards.
I did some reasearch but I can't find anything related.
As I mentioned in comments, I strongly suggest you keep all critical images in a private container registry. Using this approach you can avoid incidents like that, and earn some extra control upon the images, such as: versioning, the security etc.
There are many guides on the internet to setup your own managed private container registry like Nexus, if you want to use as a service, you can try Gooogle Container Registry.
Keep in mind that when you are working in a critical environment, you need to try minize the variables to keep your service as resilient as possible.
I noticed a small downtime with one of our services deployed to the GKE and noticed istio-operator was listed with a red warning.
The log was:
Back-off pulling image "gke.gcr.io/istio/operator:1.6.4": ImagePullBackOff
Since istio-operator is a workload GKE manages I was hesitant but the downtime repeated couple of times for couple of minutes so I also edited the service yaml and update the image with docker.
I am using aws for my cloud infrastructure. I use ecs fargate as my compute machine. I am currently maintaining 10-20 apis which interact with members who have my application downloaded on their phone. Obviously one or two of these apis are my "main" apis and these are the ones which are really personalised to my users and honestly, these are the only two apis which members really access (by navigating to those screens).
My business team wants to send push notifications to members to alert them on certain new events which lands them on a screen where these APIs need to be called. Due to this, my application has mini crashes during this time period.
I've thought of a couple of ideas for the same, but since this is obviously an issue across industries and a solved problem, I wanted the standard solutions.
The ideas I have:
Sending notifications in batches. This seems like the best solution though it requires a bit of effort though I'm not sure how much.
Have a serverless machine run my requests (aws lambda functions) for those APIs which need to scale instantly. I have a lot of other APIs which I keep in fargate because I don't want my lambda function to be too heavy and then take a while to start up.
Scale machines all the time to handle the load I get during push notifications. This seems suboptimal due to cost reasons.
Scale machines up just during those periods where I want to send push notifications and them scale them back down. This seems like a decent solution if I can automate the entire process. I can have a flow which I follow for each push notification which will cause the system to scale and then start sending the notifications.
Is there a better way to do this. This seems like a relatively straightforward problem for people to have, but I don't see too much information on this topic.
I like your second option best because it's by far the easiest to manage (because you don't have to manage it). After that I'd go with your last option. I would use step functions to manage this, where the first step is to scale up the number of instances in Fargate. Once that has reached the desired level you would send the notifications. Add autoscaling to your services in Fargate to have it handle coming down automatically.
(disclaimer: new to AWS, a developer trying to pivot into DevOps)
During one interview/screening, I took a take-home challenge, which asked me to provision a complex piece of infrastructure, using their own script, which relied on Terraform, and created a little cosmos of intertwined AWS resources. Unfortunatelly, I manually (yuck!) deleted the S3 bucket used by Terraform to keep track of the state of things - so destroying them automatically is no longer possible, so I need to clean it all up manually.
Most of the things are cleared now. But there's a Security Group that's left over. Deleting it fails due to it being connected to something called Network Interface. Looking at that Network Interface I found out that "Delete" button is greyed out, but "Detach" is active - alas, it said I can't "Detach" it, because I "lack permissions". Given the fact that I'm logged into console as a root of my AWS account, I don't buy it.
Does anyone know what these beasts are, and what are the possible problems killing them? I suppose it's kinda like a connection between A and B, and if either end of a connection is plugged in, you can't "kill" it - but what should I look for?
Got the bugger!
As I was clearing out the Roles created by Terraform, I discovered a Service-owned role; it had "RDS" in it's name. Surprised (I thought I killed those among the first), I went to check, and indeed there was an instance lurking. After killing that, removing NI and VPC (and the role) was unblocked.
It's interesting to me now - how come the error didn't mention the RDS as a blocker? All it could tell me was the NI is blocking it; now I guess I know who was the owner of the Attachment - that RDS instance. But why the heck was it a "permission" issue for me? 'Force' should have dealt with it!
I have an application based on php in one amazon instance for uploading and transcoding audio files. This application first uploads the file and after that transcodes that and finally put it in one s3 bucket. At the moment application shows the progress of file uploading and transcoding based on repeatedly ajax requests by monitoring file size in a temporary folder.
I was wondering all the time if tomorrow users rush to my service and I need to scale my service with any possible way in AWS.
A: What will happen for my upload and transcoding technique?
B: If I add more instances does it mean I have different files on different temporary conversion folders in different physical places?
C: If I want to get the file size by ajax from http://www.example.com/filesize up to the finishing process do I need to have the real address of each ec2 instance (i mean ip,dns) or all of the instances folders (or folder)?
D: When we scale what will happen for temporary folder is it correct that all of instances except their lamp stack locate to one root folder of main instance?
I have some basic information about scaling in the other hosting techniques but in amazon these questions are in my mind.
Thanks for advice.
It is difficult to answer your questions without knowing considerably more about your application architecture, but given that you're using temporary files, here's a guess:
Your ability to scale depends entirely on your architecture, and of course having a wallet deep enough to pay.
Yes. If you're generating temporary files on individual machines, they won't be stored in a shared place the way you currently describe it.
Yes. You need some way to know where the files are stored. You might be able to get around this with an ELB stickiness policy (i.e. traffic through the ELB gets routed to the same instances), but they are kind of a pain and won't necessarily solve your problem.
Not quite sure what the question is here.
As it sounds like you're in the early days of your application, give this tutorial and this tutorial a peek. The first one describes a thumbnailing service built on Amazon SQS, the second a video processing one. They'll help you design with best AWS practices in mind, and help you avoid many of the issues you're worried about now.
One way you could get around scaling and session stickiness is to have the transcoding update a database with the current progress. Any user returning checks the database to see the progress of their upload. No need to keep track of where the transcoding is taking place since the progress gets stored in a single place.
However, like Christopher said, we don't really know anything about you're application, any advice we give is really looking from the outside in and we don't have a good idea about what would be the easiest thing for you to do. This seems like a pretty simple solution but I could be missing something because I don't know anything about your application or architecture.