How to concatenate multiple log files from multiple EC2 servers? - amazon-web-services

So I'm running nginx on three EC2 servers all in different locations (US, EU, Asia). I want to execute a perl script every day on the joined log files (each EC2 holds an nginx log in /var/log/nginx/access.log).
It seems Amazon's CloudWatch has some similar abilities but then again I'm reading about pushing each log to a S3 location. What is the easiest way to accomplish this?

I have been amazed at the logs cost, performance and search capabilities of log aggregator services like PaperTrail for these kinds of problems.
We have 30 instances of all types running Windows with Nxlog configured on each. anytime we spin up an instance, its logs are immediately captured by the paper trail syslogd service. I cannot imagine running cloud services with some log aggregator.
The searching and archiving is great. Papertrail has a free plan 100 MB/month, 48 hour search time, and 7 day archiving.
Disclaimer: Not related to Papertail, just a happy customer.

Related

Best way to run a script that takes a long time (48 hours) to run on AWS?

I am an academic researcher. I need to get data from a social media platform for a large number of users. Due to API restrictions, it takes a very long time (~48 hours) to get this data for all users. As of now, I write this data to a CSV file as I go with one line per user.
My lab has access and many credits for AWS. Assuming this script just needs to be run once a week, what is the best way to do it in AWS? And I assume I should use a database instead of a CSV file -- what options are there for setting that up?
There are several ways you can achieve this:
You could set up a cron job on an instance you keep deployed throughout. This could be a very small, inexpensive instance like t2.small or t3.medium. The job runs once a week.
If you don't want to keep the instance deployed, write a small script which creates an EC2 instance every week and puts your script on the instance where it runs. And the script itself and send AWS the request to terminate the instance on a successful run (not recommended)
If you can break your task into steps, AWS Lambda is the way to go. Look at Step Functions here.
I would recommend looking at other solutions, e.g. AWS Fargate. It is serverless and allows running arbitrary long tasks in Docker containers.
Furthermore, a handy workaround (usually) to the API restriction is to ping from multiple clients or IPs and merge the results later.

Shifting/Migrating PHP Codeigniter project to AWS

Our Company has a Software Product consists of Web App, Android and iOS App.
we have more then 350 clients, that is we have more then 350 databases(MYSQL) of each client and one code file repository(PHP Codeigniter). When new client purchase our software we just copy the the old empty database and client is able to use the software. this is our architecture.
Now we are planing to shift to AWS but we do not know which AWS service we really need for this type of architecture
We have Codeigniter 3.1 version, PHP 7 and MYSQL.
You can implement this sort of system on a single EC2 instance, simply installing the same software as you have on your current server. However in this case you are likely better off to host it somewhere cheaper than AWS.
However, what I recommend is that you implement it using RDS, EC2, S3 and Cloudfront.
RDS
I recommend to run your database on RDS:
the database server competes over completely different resources than PHP, so if you run into performance problems, it is impossible to figure out what is happening when database and PHP are on the same instance. A lack of CPU can lead to a lack of memory and vice versa.
built-in point-in-time recovery for up to 35 days has saved my bacon many many times and is great when you have a bug that is hard to reproduce or when someone (you) has accidentally deleted a large amount of data
On top of this I recommend to also go for Aurora for MySQL instead of MySQL RDS, especially as I expect your database size on disk to be smaller than 50GB:
On MySQL RDS you need to commission at least 100GB of disk to get good enough performance for production. 100GB gives you 100x50kb per second on the EBS disks that are used.
By comparison, on AWS Aurora you get the read performance of 6 different storage locations without having to commit to any amount of disk space. This saves money and is more performant
Aurora is also much faster in restoring point in time as well as with "dumb" queries, ie. table scans.
EC2
I recommend to look at nothing older than the t3, c5 or m5 instances, as they have the new "nitro hypervisor" and are significantly faster, while being cheaper. From experience you can go down a notch from your existing CPU count with these instances
If you can use c6/m6/t4 instances
I also found c5a and equivalents to be just as performant
AWS recommends to always use auto-scaling, but if you are coming a single server somewhere else you are already winning because you can restore within minutes.
Once you hit $600 per month in EC2 charges, definitely look at autoscaling. Virtually every webapp can be written in a way that allows for a server to be replaced at any point in time. With auto scaling you can then use Spot instances at 50-90% discount for your 2nd/3rd etc instance and save serious money.
S3
Store all customer provided files on S3, DO NOT get into a shared file system.
This is much cheaper than any disk or file system and has numerous automation features, such as versioning, cross-region backup, archiving, event triggers etc.
do not ever make your bucket publicly accessible.
Cloudfront
The key benefit of storing all customer provided files on S3 is that you can serve them with Cloudfront without paying for CPU. Cloudfront only charges for traffic delivered. S3 only charges for space used. Every file delivered through Cloudfront does not use your server's CPU, sockets, network bandwidth. On top of this transfer from EC2 to S3 and from S3 to Cloudfront is free of charge. You are only charged for the traffic you already had to pay for anyway.
You need to secure your clients file properly with Signed Urls or Signed Cookies. For this you can either create separate S3 buckets for each client or one single bucket.
Bonus: SQS
Many things in web application do not need to be done right now. They can wait a bit, sometimes a couple of 100 milliseconds, sometimes minutes or hours.
Anything that can wait, I recommend start implementing a background process that reads from an SQS queue for it. Your web application will need minimal time to push the work required and its parameters into an SQS queue. Your background process can then work on it in (rough) order of entry into the queue. When you use your normal web servers to process the background queues you are already getting a better distribution of server load over time. This is because you cannot control the amount of web requests, but you can control the speed in how you process background items (to a degree of course).
Later, when you have a lot of background processing and a lot of traffic, you can consider using different servers for background processing.
There are also lots of ways of how you can hook other event driven code onto the items that go into your queue, including monitoring for limits exceeded for certain items etc.

Collectd on AWS

We have instances setup in an autoscale group on AWS. We want to collect the metrics in order to determine our scalability needs. Collectd, so far I know that it collects the stats in the same machine and puts it all in RRD files. However, in a scenario of an autoscale cluster, if another instance is spawned and assuming the AMI from which it has been spawned already has collectd, how are we supposed to gather the stats of that second instance in the group? It might just stay up for five to six minutes and go down, but we would need the logs before it goes down. Any way by which we can club these logs for the same cluster or something similar? Or if collectd can make it report somewhere online?
Found the answer. This can be done by using the client-server architecture of collectd. More details can be found here

Will my current AWS architecture scale to 20,000 visitors per day? How can I improve it?

The site I'm working on will potentially get 20,000 visitors per day. It's no guarantee, but it's supposed to be used everyday by each employee in an organisation.
In the past I've just used a single t2.micro EC2 instance with an attached EBS volume to host sites, which has always been enough because these sites don't get a lot of traffic. But with 20,000 visitors a day how could I improve my AWS architecture to scale?
The site is going to have a profile for each user, including a profile picture - so potentially 20,000 image files. Should I be writing these to an S3 bucket instead of to the EBS?
Would a t2.micro ec2 instance cope with the scale, or should I be using a t2.small, t2.medium or even t2.large?
My MySQL databases are currently on the EBS volume, but should I use RDS?
All the users are in the UK, so I'm assuming using CloudFront is overkill?
You're right to assume CloudFront is overkill since all your users are localized to UK.
Update: using a CDN will take a lot of stress off your servers by caching the files rather than processing them each time a call is made.
Look at it this way, if you get 100,000 hits a day, and 90% of those hits are cached and served by the CDN, then your server only has to process 10,000 hits a day. That could be the difference between needing a m4.xlarge versus just needing a t2.small.
#mark-b
Use the Ireland region (and soon you can copy over to the UK region)
If you want to keep your database on your instance I would highly recommend a bit bigger one. As for a quick and easy solution, start up the smallest T series instance with EBS, beta test with 1000-5000 users, see how that goes. Notify the select group all their crap will disappear so don't invest a bunch o' time.
Next, get your analytics on the system and see if that will work times 4-5 more users. For SQL DB stuff you'll eventually want a M series instance I believe.
Also, you could always create a load balanced fleet. You do this in EBS by hitting Load Balanced instead of Single Instance. Create an auto scaling group and boom sauce - check that off.
As for the images, yeah I would recommend S3. Don't really want to dump the whole amount in i/o cause DB, hits, i/o, all on one instance is a lot.
Lastly, if you do plan on going to the UK region (not positive if that's been rolled out yet) I would recommend sectioning all the pieces of your application. This is really good practice to use all the resources they provide.
For a very fault tolerant system:
EC2 fleet (m or c series) with an ELB
S3 the images
RDS the users
CloudWatch the stats
Tenecy the users with IAM groups
Authenticate with STS or AD or whatever (kinda been in the cognito only recently)
Store their session and authenticated crap in ElastiCache - Redis
Keep tabs on them with Kinesis (optional)
And let them search each other with CloudSearch (also optional)
Boss system right there!
And that's if you want to spend a bunch o' cash but have a sweet sweet system. If you want to spend next to nothing, make it serverless. A broad question asked with hundreds of combinations so this is up to interpretation.
Hope this helps!

need some guidance on usage of Amazon AWS

every once in a while i read/hear about AWS and now i tried reading the docs.
But such docs seem to be written for people who already know which AWS they need to use and only search for how it can be used.
So, for myself, to understand AWS better i try to sketch a hypothetical Webapplication with a few questions.
The apps purpose is to modify content like videos or images. So a user has some kind of webinterface where he can upload his files, do some settings and a server grabs the file and modifies it (e.g. reencoding). The Service also extracts the audio track of a video and trys to index the spoken words so the customer can search within his videos. (well its just hypothetical)
So my questions:
given my own domain 'oneofmydomains.com' is it possible to host the complete webinterface on AWS? i thought about using GWT to create the interface and just deliver the JS/images via AWS, but which one, simple storage? what about some kind of index.html, is there an EC2 instance needed to host a webserver which has to run 24/7 causing costs?
now the user has the interface with a login form, is it possible to manage logins with an AWS? here i also think about an EC2 instance hosting a database, but it would also cause costs and im not sure if there is a better way?
the user has logged in and uploads a file. which storage solution could be used to save the customers original and modified content?
now the user wants to browse the status of his uploads, this means i need some kind of ACL, so that the customer only sees his own files. do i need to use a database (e.g. EC2) for this, or does amazon provide some kind of ACL, so the GWT webinterface will be secure without any EC2?
the customers files are reencoded and the audio track is indexed. so he wants to search for a video. Which service could be used to create and maintain the index for each customer?
hope someone can give a few answers so i understand AWS better on how one could use it
thx!
Amazon AWS offers a whole ecosystem of services which should cover all aspects of a given architecture, from hosting to data storage, or messaging, etc. Whether they're the best fit for purpose will have to be decided on a case by case basis. Seeing as your question is quite broad I'll just cover some of the basics of what AWS has to offer and what the different types of services are for:
EC2 (Elastic Cloud Computing)
Amazon's cloud solution, which is basically the same as older virtual machine technology but the 'cloud' offers additional knots and bots such as automated provisioning, scaling, billing etc.
you pay for what your use (by hour), for the basic (single CPU, 1.7GB ram) would prob cost you just under $3 a day if you run it 24/7 (on a windows instance that is)
there's a number of different OS to choose from including linux and windows, linux instances are cheaper to run without the license cost associated with windows
once you're set up the server to be the way you want, including any server updates/patches, you can create your own AMI (Amazon machine image) which you can then use to bring up another identical instance
however, if all your html are baked into the image it'll make updates difficult, so normal approach is to include a service (windows service for instance) which will pull the latest deployment package from a storage (see S3 later) service and update the site at start up and at intervals
there's the Elastic Load Balancer (which has its own cost but only one is needed in most cases) which you can put in front of all your web servers
there's also the Cloud Watch (again, extra cost) service which you can enable on a per instance basis to help you monitor the CPU, network in/out, etc. of your running instance
you can set up AutoScalers which can automatically bring up or terminate instances based on some metric, e.g. terminate 1 instance at a time if average CPU utilization is less than 50% for 5 mins, bring up 1 instance at a time if average CPU goes beyond 70% for 5 mins
you can use the instances as web servers, use them to run a DB, or a Memcache cluster, etc. choice is yours
typically, I wouldn't recommend having Amazon instances talk to a DB outside of Amazon because of the round trip is much longer, the usual approach is to use SimpleDB (see below) as the database
the AmazonSDK contains enough classes to help you write some custom monitor/scaling service if you ever need to, but the AWS console allows you to do most of your configuration anyway
SimpleDB
Amazon's non-relational, key-value data store, compared to a traditional database you tend to pay a penalty on per query performance but get high scalability without having to do any extra work.
you pay for usage, i.e. how much work it takes to execute your query
extremely scalable by default, Amazon scales up SimpleDB instances based on traffic without you having to do anything, AND any control for that matter
data are partitioned in to 'domains' (equivalent to a table in normal SQL DB)
data are non-relational, if you need a relational model then check out Amazon RDB, I don't have any experience with it so not the best person to comment on it..
you can execute SQL like query against the database still, usually through some plugin or tool, Amazon doesn't provide a front end for this at the moment
be aware of 'eventual consistency', data are duplicated on multiple instances after Amazon scales up your database, and synchronization is not guaranteed when you do an update so it's possible (though highly unlikely) to update some data then read it back straight away and get the old data back
there's 'Consistent Read' and 'Conditional Update' mechanisms available to guard against the eventual consistency problem, if you're developing in .Net, I suggest using SimpleSavant client to talk to SimpleDB
S3 (Simple Storage Service)
Amazon's storage service, again, extremely scalable, and safe too - when you save a file on S3 it's replicated across multiple nodes so you get some DR ability straight away.
you only pay for data transfer
files are stored against a key
you create 'buckets' to hold your files, and each bucket has a unique url (unique across all of Amazon, and therefore S3 accounts)
CloudBerry S3 Explorer is the best UI client I've used in Windows
using the AmazonSDK you can write your own repository layer which utilizes S3
Sorry if this is a bit long winded, but that's the 3 most popular web services that Amazon provides and should cover all the requirements you've mentioned. We've been using Amazon AWS for some time now and there's still some kinks and bugs there but it's generally moving forward and pretty stable.
One downside to using something like aws is being vendor locked-in, whilst you could run your services outside of amazon and in your own datacenter or moving files out of S3 (at a cost though), getting out of SimpleDB will likely to represent the bulk of the work during migration.