Storing solr data with amazon s3 - amazon-web-services

I am using solr on amazon ec2, and I am hoping to configure the solr instance so that it automatically stores data in amazon s3 instead of anywhere on the server. However I couldn't find any useful information on how to implement this. Does anyone know how? If this can't be achieved using amazon s3, what cloud storage do you recommend?
Thanks in advance.

You will want to store the Solr indexes on an EBS volume, which you can attach to the server. S3 is meant for serving files directly out to the internet (such as images and css files), or for general file storage (such as backups.) It is not meant to be used as a mounted disk for a database.
Solr likes very high IO, so the SSD backed EBS volumes are great for this. You can also make snapshots of an EBS volume to backup its data.
If you setup Solr slaves, you can also get away with using the server's ephemeral storage. This is a large partition that comes with most instance types. It is volatile storage, meaning all of the data is lost if the server is shutdown. However, it is free and quite fast. It is perfect for a slave which replicates its data from a master Solr instance backed by EBS.

Related

AWS Import-Snapshot with shared S3 Bucket

I am currently looking for a way of easily distributing customised volumes to clients.
An approach I am looking at is creating RAW disk images, saving them to S3 and having clients import them as snapshots using the AWS CLI.
My question is - who pays for the data access request/data transfer?
...I'm assuming its bucket owner as there is no "requester-pays" option for the Import-Snapshot command. Has anybody done anything similar?
Another approach is directly sharing snapshots to a clients account - but this involves an added charge on our part to create the ideal sized volumes + generate the snapshots to share.
Is there a better method of generating + sharing data (essentially what would become EBS volumes) of varying sizes and content?
The easiest method would be to create an Amazon Machine Image (AMI), which is a bootable snapshot. You can list this it as a Community AMI.
Your clients can select the AMI when launching an Amazon EC2 instance. The boot disk will be exactly as you configured -- with the operating system, your application and all configurations that were saved on the disk.
There is no cost to you when a client uses the AMI.
See: Make an AMI public - Amazon Elastic Compute Cloud

Elastic Beanstalk with EFS or S3

Basically, I'm trying to figure out what design to use. I'm collecting 1TB of data per month using an EC2 instance mounted to EBS. I created another Elastic Beanstalk instance serving as the website, and I wanted to figure out if it's better to access this EC2 instance's data through EFS or S3. Also, the amount of data that the elastic beanstalk webpage would access maybe be 10 - 50GB occasionally from a web application.
Basically, it depends upon the type of data you want to store.
EFS - Amazon EFS is automatically scalable - that means that your running applications won't have any problems if the workload suddenly becomes higher - the storage will scale itself automatically. If the workload decreases - the storage will scale down, so you won't pay anything for the storage you don't use. Good for shareable applications and workloads , Faster than S3
S3 - Amazon S3 also allows hosting static website content. provides simple object storage, useful for hosting website images and videos, data analytics, and both mobile and web applications. Object storage manages data as objects, meaning all data types are stored in their native formats.
So I would suggest, as you are collecting 1TB of data and webpage would access 10 - 50GB occasionally, so S3 will make your process (API's) slow and its good the amount of disk space you use, have to pay for that only.
And as you are talking about 1Tb, if data goes beyond that, the disk will be scalable and the application will be highly available.

Comparative Application ebs vs s3

I am new to cloud environment. I do understand the definition and storage types EBS and S3. I wanted to understand the application of EBS as compared to S3.
I do understand EBS looks like a device for heavy though put operations. I cannot find any application where this can be used in comparison to S3. I could think of putting server logs on EBS on magnetic storage, as one EBS can be attached to one instance.
S3 you can use the scaling property to add some heavy data and expand in real time. We can deploy our slef managed dbs on this service.
Please correct me if I am wrong. Please help me understand what is best suited for what and application of them in comparison with one another.
As you stated, they are primarily different types of storage:
Amazon Elastic Block Store (EBS) is a persistent disk-storage service, which provides storage volumes to a virtual machine (similar to VMDK files in VMWare)
Amazon Simple Storage Service (S3) is an object store system that stores files as objects and optionally makes them available across the Internet.
So, how do people choose which to use? It's quite simple... If they need a volume mounted on an Amazon EC2 instance, they need to use Amazon EBS. It gives them a C:, D: drive, etc in Windows and a mountable volume in Linux. Computers traditionally expect to have locally-attached disk storage. Put simply: If the operating system or an application running on an Amazon EC2 instance wants to store data locally, it will use EBS.
EBS Volumes are actually stored on two physical devices in case of failure, but an EBS volume appears as a single volume. The volume size must be selected when the volume is created. The volume exists in a single Availability Zone and can only be attached to EC2 instances in the same Availability Zone. EBS Volumes persist even when the attached EC2 instance is Stopped; when the instance is Started again, the disk remains attached and all data has been presrved.
Amazon S3, however, is something quite different. It is a storage service that allows files to be uploaded/downloaded (PutObject, GetObject) and files are replicated across at least three data centers. Files can optionally be accessed via the Internet via HTTP/HTTPS without requiring a web server. There are no limits on the amount of data that can be stored. Access can be granted per-object, per-bucket via a Bucket Policy, or via IAM Users and Groups.
Amazon S3 is a good option when data needs to be shared (both locally and across the Internet), retained for long periods, backed-up (or even for storing backups) and made accessible to other systems. However, applications need to specifically coded to use Amazon S3 and many traditional application expect to store data on a local drive rather than on a separate storage service.
While Amazon S3 has many benefits, there are still situations where Amazon EBS is a better storage choice:
When using applications that expect to store data locally
For storing temporary files
When applications want to partially update files, because the smallest storage unit in S3 is a file and updating a part of a file requires re-uploading the whole file
For very High-IO situations, such as databases (EBS Provisioned IOPS can provide volumes up to 20,000 IOPS)
For creating volume snapshots as backups
For creating Amazon Machine Images (AMIs) that can be used to boot EC2 instances
Bottom line: They are primarily different types of storage and each have their own usage sweet-spot, just like a Database is a good form of storage depending upon the situation.

amazon ec2 free server with persistent data

I will install a website in the free EC2 from amazon but I read something not good: I have a simple website which uses a database. Users come inside my website and post information, send commetns... if for some reason the instance breaks or amazon shuts it down, will I lose all information posted in my website and database? All files users uploaded and information saved will be gone?
If so, why would someone use EC2 if you lose all your data if some problem happens, and because problems always happen, sometime I will certainly lose my data!
I know I can save an image of my current OS in AWS but do I need to save the image everytime a user posts something to my website? It's ridiculous. I know I am missing something here, but I looked into google and people all the time say I should use EBS but it's not in the free plan. So how is it good idea using AWS EC2 free plan if my data will always be at risk of being lost?
Typically you would want to use an EBS backed instance. Since the free tier does not support that, but does offer EBS storage, create your database on an EBS partition for data you cannot lose
30 GB of Amazon Elastic Block Storage, plus 2 million I/Os and 1 GB of snapshot storage*
http://aws.amazon.com/free/
You should have a means to quickly launch a new instance, and you should back up the data on your EBS partition because EBS volumes can and do fail from time to time.
UPDATE
It seems that Micro instances are in fact EBS backed.
It is still advisable to attach a separate EBS volume, because it makes it much more convenient to backup the database (you create a snapshot of the EBS volume... you can find scripts online to accomplish that, which vary a bit depending on your choice of database and file system).

Best setup to work with amazon AWS

I have a website which gets backup from different social media services and then stores the data on server and then that is displayed on my website. content includes, videos, images, and text data.
Currently i am using an EC2 instance with RDS and EBS. Data is stored in EBS Volumes, But as the amount of the data is big enough more than 1 TB and that is increasing. Every time my EBS volume gets filled i attach another volume.
Then i added S3 to my Setup. Cron jobs runs and stores data on S3 and the EC2 instance displays data from the S3. I am using PHP SDK for this purpose.
The problem which i am facing is that the S3 is very slow in my current setup.
Please suggest whether my setup is good or i need some change in my setup and the other way how can i speedup S3. or i should opt some other way to my setup.
EC2 instance is large reserved instance running CentOS.
I have listened some about the S3fs that mount S3 bucket to Ec2 as a volume. Is this a good choice, as when i mounted S3 Bucket to Ec2 instance the transfer rate was very slow.
I am new to the AWS. My users does not access files directly from S3, but they access through my website which is running on EC2 Instance.
RDS is a good choice for storing metadata such as tags, comments and other relevant information about your multimedia files. S3 is good for storing static content such as Video, Audio and Pictures. I think your approach with RDS and S3 is good enough.
EBS backed instances are good for persistence. If you store your metadata on RDS and static content on S3, the only reason why you should use EBS backed EC2 instances is that you have some configuration files which are unversioned right now. If that's not the case, assuming that your configuration is checked into version control and can be pulled on-demand for a fresh instance every time, then you might want to ditch EBS volumes in favor of ephemeral storage. That may give you some performance boost, nothing significant though.
Regarding your concern with S3's latency, yes, S3 is slow. While all your writes may happen directly to S3, I would highly recommend that you set up Amazon CloudFront for your S3 buckets and let your website consume multimedia content from the CloudFront. CloudFront is a Content Delivery Network (CDN) which works with disk volumes (EBS backed or ephemeral) as well as with S3. Setting it up would take not more than a few minutes. CloudFront also supports streaming media files over RTMP. You may need a library like GPAC for hinting multimedia files to make them streamable if not being done already. You might then want to consider creating one distribution for Video/Audio files for streaming and another distribution for Images, Javascript, Stylesheets and other text files.
Hope this helps.
For faster getting and uploading files from Amazon S3 I use batch() found here.
Also you can use cloudfront for faster getting files. I think 9gag uses cloudfront also..