EBS Snapshots full backups - aws-ebs

Is it possible to take a full backup (snapshot) of a volume again after the incremental backup?
E.g. Day 1 : Full backup
Day 2-6 : Incremental backups
Day 7 : Full backup again.
The reason : Client wants to keep their RTO low as it will take more time to restore from an incremental backup. Any solutions?

I dont think its possible to take a full backup after an incremental backup.
You can back up the data on your Amazon EBS volumes to Amazon S3 by taking point-in-time snapshots. Snapshots are incremental backups, which means that only the blocks on the device that have changed after your most recent snapshot are saved
For Example:
In State 1, the volume has 10 GiB of data. Because Snap A is the first snapshot taken of the volume, the entire 10 GiB of data must be copied.
In State 2, the volume still contains 10 GiB of data, but 4 GiB have changed. Snap B needs to copy and store only the 4 GiB that changed after Snap A was taken. The other 6 GiB of unchanged data, which are already copied and stored in Snap A, are referenced by Snap B rather than (again) copied. This is indicated by the dashed arrow.
In State 3, 2 GiB of data have been added to the volume, for a total of 12 GiB. Snap C needs to copy the 2 GiB that were added after Snap B was taken. As shown by the dashed arrows, Snap C also references 4 GiB of data stored in Snap B, and 6 GiB of data stored in Snap A.
The total storage required for the three snapshots is 16 GiB.

Related

Data ingestion configuration for spark in aws

I am working on batch files and which we receive 1GB csv input file at a time to EMR. What is the ideal configuration for
Master and Core for 1 GB data and how do you arrive at that conclusion and is there a standard procedure? I am using the below I want to downgrade the instances 1 core instance. My concern is if more data comes in how can I upgrade my configuration?
1 instance - Master- 4 VCore, 16GiB memory, EBS-64GB
2 instances - Core- 4 VCore, 16GiB memory, EBS-64GB
The ingestion code has simple transformation and converting to parquet.

Costs of enabling versioning in Amazon S3

I have a question about the costs of versioning in Amazon S3 that don't seem to be present in the guide. There is a cost for every PUT/POST, but for versioned objects(especially when you keep older versions in alternative storage such as glacier) does each PUT/POST cost 2x the PUT/POST cost, one for the new version then one to move the old version to glacier?
You can refer to FAQ page: https://aws.amazon.com/s3/faqs/?nc1=h_ls
Q: How am I charged for using Versioning?
Normal Amazon S3 rates apply for every version of an object stored or
requested. For example, let’s look at the following scenario to
illustrate storage costs when utilizing Versioning (let’s assume the
current month is 31 days long):
1) Day 1 of the month: You perform a PUT of 4 GB (4,294,967,296 bytes)
on your bucket. 2) Day 16 of the month: You perform a PUT of 5 GB
(5,368,709,120 bytes) within the same bucket using the same key as the
original PUT on Day 1.
When analyzing the storage costs of the above operations, please note
that the 4 GB object from Day 1 is not deleted from the bucket when
the 5 GB object is written on Day 15. Instead, the 4 GB object is
preserved as an older version and the 5 GB object becomes the most
recently written version of the object within your bucket. At the end
of the month:
Total Byte-Hour usage [4,294,967,296 bytes x 31 days x (24 hours /
day)] + [5,368,709,120 bytes x 16 days x (24 hours / day)] =
5,257,039,970,304 Byte-Hours.
Conversion to Total GB-Months 5,257,039,970,304 Byte-Hours x (1 GB /
1,073,741,824 bytes) x (1 month / 744 hours) = 6.581 GB-Month
The fee is calculated based on the current rates for your region on
the Amazon S3 Pricing page.

Subscribe Google Pub/sub topic to Cloud Storage Avro file gives me "quota exceeded" error - in a beginners tutorial?

I'm going through Google's Firestore to BigQuery pipeline tutorual and I've come to step 10 where I should set up an export from my topic to an avro file saved on cloud storage.
However, when I try running the job, after doing exactly what's mentioned in the tutorial, I get an error telling me that my project has insufficient quotas to execute the workflow. In the quota summary of the message, I notice that it says 1230/818 disk GB. Does that mean that the job requires 1230 GB disk space? Currently, there are only 100 documents in the Firestore?. This seems wrong to me?
All my Cloud storage buckets are empty:
But when I look at the resources used in the first export job I set up (Pubsub Topic to BigQuery) on page 9, I'm even more confused.
It seems like it's using CRAZY amounts of resources
Current vCPUs
4
Total vCPU time
2.511 vCPU hr
Current memory
15 GB
Total memory time
9.417 GB hr
Current PD
1.2 TB
Total PD time
772.181 GB hr
Current SSD PD
0 B
Total SSD PD time
0 GB hr
Can this be real, or have I done something completely wrong, since all these resources are used? I mean, there's no activity at all, It's just a subscription, right?
Under the hood, that step is calling a Cloud Dataflow template (this one to be exact) to read from Pub/Sub and write to GCS. In turn, Cloud Dataflow is using GCE instances (VMs) for its worker pool. Cloud Dataflow is requesting too many resources (GCE instances which need disk, ram, vCPUs etc) and is hitting your project's limit/quota.
You can override the default number of workers (try 1 to start with) and also also set the smallest VM type (n1-standard-1) when configuring the job under optional parameters. This should also save you some money too. Bonus!

Best AWS Instance for Partitioning Big Data

the problem that I am having right now is trying to find the best AWS instance for partitioning large data (scaling to greater than 1TB).
The data that I am receiving is structured data, and am hoping to partition it by either /year/month/day/ or /year/month/day/hour of the created at time. So far I have tried using EMR with the following configurations to partition 260GB of parquet data in /year/month/day (spark.dynamicAllocation.enabled == true):
3 r5.2xlarge (8 vCPU, 64GB) --> > 1 hour to just write to HDFS
2 c5.4xlarge (16 vCPU, 32GB) --> >> 1 hour to just write to HDFS (was 28% slower than the 3 r5.2xlarge)
2 r5d.4xlarge (16 vCPU, 128GB) --> 54 minutes to just write to HDFS (note, HDFS is on NVMe SSD)
This is a graph of what the 3 r5.2xlarge is producing:
This is a graph of what the 2 c5.4xlarge is producing (note, the two peaks are due to running the job twice):
This is a graph of what the 2 r5d.4xlarge is producing:
Is it possible for me to reach ~10 minutes? If so, would that mean adding more nodes or a different instance type?

What happen when I increase the size of running volume of ec2 instance

My question is so simple:
What happens when I increase the size of running volume of ec2 instance.
1) Does my all data wiped ?
2) Does the space of my instance will also modify with new size ?
Actually my instance has storage of 8GB and that is almost full. I want to increase space that can help me to save more files to my instance.
I have found this option in my console.
I have found that connected ec2 volume. Does directly modifying the volume size will automatically reflect my instance space after reboot.
I
know this is quiet simple. I am just worried about my existing data.
Thank you for your help !
Assuming you have found the option in console to modify the size of the instance and the Instance here is Linux Instance. What the other answer forgets to mentions an important thing that is according to AWS Documentation:
Modifying volume size has no practical effect until you also extend
the volume's file system to make use of the new storage capacity. For
more information, see Extending a Linux File System after Resizing the
Volume.
For ext2, ext3, and ext4 file systems, this command is resize2fs. For XFS file systems, this command is xfs_growfs
Note:
If the volume you are extending has been partitioned, you need to increase the size of the partition before you can resize the file system
To check if your volume partition needs resizing:
Use the lsblk command to list the block devices attached to your instance. The example below shows three volumes: /dev/xvda, /dev/xvdb, and /dev/xvdf.
In Case if the partition occupies all of the room on the device, so it does not need resizing.
However, /dev/xvdf1if is an 8-GiB partition on a 35-GiB device and there are no other partitions on the volume. In this case, the partition must be resized in order to use the remaining space on the volume.
To extend a Linux file system
Log In to Instance via SSH
Use the df -h command to report the existing disk space usage on the file system.
Expand the modified partition using growpart (and note the unusual syntax of separating the device name from the partition number):
sudo growpart /dev/xvdf 1
Then Use a file system-specific command to resize each file system to the new volume capacity.
Finally Use the df -h command to report the existing file system disk space usage
Note : It is Recommended to take snapshot of ebs volume before making any changes.
Please Refer to this AWS Documentation
Well you can just modify the volume directly and this will not affect any file, it will take around 1 min or so to upgrade the size or you might want to restart your instance.
to ensure data safety you can create a snapshot of that volume and from that snapshot create a new volume of whatever size you want and delete the old volume which now contains old data.