Estimating the size of a AWS Neptune graph database - amazon-web-services

I am currently building a graph using AWS Neptune. Is there a way of determining or calculating the size of a filled database with AWS Neptune?

There is an answer already in this post, but posting one more with a bit more details, as the previous answer does not mention if the storage includes space used by replication, deleted data etc.
As #Morinaga already pointed out, Cloudwatch exposes the amount of bytes used by actual datapages under AWS/Neptune -> By Cluster -> VolumeBytesUsed. This shows the exact storage that you get charged for. Internally Neptune uses a distributed storage for the data, which includes multiple copies, some additional storage for metadata etc. None of that info impacts how you get billed, so they are not included in VolumeBytesUsed.
Neptune also supports copy-on-write, where you can create a cloned volume from another cluster. One thing to note with cloned volumes is that the new cluster only takes us space for pages that have diverged from the source. So when you plot the VolumeBytesUsed metric for a clone, you would see a much smaller number for the clone as long as the source cluster is still active and lying around. If you delete the source cluster, the space is then re-adjusted in the clones. Do make a note of this, to avoid any possible confusion later on.
Last thing to note is that Neptune, as of Sept 2020, does not do volume shrinking. The VolumeBytesUsed is pretty much a high watermark of how much data pages were used, and deleting a lot of data just clears the data in the data pages, it does not remove it from the volume. So if you create a cluster, add a bunch of data and them delete everything, your VolumeBytesUsed would still show the high watermark. When you insert new data, we would reuse the available data pages first, so you don't end up paying for new data pages.

AWS Cloud Watch can be used to figure out the exact size of your filled database.
Under Metrics you can select Neptune and search for the MetricName='VolumeBytesUsed'. This will show you the amount of data that has been uploaded to your database.

It really depends on how much data you store in vertex and edge properties. Taylor answer here explains more as storage capacity is dynamically allocated in Amazon Neptune.

Related

Data Storage and Analytics on AWS

I have one data analytics requirement on AWS. I have limited knowledge on Big Data processing, but based on my
analysis, I have figured out some options.
The requirement is to collect data by calling a Provider API every 30 mins. (data ingestion)
The data is mainly structured.
This data need to be stored in a storage (S3 data lake or Red Shift.. not sure)and various aggregations/dimensions from this data are to be provided through a REST API.
There is a future requirement to run ML algorithms on the original data and hence the storage need to be decided accordingly. So based on this, can you suggest:
How to ingest data (Lambda to run at a scheduled interval and pull data, store in the storage OR any better way to pull data in AWS)
How to store (store in S3 or RedShift)
Data Analytics (currently some monthly, weekly aggregations), what tools can be used? What tools to use if I am storing data in S3.
Expose the analytics results through an API. (Hope I can use Lambda to query the Analytics engine in the previous step)
Ingestion is simple. If the retrieval is relatively quick, then scheduling an AWS Lambda function is a good idea.
However, all the answers to your other questions really depend upon how you are going to use the data, and then work backwards.
For Storage, Amazon S3 makes sense at least for the initial storage of the retrieved data, but might (or might not) be appropriate for the API and Analytics.
If you are going to provide an API, then you will need to consider how the API code (eg using AWS API Gateway) will need to retrieve the data. For example, is it identical to the blob of data original retrieved, or are there complex transformations required or perhaps combining of data from other locations and time intervals. This will help determine how the data should be stored so that it is easily retrieved.
Data Analytics needs will also drive how your data is stored. Consider whether an SQL database sufficient. If there are millions and billions of rows, you could consider using Amazon Redshift. If the data is kept in Amazon S3, then you might be able to use Amazon Athena. The correct answer depends completely upon how you intend to access and process the data.
Bottom line: Consider first how you will use the data, then determine the most appropriate place to store it. There is no generic answer that we can provide.

How does the snapshot size affects restore process in Amazon Redshift?

I am doing some POC around creating a cluster from a snapshot. But I am uncertain about the time it takes to restore from an existing snapshot. Sometimes it takes around 10 mins but sometimes it also takes as long as 30 min.
Is there any data(size of snapshot) vs time breakup is available?
What operations does redshift perform in the background during the restore process?
Redshift restore from snapshot does not require a full repopulate of data before the cluster is available. Cluster availability is based on having the hardware, OS, and application up alone with populating the leader node (blocklist mostly). Once these are in place the cluster can take queries and if the table data is not yet loaded into the cluster from the snapshot the restore of the data blocks needed will be prioritized and the query will run slow until these blocks are populated. Since most queries are based on a minority of "hot" blocks the query speed for most will be as fast as usual fairly quickly.
I know this just complicates the analysis you are performing but this is how restore works. I expect you are seeing variability based on many factors and a small one of these is the size of the blocklist table on the leader node. How does the time for creating an empty cluster compare? How variable is this?

Database suggestion for large unstructured datasets to integrate with elasticsearch

A scenario where we have millions of records saved in database, currently I was using dynamodb for saving metadata(and also do write, update and delete operations on objects), S3 for storing files(eg: files can be images, where its associated metadata is stored in dynamoDb) and elasticsearch for indexing and searching. But due to dynamodb limit of 400kb for a row(a single object), it was not sufficient for data to be saved. I thought about saving for an object in different versions in dynamodb itself, but it would be too complicated.
So I was thinking for replacement of dynamodb with some better storage:
AWS DocumentDb
S3 for saving metadata also, along with object files
So which one is better option among both in your opinion and why, which is also cost effective. (Also easy to sync with elasticsearch, but this ES syncing is not much issue as somehow it is possible for both)
If you have any other better suggestions than these two you can also tell me those.
I would suggest looking at DocumentDB over Amazon S3 based on your use case for the following reasons:
Pricing of storing the data would be $0.023 for standard and $0.0125 for infrequent access per GB per month (whereas Document DB is $0.10per GB-month), depending on your size this could add up greatly. If you use IA be aware that your costs for retrieval could add up greatly.
Whilst you would not directly get the data down you would use either Athena or S3 Select to filter. Depending on the data size being queried it would take from a few seconds to possibly minutes (not the milliseconds you requested).
For unstructured data storage in S3 and the querying technologies around it are more targeted at a data lake used for analysis. Whereas DocumentDB is more driven for performance within live applications (it is a MongoDB compatible data store after all).

Impact of On-Demand mode on Audit table data for Amazon DynamoDB

I am working on Amazon DynamoDB audit table.
The read/write mode was set to "Provisioning". Now, the mode is changed to "On-Demand". I have an "Audit Table" (which captures the audit information like date and time of operation, user details, etc) associated with DynamoDB.
My questions on this are:
1) How is it impacting the data that gets created in the "Audit Table"?
2) Will the data be deleted automatically on timely bases?
3) If not, what is the maximum limit of data that a table (audit table in this case) can persist?
Please let me know if you need any more information from my side.
Waiting for your answers on my questions.
Thanks and regards,
Mahesh Bongale
Provisioning just means that the table is initializing with whatever read/write capacity you set, or OnDemand capacity if you set it to that mode (similar to an auto-scaling mode where it will always deliver the throughput needed by your application). More info: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html
No, absolutely not, unless you specifically add code that will delete old data OR set a specific TTL on your data. More info: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TTL.html
There is no specific limit on the number of rows in a given table. It can be as much as you want. There are a few limits though on a few things, some can be lifted if you ask AWS, some can not: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html

AWS hosted data storage for storing simple entities

I need to choose data storage for simple system. The main purpose of the system is storing events - simple entities with timestamp, user id and type. No joins. Just single table.
Stored data will be fetched rarely (compared with writes). I expect following read operations:
get latest events for a list of users
get latest events of a type for a list of users
I expect about 0.5-1 million writes a day. Data older than 2 years can be removed.
I'm looking for best fitted service provided by AWS. I wonder if using redshift is like taking a sledgehammer to crack a nut?
For your requirement you can use AWS DynamoDB and also define the TTL values to remove the older items automatically. You get the following advantages.
Fully managed data storage
Able to scale with the need for write throughput (Though it can be costly)
Use sort key with timestamp to query latest items.
I would also like to check the AWS Simple DB as it looks more fit(in a first glance) for your requirements.
Please refer this article which explains some practical user experience.
http://www.masonzhang.com/2013/06/2-reasons-why-we-select-simpledb.html