I know that the blockchain stores the transnational data and it is immutable but where is the actual data stored?
If there is a use case to replace a centralized data center solution with a blockchain solution, where will the data be stored in blockchain?
Data centers usually have petabytes of raw data, hence I am assuming that a decentralized solution like blockchain won't be able to accommodate large amounts of data.
Note: Many links on google say that blockchain is not an ideal solution for large data, but then any solution will eventually produce an ever increasing amount of data.
In most blockchains, such as bitcoin, every node contains a full set of all data, allowing all nodes to verify previous and new transactions. The data itself is normally stored in a local database, typically leveldb.
As you assumed, for this reason, distributed databases (blockchains) that require a fully copy of the dataset are not ideal for petabytes worth of data. The Bitcoin blockchain is currently roughly 270GB.
Related
I am currently building a graph using AWS Neptune. Is there a way of determining or calculating the size of a filled database with AWS Neptune?
There is an answer already in this post, but posting one more with a bit more details, as the previous answer does not mention if the storage includes space used by replication, deleted data etc.
As #Morinaga already pointed out, Cloudwatch exposes the amount of bytes used by actual datapages under AWS/Neptune -> By Cluster -> VolumeBytesUsed. This shows the exact storage that you get charged for. Internally Neptune uses a distributed storage for the data, which includes multiple copies, some additional storage for metadata etc. None of that info impacts how you get billed, so they are not included in VolumeBytesUsed.
Neptune also supports copy-on-write, where you can create a cloned volume from another cluster. One thing to note with cloned volumes is that the new cluster only takes us space for pages that have diverged from the source. So when you plot the VolumeBytesUsed metric for a clone, you would see a much smaller number for the clone as long as the source cluster is still active and lying around. If you delete the source cluster, the space is then re-adjusted in the clones. Do make a note of this, to avoid any possible confusion later on.
Last thing to note is that Neptune, as of Sept 2020, does not do volume shrinking. The VolumeBytesUsed is pretty much a high watermark of how much data pages were used, and deleting a lot of data just clears the data in the data pages, it does not remove it from the volume. So if you create a cluster, add a bunch of data and them delete everything, your VolumeBytesUsed would still show the high watermark. When you insert new data, we would reuse the available data pages first, so you don't end up paying for new data pages.
AWS Cloud Watch can be used to figure out the exact size of your filled database.
Under Metrics you can select Neptune and search for the MetricName='VolumeBytesUsed'. This will show you the amount of data that has been uploaded to your database.
It really depends on how much data you store in vertex and edge properties. Taylor answer here explains more as storage capacity is dynamically allocated in Amazon Neptune.
Since bitcoin is a blockchain and blockchain has been described as a kind of database, how would the data schema of bitcoin look like? Is it a single table database? If yes, which columns are inside this table?
The data is stored in an application-specific format optimized for compact storage, and wasn't really intended to be easily parsed by other applications.
See https://bitcoin.stackexchange.com/q/10814
For this custom format, see https://en.bitcoin.it/wiki/Protocol_documentation#block
There are various databases for various usages. As a reference client I would use bitcoin-core and describe its standard structure that is stored via the client. It actually uses "leveldb" and "berkleydb-4.8" for storing all kind of data.
Wallet database
Saves your transactions, generated public/private keys. That is usually encrypted ;)
Source: Wallets
Index Database
It's usually OPTIONAL, but usually stores a list of all transactions and in which block they occurred
Block Database
It's the most important db which locally stored and share via the network to communicate about newly created blocks and verify them. Every client has a copied version of it.
They usually store all blocks that ever occurred and also include fork-off blocks and also obsolete blocks.
Source: Blockchain / Transactions
Peers Database
Obviously there also is a database for all peers you have seen in the past. It rates each peer by giving it a ban-score, stores their IP addresses, ports and last seen status.
Conclusion:
That would be all databases. They mostly have "one table" which includes exactly the previously described data structures.
More information about the p2p network structure can be found right here.
I've got an external table in BigQuery that pulls its data from Avro files on Google Cloud Storage. I'm currently hive partitioning the data on date as every query will use the date, with an emphasis on newer data. I'm considering also partitioning further on organisation.
I'm not finding much information on the best practices in terms of partitioning to maintain performance and keep costs low. Should I be aiming to keep the number of file reads low (ie have a small number of larger files) or should I be looking to keep the number of bytes being read by BigQuery low (more, smaller files with a fine-grained partition strategy)? Or perhaps it's more nuanced and there's a balance to be kept?
I know this is a tough question without understanding the dataset and queries but I just want to find somewhere to start from rather than just guessing and having to change it later.
There is no general prescription approaching best performance querying a data stored externally (federated data) behind Bigquеry as it mostly depends on the use case and customer purpose, citing the GCP documentation:
Loading and cleaning your data in one pass by querying the data from an external data source (a location external to BigQuery) and
writing the cleaned result into BigQuery storage.
Having a small amount of frequently changing data that you join with other tables. As an external data source, the frequently
changing data does not need to be reloaded every time it is
updated.
As I mentioned in the comment, due to external data source limitations, if the query performance is the leading factor, when it is recommended to switch to classic way loading data to Bigquery sink:
Query performance for external data sources may not be as high as
querying data in a native BigQuery table. If query speed is a
priority, load the data into BigQuery instead of setting up an
external data source.
Having said this, there is no specific enhancement in the I/O operations with GCS in terms of usage it with Bigquery external data sources:
In general, query performance for external data sources should be
equivalent to reading the data directly from the external storage.
I found a paper which is talking about a way to store data off-chain using the blockchain. The data are sent to the blockchain with a transaction which subsequently routes it to an off-blockchain store, while retaining only a pointer to the data on the public ledger.
In particular the paper says:
Consider the following example: a user installs an application that uses our platform for preserving her privacy. As the user signs up for the first time, a new shared (user, service) identity is generated and sent, along with the associated permissions, to the blockchain in a Taccess transaction. Data collected on the phone (e.g., sensor data such as location) is encrypted using a shared encryption key and sent to the blockchain in a Tdata transaction, which subsequently routes it to an off-blockchain key-value store, while retaining only a pointer to the data on the public ledger (the pointer is the SHA-256 hash of the data).
What I cannot understand is how they do it! If all the nodes on the blockchain have to execute that very transaction, it means that they all have to save those information off-blockchain causing a duplication of contents. Did I get it wrong?
After a quick glance at the paper in question, it makes no mention of storage replication. The use case they are describing here is to use blockchain transactions as references to physical data that is stored somewhere. The data can be accessed by anyone who has the reference to it; i.e. access to that particular blockchain system, however the data is encrypted such that only parties with the encryption key can actually decipher it. This approach allows for quick validation of data integrity while maintaining privacy.
From the perspective of the blockchain node all they see is a transaction that will be added to their local ledger, they don't actually save the data themselves.
I want to store Personel data to BlockChain for a company. We want to prove that the data is unchangeable. A Customer in the blockchain will not access or see any other customer data.
But Company will access all customer data and can make any operation and also can follow any operation, any access Log.
Company will store new form type(Personal data) and flag it as a personal data card.
Is it possible with Blockchain?
The best method would be to encrypt the data, but it really depends upon what you are doing with it. If you need to do operations on it, then you will have to use zk-SNARKs, but these are a new field and you would have to do a lot of research to get it working. If you aren't using the data for anything; it's just metadata, then why would you need it to be on a public ledger and validated?
Plus, there is one big problem about storing sensitive data on the blockchain: the blockchain is immutable and once something is on the blockchain, it is stored forever. So what if there comes a time when quantum computers become so powerful that they can break all encryption we have today? Then all your users' personal data will be public on the blockchain.