What are the files that require backup (ledger) in sawtooth validator - blockchain

What are the main set of files that are required for orchestration of new network from old data from old sawtooth network ( I don't want to extend old sawtooth network).
I want to backup the essential files that are crucial for the operation of the network from the last block in the ledger.
I have list of files that were generated in sawtooth validator and with poet concenses:
block-00.lmdb
poet-key-state-0371cbed.lmdb
block-00.lmdb-lock
poet_consensus_state-020a4912.lmdb
block-chain-id
poet_consensus_state-020a4912.lmdb-lock
merkle-00.lmdb
poet_consensus_state-0371cbed.lmdb
merkle-00.lmdb-lock
txn_receipts-00.lmdb
poet-key-state-020a4912.lmdb
txn_receipts-00.lmdb-lock
poet-key-state-020a4912.lmdb-lock
What is the significance of each file and what are the consequences if not included when restarting the network or creation of new network with old data in ledger.

Answer for this question could bloat, I will cover most part of it here for the benefit of folks who have this question, especially this will help when they want to deploy the network through Kubernetes. Also similar questions are being asked frequently in the official RocketChat channel.
The essential set of files for the Validator and the PoET are stored in /etc/sawtooth (keys and config directory) and /var/lib/sawtooth (data directory) directories by default, unless changed. Create a mounted volume for these so that they can be reused when a new instance is orchestrated.
Here's the file through which the default validator paths can be changed https://github.com/hyperledger/sawtooth-core/blob/master/validator/packaging/path.toml.example
Note that you've missed keys in the list of essential files in your question and that plays important role in the network. In case of PoET each enclave registration information is stored in the Validator Registry against the Validator's public key. In case of Raft/PBFT consensus engine makes use of keys (members list info) to send peer-peer messages.
In case of Raft the data directory it is /var/lib/sawtooth-raft-engine.
Significance of each of the file you listed may not be important for the most people. However, here goes explanation on important ones
*-lock files you see are system generated. If you see these, then one of the process must have opened the file for write.
block-00.lmdb it's the block store/block chain, has KV pair of block-id to block information. It's also possible to index blocks by other keys. Hyperledger Sawtooth documentation is right place to understand complete details.
merkle-00.lmdb is to store the state root hash/global state. It's merkle tree representation in KV pair.
txn-receipts-00.lmdb file is where transaction execution status is stored upon success. This also has information about events if any associated with those transactions.

Here is a list of files from the Sawtooth FAQ:
https://sawtooth.hyperledger.org/faq/validator/#what-files-does-sawtooth-use

Related

Does IPFS use a linked list system like blockchain networks do?

I'm in the process of developing a blockchain-based application for a client that wishes to store files securely. For this purpose I am using IPFS to store the files and the blockchain(more specifically an ethereum network) to store the hashes for the files. As is the case in most such applications.
However, the client is insistent on storing the files directly on the blockchain because of the linked list feature that ensures that the hash of every block on the blockchain is dependent on the previous block and as such every single hash depends on each other.
Does IPFS have a similar feature in it's data structure? I realize that the Merkle Tree system ensure that any tampering with any of the data chunks that the root hash references will change the root hash and as such allows verification of shared files. However, is there any feature that makes the hashes of files dependent on each other?
Perhaps if the files were in some sort of directory structure?
IPFS blocks form a DAG - Directed Acyclic Graph. A blockchain is a specific kind of DAG where each node has only one child. As you say, the root block of a file contains an array of the hashes of the component blocks. Similarly, a directory object contains a dictionary that maps filenames to the hashes of those root blocks. So, if you add a directory to ipfs, you will have a single hash that validates the entire contents of the directory.

How to make bitcoin hard fork

I am exploring bitcoin source code for some time and have successfully created a local bitcoin network with new genesis block.
Now i am trying to understand the process of hard forks (if i am using wrong terms here, i am referring to the one where the blockchain is split instead of mining a new genesis).
I am trying find this approach in BitcoinCash source code, but haven't got anywhere so far except the checkpoints.
//UAHF fork block.
{478558, uint256S("0000000000000000011865af4122fe3b144e2cbeea86"
"142e8ff2fb4107352d43") }
So for i understand that the above checkpoint is responsible for the chain split. But i am unable to find the location in source code where this rule is enforced, i.e. the code where it is specified to have a different block than bitcoin after block number 478558.
Can anyone set me into the right direction here?
There is not a specific rule that you put in the source code that says "this is where the fork starts". The checkpoints are just for bootstrapping a new node, they are checked to make sure the right chain is being downloaded and verified.
A hard fork by definition is just a change in consensus rules. By nature, if you introduce new consensus-breaking rules, any nodes that are running Bitcoin will reject those blocks which are incompatible and as soon as one block is rejected (and mined on the other chain) you will now have different chains.
As a side note, you should probably change the default P2P ports and P2P message headers in chainparams.cpp so it doesn't try to connect with other Bitcoin nodes.

Is it possible to store images on the Ethereum blockchain?

I'm ramping up on learning Solidity, and have some ideas. At the moment I am curious if files/images can be put on the blockchain. I'm thinking an alternative would be some hybrid approach where some stuff is on the blockchain, and some stuff is in a more traditional file storage and uses address references to grab it. One issue I foresee is gas price of file uploads.
Is it possible to store images on the Ethereum blockchain?
It's absolutely possible!
Should you do it? Almost certainly not!
One issue I foresee is gas price of file uploads.
The cost of data storage is 640k gas per kilobyte of data.
The current gas price is approximately 15 Gwei (or 0.000000015 ETH).
At today's price, 1 ETH is approximately $200.
That works out at just under $2 per kilobyte.
It's not for me to tell you if this is too expensive for your application, but you should also consider that the price of both Gas and Ether vary dramatically over time, and you should expect to experience periods when this number will be significantly higher.
Note: I tried to store that +10,000 long base64 string of a 100kb image, but it did't accept. but when i tried 1kb image, it worked.
yes. This is the solidity code to do it:-
// SPDX-License-Identifier: GPL-3.0
pragma solidity >=0.7.0 <0.9.0;
contract ImgStorage {
uint i=0;
mapping(uint => string[]) public base64_images;
function push(string memory base64_img) public {
base64_images[i].push(base64_img);
i++;
}
function returnImage(uint n) public view returns(string[] memory){
return base64_images[n];
}
}
working code image:
You can convert image to base64 and vise versa online.
Here is NodeJS code to convert image to base64 string:
const imageToBase64 = require('image-to-base64');
const fs=require('fs')
imageToBase64("img/1kb.png")
.then(data => {fs.writeFile('1kb_png.md',data, (err)=>{console.log(err)})})
.catch(err =>console.log(err))
I totally agree with #Peter Hall that storing the image on ethereum is too costly.
So, what you can do instead of this?
You can store the image on IPFS. IPFS gives you a fixed length of a hash. Now, you can store this hash on Ethereum and it cost less than another way.
Technically, yes, you could store very small images. But you shouldn't.
Preferred Alternative
Store the image in a distributed file store (for example, Swarm or IPFS), and store a hash of the image on-chain, if it's really important for the image to provably untampered. If that's not important, then maybe don't put anything on chain.
What technical limit is there?
Primarily, the block's gas limit. Currently, Ethereum mainnet has an 8Mgas block limit. Every new 32bytes of storage uses 20k gas. So you can't store data that sums to more than 12.8kb, because it doesn't fit in the block.
Why shouldn't I use it for small files?
The blockchain wasn't designed for that usage (which is why other projects like Swarm and IPFS exist). It bloats and slows everything down, without providing you any benefit over other file storage systems. By analogy, you typically don't store files in a SQL database, either.
You can store images on Ethereum blockchain, but it is too expensive because of "blockspace premium" of high quality blockchain.
Other, more affordable decentralised storage solutions include
Chia
Filecoin (where Filecoin VM smart contracts can manipulate files)
Arweave
Storj
Storing images on-chain is an emphatic NO!
Storing images in a database is also not a good practice, I'm assuming you just mean file storage solutions like S3 / firebase. Storing images on a central server is okay but it depends what you want to achieve, there are decentralized storage solutions like IPFS and Swarm that you could look into.
Ethereum is too heavy as well as expensive to store large blobs like images,
video, and so on. Hence, some external storage is necessary to store bigger
objects. This is where the Interplanetary File System (IPFS) comes into the
picture. The Ethereum Dapp can hold a small amount of data, whereas for
saving anything more or bigger such as images, words, PDF files, and so on,
we use IPFS.
IPFS is an open-source protocol and network designed to create a peer-to-peer method of storing and sharing data. It is similar to Bit Torrent.
If you want to upload a PDF, Word, or image file to
IPFS.
1- You put the PDF, Word, or image file in your working directory.
2- You inform IPFS to add this file, which generates a hash of the file. Note an IPFS hash always starts with “Qm....”
3- Your file is available on the IPFS network.
Now you uploaded the file and want to share the file with Bob. you send the hash of the file to Bob, Bob uses the hash and calls IPFS for the file. The file is now downloaded at Bob’s end. The issue here is that anyone who can get access to the hash will also be able to get access to the file.
Sharing Data on IPFS by Asymmetric Cryptography
Let' say you uploaded a file to IPFS and you want to share it only with Bob.
Bob will send you a public key. you will encrypt the file with Bob's public key and then upload it to IPFS network.
You send the hash of the file to Bob. Bob uses this hash and gets the file.
Bob decrypts the file using his private key of the public key that was used to encrypt the file.
In Asymmetric Cryptography, public keys are generated by the private key and if you lock something with a public key, the only key that will unlock that thing is the private key that the given public key is generated from.
The better way of dealing with files on the blockchain is to use some sort of file storage services like AWS-S3, IPFS, Swarm, etc.
You can upload the files to one of the above file storage servers and generate the hash key(Which is used to access the file), and keep this key in the blockchain.
The advantages of using this method are -
Low-cost solution
Easy and fast access to files using file storage searching algorithms
Lightweight
Flexibility to move from blockchain to DB or vice-a-versa
If the file storage system has good security from tampering, then be assured as these files will not be accessed without the right hash key
Easy to perform migration of file storage from one service to another

S3AFileSystem - FileAlreadyExistsException when prefix is a file and part of a directory tree

We are running Apache Spark jobs with aws-java-sdk-1.7.4.jar hadoop-aws-2.7.5.jar to write parquet files to an S3 bucket.
We have the key 's3://mybucket/d1/d2/d3/d4/d5/d6/d7' in s3 (d7 being a text file). We also have keys 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180615/a.parquet' (a.parquet being a file)
When we run a spark job to write b.parquet file under 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180616/' (ie would like to have 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180616/b.parquet' get created in s3) we get the below error
org.apache.hadoop.fs.FileAlreadyExistsException: Can't make directory for path 's3a://mybucket/d1/d2/d3/d4/d5/d6/d7' since it is a file.
at org.apache.hadoop.fs.s3a.S3AFileSystem.mkdirs(S3AFileSystem.java:861)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1881)
As discussed in HADOOP-15542. you can't have files under directories in a "normal" FS; you don't get them in the S3A connector, at least where it does enough due diligence.
It just confuses every single tree walking algorithm, renames, deletes, anything which scans for files. This will include the spark partitioning logic. That new directory tree you are trying to create would probably appear invisible to callers. (you could test this by creating it, doing the PUT of that text file into place, see what happens)
We try to define what an FS should do in The Hadoop Filesystem Specification, including defining things "so obvious" that nobody bothered to write them down or write tests for, such as
Only directories can have children
All children must have a parent
Only files can have data (exception: ReiserFS)
Files are as long as they say they are (this is why S3A doesn't support client-side encryption, BTW).
Every so often we discover some new thing we forgot to consider, which "real" filesystems enforce out the box, but which object stores don't. Then we add tests, try our best to maintain the metaphor except when the performance impact would make it unusable. Then we opt not to fix things and hope nobody notices. Generally, because people working with data in the hadoop/hive/spark space have those same preconceptions of what a filesystem does, those ambiguities don't actually cause problems in production.
Except of course eventual consistency, which is why you shouldn't be writing data straight to S3 from spark without a consistency service (S3Guard, consistent EMRFS), or a commit protocol designed for this world (S3A Committer, databricks DBIO).

Inotify-like feature in a distributed file system

As the title goes, I want to trigger a notification when some events happen.
A event above can be user-defined, such as updating specified files in 1-miniute.
If files are stored locally, I can easily make it with the system call inotify, but the case is that files locate on a distributed file system such as mfs..
How to make it? I wonder to know if there are some solutions or open-source project to solve this problem. Thanks.
If you have only black-box access (e.g. NFS protocol) to the remote system(s), you don't have much options unless the protocol supports what you need. So I'll assume you have control over the remote systems.
The "trivial" approach is running a local inotify/fanotify listener on each computer that would forward the notification over the network. FAM can do this over NFS.
A problem with all notification-based system is the risk of lost notifications in various edge cases. This becomes much more acute over a network - e.g. client confirms reciept of notification, then immediately crashes. There are reliable message queues you can build on but IMHO this way lies madness...
A saner approach is stateless hash-based scan.
I like to call the following design "hnotify" but that's not an established term. The ideas are widely used by many version control and backup systems, dating back to Plan 9.
The core idea is if you know cryptographic hashes for files, you can compose a single hash that represents a directory of files - it changes if any of the files changed - and you can build these bottom-up to represent the whole filesystem's state.
(Git stores things this way and is very efficient at it.)
Why are hash trees cool? If you have 2 hash trees — one representing the filesystem state you saw at point in the past, one representing the current state — you can easily find out what changed between them:
You start at the roots. If they are different you read the 2 root directories and compare hashes for subdirectories.
If a subdirectory has same hash in both trees, then nothing under it changed. No point going there.
If a subdirectory's hash changed, compare its contents recursively — call step (1).
If one has a subdirectory the other doesn't, well that's a change. With some global table you can also detect moves/renames.
Note that if few files changed, you only read a small portion of the current state. So the remote system doesn't have to send you the whole tree of hashes, it can be an interactive ping-pong of "give me hashes for this directory; ok now for this...".
(This is akin to how Git's dumb http protocol worked; there is a newer protocol with less round trips.)
This is as robust and bug-proof as polling the whole filesystem for changes — you can't miss anything — but reasonably efficient!
But how does the server track current hashes?
Unfortunately, fully hashing all disk writes is too expensive for most people. You may get if for free if you're lucky to be running a deduplicating filesystem, e.g. ZFS or Btrfs.
Otherwise you're stuck with re-reading all changed files (which is even more expensive than doing it in the filesystem layer) or using fake file hashes: upon any change to a file, invent a new random "hash" to invalidate it (and try to keep the fake hashes on moves). Still compute real hashes up the tree. Now you may have false positives — you "detect a change" when the content is the same — but never false negatives.
Anyway, the point is that whatever stateful hacks you do (e.g. inotify with periodic scans to be sure), you only do them locally on the server. Across the network, you only ever send hashes that represent snapshots of current state (or its subtrees)! This way you can have a distributed system with many servers and clients, intermittent connectivity, and still keep your sanity.
P.S. Btrfs can efficiently find differences from an older snapshot. But this is a snapshot taken on the server (and causing all data to be preserved!), less flexible than a client-side lightweight tree-of-hashes.
P.S. One of your tags is HadoopFS. I'm not really familiar with it, but I suspect a lot of its files are write-once-then-immutable, and it might be able to natively give you some kind of file/chunk ids that can serve as fake hashes?
Existing tools
The first tool that springs to my mind is bup index. bup is a very clever deduplicating backup tool built on git (only scalable to huge data), so it sits on the foundation described above. In theory, indexing data in bup on the server and doing git fetch over the network would even implement the hash-walking comparison of what's new that I described above — unfortunately the git repositories that bup produces are too big for git itself to cope with. Also you probably don't want bup to read and store all your data. But bup index is a separate subsystem that quickly scans a filesystem for potential changes, without yet reading the changed files.
Currently bup doesn't use inotify but it's been discussed in depth.
Oh, and bup uses Bloom Filters which are a nearly optimal way to represent sets with false positives. I'm almost certain Bloom filters have a role to play in optimizion stateless notification protocols ("here is a compressed bitmap of all I have; you should be able to narrow your queries with it" or "here is a compressed bitmap of what I want to be notified about"). Not sure if the way bup uses them is directly useful to you, but this data structure should definitely be in your toolbelt.
Another tool is git annex. It's also based on Git (are you noticing a trend?) but is designed to keep the data itself out of Git repos (so git fetch should just work!) and has a "WORM" option that uses fake hashes for faster performance.
Alternative design: compressed replayable journal
I used to think the above is the only sane stateless approach for clients to check what's changed. But I just read http://arstechnica.com/apple/2007/10/mac-os-x-10-5/7/ about OS X's FSEvents framework, which has a perhaps simpler design:
ALL changes are logged to a file. It's kept forever.
Clients can ask "replay for me everything since event 51348".
The magic trick is the log has coarse granularity ("something in this directory changed, go re-scan it to find out what", repeated changes within 30 seconds are combined) so this journal file is very compact.
At the low level you might resort to similar techniques — e.g. hashes — but the top-level interface is different: instead of snapshots you deal with a timeline of events. It may be an easier fit for some applications.