what does low level storage management like iRODS exactly for (in fedora-commons)? - fedora-commons

I am not clear about the actual advantage of having iRODS or any other low level storage management. What are it's benefits exactly and when should we use it?
In Fedora-commons with normal file system low level storage:
a datastream created on May 8th, 2009 might be located in the 2009/0508/20/48/ directory.
How does iRODS helpful here?

I wanted to close the loop here, for other Stack Overflow users.
You posted the same question to our Google Group https://groups.google.com/d/msg/irod-chat/fti4ZHvmS-Y/LU8CQCZQHwAJ The question was answered there, and, thanks to you, the response is now also posted on the iRODS.org FAQ: http://irods.org/faq/
Here it is, once again, for posterity:
Don’t think of iRODS as simply low level storage management.
iRODS is really the only platform for policy managed data preservation. It does indeed virtualize storage, providing a global, logical namespace over heterogeneous types of storage, but it also allows you to enforce preservation policies at each storage location, no matter what client or access method is used. It also provides a global metadata catalog that is automatically maintained and reflects the application of your preservation policies, allowing audit and verification of your preservation policies.
iRODS is developing a powerful metadata management capability, allowing pluggable indexing and query capability that allow synchronization with external indices (e.g. Elastic Search, MAUI, Jena triple store).
With the pluggable rule engine and asynchronous messaging architecture, it becomes rather straightforward to generate audit and provenance metadata that will track every single (pre- and post-) operation on your data, including any plugins you may develop or utilize.
iRODS is middleware, rather than a prepackaged solution. This middleware supports plugins and configurable policies at all points, so you are not limited by a pre-defined set of tools. iRODS also can be connected to wide range of preservation, computation, and enterprise services, and can manage large amounts of data (both in number of objects and size of those objects), and efficiently move and manage data using high performance protocols, including third party data transfer protocols.
iRODS is built to support federation, so that your preservation environment may share data with other institutions or organizations while remaining under your own audit and policy control. Many organizations are doing this for many millions of objects, many thousands of users, and with a large range of object sizes.

Related

What are the true purposes for a managed private blockchain service such as Azure Blockchain Service in terms of data Interoperability and provenance

I ask this question because I want to facilitate a workflow that utilizes a managed blockchain service such as the Azure or AWS blockchain service.
Is the true purpose attestations, provenance and interoperability?
In that aspect, aren't regular (legacy and or current) methodologies sufficient for data interoperability and the transfer and consumption of said data?
Lastly, if all this effectively is doing is creating a ledger account of data flow would a true advantage be the encryption of the data existing on the entire flow including up unto the edge?
If it cannot be encrypted up to the edge so that it is not readable at any point in time of the data flow into the data archive/traditional store is effectively worth any of the previous described gains of provenance and interoperability?
I think there is some nuance to this answer. The purpose of Azure Blockchain Service is to allow enterprises to build networks (consortiums) that enable the business workflows. The unique value that blockchain is adding to business workflows is a logical data model/flow with infrastructure shared to the participants (businesses). That is not easy to do with a traditional database model.
With regards to the encryption you mentioned above, the value with blockchain is providing a digital signature for every change in the system that is shared between enterprises. The typically is done at the client to provide the least chance for manipulation. Privacy, which can use encryption techniques, is something that can be used to allow participants to control access to change details. The fact that changes were made is still cryptographically verifiable, without sharing all the data details with everyone.
If you look at something like EDI that is done today with supply chains, this essentially is a complex network of enterprises, synchronizing databases. This typically suffers breakage of keeping all these things in sync. With a blockchain based system, the "syncing" is abstracted and the focus is more about the business logic, which is always cryptographically signed and verifiable. So it functions like a single "logical" data store, but is actually distributed.

Simple, editable data in AWS

I am working on a project that deals with execution of several models one after the other. For this, users need to upload a lot of files (mostly CSV) for each workflow, and each files has several columns.
Since understanding each file is difficult for users of our application, we want to provide friendly names, small descriptions, help texts, etc. for each file, and display them on our website.
These names and descriptions should be editable by people who are not developers (but will have access to the AWS account). So we would prefer a storage for this that provides some convenient user-interface for this.
In the world of AWS, what would you recommend as a storage for this use-case? Is dynamodb an overkill / inconvenient for this?
Should we have our a separate user-interface and service to implement this feature?
Your choice of User Interface and Storage are completely independent.
Storage should be selected based upon the type of data and how it will be accessed. It might be relational if you are querying and joining a lot of data, or it might be NoSQL (DynamoDB or even Amazon S3) if you need fast, predictable performance but no complex querying.
The User Interface should not be impacted by the choice of storage. It should present the data for viewing/editing in a way that is most convenient for users. There is no reason to have UI drive the storage choice (unless you simply want to use Google Sheets as your frontend).

What is the pattern for Google Cloud Functions to implement mutex

I'm using https triggered Google Cloud Functions to handle client requests to perform database writes. The data is structured in a way that most in parallel writes will not result in corruption.
There are few cases where I need to prevent multiple write actions to happen at once for the same item. What are the common patterns to lock access to some resource on the function level. I'm looking for some "mutex-like" functionality.
I was thinking of some external service that could grant or deny access to the resource for requesting function instances, but the connection overhead would be huge - handshake each time etc.
Added an example as requested. In this specific case, restructuring the data to keep the track of updates isn't a suitable solution.
import * as admin from "firebase-admin";
function updateUserState(userId: string) {
// Query current state
admin
.database()
.ref()
.child(`/users/${userId}/state`)
.once("value")
.then(snapshot => {
return snapshot.val() || 0;
})
.then(currentState =>
// Perform some operation
modifyStateAsync(currentState)
)
.then(newState => {
admin
.database()
.ref()
.child(`/users/${userId}/state`)
.set(newState);
});
}
This is not a pattern that you want to implement in Cloud Functions. Restricting the parallelism of Cloud Functions would limit its scalability, which is counter to the way Cloud Functions works. To learn more about how Cloud Functions scales, watch this video.
If you have a database that needs to have some protection against concurrent access, you should be using the database's own transaction features. Pretty much every database that provides concurrent access to data also provides some ability to perform atomic transactions. Use these transactions, and let the serverless container scale up and down in the way it sees fit.
In the Google Cloud there is an elegant way to have a global distributed mutex for a critical section in a Cloud Function:
gcslock
This is a library written in Go language, hence available for Cloud Functions written in Go, that utilises atomicity guarantees of the Google Cloud Storage service. This approach is apparently not available in AWS because of the lack of such guarantees in the S3 service.
The tool is not applicable for every use case. Acquiring and releasing the lock are operations of order of 10ms, which might be too much for high speed processing use cases.
For a typical batch process, that is not time critical, the tool provides pretty interesting option of guaranteeing that your Cloud Function is not running concurrently over the same target resource. Just create the lock file in GCS with the name that is unique for the operation that you'd like to put into the critical section and release it once its done (or rely on the GCS object lifecycle management to clean the locks up).
Please see more considerations and pros and cons in the original tool GitHub project.
There is also apparently an implementation of the same in Python.
Here is a nice article that summarises use cases for distributed locking on GCP in particular.

Nuodb and HDFS as storage

Using HDFS for Nuodb as storage. Would this have a performance impact?
If I understand correctly, HDFS is better suited for batch mode or write once and read many times, types of application. Would it not increase the latency for record to be fetch in case it needs to read from storage?
On top of that HDFS block size concept, keep the file size small that would increase the network traffic while data is being fetch. Am I missing something here? Please point out the same.
How would Nuodb manage these kind of latency gotchas?
Good afternoon,
My name is Elisabete and I am the Technical Support Engineer over at NuoDB. I believe that I may have just answered this via your post on our own forum, but I'm responding here as well for anyone else who's curious.
First... a mini lesson on NuoDB architecture/layout:
The most basic NuoDB set-up includes:
Broker Agent
Transaction Engine (TE)
Storage Manager (SM) connected to an Archive Directory
Broker Agents keep track of all the moving parts in the domain (collection of machines hosting NuoDB processes) and provide client applications with connection information for the next available Transaction Engine.
Transaction Engines process incoming SQL requests and manage transactions.
Storage Managers read and write data to and from "disk" (Archive Directory)
All of these components can reside on a single machine, but an optimal set up would have them spread across multiple host machines (allowing each process to take full advantage of the host's available CPU/RAM). Also, while it's possible to run with just one of each component, this is a case where more is definitely more. Additional Brokers provide resiliency, additional TE's increase performance/speed and additional SM's ensure durability.
Ok, so now lets talk about Storage:
This is the "Archive Directory" that your storage manager is writing to. Currently, we support three modes of storage:
Local Files System
Amazon Web Services: Simple Storage volume (S3), Elastic Block Storage (EBS)
Hadoop Distributed Files System (HDFS)
So, to elaborate on how NuoDB works with HDFS... it doesn't know about the multiple machines that the HDFS layer is writing to. As far as the SM is concerned, it is reading and writing data atoms to a single directory. The HDFS layer decides how to then distribute and retrieve data to and from the cluster of machines it resides over.
And now to finally address the question of latency:
Here's the thing, whenever we introduce a remote storage device, we inevitably introduce some amount of additional latency because the SM now has further to go when reading/writing atoms to/from memory. HDFS likely adds a bit more, because now it needs to do it's magic divvying up, distributing, retrieving and reassembling data. Add to that discrepancy in network speed, etc.
I imagine that the gained disk space outweighs the cost in travel time, but this is something you'd have to decide on a case by case basis.
Now, all of that said... I haven't mentioned that TE and SM's both have the ability to cache data to local memory. The size of this cache is something you can set, when starting up each process. NuoDB uses a combination of Multi-Version Concurrency Control (MVCC) and a near constant stream of communication between all of the processes, to ensure that data held in cache is kept up to date with all of the changes happening within the system. Garbage Collection also kicks in and clears out atoms in a Least Recently Used order, when the cache grows close to hitting its limit.
All of this helps reduce latency, because the TE's can hold onto the data they reference most often and grab copies of data they don't have from sibling TE's. When they do resort to asking the SM's for data, there's a chance that the SM (or one of its sibling SM's) has a copy of the requested data in local cache, saving itself the trip out to the Archive Directory.
Whew.. that was a lot and I absolutely glossed over more than a few concepts. These topics are covered in greater depth via the new suite of white papers (and the new "green book") available on our main website. I'm currently also working on some visual guides, to help explain all of this.
If you'd like to know more about NuoDB or if I didn't quite answer your question.... please reach out to me directly via the NuoDB Community Forums (I respond to posts there, a bit faster).
Thank you,
Elisabete
Technical Support Engineer at NuoDB

What is the disadvantage of just using Redis instead of an RDBMS?

So if for example I am trying to implement something that looks like Facebook's Graph API that needs to be very quick and support millions of users, what is the disadvantage of just using Redis instead of a RDBMS?
Thanks!
Jonathan
There are plenty of potential benefits and potential drawbacks of using Redis instead of a classical RDBMS. They are very different beasts indeed.
Focusing only on the potential drawbacks:
Redis is an in-memory store: all your data must fit in memory. RDBMS usually stores the data on disks, and cache part of the data in memory. With a RDBMS, you can manage more data than you have memory. With Redis, you cannot.
Redis is a data structure server. There is no query language (only commands) and no support for a relational algebra. You cannot submit ad-hoc queries (like you can using SQL on a RDBMS). All data accesses should be anticipated by the developer, and proper data access paths must be designed. A lot of flexibility is lost.
Redis offers 2 options for persistency: regular snapshotting and append-only files. None of them is as secure as a real transactional server providing redo/undo logging, block checksuming, point-in-time recovery, flashback capabilities, etc ...
Redis only offers basic security (in term of access rights) at the instance level. RDBMS all provide fine grained per-object access control lists (or role management).
A unique Redis instance is not scalable. It only runs on one CPU core in single-threaded mode. To get scalability, several Redis instances must be deployed and started. Distribution and sharding are done on client-side (i.e. the developer has to take care of them). If you compare them to a unique Redis instance, most RDBMS provide more scalability (typically providing parallelism at the connection level). They are multi-processed (Oracle, PostgreSQL, ...) or multi-threaded (MySQL, Microsoft SQL Server, ... ), taking benefits of multi-cores machines.
Here, I have only described the main drawbacks, but keep in mind there are also plenty of benefits in using Redis (very fast, good concurrency support, low latency, protocol pipelining, good to easily implement optimistic concurrent patterns, good usability/complexity ratio, excellent support from Salvatore and Pieter, pragmatic no-nonsense approach, ...)
For your specific problem (graph), I would suggest to have a look at neo4J or OrientDB which are specifically designed to store graph-oriented data.
I have some additions:
There is a value length limitations in redis. When using redis, you always think about your redis K,V size, especially in redis cluster