Google Cloud BlobListOption only iterates one level below currentDir - google-cloud-platform

I was testing some of the Cloud Storage functionality and just seen that the iterative approach only works with a level underneath the current directory?
Page<Blob> blobs = STORAGE_INSTANCE.list(bucket, Storage.BlobListOption.currentDirectory(),
Storage.BlobListOption.prefix(getBucketKey(GS_SCHEMA, prefix).concat(URI_DELIMITER)));
Given that what .prefix() receives is, for example, /dir/ and this prefix contains two nested levels such as /dir/content/ and /dir/content/mycontent.txt.
If that call is executed with the previously mentioned /dir/, only /dir/content/ is listed, but no more prefixes.
So, whenever I want to iterate through all the prefixes below /dir/, no matter what I have to reiterate /dir/content/ so that I can see dir/content/mycontent.txt listed.
Is there an easy way to fix this or am I not using the API properly?

Remove the Storage.BlobListOption.currentDirectory() parameter from the list() method. The following code snippet managed to display all Blobs containing a specific prefix for me:
Page<Blob> blobs = storage.list(bucketName, BlobListOption.prefix(prefix));
for (Blob blob : blobs.iterateAll()) {
System.out.println(blob);
}

Related

Slow insertion using Neptune and Gremlin

I'm having problems with the insertion using gremlin to Neptune.
I am trying to insert many nodes and edges, potentially hundred thousands of nodes and edges, with checking for existence.
Currently, we are using inject to insert the nodes, and the problem is that it is slow.
After running the explain command, we figured out that the problem was the coalesce and the where steps - it takes more than 99.9% of the run duration.
I want to insert each node and edge only if it doesn’t exist, and that’s why I am using the coalesce and where steps.
For example, the query we use to insert nodes with inject:
properties_list = [{‘uid’:’1642’},{‘uid’:’1322’}…]
g.inject(properties_list).unfold().as_('node')
.sideEffect(__.V().where(P.eq('node')).by(‘uid).fold()
.coalesce(__.unfold(), __.addV(label).property(Cardinality.single,'uid','1')))
With 1000 nodes in the graph and properties_list with 100 elements, running the query above takes around 30 seconds, and it gets slower as the number of nodes in the graph increases.
Running a naive injection with the same environment as the query above, without coalesce and where, takes less than 1 second.
I’d like to hear your suggestions and to know what are the best practices for inserting many nodes and edges (with checking for existence).
Thank you very much.
If you have a set of IDs that you want to check for existence, you can speed up the query significantly by also providing just a list of IDs to the query and calculating the intersection of the ones that exist upfront. Then, having calculated the set that need updates you can just apply them in one go. This will make a big difference. The reason you are running into problems is that the mid traversal V has a lot of work to do. In general it would be better to use actual IDs rather than properties (UID in your case). If that is not an option the same technique will work for property based IDs. The steps are:
Using inject or sideEffect insert the IDs to be found as one list and the corresponding map containing the changes to conditionally be applied in a separate map.
Find the intersection of the ones that exist and those that do not.
Using that set of non existing ones, apply the updates using the values in the set to index into your map.
Here is a concrete example. I used the graph-notebook for this but you can do the same thing in code:
Given:
ids = "['1','2','9998','9999']"
and
data = "[['id':'1','value':'XYZ'],['id':'9998','value':'ABC'],['id':'9999','value':'DEF']]"
we can do something like this:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by()
which correctly finds the ones that do not already exist:
{'id': 9998, 'value': 'ABC'}
{'id': 9999, 'value': 'DEF'}
You can use this pattern to construct your conditional inserts a lot more efficiently (I hope :-) ). So to add the new vertices you might do:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by().
addV('test').
property(id,select('d').select('id')).
property('value',select('d').select('value'))
v[9998]
v[9999]
As a side note, we are adding two new steps to Gremlin - mergeV and mergeE that will allow this to be done much more easily and in a more declarative style. Those new steps should be part of the TinkerPop 3.6 release.

Compare values with a static and global table in C++

I am working on statistical analyses in my field, and using c++. I am implementing several tests, and some of them need to compare the calculated value with a table, say a distribution table for example, like this one.
I want my different functions in my different classes to be able to access a specific value, to evaluate the significance of my result, for example something like this:
float F = fisherTest(serie1, serie2);
auto tableValue = findValue(serie1.size(), serie2.size());
if(tableValue < F) {
cout << "Not significant";
return -1;
}
This is just an example, as this test actually makes no sense. But I just want to be able to read values from a predefined table.
Do you have an idea of how I can achieve this? Can I store this in a "resource file"?
I hope my question is clear! Thank you.
You can have some data files and pass a configuration during startup (e.g. commandline) to the application so it can find the files and read them. The data structure can then be fed to the test.
It is possible to get the predefined data from several sources:
Hard coded tables in your program.
One or more functions that can compute the data on demand.
Files on your local disk.
Data stored in a database server.
You and your team need to decide which makes the most sense for your application.

sequential/online kmeans clustering, how does it work? Existing codes?

I'm a little confused about online kmeans clustering. I know that it allows me to cluster with just one data at a time. But,is this all limited to one session? Suppose that I have a bunch of data clustered via this method and I get the clustered data result, would I be able to add more data to the cluster in the future?
I've also been looking for implementations of this code, and to no avail. Anyone know of any?
Update:
To clarify more. Here is how my code works right now:
Image is taken from live video feed, once enough pictures are saved, get kmeans of sift features.
Repeat step 1, a new batch of live feed pictures, get kmeans again. Combine the kmeans vectors with the previous kmeans like :[A B]
You can see that this is bad, because I quickly get too much clusters, and each batch of clusters will definitely have overlaps with another batch.
What I want:
Image taken from live video feed, once pics are saved, get kmeans
Repeat step 1, get kmeans again, which updates and adds new clusters to the previous cluster.
Nothing that I've seen could accommodate that, unless I'm just not understanding them correctly.
If you look at the original (!) publications, the method proposed by MacQueen - where the name k-means comes from - was in fact an online algorithm. I'm not sure if MacQueen did multiple passes over the data to improve the result. I believe he used a single pass, and objects would never be reassigned to a different cluster. If so, it was already an online algorithm!
Means are commonly computed as sum / count. This is not very sensible from a numerical point of view. E.g. in the classic Knuth book you can find a method for incrementally updating means. Wikipedia has it also.
Things get slightly more complicated once you actually want to reassign earlier points. But usually in a streaming context you do not know the previous points, so you cannot do that anyway.

Check a fingerprint in the database

I am saving the fingerprints in a field "blob", then wonder if the only way to compare these impressions is retrieving all prints saved in the database and then create a vector to check, using the function "identify_finger"? You can check directly from the database using a SELECT?
I'm working with libfprint. In this code the verification is done in a vector:
def test_identify():
cur = DB.cursor()
cur.execute('select id, fp from print')
id = []
gallary = []
for row in cur.fetchall():
data = pyfprint.pyf.fp_print_data_from_data(str(row['fp']))
gallary.append(pyfprint.Fprint(data_ptr = data))
id.append(row['id'])
n, fp, img = FingerDevice.identify_finger(gallary)
There are two fundamentally different ways to use a fingerprint database. One is to verify the identity of a person who is known through other means, and one is to search for a person whose identity is unknown.
A simple library such as libfprint is suitable for the first case only. Since you're using it to verify someone you can use their identity to look up a single row from the database. Perhaps you've scanned more than one finger, or perhaps you've stored multiple scans per finger, but it will still be a small number of database blobs returned.
A fingerprint search algorithm must be designed from the ground up to narrow the search space, to compare quickly, and to rank the results and deal with false positives. Just as a Google search may come up with pages totally unrelated to what you're looking for, so too will a fingerprint search. There are companies that devote their entire existence to solving this problem.
Another way would be to have a mysql plugin that knows how to work with fingerprint images and select based on what you are looking for.
I really doubt that there is such a thing.
You could also try to parallelize the fingerprint comparation, ie - calling:
FingerDevice.identify_finger(gallary)
in parallel, on different cores/machines
You can't check directly from the database using a SELECT because each scan is different and will produce different blobs. libfprint does the hard work of comparing different scans and judging if they are from the same person or not
What zinking and Tudor are saying, I think, is that if you understand how does that judgement process works (which is by the way, by minutiae comparison) you can develop a method of storing the relevant data for the process (the *minutiae, maybe?) in the database and then a method for fetching the relevant values -- maybe a kind of index or some type of extension to the database.
In other words, you would have to reimplement the libfprint algorithms in a more complex (and beautiful) way, instead of just accepting the libfprint method of comparing the scan with all stored fingerprint in a loop.
other solutions for speeding your program
use C:
I only know sufficient C to write kind of hello-world programs, but it was not hard to write code in pure C to use the fp_identify_finger_img function of libfprint and I can tell you it is much faster than pyfprint.identify_finger.
You can continue doing the enrollment part of the stuff in python. I do it.
use a time / location based SELECT:
If you know your users will scan their fingerprints with more probability at some time than other time, or at some place than other place (maybe arriving at work at some time and scanning their fingers, or leaving, or entering the building by one gate, or by other), you can collect data (at each scan) for measuring the probabilities and creating parallel tables to sort the users for their probability of arriving at each time and location.
We know that identify_finger tries to identify fingers in a loop with the fingerprint objects you provided in a list, so we can use that and give it the objects sorted in a way in which the more likely user for that time and that location will be the first in the list and so on.

CERN ROOT Extract Data from TNtuple

I am using CERN's ROOT framework (required), and I want to take data from a TNtuple and graph it. I can either graph the data when I create the TNtuple, or after I write it to a .root file. Some of the support documentation suggested that I create a TTree, but that seemed like it might be overkill/roundabout since I wouldn't be using it for anything else (and the TNtuple fulfills all of my other requirements). Does anyone have a better suggestion for how to extract data from the TNtuple and graph it?
As TNtuple inherits from TTree, you can use all the methods presented in the support documentation for TTrees directly on the TNtuple.
This especially means that you can use TTree::Draw() which is typically more than sufficient for quickly graphing the data. This function is documented here.
For more elaborate plots you will have to read the data from the TNtuple event by event and feed it to your favorite graphing tool in ROOT. This again follows the basic principles from a tree. The best example I could find on the ROOT homepage is in the user manual, section trees in the paragraph "Reading the Tree".
The methods used to create histograms and plots for TNtuples is essentially the same as TTrees. The code:
ntuple->Draw("var");
will create a histogram of the variable var stored in the Ntuple. If you want to plot one variable in the Ntuple as a function of another, use
ntuple->Draw("xVar:yVar");
You can do fancier things such as creating plots only when a logical condition is satisfied. For example, suppose you want a histogram of var1 only when var2 is greater than 2 and var3 is less than 0.
ntuple->Draw("var","var2 > 2 && var3 < 0");
By plotting in this way, ROOT automatically sets the binning and range for the x-axis. If you wish to control these features yourself, use
ntuple->Draw("var >> hist(Nbins,xmin,xmax)");
This creates the object hist, which you treat as a usual histogram object in ROOT. As stated in the previous post, this is documented in the ROOT manual along with several other features and tools. Unfortunately, the manual doesn't always give clear explanations.
{
ntuple->Draw("py:px","px>py","goff");
TGraph *gr = new TGraph(ntuple->GetSelectedRows(),ntuple->GetV2(), ntuple->GetV1());
gr->Draw("AP");
}