Is there a faster way to query from NDB using list? - python-2.7

I have a list that I I need to query the corresponding information from.
I can do:
for i in list:
database.query(infomation==i).fetch()
but this is so slow, because for every element in the list, it have to go to data base and then back, repeat, instead of querying everything at once. Is there a way to speed this process up?

You can use the ndb async operations to speed up your code. Basically you would launch all your queries pretty much in parallel, then process the results as they come in, which would result in potentially much faster overall execution, especially if your list is long. Something along these lines:
futures = []
for i in list:
futures.append(
database.query(infomation==i).fetch_async())
for future in futures:
results = future.get_result()
# do something with your results
There are more advanced ways of using the async operations described in the mentioned doc which you may find interesting, depending on the structure of your actual code.

Related

Mapping into multiple maps in parallel with Java 8 Streams

I'm iterating over a CloseableIterator (looping over elements) and currently adding to a hashmap (just putting into a HashMap, dealing with conflicts as needed). My goal is to do this process in parallel, add to multiple hashmaps in chunks using parallelism to speed up the process. Then reduce to a single hashmap.
Not sure how to do the first step, using streams to map into multiple hashmaps in parallel. Appreciate help.
Parallel streams collected into Collectors.toMap will already process the stream on multiple threads and then combine per-thread maps as a final step. Or in the case of toConcurrentMap multiple threads will process the stream and combine data into a thread-safe map.
If you only have an Iterator (as opposed to an Iterable or a Spliterator), it's probably not worth parallelizing. In Effective Java, Josh Bloch states that:
Even under the best of circumstances, parallelizing a pipeline is unlikely to increase its performance if the source is from Stream.iterate, or the intermediate operation limit is used.
An Iterator has only a next method, which (typically) must be called sequentially. Thus, any attempt to parallelize would be doing essentially what Stream.iterate does: sequentially starting the stream and then sending the data to other threads. There's a lot of overhead that comes with this transfer, and the cache is not on your side at all. There's a good chance that it wouldn't be worth it, except maybe if you have few elements to iterate over and you have a lot of work to do on each one. In this case, you may as well put them all into an ArrayList and parallelize from there.
It's a different story if you can get a reasonably parallelizable Stream. You can get these if you have a good Iterable or Spliterator. If you have a good Spliterator, you can get a Stream using the StreamSupport.stream methods. Any Iterable has a spliterator method. If you have a Collection, use the parallelStream method.
A Map in Java has key-value pairs, so I'm not exactly sure what you mean by "putting into a HashMap." For this answer, I'll assume that you mean that you're making a call to the put method where the key is one of the elements and the value Boolean.TRUE. If you update your question, I can give a more specific answer.
In this case, your code could look something like this:
public static <E> Map<E, Boolean> putInMap(Stream<E> elements) {
return elements.parallel()
.collect(Collectors.toConcurrentMap(e -> e, e -> Boolean.TRUE, (a, b) -> Boolean.TRUE));
}
e -> e is the key mapper, making it so that the keys are the elements.
e -> Boolean.TRUE is the value mapper, making it so the set values are true.
(a, b) -> Boolean.TRUE is the merge function, deciding how to merge two elements into one.

What is the scope of result rows in PDI Kettle?

Working with result rows in kettle is the only way to pass lists internally in the program. But how does this work exactly? This topic has not been well documented and there's a lot of questions.
For example, a job containing 2 transformation can have result rows sent from the first to the second. But what if there's a third transformation getting the result rows? What is the scope? Can you pass result rows to a sub-job as well? Can you clear the result rows based on logic inside a transformation?
Working with lists and arrays is useful and necessary in programming, but confusing in PDI Kettle.
I agree that working with result rows may be confusing, but you can be confident: it works.
Yes, you can pass it the a sub-job, and in a series of sub-jobs (define the scope as "valid in the java machine" for the first test).
And no, there is no way to clear the results in a transformation (and certainly not based on a formula). That would mean a terrible overload in maintenance.
Kettle is not an imperative language, it is more of the data-flow family. It means it is nearer the way you are thinking when developing an ETL and much, much more performant. The drawback is that list and array have no meaning. Only flow of data.
And that is what is a result set : a flow of data, like the the result set of a sql query. The next job has to open it, pass each row to the transformation, and close it after the last row.

How to search the value from a std::map when I use cuda?

I have something stored in std::map, which maps string to vector. Its keys and values looks like
key value
"a"-----[1,2,3]
"b"-----[8,100]
"cde"----[7,10]
For each thread, it needs to process one query. The query looks like
["a", "b"]
or
["cde", "a"]
So I need to get the value from the map and then do some other jobs like combine them. So as for the first query, the result will be
[1,2,3,8,100]
The problem is, how can threads access the map and find the value by a key?
At first, I tried to store it in global memory. However, It looks like it can only pass arrays from host to device.
Then I tried to use thrust, but I can only use it to store vector.
Is there any other way I can use? Or maybe I ignored some methods in thrust? Thanks!
**Ps: I do not need to modify the map, I just need to read data from it.
I believe it's unlikely you will benefit from doing any of this on the GPU, unless you have a huge number of queries which are all available to you at once, or at least in batches.
If you do not have that many queries, then just transferring the data (regardless of its exact format/structure) will likely be a waste.
If you do have that many queries, the benefit is still entirely unclear, and depends on a lot of parameters. The fact that you've been trying to use std::map for anything suggests (see below for the reason) that you haven't been seriously concerned with performance so far. If that's indeed the case, just don't make your life difficult by using a GPU.
So what's wrong what std::map? Nothing, but it's extremely slow even on the CPU, and this is even worse on the GPU.

Selection appropriate STL container for logging Data

I require logging and filtering mechanism in my client server application.where client may request log data based on certain parameter.
log will have MACID,date and time,command type and direction as field.
server can filter log data based on these parameter as well.
size of the the log is 10 mb afterwards the log will be override the message from beginning.
My approach is I will log data in to file as well in the STL container as "in memory" so that when the client request data server will filter the log data based on any criteria
So the process is server will first do the sorting on particular criteria on vector<> and then filter it using binary search.
I am planning to use vector as STL container for in memory logging data.
I am bit confused whether vector will appropriate on this situation or not.
since size of the data can max upto 10 mb in vector.
my question whether vector is fare enough for this case or not ?
I'd go with a deque, double ended queue. It's like a vector but you can add/remove elements from both ends.
I would first state that I would use a logging library since there are many and I assure you they will do a better job (log4cxx for ex). If you insist on doing this your yourself A vector is an appropriate mechanism but you will have to manually sort the data biased upon user requests. One other idea is to use sqllite and let it manage storing sorting and filtering your data.
The actual response will depend a lot on the usage pattern and interface. If you are using a graphical UI, then chances are that there is already a widget that implements that feature to some extent (ability to sort by different columns, and even filter). If you really want to implement this outside of the UI, then it will depend on the usage pattern, will the user want a particular view more than others? does she need only filtering, or also sorting?
If there is one view of the data that will be used in most cases, and you only need to show a different order a few times, I would keep an std::vector or std::deque of the elements, and filter out with remove_copy_if when needed. If a different sort is required, I would copy and sort the copy, to avoid having to re-sort back to time based to continue adding elements to the log. Beware if you the application keeps pushing data that you will need to update the copy with the new elements in place (or provide a fixed view and rerun the operation periodically).
If there is no particular view that occurs much more often than the rest, of if you don't want to go through the pain of implementing the above, take a look a boost multi index containers. They keep synchronized views of the same data with different criteria. That will probably be the most efficient in this last case, and even if it might be less efficient in the general case of a dominating view, it might make things simpler, so it could still be worth it.

How do we compare two Query result sets in coldfusion

I need to build a generic method in coldfusion to compare two query result sets... Any Ideas???
If you are looking to simply decide whether two queries are exactly alike, then you can do this:
if(serializeJSON(query1) eq serializeJSON(query2)) ...
This will convert both queries to strings and compare the strings.
If you're looking for more nuance, I believe Sergii's approach (convert to struct, compare keys) is probably the right approach. You could "guard" it by adding in simple checks first.... do the column lists match? Is the recordcount the same? That way, if either of those checks fail, you know that the queries can't possibly be equivalent and so it's safe to return false, thereby avoiding the performance hit of a full compare.
If I understand you correctly, you have two result sets with same structure but different datasets (like selecting with different clauses).
If this is correct, I believe that better (more efficient) way is to try to solve this task on the database level. Maybe with temporarily/cumulative tables and/or stored procedure.
Using CF it is almost definitely will need a ton of loops, which can be inappropriate for the large datasets. Though I did something like this for the small datasets using intermediate storage: converted one result set into the structure, and looped over the second query with checking the structure keys.