How do we compare two Query result sets in coldfusion - coldfusion

I need to build a generic method in coldfusion to compare two query result sets... Any Ideas???

If you are looking to simply decide whether two queries are exactly alike, then you can do this:
if(serializeJSON(query1) eq serializeJSON(query2)) ...
This will convert both queries to strings and compare the strings.
If you're looking for more nuance, I believe Sergii's approach (convert to struct, compare keys) is probably the right approach. You could "guard" it by adding in simple checks first.... do the column lists match? Is the recordcount the same? That way, if either of those checks fail, you know that the queries can't possibly be equivalent and so it's safe to return false, thereby avoiding the performance hit of a full compare.

If I understand you correctly, you have two result sets with same structure but different datasets (like selecting with different clauses).
If this is correct, I believe that better (more efficient) way is to try to solve this task on the database level. Maybe with temporarily/cumulative tables and/or stored procedure.
Using CF it is almost definitely will need a ton of loops, which can be inappropriate for the large datasets. Though I did something like this for the small datasets using intermediate storage: converted one result set into the structure, and looped over the second query with checking the structure keys.

Related

What is the scope of result rows in PDI Kettle?

Working with result rows in kettle is the only way to pass lists internally in the program. But how does this work exactly? This topic has not been well documented and there's a lot of questions.
For example, a job containing 2 transformation can have result rows sent from the first to the second. But what if there's a third transformation getting the result rows? What is the scope? Can you pass result rows to a sub-job as well? Can you clear the result rows based on logic inside a transformation?
Working with lists and arrays is useful and necessary in programming, but confusing in PDI Kettle.
I agree that working with result rows may be confusing, but you can be confident: it works.
Yes, you can pass it the a sub-job, and in a series of sub-jobs (define the scope as "valid in the java machine" for the first test).
And no, there is no way to clear the results in a transformation (and certainly not based on a formula). That would mean a terrible overload in maintenance.
Kettle is not an imperative language, it is more of the data-flow family. It means it is nearer the way you are thinking when developing an ETL and much, much more performant. The drawback is that list and array have no meaning. Only flow of data.
And that is what is a result set : a flow of data, like the the result set of a sql query. The next job has to open it, pass each row to the transformation, and close it after the last row.

How to search the value from a std::map when I use cuda?

I have something stored in std::map, which maps string to vector. Its keys and values looks like
key value
"a"-----[1,2,3]
"b"-----[8,100]
"cde"----[7,10]
For each thread, it needs to process one query. The query looks like
["a", "b"]
or
["cde", "a"]
So I need to get the value from the map and then do some other jobs like combine them. So as for the first query, the result will be
[1,2,3,8,100]
The problem is, how can threads access the map and find the value by a key?
At first, I tried to store it in global memory. However, It looks like it can only pass arrays from host to device.
Then I tried to use thrust, but I can only use it to store vector.
Is there any other way I can use? Or maybe I ignored some methods in thrust? Thanks!
**Ps: I do not need to modify the map, I just need to read data from it.
I believe it's unlikely you will benefit from doing any of this on the GPU, unless you have a huge number of queries which are all available to you at once, or at least in batches.
If you do not have that many queries, then just transferring the data (regardless of its exact format/structure) will likely be a waste.
If you do have that many queries, the benefit is still entirely unclear, and depends on a lot of parameters. The fact that you've been trying to use std::map for anything suggests (see below for the reason) that you haven't been seriously concerned with performance so far. If that's indeed the case, just don't make your life difficult by using a GPU.
So what's wrong what std::map? Nothing, but it's extremely slow even on the CPU, and this is even worse on the GPU.

C++ Complicated look-up table

I have around 400.000 "items".
Each "item" consists of 16 double values.
At runtime I need to compare items with each other. Therefore I am muplicating their double values. This is quite time-consuming.
I have made some tests, and I found out that there are only 40.000 possible return values, no matter which items I compare with each other.
I would like to store these values in a look-up table so that I can easily retrieve them without doing any real calculation at runtime.
My question would be how to efficiently store the data in a look-up table.
The problem is that if I create a look-up table, it gets amazingly huge, for example like this:
item-id, item-id, compare return value
1 1 499483,49834
1 2 -0.0928
1 3 499483,49834
(...)
It would sum up to around 120 million combinations.
That just looks too big for a real-world application.
But I am not sure how to avoid that.
Can anybody please share some cool ideas?
Thank you very much!
Assuming I understand you correctly, You have two inputs with 400K possibilities, so 400K * 400K = 160B entries... assuming you have them indexed sequentially, and the you stored your 40K possibilities in a way that allowed 2-octets each, you're looking at a table size of roughly 300GB... pretty sure that's beyond current every-day computing. So, you might instead research if there is any correlation between the 400K "items", and if so, if you can assign some kind of function to that correlation that gives you a clue (read: hash function) as to which of the 40K results might/could/should result. Clearly your hash function and lookup needs to be shorter than just doing the multiplication in the first place. Or maybe you can reduce the comparison time with some kind of intelligent reduction, like knowing the result under certain scenarios. Or perhaps some of your math can be optimized using integer math or boolean comparisons. Just a few thoughts...
To speed things up, you should probably compute all of the possible answers, and store the inputs to each answer.
Then, I would recommend making some sort of look up table that uses the answer as the key(since the answers will all be unique), and then storing all of the possible inputs that get that result.
To help visualize:
Say you had the table 'Table'. Inside Table you have keys, and associated to those keys are values. What you do is you make the keys have the type of whatever format your answers are in(the keys will be all of your answers). Now, give your 400k inputs each a unique identifier. You then store the unique identifiers for a multiplication as one value associated to that particular key. When you compute that same answer again, you just add it as another set of inputs that can calculate that key.
Example:
Table<AnswerType, vector<Input>>
Define Input like:
struct Input {IDType one, IDType two}
Where one 'Input' might have ID's 12384, 128, meaning that the objects identified by 12384 and 128, when multiplied, will give the answer.
So, in your lookup, you'll have something that looks like:
AnswerType lookup(IDType first, IDType second)
{
foreach(AnswerType k in table)
{
if table[k].Contains(first, second)
return k;
}
}
// Defined elsewhere
bool Contains(IDType first, IDType second)
{
foreach(Input i in [the vector])
{
if( (i.one == first && i.two == second ) ||
(i.two == first && i.one == second )
return true;
}
}
I know this isn't real C++ code, its just meant as pseudo-code, and it's a rough cut as-is, but it might be a place to start.
While the foreach is probably going to be limited to a linear search, you can make the 'Contains' method run a binary search by sorting how the inputs are stored.
In all, you're looking at a run-once application that will run in O(n^2) time, and a lookup that will run in nlog(n). I'm not entirely sure how the memory will look after all of that, though. Of course, I don't know much about the math behind it, so you might be able to speed up the linear search if you can somehow sort the keys as well.

What's the correct way to generate random strings without duplicates

I'm thinking about generating random strings, without making any duplication.
First thought was to use a binary tree create and locate for duplicate in tree, if any.
But this may not be very effective.
Second thought was using MD5 like hash method which create messages based only on time, but this may introduce another problem, different machines has different accuracy of time.
And in a modern processor, more than one string could be created in a single timestamp.
Is there any better way to do this?
Generate N sequential strings, then do a random shuffle to pull them out in random order. If they need to be unique across separate generators, mix a unique generator ID into the string.
Beware of MD5, there's no guarantee that two different Strings won't generate the same hash.
As for your problem, it depends on a number of constraints: are the strings short or long? Do they have to be meaningful? Etc... Two solutions from the top of my head:
1 Generate UUIDs then turn them into String with a binary representation or base 64 algorithm.
2 Simply generate random Strings and put them in a searchable structure (HashMap) so that you can find very quickly (O(1)-O(log n)) if a generated String already has a duplicate, in which case it is discarded.
A tree probably won't be the most efficient, especially for insertions - as it will have to constantly re-balance itself (somewhat of an "expensive" operation).
I'd recommend using a HashSet type data structure. The hashing algorithm should already be quite efficient (much more so than something like MD5), and all operations are constant-time. Insert all your Strings into the Set. If you create a new String, check to see if it already exists in the Set.
It sounds like you want to generate a uuid? See http://docs.python.org/library/uuid.html
>>> import uuid
>>> uuid.uuid4()
UUID('dafd3cb8-3163-4734-906b-a33671ce52fe')
You should specify in what programming language you're coding. For instance, in Java this will work nicely: UUID.randomUUID().toString() . UUID identifiers are unique in practice, as is stated in wikipedia:
The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination. In this context the word unique should be taken to mean "practically unique" rather than "guaranteed unique". Since the identifiers have a finite size it is possible for two differing items to share the same identifier. The identifier size and generation process need to be selected so as to make this sufficiently improbable in practice.
A binary tree is probably better than usual here - no rebalancing necessary, because your strings are random, and it's on random data that binary trees work their best. However, it's still O(log(n)) for lookup and addition.
But maybe more efficient, if you know in advance how many random strings you'll need and don't mind a little probability in the mix, is to use a bloom filter.
Bloom filters give an efficient, probabilistic set membership test with memory requirements as low as one bit per element saved in the set. Basically, a bloom filter can say with 100% certainty that a member does not belong to a set, but with a high but not quite 100% certainty that a member is in a set. In your case, throwing out an extra candidate or two shouldn't hurt at all, so the probabilistic nature shouldn't hurt a bit.
Bloom filters are also relatively unique in that they can test for set membership in constant time.
For a while, I listed treaps here, but that's silly - they do a lot of operations in O(log(n)) again, and would only be relevant if your data isn't truly random.
If you don't need your strings to be saved in order for some reason (and it sounds like you probably don't), a traditional hash table is a good way to go. They like to know how big your final dataset will be in advance (to avoid slow hash table resizes), but they too are constant time for insertion and lookup.
http://stromberg.dnsalias.org/svn/bloom-filter/trunk/

What is the best data structure to store FIX messages?

What's the best way to store the following message into a data structure for easy access?
"A=abc,B=156,F=3,G=1,H=10,G=2,H=20,G=3,H=30,X=23.50,Y=xyz"
The above consists of key/value pairs of the following:
A=abc
B=156
F=3
G=1
H=10
G=2
H=20
G=3
H=30
X=23.50
Y=xyz
The tricky part is the keys F, G and H. F indicates the number of items in a group whose item consists of G and H.
For example if F=3, there are three items in this group:
Item 1: G=1, H=10
Item 2: G=2, H=20
Item 3: G=3, H=30
In the above example, each item consists of two key/pair values: G and H. I would like the data structure to be flexible such that it can handle if the item increases its key/pair values. As much as possible, I would like to maintain the order it appears in the string.
UPDATE: I would like to store the key/value pairs as strings even though the value often appears as float or other data type, like a map.
May not be what you're looking for, but I'd simply recommend using QuickFIX (quickfixengine.org), which is a very high quality C++ FIX library. It has the type "FIX::Message" which does everything you're looking for, I believe.
I work with FIX a lot in Python an Perl, and I tend to use a dictionary or hash. Your keys should be unique within the message. For C++, you could look at std::map or STL extension std::hash_map.
If you have a subset of FIX messages you have to support (most exchanges usually use 10-20 types), you can roll your own classes to parse messages into. If you're trying to be more generic, I would suggest creating something like a FIXChunk class. The entirety of the message could be stored in this class, organized into keys and their values, as well as lists of repeating groups. Each of the repeating groups would itself be a FIXChunk.
A simple solution, but you could use a std::multimap<std::string,std::string> to store the data. That allows you to have multiple keys with the same value.
In my experience, fix messages are usually stored either in their original form (as a stream of bytes) or as a complex data structure providing a full APIs that can handle their intricacies. After all, a fix message can sometimes represent a tree of data.
The problem with the latter solution is that the transition is expensive in terms of computation cost in high-speed trading systems. If you are building a trading system, you may prefer to lazily calculate the parts of the fix message than you need, which is admittedly easier said than done.
I am not familiar with efficient open-source implementations; companies like the one I work for usually have proprietary implementations.