How can <cfqueryparam> affect performance for constants and null values? - coldfusion

Consider the following:
<cfquery name="aQuery" datasource="#aSource#">
SELECT aColumn
FROM aTable
WHERE bColumn = <cfqueryparam value="#aVariable#" cfsqltype="#aSqlType#" />
AND cColumn = 'someConstant'
AND dColumn is null
</cfquery>
If I change
AND cColumn = 'someConstant'
to
AND cColumn = <cfqueryparam value="someConstant" cfsqltype="#aSqlType#" />
Is there a potential performance improvement? Is there potential to hurt performance?
What if I do the same (use cfqueryparam) with AND dColumn is null?
My findings have been inconclusive.
If it's important, assume ColdFusion9 and Oracle db 11g.
EDIT:
I'd just like to reiterate that I'm asking about cfqueryparam tags being used specifically with constants and/or null values and the performance remifications, if any.

Is there a potential performance improvement?
No. Bind variables are most useful when varying parameters are involved. Without them, the database would generate a new execution plan every time the query parameters changed (which is expensive). Bind variables encourage the database to cache and reuse a single execution plan, even when the parameters change. This saves the cost of compiling, boosting performance. There is really no benefit with constants. Since the value never changes, the database will always reuse the execution plan. So there is not much reason to use it on constants.
Is there potential to hurt performance?
I have a seen a few mentions of special cases where using bind variables on constants may actually degrade performance. But that is really on a case-by-case basis.
http://decipherinfosys.wordpress.com/2007/04/02/scenario-for-using-a-constant-instead-of-a-bind-variable-in-an-oltp-system/
http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:24110#121568

I don't think so, since you already have one <cfqueryparam> and that will turn it into a prepared statement, but I'm not sure.

Using a query param will help in 2 ways.
First, it will protect you from SQLI. It will add a level of protection to make sure that the data in the param is what is expected.
Secondly, you will see a performance increase. However, the increase depends on the data schema and indexes. The param will allow for the database to cache the query plan. This speeds up the initial overhead of the query execution. The more complex the query, the more important is becomes to cache the query plan.
Also, make sure you have a covering indexes for all the columns in the where clause and that they are in the correct order. If not, the query optimizer may opt to ignore the indexes and go directly to doing table scans.

Related

SAS Hash Tables: Is there a way to find/join on different keys or have optional keys

I frequently work with some data for which the keys are not perfect, and I need to join data from a difference source, I want to continue using Hash Objects for the speed advantage however when I am using a lot of data I can run into crashes (memory restraints).
A simplified overview is I have 2 different keys which are all unique but not present for every record, we will call them Key1 and Key2.
My current solution, which is not very elegant (but it works) is to do the following:
if _N_ = 1 then do;
declare hash h1(Dataset:"DataSet1");
h1.DefineKey("key1");
h1.DefineData("Value");
h1.DefineDone();
declare hash h2(Dataset:"DataSet1");
h2.DefineKey("key2");
h2.DefineData("Value");
h2.DefineDone();
end;
set DataSet2;
rc = h1.find();
if rc NE 0 then do;
rc = h2.find();
end;
So I have exactly the same dataset in two hash tables, but with 2 different keys defined, if the first key is not found, then I try to find the second key.
Does anyone know of a way to make this more efficient/easier to read/less memory intensive?
Apologies if this seems a bad way to accomplish the task, I absolutely welcome criticism so I can learn!
Thanks in advance,
Adam.
I am a huge proponent of hash table lookups - they've helped me do some massive multi hundred-million row joins in minutes that otherwise would could have taken hours.
The way you're doing it isn't a bad route. If you find yourself running low on memory, the first thing to identify is how much memory your hash table is actually using. This article by sasnrd shows exactly how to do this.
Once you've figured out how much it's using and have a benchmark, or if it doesn't even run at all because it runs out of memory, you can play around with some options to see how they improve your memory usage and performance.
1. Include only the keys and data you need
When loading your hash table, exclude any unnecessary variables. You can do this before loading the hash table, or during. You can use dataset options to help reduce table size, such as where, keep, and drop.
dcl hash h1(dataset: 'mydata(keep=key var1)');
2. Reduce the variable lengths
Long character variables take up more memory. Decreasing the length to their minimum required value will help reduce memory usage. Use the %squeeze() macro to automatically reduce all variables to their minimum required size before loading. You can find that macro here.
%squeeze(mydata, mydata_smaller);
3. Adjust the hashexp option
hashexp helps improve performance and reduce hash collisions. Larger values of hashexp will increase memory usage but may improve performance. Smaller values will reduce memory usage. I recommend reading the link above and also looking at the link at the top of this post by sasnrd to get an idea of how it will affect your join. This value should be sized appropriately depending on the size of your table. There's no hard and fast answer as to what value you should use, my recommendation is as big as your system can handle.
dcl hash h1(dataset: 'mydata', hashexp:2);
4. Allocate more memory to your SAS session
If you often run out of memory with your hash tables, you may have too low of a memsize. Many machines have plenty of RAM nowadays, and SAS does a really great job of juggling multiple hard-hitting SAS sessions even on moderately equipped machines. Increasing this can make a huge difference, but you want to adjust this value as a last resort.
The default memsize option is 2GB. Try increasing it to 4GB, 8GB, 16GB, etc., but don't go overboard, like setting it to 0 to use as much memory as it wants. You don't want your SAS session to eat up all the memory on the machine if other users are also on it.
Temporarily setting it to 0 can be a helpful troubleshooting tool to see how much memory your hash object actually occupies if it's not running. But if it's your own machine and you're the only one using it, you can just go ham and set it to 0.
memsize can be adjusted at SAS invocation or within the SAS Configuration File directly (sasv9.cfg on 9.4, or SASV9_Option environment variable in Viya).
I have a fairly similar problem that I approached slightly differently.
First: all of what Stu says is good to keep in mind, regardless of the issue.
If you are in a situation though where you can't really reduce the character variable size (remember, all numerics are 8 bytes in RAM no matter what the dataset size, so don't try to shrink them for this reason), you can approach it this way.
Build a hash table with key1 as key, key2 as data along with your actual data. Make sure that key1 is the "better" key - the one that is more fully populated. Rename Key2 to some other variable name, to make sure you don't overwrite your real key2.
Search on key1. If key1 is found, great! Move on.
If key1 is missing, then use a hiter object (hash iterator) to iterate over all of the records searching for your key2.
This is not very efficient if key2 is used a lot. Step 3 also might be better done in a different way than using a hiter - you could do a keyed set or something else for those records, for example. In my particular case, both the table and the lookup were missing key1, so it was possible to simply iterate over the much smaller subset missing key1 - if in your case that's not true, and your master table is fully populated for both keys, then this is going to be a lot slower.
The other thing I'd consider is abandoning hash tables and using a keyed set, or a format, or something else that doesn't use RAM.
Or split your dataset:
data haskey1 nokey1;
set yourdata;
if missing(key1) then output nokey1;
else output haskey1;
run;
Then two data steps, one with a hash with key1 and one with a hash with key2, then combine the two back together.
Which of these is the most efficient depends heavily on your dataset sizes (both master and lookup) and on the missingness of key1.

PDI aka Kettle : is better "add constant" or "literal string" in table input?

In a Kettle/ PDI transformation, I need to write on a table the values from another table plus other static strings.
1 Table input : read records;
2 Add constants : add "status" = "A"; (and other static strings)
3 Table output : write old values + status and other constants
Is it better to add the literal in the table input "select" (select id,field1, 'A' as status from ...) or better to use an Add Constants step?
I suppose it's better to reduce steps' quantity, because with "Add constants"
you need to instantiate a new step.
EDIT: For "better" I mean faster and less memory consuming
My opinion is to make the minimum transformation on the Input Table step, because the philosophy of the PDI is to make visible all the transformation.
Now, if you're an expert in SQL, or have a legacy select of 200 lines with complex computation, my answer would be different.
Creating one more step in transformation will lead to separate thread allocation since every step is separate thread as far as allocation of at least one BlockingQueue since rows between steps is distributed in memory through these structures.
Using one more step even so simple as add constant will cause additional resource allocations.
PDI is still happy to be open source.
If you curious how it is done this is base transformation step implementation (was for long time) -> https://github.com/pentaho/pentaho-kettle/blob/master/engine/src/main/java/org/pentaho/di/trans/step/BaseStep.java
This is example of code used to distribute rows between steps -> https://github.com/pentaho/pentaho-kettle/blob/master/core/src/main/java/org/pentaho/di/core/BlockingRowSet.java#L54
Sure for simple add constant in sql query PDI will be overhead. There is a lot of use-cases how to make some operations faster or less memory consuming, but about GUI or any other feature actually PDI is famous of?

Performance implications of using a flatter schema

I'm using FlatBuffers (C++) to store metadata information about a file. This includes EXIF, IPTC, GPS and various other metadata values.
In my current schema, I have a fairly normalized definition whereby each of the groups listed above has its own table. The root table just includes properties for each sub-table.
Basic Example:
table GPSProperties {
latitude:double;
longitude:double;
}
table ContactProperties {
name:string;
email:string;
}
table EXIFProperties {
camera:string;
lens:string;
gps:GPSProperties;
}
table IPTCProperties {
city:string;
country:string;
contact:ContactProperties;
}
table Registry {
exifProperties:EXIFProperties;
iptcProperties:IPTCProperties;
}
root_type Registry;
This works, but the nesting restrictions when building a buffer are starting to make the code pretty messy. As well, breaking up the properties into separate tables is only for clarity in the schema.
I'm considering just "flattening" the entire schema into a single table but I was wondering if there are any performance or memory implications of doing that. This single table could have a few hundred fields, though most would be empty.
Proposal:
table Registry {
exif_camera:string;
exif_lens:string;
exif_gps_latitude:double;
exif_gps_longitude:double;
iptc_city:string;
iptc_country:string;
iptc_contact_name:string;
iptc_contact_email:string;
}
root_type Registry;
Since properties that are either not set or set to their default value don't take up any memory, I'm inclined to believe that a flattened schema might not be a problem. But I'm not certain.
(Note that performance is my primary concern, followed closely by memory usage. The normalized schema is performing excellently, but I think a flattened schema would really help me clean up my codebase.)
Basics you should be first clear with:
Every table has a vtable at top of it which tells the offset at whihc each field of table could be found. If there are too many fields in a table, this vtable will grow huge, no matter if you store the data or not.
If you try to create a hierarchy of tables, there are extra vtables you are creating and also adding indirection cost to the design.
Also vtables are shared if there is similar data being stored in multiple objects.. Like if you are creating objects with only exif_camera variable being used!
So it depends if your data is going to be huge and heterogeneous use the more organized hierarchy. But if your data is going to be homogeneous prefer a flattened table.
Since most of your data is strings, the size and speed of both of these designs will be very similar, so you should probably choose based on what works better for you from a software engineering perspective.
That said, the flat version will likely be slightly more efficient in size (less vtables) and certainly will be faster to access (though again, that is marginal given that it is mostly string data).
The only way in which the flat version could be less efficient is if you were to store a lot of them in one buffer, where which fields are set varies wildly between each table. Then the non-flat version may generate more vtable sharing.
In the non-flat version, tables like GPSProperties could be a struct if the fields are unlikely to ever change, which would be more efficient.
This single table could have a few hundred fields, though most would
be empty.
The performance cost is likely to be so small you won't notice, but your above quote, to me, is the swaying factor about which design to use.
While others are talking about the cost of vtables; I wouldn't worry about that at all. There's a single vtable per class, prepared once per run and will not be expensive.
Having 100's of strings that are empty and unused however is going to be very expensive (memory usage wise) and a drain on every object you create; in addition reading your fields will become much more complex since you can no longer assume that all the data for the class as you read it is there.
If most / all the fields were always there, then I can see the attraction of making a single class; but they're not.

difference between GL_SAMPLES_PASSED and GL_ANY_SAMPLES_PASSED in occlusion query

I know what a query object is but I dont quite fully understand the difference between GL_SAMPLES_PASSED and GL_ANY_SAMPLES_PASSED. The reference page says this about GL_ANY_SAMPLES_PASSED: Subsequent rendering causes the flag to be set to GL_TRUE if "any" sample passes the depth test in the case of GL_ANY_SAMPLES_PASSED.
So does this mean, the only diffferences are that, query object with GL_ANY_SAMPLES_PASSED is much faster because it does not have to count the number of samples that passed, it simply returns true/false instead of a count and this is also helpful in conditional rendering (because of true false value).
Whether GL_ANY_SAMPLES_PASSED is faster than GL_SAMPLES_PASSED or not is unknown. Neither way will be faster in terms of rendering, as you can't know what the answer is until the entire test has completed rendering through the pipeline.
It's not even really for conditional rendering scenarios, because they can both be used for that. You can use conditional rendering with GL_SAMPLES_PASSED and achieve the same effect; they use the same true/false condition (ie: conditional with GL_SAMPLES_PASSED is considered to pass if the sample count is > 0).
The difference is that one gives you more information than the other. The any query might be more efficient to compute; that is, GL_SAMPLES_PASSED might have some non-trivial rasterization overhead that GL_ANY_SAMPLES_PASSED does not. Then again, it may not. It would vary with the hardware.
Use whichever one serves your needs. If you need a sample count, ask for one. If all you need to know is whether it passed, then use that.

ref-set vs commute vs alter

What is the difference in the 3 ways to set the value of a ref in Clojure? I've read the docs several times about ref-set, commute, and alter. I'm rather confused which ones to use at what times. Can someone provide me a short description of what the differences are and why each is needed?
As a super simple explanation of how the Software Transactional Memory system works in clojure; it retries transactions until everyone of them gets through without having its values changed out from under it. You can help it make this decision by using ref-changing-functions that give it hints about what interactions are safe between transactions.
ref-set is for when you don't care about the current value. Just set it to this! ref-set saves you the angst of writing something like (alter my-ref (fun [_] 4)) just to set the value of my-ref to 4. (ref-set my-ref 4) sure does look a lot better :).
Use ref-set to simply set the value.
alter is the most normal standard one. Use this function to alter the value. This is the meat of the STM. It uses the function you pass to change the value and retries if it cannot guarantee that the value was unchanged from the start of the transaction. This is very safe, even in some cases where you don't need it to be that safe, like incrementing a counter.
You probably want to use alter most of the time.
commute is an optimized version of alter for those times when the order of things really does not matter. it makes no difference who added which +1 to the counter. The result is the same. If the STM is deciding if your transaction is safe to commit and it only has conflicts on commute operations and none on alter operations then it can go ahead and commit the new values without having to restart anyone. This can save the occasional transaction retry though you're not going to see huge gains from this in normal code.
Use commute when you can.