How to cast expression in select clause with high performance - casting

I have a table and want to convert a column of type decimal to an integer based on a condition. I need to do this as performant as possible.
My currently working query looks like this:
select *, (case when C_CUSTKEY < 20 then 1 else 0 end) as bit
from TPCH.CUSTOMER
However I'm trying to improve this. Is it possible to do the conversion on the fly, e.g. something like this:
select *, cast((C_CUSTKEY < 20) as integer) as bit
from TPCH.CUSTOMER
Or even for a simpler version like:
select *, (C_CUSTKEY < 20) as bit
from TPCH.CUSTOMER
No matter what functionality I use, I always get the following error:
sql syntax error: incorrect syntax near "<"
Update 1
So for better understanding an actual usecase of what I want to do is as following:
select ( (case when col1 < x then 1 else 0 end)
+ (case when col2 > y then 2 else 0 end) ) as bitset
from TPCH.CUSTOMER
In general there can be a large number of 'case when' expressions (>100).
Now the first problem is that the expression is very slow and I need to improve performance so I wanted to know if its possible to to the conversion on the fly smth like 2 * to_int(col2 > y) but I cannot find a way to do this.
Second problem is when I have many 'case when' expressions then I get the following error: SQL internal parse tree depth exceeds its maximum: parse tree depth exceeds its maximum:255

Not quite sure about the use case here.
Generally speaking, you cannot have a different data type for the same column at the same time. That's a fundamental constraint of the relational data model.
If however, the requirement merely is to have an indicator for "value in C_CUSTKEY < 20" then using the CASE statement is a straightforward way to do this.
In order to avoid this evaluation at query time, you could create a calculated column for this.
alter table TPCH.CUSTOMER add ("BIT" integer
generated always as
case
when ("C_CUSTKEY" < 20)
then 1
else 0
end);
Concerning the syntax errors: well, you're just not using the correct syntax. Check the example above for a correctly working example.

Related

Make =IF Function Output Numbers For "Scoring": Google Sheets

I'm am exploring methods of giving scores to different datapoints within a dataset. These points come from a mix of numbers and text string attributes looking for certain characteristics, e.g. if Col. A contains more than X number of "|", then give it a 1. If not, it gets a 0 for that category. I also have some that give the point when the value is >X.
I have been trying to do this with =IF, for example, =IF([sheet] = [Text], "1","0").
I can get it to give me 1 or 0, but I am unable to get a point total with sum.
I have tried changing the formatting of the text to both "number", "plain text", and have left it as automatic, but I can't get it to sum. Thoughts? Is there maybe a better way to do this?
FWIW - I'm trying to score based on about 12 factors.
Best,
Alex
The issue here might be that you're having the cell evaluate to either the string "0" or the string "1" rather than the number 0 or the number 1. That would explain why you're seeing the right things but the math isn't coming out right - the cell contents look like numbers, but they're really text, which the summation would then ignore.
One option would be to drop the quotation marks and write something like this:
=IF(condition, 1, 0)
This has the condition evaluate to 1 if it's true and 0 if it's false.
Alternatively, you could write something like this:
=(condition) * 1
This will take the boolean TRUE or FALSE returned by condition and convert it to either the numeric value 1 (true) or the numeric value 0 (false).

cloudsearch query to boost exact match on range

In a cloudsearch structured query.
I have a couple of fields I am searching on.
On field one, the user selects "2"
On field two the user selects "1"
I am wanting to run this as a range query, so that the results that are returned are -1 to +1
eg. on field one the range would be 1,3 and on field 2 it would be 0,2
What I am wanting to do is sort the results so that the results that match both field 1 and field 2 are at the top, and the rest under it.
eg. where field one=2 and field two =1 would be at the top and the rest are not in any specific order,
note: I do need to end up sorting the results by distance, so that all the exact matching results are in distance order, then all the rest are ordered by distance.
I am sure I can do this with 2 queries, just trying to make it work in one query if at all possible to lighten the load.
Say your fields are 'a' and 'b', and the specified values are a=2 and b=1 (as in your example, except I've named the fields 'a' and 'b' instead of 'one' and 'two'). Here are the various terms of your query.
Range Query
This is the query for the range a±1 and b±1 where a=2 and b=1:
q=(and (range field=a[1,3]) (range field=b[0,2]))
Rank Expression
For your rank expression, compute a distance-based score using absolute value so that scores 'a' and 'b' can't cancel each other out (like a=3,b=0 would, for example):
expr.rank1=abs(a-2)+abs(b-1)
Sort by Rank
That defined a ranking expression named rank1, which we now want to sort by, starting with the lowest values ('0' means a=2,b=1):
sort=rank1 asc
Return the Rank
For debugging purposes, you may want return the ranking score:
return=rank1
Put all those terms together and you've got your query.
Further Potentially-Useful Things
If you want to get fancy and penalize things in a non-linear way, you can use exp. For example, if you want to differentiate between 'a' and 'b' both being off by 1 vs 'a' being an exact match and 'b' being off by 2 (eg a=3,b=2 will rank ahead of a=2,b=3 even though the previous ranker would give them both a score of 2):
expr.rank1=exp(abs(a-2))+exp(abs(b-1))
And you can use boolean logic and the ternary operator to detect and prefer certain results that meet certain criteria, eg to give a big boost when 'a' and 'b' are on-target, a smaller boost when 'a' or 'b' is on target, etc (since we're sorting in low-to-high, a boost in rank is actually achieved by adding less to the result):
((a==1&&b==2)?0:100)+((a==1||b==2)?0:1000)+abs(a-1)+abs(b-2)
See http://docs.aws.amazon.com/cloudsearch/latest/developerguide/configuring-expressions.html

How to distribute values into group in python

I have a dataset of actions doing over time, an attribute 'Hour' ( contains values from 0 ->23 ). Now I want to create another attribute, say 'PartOfDay', which group 24 hours into 4 parts. For tuples have 'Hour' value of 0 to 5, then the 'PartOfDay' value should be 1; if 'Hour' value in [6,11], then the 'PartOfDay' value should be 2;...How can I do?
The codes would do this:
train['PartOfDay']=1
train.loc[(train.Hour>=6) & (train.hour<=11),'PartOfDay']=2
train.loc[(train.Hour>=12) & (train.hour<=17),'PartOfDay']=3
train.loc[(train.Hour>=18) & (train.hour<=23),'PartOfDay']=4
but it seems not so beautiful, I would like to know a more decent one if possible
Thank you for all your supports!!
While it is not clear what train.loc represents, a general approach to your problem is to use modulus function to set the RHS:
1 + int(train.Hour / 6)

PL/SQL optimize searching a date in varchar

I have a table, that contains date field (let it be date s_date) and description field (varchar2(n) desc). What I need is to write a script (or a single query, if possible), that will parse the desc field and if it contains a valid oracle date, then it will cut this date and update the s_date, if it is null.
But there are one more condition - there are must be exactly one occurence of a date in the desc. If there are 0 or >1 - nothing should be updated.
By the time I came up with this pretty ugly solution using regular expressions:
----------------------------------------------
create or replace function to_date_single( p_date_str in varchar2 )
return date
is
l_date date;
pRegEx varchar(150);
pResStr varchar(150);
begin
pRegEx := '((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d)((.|\n|\t|\s)*((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d))?';
pResStr := regexp_substr(p_date_str, pRegEx);
if not (length(pResStr) = 10)
then return null;
end if;
l_date := to_date(pResStr, 'dd.mm.yyyy');
return l_date;
exception
when others then return null;
end to_date_single;
----------------------------------------------
update myTable t
set t.s_date = to_date_single(t.desc)
where t.s_date is null;
----------------------------------------------
But it's working extremely slow (more than a second for each record and i need to update about 30000 records). Is it possible to optimize the function somehow? Maybe it is the way to do the thing without regexp? Any other ideas?
Any advice is appreciated :)
EDIT:
OK, maybe it'll be useful for someone. The following regular expression performs check for valid date (DD.MM.YYYY) taking into account the number of days in a month, including the check for leap year:
(((0[1-9]|[12]\d|3[01])\.(0[13578]|1[02])\.((19|[2-9]\d)\d{2}))|((0[1-9]|[12]\d|30)\.(0[13456789]|1[012])\.((19|[2-9]\d)\d{2}))|((0[1-9]|1\d|2[0-8])\.02\.((19|[2-9]\d)\d{2}))|(29\.02\.((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00))))
I used it with the query, suggested by #David (see accepted answer), but I've tried select instead of update (so it's 1 regexp less per row, because we don't do regexp_substr) just for "benchmarking" purpose.
Numbers probably won't tell much here, cause it all depends on hardware, software and specific DB design, but it took about 2 minutes to select 36K records for me. Update will be slower, but I think It'll still be a reasonable time.
I would refactor it along the lines of a single update query.
Use two regexp_instr() calls in the where clause to find rows for which a first occurrence of the match occurs and a second occurrence does not, and regexp_substr() to pull the matching characters for the update.
update my_table
set my_date = to_date(regexp_subtr(desc,...),...)
where regexp_instr(desc,pattern,1,1) > 0 and
regexp_instr(desc,pattern,1,2) = 0
You might get even better performance with:
update my_table
set my_date = to_date(regexp_subtr(desc,...),...)
where case regexp_instr(desc,pattern,1,1)
when 0 then 'N'
else case regexp_instr(desc,pattern,1,2)
when 0 then 'Y'
else 'N'
end
end = 'Y'
... as it only evaluates the second regexp if the first is non-zero. The first query might also do that but the optimiser might choose to evaluate the second predicate first because it is an equality condition, under the assumption that it's more selective.
Or reordering the Case expression might be better -- it's a trade-off that's difficult to judge and probably very dependent on the data.
I think there's no way to improve this task. Actually, in order to achieve what you want it should get even slower.
Your regular expression matches text like 31.02.2013, 31.04.2013 outside the range of the month. If you put year in the game,
it gets even worse. 29.02.2012 is valid, but 29.02.2013 is not.
That's why you have to test if the result is a valid date.
Since there isn't a full regular expression for that, you would have to do it by PLSQL really.
In your to_date_single function you return null when a invalid date is found.
But that doesn't mean there won't be other valid dates forward on the text.
So you have to keep trying until you either find two valid dates or hit the end of the text:
create or replace function fn_to_date(p_date_str in varchar2) return date is
l_date date;
pRegEx varchar(150);
pResStr varchar(150);
vn_findings number;
vn_loop number;
begin
vn_findings := 0;
vn_loop := 1;
pRegEx := '((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d)';
loop
pResStr := regexp_substr(p_date_str, pRegEx, 1, vn_loop);
if pResStr is null then exit; end if;
begin
l_date := to_date(pResStr, 'dd.mm.yyyy');
vn_findings := vn_findings + 1;
-- your crazy requirement :)
if vn_findings = 2 then
return null;
end if;
exception when others then
null;
end;
-- you have to keep trying :)
vn_loop := vn_loop + 1;
end loop;
return l_date;
end;
Some tests:
select fn_to_date('xxxx29.02.2012xxxxx') c1 --ok
, fn_to_date('xxxx29.02.2012xxx29.02.2013xxx') c2 --ok, 2nd is invalid
, fn_to_date('xxxx29.02.2012xxx29.02.2016xxx') c2 --null, both are valid
from dual
As you are going to have to do try and error anyway one idea would be to use a simpler regular expression.
Something like \d\d[.]\d\d[.]\d\d\d\d would suffice. That would depend on your data, of course.
Using #David's idea you could filter the ammount of rows to apply your to_date_single function (because it's slow),
but regular expressions alone won't do what you want:
update my_table
set my_date = fn_to_date( )
where regexp_instr(desc,patern,1,1) > 0

SQL and regular expression to check if string is a substring of larger string?

I have a database filled with some codes like
EE789323
990
78000
These numbers are ALWAYS endings of a larger code. Now I have a function that needs to check if the larger code contains the subcode.
So if I have codes 90 and 990 and my full code is EX888990, it should match both of them.
However I need to do it in the following way:
SELECT * FROM tableWithRecordsWithSubcode
WHERE subcode MATCHES [reg exp with full code];
Is a regular expression like this this even possible?
EDIT:
To clarify the issue I'm having, I'm not using SQL here. I just used that to give an example of the type of query I'm using.
In fact I'm using iOS with CoreData, and I need a predicate to fetch me only the records that match.
In the way that is mentioned below.
Given the observations from a comment:
Do you have two tables, one called tableWithRecordsWithSubcode and another that might be tableWithFullCodeColumn? So the matching condition is in part a join - you need to know which subcodes match any of the full codes in the second table? But you're only interested in the information in the tableWithRecordsWithSubcode table, not in which rows it matches in the other table?
and the laconic "you're correct" response, then we have to rewrite the query somewhat.
SELECT DISTINCT S.*
FROM tableWithRecordsWithSubcode AS S
JOIN tableWithFullCodeColumn AS F
ON F.Fullcode ...ends-with... S.Subcode
or maybe using an EXISTS sub-query:
SELECT S.*
FROM tableWithRecordsWithSubcode AS S
WHERE EXISTS(SELECT * FROM tableWithFullCodeColumn AS F
WHERE F.Fullcode ...ends-with... S.Subcode)
This uses a correlated sub-query but avoids the DISTINCT operation; it may mean the optimizer can work more efficiently.
That just leaves the magical 'X ...ends-with... T' operator to be defined. One possible way to do that is with LENGTH and SUBSTR. However, SUBSTR does not behave the same way in all DBMS, so you may have to tinker with this (possibly adding a third argument, LENGTH(s.subcode)):
LENGTH(f.fullcode) >= LENGTH(s.subcode) AND
SUBSTR(f.fullcode, LENGTH(f.fullcode) - LENGTH(s.subcode)) = s.subcode
This leads to two possible formulations:
SELECT DISTINCT S.*
FROM tableWithRecordsWithSubcode AS S
JOIN tableWithFullCodeColumn AS F
ON LENGTH(F.Fullcode) >= LENGTH(S.Subcode)
AND SUBSTR(F.Fullcode, LENGTH(F.Fullcode) - LENGTH(S.Subcode)) = S.Subcode;
and
SELECT S.*
FROM tableWithRecordsWithSubcode AS S
WHERE EXISTS(
SELECT * FROM tableWithFullCodeColumn AS F
WHERE LENGTH(F.Fullcode) >= LENGTH(S.Subcode)
AND SUBSTR(F.Fullcode, LENGTH(F.Fullcode) - LENGTH(S.Subcode)) = S.Subcode);
This is not going to be a fast operation; joins on computed results such as required by this query seldom are.
I'm not sure why you think that you need a regular expression... Just use the charindex function:
select something
from table
where charindex(code, subcode) <> 0
Edit:
To find strings at the end, you can create a pattern with the % wildcard from the subcode:
select something
from table
where '%' + subcode like code