Compare duplicates for 4 fields in open SQL

Compare duplicates for 4 fields in open SQL - combinations

I want to compare if there are duplicates across 4 fields in open SQL.
Scenario: User has 4 fields to input. First name (N1), last name (N2), additional first name (N3) and additional last name (N4).
Right now the algorithm works this way: It concatenates N1 + N2 + % and then also
N2+ N1 + %. So if the user inputs in any of the fields, the query looks for N1N2% or N2N1%. This mean for 2 fields, there are 2! combinations possible. Now with 2 additional fields, this algorithm explodes as there will be 4! combinations to check. Any ideas how to tackle this?
Note: We do this kind of combination check because the user could input data in any of those given input field. So we check for all combination of fields. Unfortunately, this cannot be changed.
EDIT:
I cannot assume the order as it was previously designed in such a way. Hence, the complications with combinations.
Edit2:
I like the idea of checking individual parts. But what we want to do is ideally concatenate all strings together and check for a substring in DB. In open-sql its done using the like statement. Our DB table has such concatenated string already stored for N1+N2 combination. This needs to be extended for 4 fields now.

The key to your problem is checking all name parts individually with leading and trailing '%' and check the total size of the db entry against the sum of the name parts:
field = ('%' + N1 + '%') AND field = ('%' + N2 + '%') AND field = ('%' + N3 + '%') AND field = ('%' + N4 + '%') AND LENGTH(field) = LENGTH(N1+N2+N3+N4)
This will find a match. You could use it to SELECT a normalized concatenation of the names and use GROUP BY and HAVING count(*)>1 to search for duplicates.

If the user does not care about the order and you want to check for duplicates then the following condition seems to meet your criteria I think.
SELECT ...
FROM ...
INTO TABLE ...
WHERE N1 IN (#INPUT_N1, #INPUT_N2, #INPUT_N3, #INPUT_N4)
AND N2 IN (#INPUT_N1, #INPUT_N2, #INPUT_N3, #INPUT_N4)
AND N3 IN (#INPUT_N1, #INPUT_N2, #INPUT_N3, #INPUT_N4)
AND N4 IN (#INPUT_N1, #INPUT_N2, #INPUT_N3, #INPUT_N4).
IF sy-dbcnt > 0.
"duplicates found, do something...
ENDIF.
Of course when there is garbage in the database where for example all the four fields are the same, then this will not return a real duplicate.

Related

Google Sheets / Google Data Studio - RegEx

I got cells in a Google Sheet, which consist of some combined data to track workout progress. They look something like this:
80kg-3x5, 100kg-1x3
For a given exercise, i.e. hang snatch above, it means what actual work loads I did for that exercise on a given date, with weights and the related set x reps separated by commas. So for one exercise, I might have only one work load, or several (which are then comma separated). I have them in a single cell to keep the data tidy, and reduce time when entering the data after a workout.
Now to analyze the data, I need to somehow separate the comma separated values. An example using the sample cell data above, would be total volume for that exercise, with an expression like this:
Sum( (digit before 'kg') * (digit before 'x') * (digit after 'x') + Same expression before, if comma ',' exists after first expression (multiple loads for the exercise) )
It should be a trivial task, but I haven't touched the functions in google sheet or data studio that much, and I had a surprisingly difficult time figuring out a way to either loop through the content in a cell with appropriate regex, or other ways. I could do this easily in python and then any other visualization software, but the point for going this way using drive tools is that it saves a lot of time (if it works...). I can either implement it in google sheet, or in data studio as a new calculated column from the import, whichever makes it possible.

If you are looking to write a custom function, something like this may do the trick (though it needs work for better error-handling)
function workoutProgress(string) {
if (string == '' || string == null || string == undefined) { return 'error';}
var stringArray = string.split(",");
var sum = 0;
var digitsArray, digitsProduct;
if ( stringArray.length > 0) {
for (var element in stringArray) {
digitsArray = stringArray[element].match(/\d{1,}/g);
digitsProduct = digitsArray.reduce(function(product, digit){ return product*digit;});
sum += digitsProduct;
}
}
return sum;
}

It can be achieved using the RegEx Calculated Field below where Field represents the respective field name; each row represents a single workload (for example 80kg-3x5), thus the below accounts for 5 workloads (more can be added, for example a 6th could be added by copy-pasting the 5th line and incrementing he number in curly brackets by one - that is, changing {4} to {5}):
(CAST(REGEXP_EXTRACT(Field,"^(\\d+)kg")AS NUMBER) * CAST(REGEXP_EXTRACT(Field,"^\\d+kg-(\\d+)")AS NUMBER) * CAST(REGEXP_EXTRACT(Field,"^\\d+kg-\\d+x(\\d+)")AS NUMBER)) +
(NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){1}(\\d+)kg")AS NUMBER),0) * NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){1}\\d+kg-(\\d+)")AS NUMBER),0) * NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){1}\\d+kg-\\d+x(\\d+)")AS NUMBER),0)) +
(NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){2}(\\d+)kg")AS NUMBER),0) * NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){2}\\d+kg-(\\d+)")AS NUMBER),0) * NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){2}\\d+kg-\\d+x(\\d+)")AS NUMBER),0)) +
(NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){3}(\\d+)kg")AS NUMBER),0) * NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){3}\\d+kg-(\\d+)")AS NUMBER),0) * NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){3}\\d+kg-\\d+x(\\d+)")AS NUMBER),0)) +
(NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){4}(\\d+)kg")AS NUMBER),0) * NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){4}\\d+kg-(\\d+)")AS NUMBER),0) * NARY_MAX(CAST(REGEXP_EXTRACT(Field,"^(?:\\d+kg-\\d+x\\d+,\\s){4}\\d+kg-\\d+x(\\d+)")AS NUMBER),0))
Editable Google Data Studio Report, Embedded Data Source, Editable Data Set (Google Sheets) and a GIF to elaborate, so feel free to change the name of the field (at the Data Source) to adapt the field to the Calculated Field:

CAST numeric to varchar is giving scientific notation

I have three fields that I am trying to concatenate into one large field. Two of the fields are varchar, but one is a float. In certain situations, the concatenated field is showing scientific notation. The concatenated field should be a varchar and show the combination of the three fields regardless of how they are formatted. I am even seeing scientific notation when I just concatenate the two varchar fields when the values have all numbers in them. Why is this occurring and how can I fix it? Here are some examples of ways I am trying to do the concatenation:
Field1 = e.DocumentNo + e.Assignment + CAST(CAST([Amount in LC] as int) as nvarchar(50))
Field2 = CAST(e.DocumentNo + e.Assignment as varchar(255))
I have also tried using CONVERT and it does not provide the expected result. DocumentNo is a varchar(255) and Assignment is a varchar(255), yet when I have these values for each, 5115146916 and 1610000 respectively, Field2 looks like 5.11515E+16.
I also tried to use CONCAT() with the fields and it produces the same undesired result.

Here you go:
IF OBJECT_ID('TEMPDB..#ConcatData','U') IS NOT NULL
DROP TABLE #ConcatData;
CREATE TABLE #ConcatData(
[Amount in LC] [float] NULL,
[Assignment] [varchar](255) NULL,
[DocumentNo] [varchar](255) NULL)
INSERT INTO #ConcatData
VALUES
(-27.08, '20120295', '4820110172'),
(10625451.5124, '20140701', '4810122475'),
(205.5, 'TPE035948900001', '8200022827'),
(10000000, 'TPE035948900001', '8200022827')
SELECT DOCUMENTNO +
ASSIGNMENT +
CASE WHEN RIGHT(str([amount in lc],50,4),4) = '0000'
THEN ltrim(LEFT(str([amount in lc],50,4),LEN(str([amount in lc],50,4))-5))
WHEN RIGHT(str([amount in lc],50,4),3) = '000'
THEN ltrim(LEFT(str([amount in lc],50,4),LEN(str([amount in lc],50,4))-3))
WHEN RIGHT(str([amount in lc],50,4),2) = '00'
THEN ltrim(LEFT(str([amount in lc],50,4),LEN(str([amount in lc],50,4))-2))
WHEN RIGHT(str([amount in lc],50,4),1) = '0'
THEN ltrim(LEFT(str([amount in lc],50,4),LEN(str([amount in lc],50,4))-1))
ELSE ltrim(str([amount in lc],50,4))
END
FROM #ConcatData
Moral of the story here, float isn't the right datatype for your column. I actually don't know when float is the right datatype...
Anyway, the obnoxious CASE statement is needed to remove excess decimal-place zeroes caused by STR(). You might even need more, but this covers you up to 4 decimal places and I think you'll get the idea.
One note, the first THEN removes 5 chars instead of 4. This is to include the . as well.
Output:
482011017220120295-27.08
48101224752014070110625451.5124
8200022827TPE035948900001205.5
8200022827TPE03594890000110000000

BIRT: Align rows in list element

I'm using the Birt list element to display my data from left to right. (see this question as reference). Eg. List element with a Grid in details and the grid set to inline.
The issue I'm facing now is, that the different rows in the grid are not aligned left to right (probably due to some rows having empty values in some fields). How can I force BIRT to align properly?
EDIT:
This is especially also a problem with longer text that wraps to more than 1 line. The wrapping /multiple lines should be reflected by all list elements in that "row of the output".

Unfortunately, I don't see any chance to accomplish this easily in the generic case - that is, if the number of records is unknown in advance, so you'd need more than one line:
student1 student2 student3
student4 student5
Let's call those line "main lines". One main line can contain up to 3 records. The number 3 may be different in your case, but we can assume it is a constant, since (at least for PDF reports) the paper width is restricted.
A possible solution could work like this:
In your data set, add two columns for each row: MAIN_LINE_NUM and COLUMN_NUM, where the meaning is obvious. For example, this could be done with pure SQL using analytic functions (untested):
select ...,
trunc((row_number() over (order by ...whatever...) - 1) / 3) + 1 as MAIN_LINE_NUM,
mod(row_number() over (order by ...whatever...) - 1), 3) +1 as COLUMN_NUM
from ...
order by ...whatever... -- The order must be the same as above.
Now you know where each record should go.
The next task is to transform the result set into a form where each record looks like this (for the example, think that you have 3 properties STUDENT_ID, NAME, ADDRESS for each student):
MAIN_LINE
STUDENT_ID_1
NAME_1
ADDRESS_1
STUDENT_ID_2
NAME_2
ADDRESS_2
STUDENT_ID_3
NAME_3
ADDRESS_3
You get the picture...
The SQL trick to achieve this is one that one should know.
I'll show this for the STUDENT_ID_1, STUDENT_ID_2 and NAME_1 column as an example:
with my_data as
( ... the query shown above including MAIN_LINE_NUM and COLUMN_NUM ...
)
select MAIN_LINE_NUM,
max(case when COLUMN_NUM=1 then STUDENT_ID else null end) as STUDENT_ID_1,
max(case when COLUMN_NUM=2 then STUDENT_ID else null end) as STUDENT_ID_2,
...
max(case when COLUMN_NUM=1 then NAME else null end) as NAME_1,
...
from my_data
group by MAIN_LINE_NUM
order by MAIN_LINE_NUM
As you see, this is quite clumsy if you need a lot of different columns.
On the other hand, this makes the output a lot easier.
Create a table item for your dat set, with 3 columns (for 1, 2, 3). It's best to not drag the dataset into the layout. Instead, use the "Insert element" context menu.
You need a detail row for each of the columns (STUDENT_ID, NAME, ADDRESS). So, add two more details rows (the default is one detail row).
Add header labels manually, if you like, or remove the header row if you don't need it (which is what I assume).
Remove the footer row, as you probably don't need it.
Drag the columns to the corresponding position in your table.
The table item should look like this now (data items):
+--------------+--------------+-------------+
+ STUDENT_ID_1 | STUDENT_ID_2 | STUDENT_ID3 |
+--------------+--------------+-------------+
+ NAME_1 | NAME_2 | NAME_3 |
+--------------+--------------+-------------+
+ ADDRESS_1 | ADDRESS_2 | ADDRESS_3 |
+--------------+--------------+-------------+
That's it!
This is one of the few examples where BIRT sucks IMHO in comparison to other tools like e.g. Oracle Reports - excuse my Klatchian.

cloudsearch query to boost exact match on range

In a cloudsearch structured query.
I have a couple of fields I am searching on.
On field one, the user selects "2"
On field two the user selects "1"
I am wanting to run this as a range query, so that the results that are returned are -1 to +1
eg. on field one the range would be 1,3 and on field 2 it would be 0,2
What I am wanting to do is sort the results so that the results that match both field 1 and field 2 are at the top, and the rest under it.
eg. where field one=2 and field two =1 would be at the top and the rest are not in any specific order,
note: I do need to end up sorting the results by distance, so that all the exact matching results are in distance order, then all the rest are ordered by distance.
I am sure I can do this with 2 queries, just trying to make it work in one query if at all possible to lighten the load.

Say your fields are 'a' and 'b', and the specified values are a=2 and b=1 (as in your example, except I've named the fields 'a' and 'b' instead of 'one' and 'two'). Here are the various terms of your query.
Range Query
This is the query for the range a±1 and b±1 where a=2 and b=1:
q=(and (range field=a[1,3]) (range field=b[0,2]))
Rank Expression
For your rank expression, compute a distance-based score using absolute value so that scores 'a' and 'b' can't cancel each other out (like a=3,b=0 would, for example):
expr.rank1=abs(a-2)+abs(b-1)
Sort by Rank
That defined a ranking expression named rank1, which we now want to sort by, starting with the lowest values ('0' means a=2,b=1):
sort=rank1 asc
Return the Rank
For debugging purposes, you may want return the ranking score:
return=rank1
Put all those terms together and you've got your query.
Further Potentially-Useful Things
If you want to get fancy and penalize things in a non-linear way, you can use exp. For example, if you want to differentiate between 'a' and 'b' both being off by 1 vs 'a' being an exact match and 'b' being off by 2 (eg a=3,b=2 will rank ahead of a=2,b=3 even though the previous ranker would give them both a score of 2):
expr.rank1=exp(abs(a-2))+exp(abs(b-1))
And you can use boolean logic and the ternary operator to detect and prefer certain results that meet certain criteria, eg to give a big boost when 'a' and 'b' are on-target, a smaller boost when 'a' or 'b' is on target, etc (since we're sorting in low-to-high, a boost in rank is actually achieved by adding less to the result):
((a==1&&b==2)?0:100)+((a==1||b==2)?0:1000)+abs(a-1)+abs(b-2)
See http://docs.aws.amazon.com/cloudsearch/latest/developerguide/configuring-expressions.html

Create prioritization log in Excel - Two lists

I am trying to create a prioritization list. I have 6 distinct values that the user inputs into a worksheet (by way of a VBA GUI). Excel calculates these values and creates a prioritization number. I need to list them (through a function(s)) in two tables. The problem comes into play when there are duplicate values (ie ProjA = 23 and ProjB = 23).
I don't care which one is listed first, but everything I have tried has secondary issues. There are two sheets to my work book. The first is where the "raw" data is entered and the second is where I would like the two lists to be located. *I do not want to use pivots for these lists.
Priority Number Proj Name
57 Project Alpha c
57 DUI Button Project
56 asdf
57 asdfsdfg
56 asdfasdf
56 Project Alpha a
56 Project Alpha b
18 Project BAS
List A (would include a value range of 1-20 and
List B (would include a value range of 20 - inf)
So, I want it to look like this:
Table 1 (High Priority) Table 2 (Low Priority)
Project BAS Project Apha C
DUI Button Project
Etc.

Generally these open-ended questions aren't received on StackOverflow. You should make an attempt to demonstrate what you've tried so far, and exactly where you're becoming confused. Otherwise people are doing your work for you, rather than trying to solve specific errors.
However, because you're new here, I've made an exception.
You can begin solving your issue by looping through the priority list and copy the values into the appropriate lists. For starters, I assumed that priority values begin at cell A2 and project names begin at cell B2 (the cells A1 and B1 would be the headers). I also assumed we're using a worksheet called Sheet1.
Now I need to know the length of the priority/project name list. I can determine this by using an integer called maxRows, calculated by Worksheets.Cells(1, 1).End(xlDown).Row. This gives the number of values in regular table (including the header, A1).
I continue by setting the columns for each priority list (high/low). In my example, I set these to columns 3 and 4. Then I clear these columns to remove any values that already existed there.
Then I create some tracking variables that will help me determine how many items I've already added to each list (highPriorityCount and lowPriorityCount).
Finally, I loop through the original list and check if the priority value is low (< 20) or high (the else condition). The project names are placed into the appropriate column, using the tracking variables I created above.
Note: Anywhere that uses a 2 as an offset is due to the fact that I am accounting for the header cells (row 1).
Option Explicit
Sub CreatePriorityTables()
With Worksheets("Sheet1")
' Determine the length of the main table
Dim maxRows As Integer
maxRows = .Cells(1, 1).End(xlDown).Row
' Set the location of the priority lists
Dim highPriorityColumn As Integer
Dim lowPriorityColumn As Integer
highPriorityColumn = 3
lowPriorityColumn = 4
' Empty the priority lists
.Columns(highPriorityColumn).Clear
.Columns(lowPriorityColumn).Clear
' Create headers for priority lists
.Cells(1, highPriorityColumn).Value = "Table 1 (High Priority)"
.Cells(1, lowPriorityColumn).Value = "Table 2 (Low Priority)"
' Create some useful counts to track
Dim highPriorityCount As Integer
Dim lowPriorityCount As Integer
highPriorityCount = 0
lowPriorityCount = 0
' Loop through all values and copy into priority lists
Dim currentColumn As Integer
Dim i As Integer
For i = 2 To maxRows
' Determine column by priority value
If (.Cells(i, 1) < 20) Then
.Cells(lowPriorityCount + 2, lowPriorityColumn).Value = .Cells(i, 2)
lowPriorityCount = lowPriorityCount + 1
Else
.Cells(highPriorityCount + 2, highPriorityColumn).Value = .Cells(i, 2)
highPriorityCount = highPriorityCount + 1
End If
Next i
End With
End Sub
This should produce the expected behavior.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Compare duplicates for 4 fields in open SQL - combinations

Related

Google Sheets / Google Data Studio - RegEx

CAST numeric to varchar is giving scientific notation

BIRT: Align rows in list element

cloudsearch query to boost exact match on range

Create prioritization log in Excel - Two lists

Categories

Resources