Remove duplicate values between arrays in BigQuery - google-cloud-platform

Let's say that I have the following arrays:
SELECT ['A', 'B', 'C', 'A', 'A', 'A'] AS origin_array
UNION ALL
SELECT ['A', 'A', 'B'] AS secondary_array
And I want to remove all duplicate values between the arrays (as opposed to within the arrays), so that the final result will be:
SELECT ['C', 'A', 'A'] AS result_array
Any idea how can it be done?

Below is for BigQuery Standard SQL
#standardSQL
CREATE TEMP FUNCTION DEDUP_ARRAYS(arr1 ANY TYPE, arr2 ANY TYPE) AS ((ARRAY(
SELECT item FROM (
SELECT item, ROW_NUMBER() OVER(PARTITION BY item) pos FROM UNNEST(arr1) item UNION ALL
SELECT item, ROW_NUMBER() OVER(PARTITION BY item) pos FROM UNNEST(arr2) item
)
GROUP BY item, pos
HAVING COUNT(1) = 1
)));
WITH `project.dataset.table` AS (
SELECT ['A', 'B', 'C', 'A', 'A', 'A'] AS origin_array, ['A', 'A', 'B'] AS secondary_array
)
SELECT DEDUP_ARRAYS(origin_array, secondary_array) AS result_array
FROM `project.dataset.table`
with result
Row result_array
1 A
A
C
which is what would SELECT ['C', 'A', 'A'] AS result_array returned

If you type just UNION instead of UNION ALL it should not take the duplicate values.

Related

Filter and limit on a python dictionary

Given:
obj = {}
obj['a'] = ['x', 'y', 'z']
obj['b'] = ['x', 'y', 'z', 'u', 't']
obj['c'] = ['x']
obj['d'] = ['y', 'u']
How do you select (e.g. print) the top 2 entries in this dictionary, sorted by the length of each list?
the top 2 entries in this dictionary, sorted by the length of each
list
print(sorted(obj.values(), key=len)[:2])
The output:
[['x'], ['y', 'u']]

How merge dictionary with key values but which contains several different list values?

Someone, asked how my input looks like:
The input is an ouput from preceeding function.
And when I do
print(H1_dict)
The following information is printed to the screen:
defaultdict(<class 'list'>, {2480: ['A', 'C', 'C'], 2651: ['T', 'A', 'G']})
which means the data type is defaultdict with (keys, values) as (class, list)
So something like this:
H1dict = {2480: ['A', 'C', 'C'], 2651: ['T', 'A', 'G'].....}
H2dict = {2480: ['C', 'T', 'T'], 2651: ['C', 'C', 'A'].....}
H1_p1_values = {2480: ['0.25', '0.1', '0.083'], 2651: ['0.43', '0.11', '0.23']....}
H1_p2_values = {2480: ['0.15', '0.15', '0.6'], 2651: ['0.26', '0.083', '0.23']....}
H2_p1_values = {2480: ['0.3', '0.19', '0.5'], 2651: ['0.43', '0.17', '0.083']....}
H2_p2_values = {2480: ['0.3', '0.3', '0.1'], 2651: ['0.39', '0.26', '0.21']....}
I want to merge this dictionaries as:
merged_dict (class, list) or (key, values)= {2480: h1['A', 'C', 'C'], h2 ['C', 'T', 'T'], h1_p1['0.25', '0.1', '0.083'], h1_p2['0.15', '0.15', '0.6'], h2_p1['0.3', '0.19', '0.5'], h2_p2['0.3', '0.3', '0.1'], 2651: h1['T', 'A', 'G'], h2['C', 'C', 'A']....}
So, I want to merge several dictionaries using key values but maintain the order in which different dictionary are supplied.
For merging the dictionary I am able to do it partially using:
merged = [haplotype_A, haplotype_B, hapA_freq_My, hapB_freq_My....]
merged_dict = {}
for k in haplotype_A.__iter__():
merged_dict[k] = tuple(merged_dict[k] for merged_dict in merged)
But, I want to add next level of keys infront of each list, so I can access specific items in a large file when needed.
Downstream I want to access the values inside this merged dictionary using keys each time with for-loop. Something like:
for k, v in merged_dict:
h1_p1sum = sum(float(x) for float in v[index] or v[h1_p1])
h1_p1_prod = mul(float(x) for float in v[index] or v[h1_p1])
h1_string = "-".join(str(x) for x in v[h1_index_level]
and the ability to print or write it to the file line by line
print (h1_string)
print (h1_p1_sum)
I am read several examples from defaultdict and other dict but not able to wrap my head around the process. I have been able to do simple operation but something like this seems a little complicated. I would really appreciate any explanation that you may add to the each step of the process.
Thank you in advance !
If I understand you correctly, you want this:
merged = {'h1': haplotype_A, 'h2': haplotype_B, 'h3': hapA_freq_My, ...}
merged_dict = defaultdict(dict)
for var_name in merged:
for k in merged[var_name]:
merged_dict[k][var_name] = merged[var_name][k]
This should give you an output of:
>>>merged_dict
{'2480': {'h1': ['A', 'C', 'C'], 'h2': ['C', 'T', 'T'], ..}, '2651': {...}}
given of course, the variables are the same as your example data given.
You can access them via nested for loops:
for k in merged_dict:
for sub_key in merged_dict[k]:
print(merged_dict[k][sub_key]) # print entire list
for item in merged[k][sub_key]:
print(item) # prints item in list

Calculating Overlapping time-span count in SQL Server 2008 R2

I need to get concurrency count in SQL Server 2008 R2. I have a provider who will for a date will supervise certain records, I need to get count for overlapping times. e.g
Sample data:
DECLARE #Concurrency Table
(
[ConcurrencyTimeID] BigInt,
[SupervisingProviderID] Int,
[SupervisingProviderFirstName] Varchar(50),
[SupervisingProviderLastName] Varchar(50),
[DateOfService] DateTime,
[PatientFirstName] Varchar(50),
[PatientLastName] Varchar(50),
[StartTime] Time,
[StopTime] Time,
[DStartTime] DateTime,
[DStopTime] DateTime,
[IsNextDay] Bit
)
INSERT INTO #Concurrency
(
[ConcurrencyTimeID],
[SupervisingProviderID],
[SupervisingProviderFirstName],
[SupervisingProviderLastName],
[DateOfService],
[PatientFirstName],
[PatientLastName],
[StartTime],
[StopTime],
[IsNextDay]
)
SELECT 25, 4, 'hardik', 'Patel', '05/30/2016', 'a', 'a', '8:00:00 PM', '11:00:00 PM', 0
UNION ALL SELECT 25, 4, 'hardik', 'Patel', '05/30/2016', 'b', 'b', '8:30:00 PM', '9:30:00 PM', 0
UNION ALL SELECT 25, 4, 'hardik', 'Patel', '05/30/2016', 'c', 'c', '9:00:00 PM', '11:30:00 PM', 0
UNION ALL SELECT 25, 4, 'hardik', 'Patel', '05/30/2016', 'd', 'd', '10:00:00 PM', '2:00:00 AM', 1
UNION ALL SELECT 25, 4, 'hardik', 'Patel', '05/31/2016', 'e', 'e', '1:00:00 AM', '3:00:00 AM', 0
UNION ALL SELECT 25, 4, 'hardik', 'Patel', '05/31/2016', 'f', 'f', '2:30:00 AM', '3:30:00 AM', 0
UPDATE #Concurrency
SET [DStartTime] = ( Convert(Varchar, [c].[DateOfService], 112) + Cast([c].[StartTime] AS DateTime) ),
[DStopTime] = CASE WHEN [c].[IsNextDay] = 1 THEN ( Convert(Varchar, [c].[DateOfService], 112) + Cast('23:59' AS DateTime) )
ELSE ( Convert(Varchar, [c].[DateOfService], 112) + Cast([c].[StopTime] AS DateTime) )
END
FROM #Concurrency AS [c]
My query:
SELECT [c].[SupervisingProviderID],
[ca].[Ccount]
FROM #Concurrency AS [c]
CROSS APPLY (
SELECT [Ccount] = Count(*)
FROM #Concurrency AS [c2]
WHERE [c].[DateOfService] = [c2].[DateOfService]
AND [c].[SupervisingProviderID] = [c2].[SupervisingProviderID]
AND [c].[DStartTime] <= [c2].[DStopTime]
AND [c2].[DStartTime] <= [c].[DStopTime]
) AS ca
Expected Result
At present I am getting Concurrency = 4 for Patient - A. Can anyone help me get what am I doing wrong?

Counter in Python

Is there a way to collect values in Counter with respect to occurring number?
Example:
Let's say I have a list:
list = ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd']
When I do the Counter:
Counterlist = Counter(list)
I'll get:
Counter({'b': 3, 'a': 3, 'c': 3, 'd': 2})
Then I can select let's say a and get 3:
Counterlist['a'] = 3
But how I can select the occurring number '3'?
Something like:
Counterlist[3] = ['b', 'a', 'c']
Is that possible?
You can write the following
import collections
my_data = ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd']
result = collections.defaultdict(list)
for k, v in collections.Counter(my_data).items():
result[v].append(k)
and then access result[3] to obtain the characters with that count.

How to generate each possible combination of members from two lists (in Python)

I am a Python newbie and I've been trying to find the way to generate each possible combination of members from two lists:
left = ['a', 'b', 'c', 'd', 'e']
right = ['f', 'g', 'h', 'i', 'j']
The resulting list should be something like:
af ag ah ai aj bf bg bh bi bj cf cg ch ci cj etc...
I made several experiments with loops but I can't get it right:
The zip function but it wasn't useful since it just pairs 1 to 1 members:
for x in zip(left,right):
print x
and looping one list for the other just returns the members of one list repeated as many times as the number of members of the second list :(
Any help will be appreciated. Thanks in advance.
You can use for example list comprehension:
left = ['a', 'b', 'c', 'd', 'e']
right = ['f', 'g', 'h', 'i', 'j']
result = [lc + rc for lc in left for rc in right]
print result
The result will look like:
['af', 'ag', 'ah', 'ai', 'aj', 'bf', 'bg', 'bh', 'bi', 'bj', 'cf', 'cg', 'ch', 'ci', 'cj', 'df', 'dg', 'dh', 'di', 'dj', 'ef', 'eg', 'eh', 'ei', 'ej']