convert sparql query to sql query in pyspark-dataframe

convert sparql query to sql query in pyspark-dataframe - python-2.7

I have dataset that consist of three columns subject, predicate, and object
subject predicate object
c1 B V3
c1 A V3
c1 T V2
c2 A V2
c2 A V3
c2 T V1
c2 B V3
c3 B V3
c3 A V3
c3 T V1
c4 A V3
c4 T V1
c5 B V3
c5 T V2
c6 B V3
c6 T V1
I want to apply association mining rules on this data by using sql queries.
I take this idea from this paper Association Rule Mining on Semantic data by using sparql(SAG algorithm)
first, the user has to specify T (target predicate) and minimum support,then query if this T is frequent or not:
SELECT ?pt ?ot (COUNT(*) AS ?Yent)
WHERE {?s ?pt ?ot.
FILTER (regex (str(?pt), 'T', 'i'».}
GROUP BY ?pt ?ot
HAVING (?Yent >= 2)
I tried following code and I got same result:
q=mtcars1.select('s','p','o').where(mtcars1['p']=='T')
q1=q.groupBy('p','o').count()
q1.filter(q1['count']>=2).show()
result
+---+---+-----+
| p| o|count|
+---+---+-----+
| T| V2| 2|
| T| V1| 4|
+---+---+-----+
second query to calculate other predicates and objects if they are frequent:
q2=mtcars1.select('s','p','o').where(mtcars1['p']!='T')
q3=q2.groupBy('p','o').count()
q3.filter(q3['count']>=2).show()
result
+---+---+-----+
| p| o|count|
+---+---+-----+
| A| V3| 4|
| B| V3| 5|
+---+---+-----+
in order to find rules between two above queries, we will scan dataset again and find if they are repeated together greater than or equal minimum support
SELECT ?pe ?oe ?pt ?ot (count(*) AS ?supCNT)
WHERE { ?s ?pt ?ot .
FILTER (regex (str(?pt), 'T','i'».
?s ?pe ?oe .
FILTER (!regex (str(?pe), 'T','i'».}
GROUP BY ?pe ?oe ?pt ?ot
HAVING (?supCNT >= I)
ORDER BY ?pt ?ot
I tried to store subject in list then join between items ,but this took long time, and this will take very long time if data is very large.
w=mtcars1.select('s','p' ,'o').where(mtcars1['p']=='T')
w1=w.groupBy('p','o').agg(collect_list('s')).show()
result
+---+---+----------------+
| p| o| collect_list(s)|
+---+---+----------------+
| T| V2| [c1, c5]|
| T| V1|[c2, c3, c4, c6]|
+---+---+----------------+
w2=mtcars1.select('s','p' ,'o').where(mtcars1['p']!='T')
w3=w2.groupBy('p','o').agg(collect_list('s')).show()
result
+---+---+--------------------+
| p| o| collect_list(s)|
+---+---+--------------------+
| A| V3| [c1, c2, c3, c4]|
| B| V3|[c1, c2, c3, c5, c6]|
| A| V2| [c2]|
+---+---+--------------------+
join code
from pyspark.sql.functions import *
w44=w1.alias("l")\
.crossJoin(w3.alias("r"))\
.select(
f.col('l.p').alias('lp'),
f.col('l.o').alias('lo'),
f.col('r.p').alias('rp'),
f.col('r.o').alias('ro'),
intersection_udf(f.col('l.collect_list(s)'), f.col('r.collect_list(s)')).alias('TID'),
intersection_length_udf(f.col('l.collect_list(s)'), f.col('r.collect_list(s)')).alias('len')
)\
.where(f.col('len') > 1)\
.select(
f.struct(f.struct('lp', 'lo'), f.struct('rp', 'ro')).alias('2-Itemset'),
'TID'
)\
.show()
result
+---------------+------------+
| 2-Itemset| TID|
+---------------+------------+
|[[T,V2],[B,V3]]| [c1, c5]|
|[[T,V1],[A,V3]]|[c3, c2, c4]|
|[[T,V1],[B,V3]]|[c3, c2, c6]|
+---------------+------------+
so,I have to re scan dataset again and find association rules between items, and re scan again to find again rules.
following query is used to construct 3-factor set:
SELECT ?pel ?oel ?pe2 ?oe2 ?pt ?ot (eount(*) AS
?supCNT)
WHERE { ?s ?pt ?ot .
FILTER (regex (str(?pt), 'T','i'».
?s ?pel ?oel .
FILTER (!regex (str(?pel), 'T','i'».
FILTER (!regex (str(?pc2), 'T','i')&& !regex
(str(?pc2), str(?pcl),'i') ).}
GROUP BY ?pcl ?ocl ?pc2 ?oc2 ?pt ?ot
HAVING (?supCNT >=2)
ORDER BY ?pt ?ot
result for this query should be
{[(A, V3) (B, V3) (T, V1), 2]}
and we will repeat queries until no other rules between items
can anyone help me how can make association rules by sql queries,where subject is used as ID ,predicate + object=items

Related

How can I compare pairs of columns in a PySpark dataframe and number of records changed?

I have a situation where I need to compare multiple pairs of columns (the number of pairs will vary and can come from a list as shown in below code snippet) and get 1/0 flag for match/mismatch respectively. Eventually use this to identify the number of records/rows with mismatch and % records mismatched
NONKEYCOLS= ['Marks', 'Qualification']
The first image is source df and second image is expected df.
[
Since this is happening for multiple pairs on a loop, it is very slow for about a billion records. Need help with something efficient.
I have the below code but the part that calculates change records is taking long time.
for ind,cols in enumerate(NONKEYCOLS):
print(ind)
print(cols)
globals()['new_dataset' + '_char_changes_tmp']=globals()['new_dataset' + '_char_changes_tmp']\
.withColumn("records_changed" + str(ind),\
F.sum(col("records_ch_flag_" + str(ind)))\
.over(w1))
globals()['new_dataset' + '_char_changes_tmp']=globals()['new_dataset' + '_char_changes_tmp']\
.withColumn("records_changed" + str(ind),\
F.sum(col("records_ch_flag_" + str(ind)))\
.over(w1))
globals()['new_dataset' + '_char_changes_tmp']=globals()['new_dataset' + '_char_changes_tmp']\
.withColumn("records_changed_cnt" + str(ind),\
F.count(col("records_ch_flag_" + str(ind)))\
.over(w1))

i'm not sure what loop are you running, but here's an implementation with list comprehension within a select.
data_ls = [
(10, 11, 'foo', 'foo'),
(12, 12, 'bar', 'bar'),
(10, 12, 'foo', 'bar')
]
data_sdf = spark.sparkContext.parallelize(data_ls). \
toDF(['marks_1', 'marks_2', 'qualification_1', 'qualification_2'])
col_pairs = ['marks','qualification']
data_sdf. \
select('*',
*[(func.col(c+'_1') == func.col(c+'_2')).cast('int').alias(c+'_check') for c in col_pairs]
). \
show()
# +-------+-------+---------------+---------------+-----------+-------------------+
# |marks_1|marks_2|qualification_1|qualification_2|marks_check|qualification_check|
# +-------+-------+---------------+---------------+-----------+-------------------+
# | 10| 11| foo| foo| 0| 1|
# | 12| 12| bar| bar| 1| 1|
# | 10| 12| foo| bar| 0| 0|
# +-------+-------+---------------+---------------+-----------+-------------------+
where the list comprehension would yield the following
[(func.col(c+'_1') == func.col(c+'_2')).cast('int').alias(c+'_check') for c in col_pairs]
# [Column<'CAST((marks_1 = marks_2) AS INT) AS `marks_check`'>,
# Column<'CAST((qualification_1 = qualification_2) AS INT) AS `qualification_check`'>]
EDIT
based on the additional (updated) info, you need the count of unmatched records for that pair and then you want to calculate the unmatched percentage.
reversing the aforementioned logic to count the unmatched records
col_pairs = ['marks','qualification']
data_sdf. \
agg(*[func.sum((func.col(c+'_1') != func.col(c+'_2')).cast('int')).alias(c+'_unmatch') for c in col_pairs],
func.count('*').alias('row_cnt')
). \
select('*',
*[(func.col(c+'_unmatch') / func.col('row_cnt')).alias(c+'_unmatch_perc') for c in col_pairs]
). \
show()
# +-------------+---------------------+-------+------------------+--------------------------+
# |marks_unmatch|qualification_unmatch|row_cnt|marks_unmatch_perc|qualification_unmatch_perc|
# +-------------+---------------------+-------+------------------+--------------------------+
# | 2| 1| 3|0.6666666666666666| 0.3333333333333333|
# +-------------+---------------------+-------+------------------+--------------------------+
the code flags (as 1) the records where the pair does not match and takes a sum of the flag - which gives us the pair's unmatched record count. dividing that with the total row count will give the percentage.
the list comprehension will yield the following
[func.sum((func.col(c+'_1') != func.col(c+'_2')).cast('int')).alias(c+'_unmatch') for c in col_pairs]
# [Column<'sum(CAST((NOT (marks_1 = marks_2)) AS INT)) AS `marks_unmatch`'>,
# Column<'sum(CAST((NOT (qualification_1 = qualification_2)) AS INT)) AS `qualification_unmatch`'>]
this is very much efficient as all of it happens in a single select statement which will only project once in the spark plan as opposed to your approach which will project every time you do a withColumn - and that is inefficient to spark.

df.colRegex may serve you well. If all the values in columns which match the regex are equal, you get 1. The script is efficient, as everything is done in one select.
Inputs:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('p', 1, 2, 'g', 'm'),
('a', 3, 3, 'g', 'g'),
('b', 4, 5, 'g', 'g'),
('r', 8, 8, 'm', 'm'),
('d', 2, 1, 'u', 'g')],
['Name', 'Marks_1', 'Marks_2', 'Qualification_1', 'Qualification_2'])
col_pairs = ['Marks', 'Qualification']
Script:
def equals(*cols):
return (F.size(F.array_distinct(F.array(*cols))) == 1).cast('int')
df = df.select(
'*',
*[equals(df.colRegex(f"`^{c}.*`")).alias(f'{c}_result') for c in col_pairs]
)
df.show()
# +----+-------+-------+---------------+---------------+------------+--------------------+
# |Name|Marks_1|Marks_2|Qualification_1|Qualification_2|Marks_result|Qualification_result|
# +----+-------+-------+---------------+---------------+------------+--------------------+
# | p| 1| 2| g| m| 0| 0|
# | a| 3| 3| g| g| 1| 1|
# | b| 4| 5| g| g| 0| 1|
# | r| 8| 8| m| m| 1| 1|
# | d| 2| 1| u| g| 0| 0|
# +----+-------+-------+---------------+---------------+------------+--------------------+
Proof of efficiency:
df.explain()
# == Physical Plan ==
# *(1) Project [Name#636, Marks_1#637L, Marks_2#638L, Qualification_1#639, Qualification_2#640, cast((size(array_distinct(array(Marks_1#637L, Marks_2#638L)), true) = 1) as int) AS Marks_result#646, cast((size(array_distinct(array(Qualification_1#639, Qualification_2#640)), true) = 1) as int) AS Qualification_result#647]
# +- Scan ExistingRDD[Name#636,Marks_1#637L,Marks_2#638L,Qualification_1#639,Qualification_2#640]
Edit:
def equals(*cols):
return (F.size(F.array_distinct(F.array(*cols))) != 1).cast('int')
df = df.select(
'*',
*[equals(df.colRegex(f"`^{c}.*`")).alias(f'{c}_result') for c in col_pairs]
).agg(
*[F.sum(f'{c}_result').alias(f'rec_changed_{c}') for c in col_pairs],
*[(F.sum(f'{c}_result') / F.count(f'{c}_result')).alias(f'{c}_%_rec_changed') for c in col_pairs]
)
df.show()
# +-----------------+-------------------------+-------------------+---------------------------+
# |rec_changed_Marks|rec_changed_Qualification|Marks_%_rec_changed|Qualification_%_rec_changed|
# +-----------------+-------------------------+-------------------+---------------------------+
# | 3| 2| 0.6| 0.4|
# +-----------------+-------------------------+-------------------+---------------------------+

group by two values and get the third values

I have this Django model with three CharFields, which I want to run a query on to get the existing values for the two of them, and for each combination get the existing values of the third field.
a = models.CharField(null=False, max_length=8000)
b = models.CharField(null=False, max_length=8000)
c = models.CharField(null=False, max_length=8000)
if assume that these values are in the database:
a | b | c |
---------------
a1 | b2 | c3 |
a1 | b2 | c1 |
a2 | b2 | c3 |
a1 | b3 | c3 |
a1 | b2 | c2 |
I want some result in this form :
{"a1-b2" : [c3, c1, c2], "a2-b2" : [c3], "a1-b3" : [c3]}
or
{"a1" : {"b2":[c3, c1, c2], "b3": [c3]}, "a2": {"b2" : [c3]}}

TLDR:
items = MyModel.objects.annotate(custom_field=Concat('a', Values('-'), 'b').values('custom_field', 'c')
Explanation
With the part .annotate(custom_field=Concat('a', Values('-'), 'b'), you are basically doing a group_by operation in SQL and creating a temporary new column with name custom_field in your queryset which will have the value of a-b.
This gives you the following structure:
a | b | c | custom_field
a1 b1 c1 a1-b1
a2 b2 c2 a2-b2
a1 b1 c3 a1-b1
The .values('custom_field', 'c') portion fetches only the custom_field and c columns from this queryset. Now all you have to do is serialize your data.
EDIT
If you want your data in that specific format, you can concatenate column c. Please read the SO accepted answer in this post. Django making a list of a field grouping by another field in model. You can then create a new field during serialization which will split() the concatenated c field into a list.

Couldn't think of good pure SQL solution, but here's pythonic one using groupby:
from itertools import groupby
# Order by key fields so it will be easier to group later
items = YOUR_MODEL.objects.order_by('a', 'b')
# Group items by 'a' and 'b' fields as key
groups = groupby(items, lambda item: (item.a, item.b))
# Create dictionary with values as 'c' field from each item
res = {
'-'.join(key): list(map(lambda item: item.c, group))
for key, group in groups
}
# {'a1-b2': ['c3', 'c1', 'c2'], 'a1-b3': ['c3'], 'a2-b2': ['c3']}

Power BI: How to use OR between 2 different filters?

I have some table like this:
+------+------+------+
| Lvl1 | Lvl2 | Lvl3 |
+------+------+------+
| A1 | B1 | C1 |
| A1 | B1 | C2 |
| A1 | B2 | C3 |
| A2 | B3 | C4 |
| A2 | B3 | C5 |
| A2 | B4 | C6 |
| A3 | B5 | C7 |
+------+------+------+
In which it is some thing like a hierarchy.
When user select A1, he actually selects the first 3 rows, B1, selects first 2 rows, and C1, selects only the first row.
That is A is the highest level, and C is the lowest. Note that ids from different levels are unique, since they have a special prefix, A,B,C.
The problem is when filtering in more than one level, I may have empty result set.
e.g. filtering on Lvl1=A1 & Lvl2=B3 (no intersection), and so will return nothing. What I need is to get the first 5 rows (Lvl1=A1 or Lvl2=B3)
const lvl1Filter: IBasicFilter = {
$schema: "http://powerbi.com/product/schema#basic",
target: {
table: "Hierarchy",
column: "Lvl1"
},
operator: "In",
values: ['A1'],
filterType: FilterType.BasicFilter
}
const lvl2Filter: IBasicFilter = {
$schema: "http://powerbi.com/product/schema#basic",
target: {
table: "Hierarchy",
column: "Lvl2"
},
operator: "In",
values: ['B3'],
filterType: FilterType.BasicFilter
}
report.setFilters([lvl1Filter, lvl2Filter]);
The problem is that the filters are independent from each other, and they will both be applied, that is with AND operation between them.
So, is there a way to send the filters with OR operation between them, or is there a way to simulate it?
PS: I tried to put all data in single column (like the following table), and it worked, but the data was very large (millions of records), and so, it was very very slow, so I need something more efficient.
All data in single column:
+--------------+
| AllHierarchy |
+--------------+
| A1 |
| A2 |
| A3 |
| B1 |
| B2 |
| B3 |
| B4 |
| B5 |
| C1 |
| C2 |
| C3 |
| C4 |
| C5 |
| C6 |
| C7 |
+--------------+
Set Filter:
const allHierarchyFilter: IBasicFilter = {
$schema: "http://powerbi.com/product/schema#basic",
target: {
table: "Hierarchy",
column: "AllHierarchy"
},
operator: "In",
values: ['A1', 'B3'],
filterType: FilterType.BasicFilter
}
report.setFilters([allHierarchyFilter]);

It isn't directly possible to make "or" filter between multiple columns in Power BI, so you were right to try to combine all values in a single column. But instead of appending all possible values by union them in one column, which will give you a long list, you can also try to combine their values "per row". For example, concatenate all values in the current row, maybe add some unique separator (it depends on your actual values, which are not shown). If all columns will have values, make it simple - create new DAX column (not a measure!):
All Levels = 'Table'[Lvl1] & "-" & 'Table'[Lvl2] & "-" & 'Table'[Lvl3]
If it is possible some of the levels to be blank, you can if you want, to handle that:
All Levels = 'Table'[Lvl1] &
IF('Table'[Lvl2] = BLANK(); ""; "-" & 'Table'[Lvl2]) &
IF('Table'[Lvl3] = BLANK(); ""; "-" & 'Table'[Lvl3])
Note that depending on your regional settings, you may have to replace semicolons in the above code with commas.
This will give you a new column, which will contain all values from the current row, e.g. A1-B2-C3. Now you can make a filter All Levels contains A1 or All Levels contains B3, which now is a filter on a single column and we can easily use or:
When embedding your JavaScript code should create advanced filter, like this:
const allLevelsFilter: IAdvancedFilter = {
$schema: "http://powerbi.com/product/schema#advanced",
target: {
table: "Hierarchy",
column: "All Levels"
},
logicalOperator: "Or",
conditions: [
{
operator: "Contains",
value: "A1"
},
{
operator: "Contains",
value: "B3"
}
],
filterType: FilterType.AdvancedFilter
}
report.setFilters([allLevelsFilter]);
If you need exact match (e.g. the above code will also return rows with A11 or B35), then add the separator at the start and the end of the column too (i.e. to get -A1-B2-C3-) and in your JavaScript code append it before and after your search string (i.e. search for -A1- and -B3-).
Hope this helps!

Google BigQuery - Execute dynamically generated queries from a select statement

Have a huge table in Google BigQuery with following structure (> 100 million rows):
name | departments
abc | 1,2,5,6
xyz | 4,5
pqr | 3,4,6
Want to convert the data into following format:
name | 1 | 2 | 3 | 4 | 5 | 6
abc | 1 | 1 | | | 1 | 1
xyz | | | | 1 | 1 |
pqr | | | 1 | 1 | | 1
As of now, able to generate the queries required to prepare the dataset in this format by using CONCAT and REGEX_REPLACE functions:
SELECT ' insert into dataset.output ( name, ' +
CONCAT(
'_' , replace(departments,',',',_') )
+ ' ) values( \'' + name +'\','+ REGEXP_REPLACE(departments, "([^,\n]+)", "1") +')'
FROM (
select name, departments from dataset.input )
This generates the output with the 100 M insert queries which can be used to create the data in the required structure.
However, now below are my questions:
Can we execute the output of this query (100 M insert queries) directly by using Big Query SQL or we would need to fire each insert one by one?
I believe there is no way to pivoting or transposing the data in a column with multiple comma separated values. Is that right?
Is there a more optimal way of achieving this using BigQuery SQL and not writing custom Java code?
Thanks.

Below example for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'abc' name, '1,2,5,6' departments UNION ALL
SELECT 'xyz', '4,5' UNION ALL
SELECT 'pqr', '3,4,6'
)
SELECT
name,
IF(departments LIKE '%1%', 1, 0) AS d1,
IF(departments LIKE '%2%', 1, 0) AS d2,
IF(departments LIKE '%3%', 1, 0) AS d3,
IF(departments LIKE '%4%', 1, 0) AS d4,
IF(departments LIKE '%5%', 1, 0) AS d5,
IF(departments LIKE '%6%', 1, 0) AS d6
FROM `project.dataset.table`
with result as
Row name d1 d2 d3 d4 d5 d6
1 abc 1 1 0 0 1 1
2 xyz 0 0 0 1 1 0
3 pqr 0 0 1 1 0 1
So you need to run above with destination to whatever new table you prepared
Note, above assumes you have just 6 departments and most important there is no ambiguity in numbers like 1 does not conflict with 10 for example
If you do have such case - you need transform below lines
IF(departments LIKE '%2%', 1, 0) AS d2,
into
IF(CONCAT(',', departments, ',') LIKE '%,2,%', 1, 0) AS d2 ...
And of course, you can use just one simple INSERT statement
INSERT `project.dataset.new_table` (name, d1, d2, d3, d4, d5, d6)
SELECT
name,
IF(departments LIKE '%1%', 1, 0) AS d1,
IF(departments LIKE '%2%', 1, 0) AS d2,
IF(departments LIKE '%3%', 1, 0) AS d3,
IF(departments LIKE '%4%', 1, 0) AS d4,
IF(departments LIKE '%5%', 1, 0) AS d5,
IF(departments LIKE '%6%', 1, 0) AS d6
FROM `project.dataset.table`
So, the final point of all this is:
instead of generating INSERT STATEMENT for each and every row in original table - you should generate simple SELECT statement that does "pivoting"
Update for "extreme" minimizing generated code
See an example:
#standardSQL
CREATE TEMP FUNCTION c(departments STRING, department INT64) AS (
IF(departments LIKE CONCAT('%',CAST(department AS STRING),'%'), 1, 0)
);
WITH `project.dataset.table` AS (
SELECT 'abc' name, '1,2,5,6' departments UNION ALL
SELECT 'xyz', '4,5' UNION ALL
SELECT 'pqr', '3,4,6'
), temp AS (
SELECT name, departments AS d
FROM `project.dataset.table`
)
SELECT
name,
c(d,1)d1,
c(d,2)d2,
c(d,3)d3,
c(d,4)d4,
c(d,5)d5,
c(d,6)d6
FROM temp
as you can see - now each of your 10000 lines will be like c(d,N)dN, with max in length as c(d,10000)d10000, so you have chance to fit into query size limit

How to change order of string based on dates

I received data with a string variable that looks something like:
var_name
25-DEC-99: A11, B14, C89; 28-FEB-94: A27, B94, C30
01-APR-11: A25, B82, C65
04-JUL-09: A21, B55, C26; 12-MAR-03: A11, B72, C68; 08-JUN-11: A62, B47, C82
12-JUN-00: A77, B19, C73; 03-JUL-12: A99, B04, C54
27-OCT-15: A22, B95, C08
And so on. My goal is to split these strings up into different variable names. The variable names would be v1_date, v1_A, v1_B, v1_C, v2_date, v2_A, v2_B, v2_C, v3_date, v3_A, v3_B, v3_C.
I can use split var_name, p(";"), rename to be v1, v2, and v3, and then split again to do this. But the problem is that I want v1, v2, and v3 to be in chronological order based on the date and the data is not currently arranged in that fashion. How can I make it so that the date of v1 comes before v2 and the date of v2 comes before the date of v3? For example in the first observation, I want 25-DEC-99: A11, B14, C89 to be associated with v2 and 28-FEB-94: A27, B94, C30 to be associated with v1.

The following gets you close, I believe. It uses both split and reshape.
clear
set more off
input ///
str100 myvar
"25-DEC-99: A11, B14, C89; 28-FEB-94: A27, B94, C30"
"01-APR-11: A25, B82, C65"
"04-JUL-09: A21, B55, C26; 12-MAR-03: A11, B72, C68; 08-JUN-11: A62, B47, C82"
"12-JUN-00: A77, B19, C73; 03-JUL-12: A99, B04, C54"
"27-OCT-15: A22, B95, C08"
end
split myvar, p(;)
drop myvar
gen obs = _n
reshape long myvar, i(obs)
drop if missing(myvar)
split myvar, p(:)
drop myvar
gen myvar11 = date(myvar1, "DMY", 2020)
format %td myvar11
drop myvar1
rename (myvar11 myvar2) (mydate mycells)
order mydate, before(mycells)
bysort obs (mydate) : gen neworder = _n
drop _j
reshape wide mydate mycells, i(obs) j(neworder)
list
You can loop over the mycells variables if you need to further split them.

In general, please consider using dataex (SSC) to create easy data examples.
You don't give all the (not trivial) code you used to split the variables. As it happens, I don't think your variable names are easy to work with, so I re-created the split in my own fashion. If you reshape long the split data, then sorting by date is easy, but I have pulled up short of the reverse reshape wide, as I suspect the long structure is much easier to work with.
clear
input str80 data
"25-DEC-99: A11, B14, C89; 28-FEB-94: A27, B94, C30"
"01-APR-11: A25, B82, C65"
"04-JUL-09: A21, B55, C26; 12-MAR-03: A11, B72, C68; 08-JUN-11: A62, B47, C82"
"12-JUN-00: A77, B19, C73; 03-JUL-12: A99, B04, C54"
"27-OCT-15: A22, B95, C08"
end
split data, p(;) gen(x)
local j = 1
gen work = ""
foreach x of var x* {
replace work = substr(`x', 1, strpos(`x', ":") - 1)
gen date`j' = daily(work, "DMY", 2050)
replace work = substr(`x', strpos(`x', ":") + 1, .)
split work, p(,)
rename (work1 work2 work3) (vA`j' vB`j' vC`j')
local ++j
}
drop work
drop x*
drop data
gen id = _n
edit
reshape long date vA vB vC, i(id) j(which)
drop if missing(date)
bysort id (date): replace which = _n
list, sepby(id)
+----------------------------------------+
| id which date vA vB vC |
|----------------------------------------|
1. | 1 1 12477 A27 B94 C30 |
2. | 1 2 14603 A11 B14 C89 |
|----------------------------------------|
3. | 2 1 18718 A25 B82 C65 |
|----------------------------------------|
4. | 3 1 15776 A11 B72 C68 |
5. | 3 2 18082 A21 B55 C26 |
6. | 3 3 18786 A62 B47 C82 |
|----------------------------------------|
7. | 4 1 14773 A77 B19 C73 |
8. | 4 2 19177 A99 B04 C54 |
|----------------------------------------|
9. | 5 1 20388 A22 B95 C08 |
+----------------------------------------+

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

convert sparql query to sql query in pyspark-dataframe - python-2.7

Related

How can I compare pairs of columns in a PySpark dataframe and number of records changed?

group by two values and get the third values

Power BI: How to use OR between 2 different filters?

Google BigQuery - Execute dynamically generated queries from a select statement

How to change order of string based on dates

Categories

Resources