how to use "select concat(col)" in siddhi aggregation query? - wso2

How to get all data of one column in the aggregation siddhi query. For example, i have the data as:
column1 column2 column_uuid
1 a uuid1
2 a uuid1
3 a uuid3
4 b uuid4
I want to use the siddhi query as:
define stream Input (column1 int, column2 string, column_uuid string);
define stream Output (column2 string, amount long, uuid string);
#info(name='query')
from Input#window.time(30 sec)
select column2, count() as amount, concat(column_uuid) as uuid
group by column2
having amount > 2
insert into Output;
and i want get result as:
Event{timestamp=xxx, data=[a, 3, "uuid1,uuid2,uuid3"]}

In Siddhi, there are two types of functions. Regular "function"s, and "aggregate-function"s. Since str:concat() is not an aggregate-function, you can't use that to get your desired outcome.
Therefore, you may have to write your own custom:concat() aggregate function to get that expected result. Please refer to following samples on writing a custom aggregate functions (samples). To get the above output, you can simply keep a global string variable within your custom attribute aggregator and append to that in processAdd(Object data) method (similar to this).

Related

How can I correct AWS Glue Crawler/Data Catalog inferring all fields in CSV as strings when they're clearly not?

I have a big CSV text file uploaded weekly to an S3 path partitioned by upload date (maybe not important). The schema of these files are all the same, the formatting is all the same, the naming conventions are all the same. Each file contains ~100 columns and ~1M rows of mixed text/numeric types. The raw data looks like this:
id,date,string,int_values,double_values
"6F87U",2021-03-21,"Text",0,1.1483
"8DU87",2021-03-22,"More text, oh yes",1,2.525
"79LO2",2021-03-23,"Moar, give me moar, text",2,3.485489
When I run a Crawler with everything default, querying with Athena like so:
select * from tb_csv_data
...the results in Athena are thus:
id
date
string
int_values
double_values
"6F87U"
2021-03-21
"Text"
0
1.1483
"8DU87"
2021-03-22
"More text
oh yes"
1
"79LO2"
2021-03-23
"Moar
give me moar
text
The problem at this level seems to be with proper detection (read: ignoring) of commas as delimiters within quotation marks. So I have a CSV classifier with the following characteristics that I have attached to the Crawler, I run the Crawler again with the classifier attached, and the resulting table properties are thus:
Input format org.apache.hadoop.mapred.TextInputFormat
Output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Serde serialization lib org.apache.hadoop.hive.serde2.OpenCSVSerde
Serde parameters
quoteChar "
separatorChar ,
Table properties
sizeKey 4356512114
objectCount 3
UPDATED_BY_CRAWLER crawler-name
CrawlerSchemaSerializerVersion 1.0
recordCount 3145398
averageRecordSize 1384
CrawlerSchemaDeserializerVersion 1.0
compressionType none
columnsOrdered true
areColumnsQuoted true
delimiter ,
typeOfData file
The resulting table with the same simple Athena query as above seems to be correct:
id
date
string
int_values
double_values
6F87U
2021-03-21
Text, yes
0
1.1483
8DU87
2021-03-22
More text, oh yes
1
2.525
79LO2
2021-03-23
Moar, give me moar, text
2
3.485489
The expected automatic inference of data types is supposed to be this (let's simplify and presume the date is correct as a string):
Column name
Data type
id
string
date
string
string
string
int_values
bigint (or long)
double_values
double
...but instead they're all strings!
Column name
Data type
id
string
date
string
string
string
int_values
string
double_values
string
I need this data to be accurately queryable from Athena as it is, where it is, so what can I do without further processing of the raw data? I suppose I could manually adjust the table properties in the Console but is that really correct when I need the entire pipeline to be automated? I also want to avoid having to cast types in queries 80+ times for each field as most of these columns are numeric. What can I do?
Thank you!
The limitation arrives from the serde that you are using in your query. Refer to note section in this doc which has below explanation :
When you use Athena with OpenCSVSerDe, the SerDe converts all column types to STRING. Next, the parser in Athena parses the values from STRING into actual types based on what it finds. For example, it parses the values into BOOLEAN, BIGINT, INT, and DOUBLE data types when it can discern them. If the values are in TIMESTAMP in the UNIX format, Athena parses them as TIMESTAMP. If the values are in TIMESTAMP in Hive format, Athena parses them as INT. DATE type values are also parsed as INT.
For date type to be detected it has to be in UNIX numeric format, such as 1562112000 according to the doc.

Parse data from table Greenplum

I have a table scheme2.central_id__new_numbers in Greenplum.
I need select data from scheme2.central_id__new_numbers in the form of a many-to-many relationship.
Also I write the code but must have made a wrong turn somewhere (the code doesn't work):
CREATE FUNCTION my_scheme.parse_new_numbers (varchar) RETURNS SETOF varchar as
$BODY$
declare
i int;
BEGIN
FOR i IN 1..10 LOOP
select
central_id,
(select regexp_split_to_table((select new_numbers
from scheme2.central_id__new_numbers limit 1 offset i), '\s+'))
from scheme2.central_id__new_numbers limit 1 offset i
END LOOP;
END;
$BODY$
LANGUAGE plpgsql;
I'd recommend using the UNNEST() function instead, i.e. assuming new_numbers column is of int[] data type,
SELECT central_id
, UNNEST(new_numbers) AS new_numbers
FROM central_id__new_numbers;
If new_columns column is not an array data type then you need to use i.e. string_to_array() or similar before using UNNEST().

Can we create Dynamic Date table in mapping Data Flow?

I have a query in Power BI that takes two parameter: Start Date and End Date.
Whenever I pass these Dates it return a table of Date that contain few columns created according to this range of date such as Date, QuarterofYear, Year, MonthName......etc.
Can we create a mapping data flow in ADF that takes two parameter as input and return a calculated table according to provided dates?
Is there any function that return the range of dates?
For your request: "I want that I pass two date Start Date and End Date in ADF Mapping Data Flow , and Data flow will Create a column such as "Date" that contain that number of Date rows. Is there any function for this? Exam. Start Date=20-01-2019, End Date=20-01-2020 Then Date Column Values should be: 20-01-2019 21-01-2019 ......... ......... 20-02-2020", according the Data Factory documents and my experience, the answer is no, we can't achieve it in Data Flow.
There is a solution to this, but it is a bit tricky.
TL;DR
The general data flow looks like this:
We need a dummy source with exactly one row which contains whatever.
Then we derive a column where we use the mapLoop() expression to create an array of all the dates we want to get rows for.
Finally, we need to flatten the array column which will result in one row per array entry and thus one row per date.
Walkthrough
Source dummy
Each dataflow needs a source and we need exactly one row to make our dataflow work. To achieve this I've created a dataset called empty of type CSV in my data lake which has this content:
empty
""
This is our source definition:
And its result looks like this:
Derived column days
This is where the magic happens!
We create a new column dates which is an array of all the dates we want to have in our date table:
In this scenario we want a date table starting on 2019-01-01 and reaching one year into the future. The full expression looks like this:
mapLoop(
addDays(currentDate(), 365) - toDate(2019-01-01),
addDays(toDate(2019-01-01), #index)
)
This is what happens here:
the mapLoop() function builds an array of elements. You specify the number of elements you want to have and the lambda expression to calculate each of the elements. For example, mapIndex([1, 2, 3, 4], #item + 2 + #index) results in [4, 6, 8, 10]
addDays(currentDate(), 365) - toDate('2019-01-01') is the number of days between our start (2019-01-01) and end date (1 year in the future from now) and thus the number of dates we want to have in our resulting array.
addDays(toDate(2019-01-01), #index) calculates each array item by adding #index days to our start date. This is executed for the number of days we've calculated before and #index is the array position. Thus, the first element of the array will be 2019-01-01 + 1, the second 2019-01-01 + 2 and so on.
Our stream now has these columns:
Flatten
Finally, you need a flatten transformation which will expand each item in your array to its dedicated row. We can also dismiss the useless empty column in this step:
And this finally results in what we wanted to achieve:
References
Data transformation expressions in mapping data flow

Power query append multiple tables with single column regardless column names

I have the following query in M:
= Table.Combine({
Table.Distinct(Table.SelectColumns(Tab1,{"item"})),
Table.Distinct(Table.SelectColumns(Tab2,{"Column1"}))
})
Is it possible to get it working without prior changing column names?
I want to get something similar to SQL syntax:
select item from Tab1 union all
select Column1 from Tab2
If you need just one column from each table then you may use this code:
= Table.FromList(List.Distinct(Tab1[item])
& List.Distinct(Tab2[Column1]))
If you use M (like in your example or the append query option) the columns names must be the same otherwise it wont work.
But it works in DAX with the command
=UNION(Table1; Table2)
https://learn.microsoft.com/en-us/dax/union-function-dax
It's not possible in Power Query M. Table.Combine make an union with columns that match. If you want to keep all in the same step you can add the change names step instead of tap2 like you did with Table.SelectColumns.
This comparison of matching names is to union in a correct way.
Hope you can manage in the same step if that's what you want.

Splitting column values with Sybase and ColdFusion

I need to trim the date from a text string in the function call of my app.
The string comes out as text//date and I would like to trim or replace the date with blank space. The column name is overall_model and the value is ford//1911 or chevy//2011, but I need the date part removed so I can loop over the array or list to get an accurate count of all the models.
The problem is that if there is a chevy//2011 and a chevy//2010 I return two rows in my table because of the date. So if I can remove the date and loop over them I can get my results of chevy there are 23 chevy models.
I have not used Sybase in a while, but I remember its string functions are very similar to MS SQL Server.
If overall_model always contains "//", use charindex to return the position of the delimiter and substring to retrieve the "text" before it. Then combine it with a COUNT. (If the "//" is not always present, you will need to add a CASE statement as well).
SELECT SUBSTRING(overall_model, 1, CHARINDEX('/', overall_model)-1) AS Model
, COUNT(*) AS NumberOfRecords
FROM YourTable
GROUP BY SUBSTRING(overall_model, 1, CHARINDEX('/', overall_model)-1)
However, ideally the "text" and "date" should be stored separately. That would offer greater flexibility and generally better performance.