WSO2 Stream processor: Siddhi App to calculate sum - wso2

I am working on stream processor 4.3.0. I have came across one scenario where I am putting some datafeeds into the rdbms table using siddhiapp. Using siddiapp, I am entering the data in RDBMS table as below
Now, I am using another SiddhiApp to retrieve the data, but I would want to try out to fetch the data in such way like below
As the common columns are shrinked to get into one row and the column which has counts are now summed to get the final Sum of all counts.
Can some one please guide me how to proceed here.
Thanks in advance
here is the app to get the total sum
#App:name("IncomingStream3")
#App:description("Description of the plan")
-- Please refer to https://docs.wso2.com/display/SP400/Quick+Start+Guide on getting started with SP editor.
--#store(type = 'rdbms', datasource = 'APIM_ANALYTICS_DB')
--#purge(enable='false', interval='60 min', #retentionPeriod(sec='1 day', min='72 hours', hours='90 days', days='1 year', months='2 years', years='3 years'))
define stream TempStatsStream (AGG_TIMESTAMP long, AGG_EVENT_TIMESTAMP long, apiName string, apiVersion string, apiResourcePath string,apiCreator string,username string, applicationConsumerKey string, AGG_LAST_EVENT_TIMESTAMP long, applicationName string, dateTime string, AGG_COUNT int);
define aggregation StatsToCal
from TempStatsStream
select apiName, apiVersion, apiResourcePath, apiCreator, username, applicationName,
applicationConsumerKey, SUM (AGG_COUNT) as totalRequestCount, dateTime
group by apiName, apiVersion, apiResourcePath, username, applicationConsumerKey
aggregate by dateTime every days;
Only change I have made here is instead of fetching the value from DB table, I am considering it as stream ( as the aggregation can be done only for Stream, I suppose).

Seems like you have to group by API, Name1, Name2 and ID? You can use group by similar to SQL group by
from TriggerStream join APITable
select APIName, Name1, Name2, ID, sum(Count) as totalCount
group by API, Name1, Name2, ID
insert into OutputStream;

Related

Replace Traffic Source from raw Google analytics session data in Bigquery?

Recently we observed that when a user tries to complete a transaction on our website using an ios device. Apple ends the current session and begins a new session. The difficulty with this is that if the user came through paid source/email the current session ends and starts a new session with apple.com traffic source.
For Instance
google->appleid.apple.com
(direct)->appleid.apple.com
email->appleid.apple.com
ios->appleid.apple.com->appleid.apple.com->appleid.apple.com
Since we have this raw data coming into BQ we are looking at replacing appleid.apple.com with their actual traffic Source i.e. google,direct,email,ios.
Any help regarding the logic/function to workaround this problem will help?
This is the code I tried implementing:
WITH DATA AS (
SELECT
PARSE_DATE("%Y%m%d",date) AS Date,
clientId as ClientId,
fullVisitorId AS fullvisitorid,
visitNumber AS visitnumber,
trafficSource.medium as medium,
CONCAT(fullvisitorid,"-",CAST(visitStartTime AS STRING)) AS Session_ID,
trafficsource.source AS Traffic_Source,
MAX((CASE WHEN (hits.eventInfo.eventLabel="complete") THEN 1 ELSE 0 END)) AS ConversionComplete
FROM `project.dataset.ga_sessions_20*`
,UNNEST(hits) AS hits
WHERE totals.visits=1
GROUP BY
1,2,3,4,5,6,7
),
Source_Replace AS (
SELECT
Date AS Date,
IF(Traffic_Source LIKE "%apple.com" ,(CASE WHEN Traffic_Source NOT LIKE "%apple.com%" THEN LAG(Traffic_Source,1) OVER (PARTITION BY ClientId ORDER BY visitnumber ASC)end), Traffic_Source) AS traffic_source_1,
medium AS Medium,
fullvisitorid AS User_ID,
Session_ID AS SessionID,
ConversionComplete AS ConversionComplete
FROM
DATA
)
SELECT
Date AS Date,
traffic_source_1 AS TrafficSource,
Medium AS TrafficMedium,
COUNT(DISTINCT User_ID) AS Users,
COUNT(DISTINCT SessionID) AS Sessions,
SUM(ConversionComplete) AS ConversionComplete
FROM
Source_Replace
GROUP BY
1,2,3
Thanks
Does assuming the visitStartTime as key to identifying the session start help? Maybe something like:
source_replaced as (
select *,
min(Traffic_Source) over (
partition by date, clientid, fullvisitorid, visitnumber order by visitStartTime
) as originating_source
from data
)
Then you can do your aggregation over the originating_source. Its kind of difficult without looking at some sample of data about whats going on.
Hope it helps.

Kettle database lookup case insensitive

I've a table "City" with more than 100k records.
The field "name" contains strings like "Roma", "La Valletta".
I receive a file with the city name, all in upper case as in "ROMA".
I need to get the id of the record that contains "Roma" when I search for "ROMA".
In SQL, I must do something like:
select id from city where upper(name) = upper(%name%)
How can I do this in kettle?
Note: if the city is not found, I use an Insert/update field to create it, so I must avoid duplicates generated by case-sensitive names.
You can make use of the String Operations steps in Pentaho Kettle. Set the Lower/Upper option to Y
Pass the city (name) from the City table to the String operations steps which will do the Upper case of your data stream i.e. city name. Join/lookup with the received file and get the required id.
More on String Operations step in pentaho wiki.
You can use a 'Database join' step. Here you can write the sql:
select id from city where upper(name) = upper(?)
and specify the city field name from the text file as parameter. With 'Number of rows to return' and 'Outer join?' you can control the join behaviour.
This solution doesn't work well with a large number of rows, as it will execute one query per row. In those cases Rishu's solution is better.
This is how I did:
First "Modified JavaScript value" step for create a query:
var queryDest="select coalesce( (select id as idcity from city where upper(name) = upper('"+replace(mycity,"'","\'\'")+"') and upper(cap) = upper('"+mycap+"') ), 0) as idcitydest";
Then I use this string as a query in a Dynamic SQL row.
After that,
IF idcitydest == 0 then
insert new city;
else
use the found record
This system make a query for file's row but it use few memory cache

Regex QueryString Parsing for a specific in BigQuery

So last week I was able to begin to stream my Appengine logs into BigQuery and am now attempting to pull some data out of the log entries into a table.
The data in protoPayload.resource is the page requested with the querystring paramters included.
The contents of protoPayload.resource looks like the following examples:
/service.html?device_ID=123456
/service.html?v=2&device_ID=78ec9b4a56
I am getting close, but when there is another entry before device_ID, I am not getting it. As you can see I am not great with Regex, but it is the only way I think I can parse the data in the query. To get just the device ID from the first example, I was able to use the following example. Works great. My next challenge is to the data when the second parameter exists. The device IDs can vary in length from about 10 to 26 characters.
SELECT
RIGHT(Regexp_extract(protoPayload.resource,r'[\?&]([^&]+)'),
length(Regexp_extract(protoPayload.resource,r'[\?&]([^&]+)'))-10) as Device_ID
FROM logs
What I would like is just the values from the querystring device_ID such as:
123456
78ec9b4a56
Assuming you have just 1 query string per record then you can do this:
SELECT REGEXP_EXTRACT(protoPayload.resource, r'device_ID=(.*)$') as device_id FROM mytable
The part within the parentheses will be captured and returned in the result.
If device_ID isn't guaranteed to be the last parameter in the string, then use something like this:
SELECT REGEXP_EXTRACT(protoPayload.resource, r'device_ID=([^\&]*)') as device_id FROM mytable
One approach is to split protoPayload.resource into multiple service entries, and then apply regexp - this way it will support arbitrary number of device_id, i.e.
select regexp_extract(service_entry, r'device_ID=(.*$)') from
(select split(protoPayload.resource, ' ') service_entry from
(select
'/service.html?device_ID=123456 /service.html?v=2&device_ID=78ec9b4a56'
as protoPayload.resource))

Analyzing tweeter with hive, regex extract

I am trying to analyze what are the most popular hashtags of July. So far I am able to select tweets from July, or display the most popular tweets, but I didn't sucess in putting them together. I am thinking about creating a intermediate table with july tweets, then display the popular hashtags, but I don't know how, can you help me? What about a 2 level select (select a from select b from table) ?
SELECT hashtags.text, count(*) as total FROM tweets
WHERE regexp_extract(created_at, "(Tue) (Jul)*", 2) = "Jul"
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text), created_at
ORDER BY total_count DESC
LIMIT 200
Regards, K.
So far, I did this, which is pretty much what I want, but is there any mean to achieve this differently ?
Working nested query:
SELECT
LOWER(hashtags.text),
COUNT(*) AS total_count
FROM (
SELECT * FROM tweets WHERE regexp_extract(created_at,"(Tue Jul)*",1) = "Tue Jul"
) tweets
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text)
ORDER BY total_count DESC
LIMIT 15
EDIT:
Ok, so if you want you can also do it by a temporary table:
CREATE TABLE tmpdb (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweet_count INT,
retweeted_status STRUCT<
text:STRING,
user:STRUCT<screen_name:STRING,name:STRING>>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
text STRING,
user STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
in_reply_to_screen_name STRING
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
Then you update it:
INSERT OVERWRITE TABLE tmpdb
SELECT * FROM tweets WHERE regexp_extract(created_at,"(Tue Jul)*",1) = "Tue Jul"
And the request become as simple as this:
SELECT
LOWER(hashtags.text),
COUNT(*) AS total_count
FROM tmpdb
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text)
ORDER BY total_count DESC
LIMIT 15
The pro/cons about the second method:
You need to update the table if you want accurate requests, so it is not suited for one-shot request, but if you need to do multiple requests on the current state of the database, then this method is better.
Don't forget that, copying a database is a costly operation ! So know when to use it :)

How to tweak LISTAGG to support more than 4000 character in select query?

Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production.
I have a table in the below format.
Name Department
Johny Dep1
Jacky Dep2
Ramu Dep1
I need an output in the below format.
Dep1 - Johny,Ramu
Dep2 - Jacky
I have tried the 'LISTAGG' function, but there is a hard limit of 4000 characters. Since my db table is huge, this cannot be used in the app. The other option is to use the
SELECT CAST(COLLECT(Name)
But my framework allows me to execute only select queries and no PL/SQL scripts.Hence i dont find any way to create a type using "CREATE TYPE" command which is required for the COLLECT command.
Is there any alternate way to achieve the above result using select query ?
You should add GetClobVal and also need to rtrim as it will return delimiter in the end of the results.
SELECT RTRIM(XMLAGG(XMLELEMENT(E,colname,',').EXTRACT('//text()')
ORDER BY colname).GetClobVal(),',') from tablename;
if you cant create types (you can't just use sql*plus to create on as a one off?), but you're OK with COLLECT, then use a built-in array. There's several knocking around in the RDBMS. run this query:
select owner, type_name, coll_type, elem_type_name, upper_bound, length
from all_coll_types
where elem_type_name = 'VARCHAR2';
e.g. on my db, I can use sys.DBMSOUTPUT_LINESARRAY which is a varray of considerable size.
select department,
cast(collect(name) as sys.DBMSOUTPUT_LINESARRAY)
from emp
group by department;
A derivative of #anuu_online but handle unescaping the XML in the result.
dbms_xmlgen.convert(xmlagg(xmlelement(E, name||',')).extract('//text()').getclobval(),1)
For IBM DB2, Casting the result to a varchar(10000) will give more than 4000.
select column1, listagg(CAST(column2 AS VARCHAR(10000)), x'0A') AS "Concat column"...
I end up in another approach using the XMLAGG function which doesn't have the hard limit of 4000.
select department,
XMLAGG(XMLELEMENT(E,name||',')).EXTRACT('//text()')
from emp
group by department;
You can use:
SELECT department
, REGEXP_REPLACE(XMLCAST(XMLAGG(XMLELEMENT(x, name, ',')) AS CLOB), ',$')
FROM emp
GROUP BY department
it will return CLOB that has no size limit, handles correctly XML entity escapes and separators.
Instead of REGEXP_REPLACE(..., ',$')) you can use RTRIM(..., ','), which should be faster, but will remove all separators from the end of the result (including those that can appear in name at the end, or previous ones if last names are empty).