beam stateful processing, how to build the state - state

I have a p-collection of type KV pair UserId, UserName:
1, name1
2, name2
3, name3
From another source i have another p-collection from which i generate a view of the following:
PCollectionView of KV pair OfficeId, UserInfo
where UserInfo:
UserId, UserName only one of them must have value
office1, 1, name1
office2, 2, null
office3, null, name3
eventually i would like to match for each userid the relevant officeId.
it looks like that for each event in the first P-Collection
i'll have to run a full scan of the view in order to find the data.
sometime the match will be done via userid,
and i case it is null it will be by username.
is there any better way to build the state so i can avoid full scan?

Related

WSO2 Stream processor: Siddhi App to calculate sum

I am working on stream processor 4.3.0. I have came across one scenario where I am putting some datafeeds into the rdbms table using siddhiapp. Using siddiapp, I am entering the data in RDBMS table as below
Now, I am using another SiddhiApp to retrieve the data, but I would want to try out to fetch the data in such way like below
As the common columns are shrinked to get into one row and the column which has counts are now summed to get the final Sum of all counts.
Can some one please guide me how to proceed here.
Thanks in advance
here is the app to get the total sum
#App:name("IncomingStream3")
#App:description("Description of the plan")
-- Please refer to https://docs.wso2.com/display/SP400/Quick+Start+Guide on getting started with SP editor.
--#store(type = 'rdbms', datasource = 'APIM_ANALYTICS_DB')
--#purge(enable='false', interval='60 min', #retentionPeriod(sec='1 day', min='72 hours', hours='90 days', days='1 year', months='2 years', years='3 years'))
define stream TempStatsStream (AGG_TIMESTAMP long, AGG_EVENT_TIMESTAMP long, apiName string, apiVersion string, apiResourcePath string,apiCreator string,username string, applicationConsumerKey string, AGG_LAST_EVENT_TIMESTAMP long, applicationName string, dateTime string, AGG_COUNT int);
define aggregation StatsToCal
from TempStatsStream
select apiName, apiVersion, apiResourcePath, apiCreator, username, applicationName,
applicationConsumerKey, SUM (AGG_COUNT) as totalRequestCount, dateTime
group by apiName, apiVersion, apiResourcePath, username, applicationConsumerKey
aggregate by dateTime every days;
Only change I have made here is instead of fetching the value from DB table, I am considering it as stream ( as the aggregation can be done only for Stream, I suppose).
Seems like you have to group by API, Name1, Name2 and ID? You can use group by similar to SQL group by
from TriggerStream join APITable
select APIName, Name1, Name2, ID, sum(Count) as totalCount
group by API, Name1, Name2, ID
insert into OutputStream;

concatenate attributes in search expression

I am trying to build Filter Expression in query for searching data in dynamodb.
var params = {
TableName: "ContactsTable",
ExpressionAttributeNames: {
"#lastName": "LastName",
"#firstName": "FirstName",
"#contactType": "ContactType"
},
FilterExpression: "contains(#lastName, :searchedName) or contains(#firstName, :searchedName)",
ExpressionAttributeValues: {
":companyContactType": event.query.companyContactType,
":searchedName": event.query.searchedValue
},
KeyConditionExpression: "#contactType = :companyContactType"
};
Users generally search for LastName, FirstName (they append comma to LastName as a common search pattern). However data is stored in separate attributes named LastName and FirstName so that they can search by that as well.
Is there a way by which I can dynamically concatenate these two fields something like contains(#lastName<append comma>#firstName, :searchedName)?
You should remove comma from user input, split words and, for each word, check if it is contained in both (first name and last name) and 'or' everything together, or even use begins_with instead of contains.
Ex. For "john smith" will result in
contains(#lastName, "john") or
contains(#lastName, "smith" ) or
contains(#firstName, "john") or
contains(#firstName, "smith")
Also contains() is case sensitive from what i know,so you might want to insert first name and last name as lowercase as well as lowercase user's search term.

Kettle database lookup case insensitive

I've a table "City" with more than 100k records.
The field "name" contains strings like "Roma", "La Valletta".
I receive a file with the city name, all in upper case as in "ROMA".
I need to get the id of the record that contains "Roma" when I search for "ROMA".
In SQL, I must do something like:
select id from city where upper(name) = upper(%name%)
How can I do this in kettle?
Note: if the city is not found, I use an Insert/update field to create it, so I must avoid duplicates generated by case-sensitive names.
You can make use of the String Operations steps in Pentaho Kettle. Set the Lower/Upper option to Y
Pass the city (name) from the City table to the String operations steps which will do the Upper case of your data stream i.e. city name. Join/lookup with the received file and get the required id.
More on String Operations step in pentaho wiki.
You can use a 'Database join' step. Here you can write the sql:
select id from city where upper(name) = upper(?)
and specify the city field name from the text file as parameter. With 'Number of rows to return' and 'Outer join?' you can control the join behaviour.
This solution doesn't work well with a large number of rows, as it will execute one query per row. In those cases Rishu's solution is better.
This is how I did:
First "Modified JavaScript value" step for create a query:
var queryDest="select coalesce( (select id as idcity from city where upper(name) = upper('"+replace(mycity,"'","\'\'")+"') and upper(cap) = upper('"+mycap+"') ), 0) as idcitydest";
Then I use this string as a query in a Dynamic SQL row.
After that,
IF idcitydest == 0 then
insert new city;
else
use the found record
This system make a query for file's row but it use few memory cache

trying to insert the data to the specific column and the row in query

I have a 2nd query where the column names are appearing and I want to insert the data in the main query. I have managed to bring all the columns of 2nd query to the main query, but the data is empty for all newly added columns.
Now I am trying to loop over the first query and trying to find the uuid which exists to insert the specific data at the specific column it finds and at the specific row based upon the uuid search. This is my try as of now:
<cfset lstusers = '51840915-e570-430d-9911-7247d076f6e7,5200915-g675-430d-9911-7247d076f6e7,56674915-e570-430d-9911-7247d076f6e7,2134563-e570-430d-9911-7247d076f6e7'>
<cfloop query="quserList">
<cfdump var="#quserList.uuid[currentRow]#">
<cfif ListContainsNoCase(lstusers,quserList.uuid[currentRow])>
<cfset QuerySetCell(quserList,"BUFFEREDRANGENOTES","name",quserList[uuid][currentRow])>
<cfset QuerySetCell(quserList,"BufferNotes","name2",quserList[uuid][currentRow])>
</cfif>
</cfloop>
</cfif>
However, it is giving me an error on the quserList[uuid][currentRow] line that says not indexable by the data:
coldfusion.sql.QueryColumn#276249a2] ] is not indexable by
51840915-e570-430d-9911-7247d076f6e7
If I try it in other way:
quserList.uuid[currentRow]
I still get an error, but it says "cannot convert to int ...". How do I fix it?
Update:
In image 1, I am doing a create column for all the above 1st query product_types into the main query and based upon the userid of 1st query and uuid of second query. I want to insert data in correct location and correct row for the user based upon uuid and userid match. Image 2 is the uuid in the second table:
In both the queries, the userid you see in the first section is common. Meaning that the same usedid exists below that tells us that this user has completed these trainings. Now I want the first query to get merged in second one so it should add correct data to the correct row and that is what messing me up.
SQL:
Query #1:
SELECT ct.trainingid,
ct.userid,
ct.trainingtype,
ct.trainingstatus,
ct.trainingscore,
ct.trainingdate,
dbo.Fn_stripcharacters(ctt.product_type, '^a-z0-9') AS product_type,
ctt.product_type AS oldName
FROM clienttraining AS ct
INNER JOIN clienttraningtypes AS ctt ON ct.trainingtype = ctt.typeid
WHERE 1 = 1
AND userid IN (
'51840915-e570-430d-9911-7247d076f6e7'
, '51927ada-6370-4433-‌​8a06-30d2d076f6e7'
)
AND trainingtype IN (
SELECT typeid
FROM complaincetestlinks
WHERE pid = 1039
AND isactive = 1
AND isdeleted = 0
)
Query 2:
SELECT id,
NAME,
username,
email,
password,
first_name,
last_name,
usertyp‌​e,
block,
sendemail,
registerdate,
lastvisitdate,
activation,
params‌​,
uuid
FROM users
WHERE uuid IN (
'51840915-e570-430d-9911-7247d076f6e7'
, '51912193-6694-4ca5-‌​94c9-9f31d076f6e7'
, '51927ada-6370-4433-8a06-30d2d076f6e7'
, '51c05ad7-d1d0-4eb6-b‌​c6b-424bd076f6e7'
, 'd047adf1-a6af-891e-94a2d0b225dcd1b6'
, '2aba38f2-d7a7-0a7a-ef‌​f2be3440e3b763'
)
Update:
Looking at it again with fresh eyes, it still seems like maybe you are over-complicating things? A simple JOIN should return the information needed, ie All users and the completed training info (if any).
Runnable SQLFiddle
SELECT u.id
, u.first_name
, u.last_name
, ct.trainingid
, ct.userid
, ct.trainingtype
, ct.trainingstatus
, ct.trainingscore
, ct.trainingdate
, ctt.product_type
, ctt.product_type AS oldName
FROM users u
LEFT JOIN clientTraining AS ct ON ct.UserID = u.UUID
LEFT JOIN clientTraningTypes AS ctt ON ct.trainingtype = ctt.typeid
LEFT JOIN (
SELECT typeID
FROM complainceTestLinks
WHERE parent_client_id = 1039
AND isactive = 1
AND isdeleted = 0
) ctl ON ctl.TypeID = ct.trainingType
WHERE u.uuid IN
(
'51840915-e570-430d-9911-7247d076f6e7'
, '51912193-6694-4ca5-94c9-9f31d076f6e7'
, '51927ada-6370-4433-8a06-30d2d076f6e7'
, '51c05ad7-d1d0-4eb6-bc6b-424bd076f6e7'
, 'd047adf1-a6af-891e-94a2d0b225dcd1b6'
, '2aba38f2-d7a7-0a7a-eff2be3440e3b763'
)
ORDER BY last_name, first_name, product_type
;
How you want to present the information, on the front end, is different question. For example, you could use a <cfoutput group="..."> to only display each user's name once, and a list of completed training courses beneath it (see below). If you need more specific advice, please post an example of the desired output.
Smith, John
Course 1
Course 2
Allen, Mark
Course 2
Course 3
...
coldfusion.sql.QueryColumn#276249a2] ] is not indexable by
51840915-e570-430d-9911-7247d076f6e7
That just means you are referencing a column name that does not exist. By omitting the quotes around uuid here quserList[uuid][currentRow], you are actually passing in the variable value as the column name, NOT the literal string "UUID". Obviously the query does not contain a column named "51840915-e570-430d-9911-7247d076f6e7". Hence the error.
it says cannot convert to int
That is pretty self explanatory. You are trying to populate a numeric column with a non-numeric value. Clearly the UUID string, ie "51840915-e570-430d-9911-7247d076f6e7" is not an integer. Either you are using the wrong value or need to change the column type.
It may be related to the fact that your QuerySetCell call is passing in the wrong parameters. The third and fourth parameter should be the "value" and query "row number". However, your code is passing in a hard coded string for "value" and the UUID string instead of a row number
QuerySetCell(quserList,"BUFFEREDRANGENOTES","name",quserList[uuid][currentRow])
That said, technically you do not even need that function. Just use associative array notation to "set" the values, ie <cfset queryName["columnName"][currentRow] = "some value here">
<cfif ListContainsNoCase(lstusers,quserList.uuid[currentRow])>
Nothing to do with the error, but ListContainsNoCase is the wrong function here, as it searches for partial matches. To match whole elements only, use ListFindNoCase.

How to use MapReduce when extracting a group of document id's by some criteria from CouchDB

I'm in my first week of CouchDB experimentation and trying to stop thinking in SQL. I have a collection of documents (5000 event files) that all have some ID value that will be common to groups of documents. So there might be 10 that all have TheID: 'foobar'.
(In case someone asks - TheID is not an auto-increment value from a relational database - it is a unique id assigned by a partner company of ours. I cannot redesign my source data to identify itself some other way, I have to use this TheID field to recognise groups of documents.)
I want to query my list of documents:
{ _id: 'document1', Message: { TheID: 'foobar' } }
{ _id: 'document2', Message: { TheID: 'xyz' } }
{ _id: 'document3', Message: { TheID: 'xyz' } }
{ _id: 'document4', Message: { TheID: 'foobar' } }
{ _id: 'document5', Message: { TheID: 'wibble' } }
{ _id: 'document6', Message: { TheID: 'foobar' } }
I want the results:
'foobar': [ 'document1', 'document4', 'document6' ]
'xyz': [ 'document2', 'document3' ]
'wibble': [ 'document5' ]
The aim is to represent groups of documents on our UI grouped by TheID, so the user can see all documents for a specific TheID together, and select that TheID to drill into the data querying just by that TheID value. Yes, the string id of each document is useful - in our case, the _id value of each document is the source event identifier, so it is a unique and useful value that the user is going to want to see in the list on screen.
In SQL one might order by or group by the TheID field and iterate the result set appropriately. I doubt this thinking is any use at all with a CouchDB query.
I know that I can use a map function to extract the TheID value for each document, for example:
function (doc) {
emit(doc.Message.TheID, 1);
}
or perhaps
function (doc) {
emit(doc._id, doc.Message.TheID);
}
I'm not sure exactly what I should emit as the key and value. Even if this is useful, I'm getting the feeling that I should not use a reduce function to try to 'reduce' the large map output (1 result row per document in the database) to what I want (3 results each with a list of document id's).
http://guide.couchdb.org/draft/views.html says "A common mistake new CouchDB users make is attempting to construct complex aggregate values with a reduce function. Full reductions should result in a scalar value, like 5, and not, for instance, a JSON hash with a set of unique keys and the count of each."
I thought I might be able to use reduce to scan the results of the map and somehow collect all results that have a common TheID value into a single result object. What I see when reading the reduce documentation is that it will be given arrays of keys and values that contain fairly unpredictable collections, driven by the structure of the btree underlying the map results. It won't be given arrays guaranteed to contain all similar TheID values that I could scan for. This approach seems completely broken.
So, is a map/reduce pair the right thing to do here? Should I look at using a 'show' or 'list' instead? I'm intending to build a mustache based HTML template engine around the results, so 'list' seems the wrong way to go.
Thanks in advance for any guidance.
EDIT I have done some local dev and come up with what I think is a broken solution. Hopefully this will show you the direction I'm trying to go in. See a public cloud based CouchDB I created at https://neek.iriscouch.com/_utils/database.html?test/_design/test/_view/collectByTheID
This is public. If you would like to play, please copy it to a new view, don't pollute this one in case others come in and want to see the original.
map function:
function(doc) {
emit(doc.Message.TheID, doc._id);
}
reduce function:
function(keys, values, rereduce) {
if (!rereduce) {
return values;
} else {
var ret = [];
values.forEach(function (ar) {
ret.concat(ar);
});
return ret;
}
}
Results:
"foobar" ["document6", "document4", "document1"]
"wibble" ["document5"]
"xyz" ["document3", "document2"]
The reduce function first leaves the array of values alone, and on the second pass concatenates them together. However when I run this on my large 5000+ document database it comes up with some TheID values with empty document id arrays. I believe this suffers from the problem I mentioned before, where the array of values passed to reduce are build dependent on the btree structure of the map they are extractd from and are not guaranteed to contain a complete set of values for given keys.
Make use of the group_level feature:
Map:
emit([doc.message.TheID, doc._id], null)
Reduce:
You must include a reduce to use group_level, it can be empty as below or something else, i.e. _count
function(keys, values){
return null;
}
A query with group_level=1 would return:
/_design/d/_view/v?group_level=1
[
{key: ["foobar"], value: null},
{key: ["xyz"], value: null},
{key: ["wibble"], value: null}
]
You would use this query to populate the top level in your grouping UI. When the user expands a category, you would do another query with group_level 2 and start and end keys:
/_design/d/_view/v?group_level=2&startkey=["foobar"]&endkey=["foobar",{}]
[
{key: ["foobar", "document6"], value: null},
{key: ["foobar", "document4"], value: null},
{key: ["foobar", "document1"], value: null}
]
This doesn't produce the output exactly as you are requesting, however, I think you'll find it flexible enough