What is the key used in Session based windowing on Google dataflow

What is the key used in Session based windowing on Google dataflow - google-cloud-platform

I am new to dataflow. I came across this example in the google documentation.
PCollection<String> items = ...;
PCollection<String> session_windowed_items = items.apply(
Window.<String>into(Sessions.withGapDuration(Duration.standardMinutes(10))));
1) In the above example, what would be the key used by dataflow to create windows?
2) If my input source is pubsub, should I set any message attributes and how can we specify what key dataflow should use when we go with Session based windowing.

Elements are assigned to sessions at the first grouping operation after the Window.into. Whatever key affects the GroupByKey, Combine.perKey, Sum.perKey, CoGroupByKey, etc. operation will be the grouping key.
You do not need to set message attributes to specify the key. Instead, you would write a ParDo to transform the existing elements into KV<K, V> values, and the key there would be derived from that.
You may want to read about group-by-key for for more info.

Related

How to get history of a state in corda?

In My corda project state may evolve over time. I have made the state of type LinearState. Now I want to retrieve the history of a corda state that means, How it evolved over time. How can I see the evolution history of a particular state in Corda?
Particularly, I want to access the complete transaction chain of a state.

Of course without access to your code this answer will vary, but there's two pieces of documentation to be aware of here.
What you want to perform is essentially a vault query (depending on what information you're looking to get).
From the docs on LinearState:
Whenever a node records a new transaction, it also decides whether it should store each of the transaction’s output states in its vault. The default vault implementation makes the decision based on the following rules.
source: https://docs.corda.net/docs/corda-os/4.6/api-states.html#the-vault
That being said, to perform your vault query you would do it just like you would other states. Here's the docs on the vault query API : https://docs.corda.net/docs/corda-os/4.6/api-vault-query.html
If you have the linear Id you can do it from the node shell or using H2 and looking in places like the VAULT_LINEAR_STATES table.
If you want an example of querying in code take a look at the obligation cordapp that takes the linearID as a parameter to the flow.
// 1. Retrieve the IOU State from the vault using LinearStateQueryCriteria
List<UUID> listOfLinearIds = Arrays.asList(stateLinearId.getId());
QueryCriteria queryCriteria = new QueryCriteria.LinearStateQueryCriteria(null, listOfLinearIds);
Vault.Page results = getServiceHub().getVaultService().queryBy(IOUState.class, queryCriteria);
StateAndRef inputStateAndRefToSettle = (StateAndRef) results.getStates().get(0);
IOUState inputStateToSettle = (IOUState) ((StateAndRef) results.getStates().get(0)).getState().getData();
Source Code example here: https://github.com/corda/samples-java/blob/master/Advanced/obligation-cordapp/workflows/src/main/java/net/corda/samples/flows/IOUSettleFlow.java#L56-L61

Difference between RangeKeyCondition and FilterKeyCondition in aws DynamoDb

I am new to AWS. while reading the docs here and example I came to know that sort key is not only use to sort the data in partitions but also used to enhance the searching criteria on dynamoDB table.But the same we can do with the help of filterCondition. So what is the difference,
and also acc. to example given we can use sort/range key in withKeyConditionExpression("CreateDate = :v_date and begins_with(IssueId, :v_issue)")
but when I tried it gave me exception
com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: Query key condition not supported
Thanks

To limit the Items returned rather than returning all Items with a particular HASH key.
There are two different ways we can handle this
The ideal way is to build the element we want to query into the RANGE key. This allows us to use Key Expressions to query our data, allowing DynamoDB to quickly find the Items that satisfy our Query.
A second way to handle this is with filtering based on non-key attributes. This is less efficient than Key Expressions but can still be helpful in the right situations. Filter expressions are used to apply server-side filters on Item attributes before they are returned to the client making the call. Filtering is Applied after DynamoDB Query is completed . If you retrieve 100KB of data in Query step but filter it down to 1KB of data, you will consume the Read Capacity Units for 100KB of data
Moral is - Filtering and projection expressions aren't a magic bullet - they won't make it easy to quickly query your data in additional ways. However, they can save network transfer time by limiting the number and size of items transferred back to your network. They can also simplify application complexity by pre-filtering your results rather than requiring application-side filtering.
From dynamodbguide
dynamodbguide

Cassandra driver querying with multiple keys

I'm trying to query cassandra from a c++ application, and return the values for a set of keys. I am using the datastax driver described here: http://datastax.github.io/cpp-driver/api/
The cassandra query string is something like this:
SELECT value from my_table WHERE key IN (?);
If I prepare a separate query string for each number of parameters, I can use cass_statement_bind_string_n , but is there a way to use one string regardless of the number of keys I wish to query?

There are several things here:
the syntax IN (?) means that you are always asking only for one item - your list has only one entry;
if you want to query multiple items, you need to change syntax to IN ? and bind it using ass_statement_bind_collection_by_name to the value has LIST type. See doc on how you can create collection types;
Using IN for query on partition key is really anti-pattern - it adds load to the node that is performing the query, and makes your queries slower as coordinating node will need to send requests to other nodes, and wait for results, collect them, and send back. It's will be faster if you issue separate requests for each partition key, and collect answer in your application.

DynamoDB 1 big table or multiple small tables?

I'm currently facing some questions regarding my database design. Currently i'm developing an api which lets users do the following:
Create an Account ( 1 User owns 1 Account)
Create a Profile ( 1 Account owns 1-n Profiles)
Let a profile upload 2 types of items ( 1 Profile owns 0-n Items ; the items differ in type and purpose)
Calling the API methods triggers AWS Lambda to perform the requested operations in the DynamoDB tables.
My current plan looks like this:
It should be possible to query items by specifying a time frame and the Profile ID. But i think my design completely defeats the purpose of DynamoDB. AWS documentation says that a well designed product only requires one table.
What would be a good way to realise this architecture in one table?
Are there any drawbacks on using the current design?
What would you specify as Primary/Partition/sort key/secondary indexes in both the current design and a one-table-approach?

I’m going to give this answer assuming that you need to be able to do the following queries.
Given an Account, find all profiles
Given a Profile, find all Items
Given a Profile and a specific ItemType, find all Items
Given an Item, find the owning Profile
Given a Profile, find the owning account
One of the beauties of DynamoDB (and also a bane, perhaps) is that it is mostly schema-less. You need to have the mandatory Primary Key attributes for every item in the table, but all of the other attributes can be anything you like. In order to have a DynamoDB design with only one table, you usually need to get used to the idea of having mixed types of objects in the same table.
That being said, here’s a possible schema for your use case. My suggestion assumes that you are using something like UUIDs for your identifiers.
The partition key is a field that is simply called pkey (or whatever you want). We’ll also call the sort key skey (but again, it doesn’t really matter). Now, for an Account, the value of pkey is Account-{{uuid}} and the value of skey would be the same. For a Profile, the pkey value is also Account-{{uuid}}, but the skey value is Profile-{{uuid}}. Finally, for an Item, the pkey is Profile-{{uuid}} and the skey is Item-{{type}}-{{uuid}}. For all of the attributes of an item, don’t worry about it, just use whatever attributes you want to use.
Since the “parent” object is always the partition key, you can get any of the “child” objects simply by querying for the ID of the of the parent. For example, your key condition expression to get all the ‘ItemType2’s for a Profile would be
pkey = “Profile-{{uuid}}” AND begins_with(skey, “Item-Type2”)
In this schema, your GSI has the same keys as the table, but reversed. You can query the GSI for ‘Item-{{type}}-{{uuid}}’ to get the owning Profile, and similarly with a Profile is to get the owning account.
What I have illustrated here is the adjacency list pattern. DynamoDB also has an article describing how to use composite sort keys for hierarchical data, which would also be suitable for your data, and depending on your expected queries, it might be more suitable than using the adjacency list.
You don’t have to put everything in a single table. Yes, DynamoDB recommends it, but it is far more important to make sure that your application is correct and maintainable. If having multiple tables means it’s easier to write a defect free application, then use multiple tables.

How can i query to get the multiple values in SimpleDB (AWS)

jpg
In that Picture i have colored one part. i have attribute called "deviceModel". It contains more than one value.. i want to take using query from my domain which ItemName() contains deviceModel attribute values more than one value.
Thanks,
Senthil Raja

There is no direct approach to get what you are asking.. You need to manipulate by writing your own piece of code. By running SELECT query you will get the item Attribute-value pair. So here you need to traverse each each itemName() and count values of your desire attribute.

I think what you are refering to is called MultiValued Attributes. When you put a value in the attribute - if you don't replace the existing attribute value the values will multiply, giving you an array of items connected to the value of that attribute name.
How you create them will depend on the sdk/language you are using for your REST calls, however look for the Replace=true/false when you set the attribute's value.
Here is the documentation page on retrieving them: http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/ (look under Using Amazon SimpleDB -> Using Select to Create Amazon SimpleDB Queries -> Queries on Attributes with Multiple Values)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

What is the key used in Session based windowing on Google dataflow - google-cloud-platform

Related

How to get history of a state in corda?

Difference between RangeKeyCondition and FilterKeyCondition in aws DynamoDb

Cassandra driver querying with multiple keys

DynamoDB 1 big table or multiple small tables?

How can i query to get the multiple values in SimpleDB (AWS)

Categories

Resources