Power Query | Loop with delay - powerbi

I'm new to PQ and trying to do following:
Get updates from server
Transform it.
Post data back.
While code works just fine i'd like it to be performed each N minutes until application closure.
Also LastMessageId variable should be revaluated after each call of GetUpdates() and i need to somehow call GetUpdates() again with it.
I've tried Function.InvokeAfter but didn't get how to run it more than once.
Recursion blow stack out ofc.
The only solution i see is to use List.Generate but struggle to understand how it can be used with delay.
let
//Get list of records
GetUpdates = (optional offset as number) as list => 1,
Updates = GetUpdates(),
// Store last update_id
LastMessageId = List.Last(Updates)[update_id],
// Prepare and response
Process = (item as record) as record =>
// Map Process function to each item in the list of records
Map = List.Transform(Updates, each Process(_))
in
Map

PowerBI does not support continuous automatic re-loading of data in the desktop.
Online, you can enforce a refresh as fast as 15 minutes using direct query1
Alternative methods:
You could do this in Excel and use VBA to re-execute the query on a schedule
Streaming data in PowerBI2
Streaming data with Flow and PowerBI3
1: Supported DirectQuery Sources
2: Realtime Streaming in PowerBI
3: Streaming data with Flow
4: Don't forget to enable historic logging!

Related

Handle 3 pub sub messages coming at same time , combine all 3 and store it in fire store

I Have 3 Pub Sub triggered cloud functions which receives 3 different Messages. These are published at the same time.
Main Data.
Sub Data 1.
Sub Data 2.
These messages has to be written into the firestore based on some logic.
Intended Goal:
Sub data 1 and Sub Data 2 has to be combined and outcome should be inserted into the Main Data document. After combining sub data 1&2, it knows the main data path(fire store document path) where it needs to attach itself.
Issues :
The cloud functions has to store their respective message data into firestore before it gets attached into the main data. Also main data has to be inserted before the sub data 1&2 combined, so that combined sub data 1&2 can be attached into the main data as extra.
What I have tried:
Tried orphan/parent storage logic. It works like,
Sub data 1 comes->looks for sub data 2 in its orphaned path , Combines with it ->
knows the main data path-> attach into it, if main data document is not available yet, get stored into the orphaned collection
or
Sub data 1 comes->looks for sub data 2, data 2 not available. Stored into the Orphaned collection.
If sub data 2 comes,
Sub data 2 comes->looks for sub data 1 , combines with it ,knows the main data path
->attach into it, if main data document is not available yet.stored
into the orphaned collection
When Main data inserted into the firestore, it will look for this sub data in the orphaned collection, if it is available, will attach those into it
Since these messages coming at the same time, exactly milliseconds interval, this logic is not working as expected.
There are two approaches worth considering here. If you know how long you need to wait, you can start functions that need to wait with a "sleep" function and just have them tread water for a bit (this is a bit hacky and could be costly at scale, but it works):
async function sleep(milliseconds) {
return new Promise((resolve) => setTimeout(resolve, milliseconds));
}
sleep(5000)
Likely a better solution would be to ditch the pub/sub messages for the subsequent functions and instead switch them to http functions and have the first function call them via a simple fetch request.
From whichever function you want to call first:
// ... do some stuff, then fetch when ready
let res = await fetch(
"https://us-central1-projectName.cloudfunctions.net/secondFunction",
{
method: "post",
body: JSON.stringify(body),
headers: {
Authorization: `bearer ${token}`,
"Content-Type": "application/json",
},
}
);
You can secure these functions with a bearer token, as documented here.

How to query big data in DynamoDB in best practice

I have a scenario: query the list of student in school, by year, and then use that information to do some other tasks, let say printing a certificate for each student
I'm using the serverless framework to deal with that scenario with this Lambda:
const queryStudent = async (_school_id, _year) => {
var params = {
TableName: `schoolTable`,
KeyConditionExpression: 'partition_key = _school_id AND begins_with(sort_key, _year)',
};
try {
let _students = [];
let items;
do {
items = await dynamoClient.query(params).promise();
_students = items.Items;
params.ExclusiveStartKey = items.LastEvaluatedKey;
} while (typeof items.LastEvaluatedKey != 'undefined');
return _students;
} catch (e) {
console.log('Error: ', e);
}
};
const mainHandler = async (event, context) => {
…
let students = await queryStudent(body.school_id, body.year);
await printCerificate(students)
…
}
So far, it’s working well with about 5k students (just sample data)
My concern: is it a scalable solution to query large data in DynamoDB?
As I know, Lambda has limited time execution, if the number of student goes up to a million, does the above solution still work?
Any best practice approach for this scenario is very appreciated and welcome.
If you think about scaling, there are multiple potential bottlenecks here, which you could address:
Hot Partition: right now you store all students of a single school in a single item collection. That means that they will be stored on a single storage node under the hood. If you run many queries against this, you might run into throughput limitations. You can use things like read/write sharding here, e.g. add a suffix to the partition key and do scatter-gatter with the data.
Lambda: Query: If you want to query a million records, this is going to take time. Lambda might not be able to do that (and the processing) in 15 minutes and if it fails before it's completely through, you lose the information how far you've come. You could do checkpointing for this, i.e. save the LastEvaluatedKey somewhere else and check if it exists on new Lambda invocations and start from there.
Lambda: Processing: You seem to be creating a certificate for each student in a year in the same Lambda function you do the querying. This is a solution that won't scale if it's a synchronous process and you have a million students. If stuff fails, you also have to consider retries and build that logic in your code.
If you want this to scale to a million students per school, I'd probably change the architecture to something like this:
You have a Step Function that you invoke when you want to print the certificates. This step function has a single Lambda function. The Lambda function queries the table across sharded partition keys and writes each student into an SQS queue for certificate-printing tasks. If Lambda notices, it's close to the runtime limit, it returns the LastEvaluatedKey and the step function recognizes thas and starts the function again with this offset. The SQS queue can invoke Lambda functions to actually create the certificates, possibly in batches.
This way you decouple query from processing and also have built-in retry logic for failed tasks in the form of the SQS/Lambda integration. You also include the checkpointing for the query across many items.
Implementing this requires more effort, so I'd first figure out, if a million students per school per year is a realistic number :-)

Update Dataflow Streaming job with Session and Siding window embedded in DF

In my use-case, I'm performing Session as well as Sliding window inside Dataflow job. So basically my Sliding window timing is 10 hour with sliding time 4 min. Since I'm applying grouping and performing max function on top of that, on every 3 min interval, window will fire the pane and it will go into Session window with triggering logic on it. Below is the code for the same.
Window<Map<String, String>> windowMap = Window.<Map<String, String>>into(
SlidingWindows.of(Duration.standardHours(10)).every(Duration.standardMinutes(4)));
Window<Map<String, String>> windowSession = Window
.<Map<String, String>>into(Sessions.withGapDuration(Duration.standardHours(10))).discardingFiredPanes()
.triggering(Repeatedly
.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(5))))
.withAllowedLateness(Duration.standardSeconds(10));
I would like to add logger on some steps for Debugging, so I'm trying to update the current streaming job using below code:
options.setRegion("asia-east1");
options.setUpdate(true);
options.setStreaming(true);
So previously I had around 10k data and I updated the existing pipeline using above config and now I'm not able to see that much data in steps of updated DF job. So help me with the understanding whether it preserves the previous job data or not as I'm not seeing previous DF step count in updated Job.

To delete the vertex by looping the dataframe in glue timeout

I want to delete the vertex to loop on one dataframe.
Suppose I will delete the vertex based on some cols of dataframe
my function is written in this way: and it is timeout
def delete_vertices_for_label(rows):
conn = self.remote_connection()
g = self.traversal_source(conn)
for row in rows:
entries = row.asDict()
create_traversal = __.hasLabel(str(entries["~label"]))
for key, value in entries.iteritems():
if key=='~id':
pass
elif key == '~label':
pass
else:
create_traversal.has(key), value)
g.V().coalesce(create_traversal).drop().iterate()
I have succeed in using this function locally on tinkerGraph, however ,when I try to run above function in glue which manipulate data in aws Neptune ; it failed.
I also create one lambda function in below: still meet the issue like timeout.
def run_sample_gremlin_basedon_property():
remoteConn = DriverRemoteConnection('ws://' + CLUSTER_ENDPOINT + ":" +
CLUSTER_PORT + '/gremlin', 'g')
graph = Graph()
g = graph.traversal().withRemote(remoteConn)
create_traversal = __.hasLabel("Media")
create_traversal.has("Media_ID", "99999")
create_traversal.has("src_name", "NET")
print ("create_traversal:",create_traversal)
g.V().coalesce(create_traversal).drop().iterate()
Dropping a vertex involves dropping associated properties and edges as well, and hence depending on the data, it could take a large amount of time. Drop step was optimized in one of the engine releases [1], so ensure that you are on a version newer than that. If you still get timeouts, set an appropriate timeout value on the cluster using the cluster parameter for timeouts.
Note: This answer is based off EmmaYang's communication with AWS Support. Looks like the Gluejob was configured in a manner that needs a high timeout. I'm not familiar enough with Glue to comment more on that (Emma - Can you please elaborate that?)
[1] https://docs.aws.amazon.com/neptune/latest/userguide/engine-releases-1.0.1.0.200296.0.html

Event Log Oldest Record Number

I'm trying to use the new event log API to get the oldest record number from a windows event log, but cannot get the the API to return the same answer as event viewer displays (looking at the details EventRecordID). Some sample code I'm using is below:
EVT_HANDLE log = EvtOpenLog(NULL, _logName, EvtOpenChannelPath);
EVT_VARIANT buf;
DWORD need = 0;
int vlen = sizeof(EVT_VARIANT);
ZeroMemory(&buf, vlen);
EvtGetLogInfo(log, EvtLogOldestRecordNumber, vlen, &buf, &need);
UINT64 old = buf.UInt64Val;
EvtClose(log);
What the API appears to be doing is returning the record number of the oldest event in the log, but not the oldest accessible event... What I mean by that is lets say you have 10 records in your log, 1-10 and you clear your log. The next 10 events inserted will be 11-20. If you use the API, it will return 1, not 11 like event viewer displays. If you try to retrieve event 1 using EvtQuery/EvtNext it will fail and not return an event -- as I would expect.
Does anyone have experience with this method? What am I doing wrong? I have used the method successfully with other properties (i.e. EvtLogNumberOfLogRecords), but cannot get this property (EvtLogOldestRecordNumber) to behave as expected.
http://msdn.microsoft.com/en-us/library/aa385385(v=VS.85).aspx
I was not able to get the new API to work for the oldest record number and had to revert to using the legacy API to retrieve the oldest record number.
msdn.microsoft.com/en-us/library/aa363665(VS.85).aspx