I've created a custom connector that uses authorization flow to connect to third-party APIs and I use an enterprise gateway to schedule the refresh for the dataset. The issue with that is, that the old dataset is being replaced by the new dataset after every refresh. I have scheduled the refresh to run every 1 hour. So at the end of the day, I lose all the reports. So as a solution for this, I created a push dataset which I believe is backed up by a DB, and using the rest API I'm pushing the refreshed data to the push dataset. Below is the code for that.
pushdataset = (data) =>
let
headers = [RelativePath = "https://api.powerbi.com", IsRetry=true, Headers = [
#"Content-Type"="application/json", Accept="application/json"
], Content = Json.FromValue(data)],
response = Web.Contents("/beta/77777/datasets/66789900/rows?key=ccccc", headers),
in
response;
expandTable = () =>
let
TableFromList = Table.FromList(reportEntries, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
ExpandColumn = Table.ExpandRecordColumn(TableFromList, "Column1", {"category"},{ "Column1.category"})
in
ExpandColumn
When I execute the connector I get
"Access is forbidden 403" error. It seems like a simple HTTP request, I can access the dataset using the python code and also from the postman.
I'm stuck with this for a long time. How do we connect to the push dataset from the custom connector? Also if there are other ways to keep the existing data and append new data to the dataset after every refresh, please let me know.
Example:
Scheduler runs at 9am
Data stored in the dataset
Category Total Item
Bike 1
Mobile 2
Scheduler runs at 10am
Data stored in the dataset**
Category Total Item
Watch 10
Books 2
What is expected:
Category Total Item
Bike 1
Mobile 2
Watch 10
Books 2
Related
I create a Studio Notebook at Kinesis analytics, I can see data come in by MQTT SQL Legacy of Analytics. So I am receiving data:
enter image description here
When I go to "open in apache zeppelin"
I create the table
%flink.ssql
CREATE TABLE `ppgsignal0903` ( `timestamp` BIGINT,`[Heart Rate Measurement]` DOUBLE,
`[Energy Expended]` DOUBLE,
`RR-Interval` DOUBLE,
`iso_time` as TO_TIMESTAMP(FROM_UNIXTIME(`timestamp`)) )
WITH ( 'connector' = 'kinesis',
'stream' = 'PPG_PW',
'aws.region' = 'eu-central-1',
'scan.stream.initpos' = 'LATEST',
'format' = 'json' )
data is coming in all the time, when I go to see my table:
%flink.ssql(type=update)
SELECT * FROM ppgsignal0903;
I have the following error:
Fail to run sql command: SELECT * FROM ppgsignal0903
Unable to create a source for reading table
'hive.ppgdatabase.ppgsignal0903'.
Table options are:
'aws.region'='eu-central-1'
'connector'='kinesis'
'format'='json'
'scan.stream.initpos'='LATEST'
'stream'='PPG_PW'
does anyone have a tip?
I need to do some analytics and manipulate the data showing real time charts( for example, hear beats per second, time between diastolic and systolic blood pressure in the last 10 min, etc) , so I need to have different paragraphs where I could run separate with real time data
I am faced with the following problem and I am a newbie to Cloud computing and databases. I want set up a simple dashboard for an application. Basically I want to replicate this site which shows data about air pollution. https://airtube.info/
What I need to do in my perception is the following:
Download data from API: https://github.com/opendata-stuttgart/meta/wiki/EN-APIs and I have this link in mind in particular "https://data.sensor.community/static/v2/data.1h.json - average of all measurements per sensor of the last hour." (Technology: Python bot)
Set up a bot to transform the data a little bit to tailor them for our needs. (Technology: Python)
Upload the data to a database. (Technology: Google Big-Query or AWS)
Connect the database to a visualization tool so everyone can see it on our webpage. (Technology: Probably Dash in Python)
My questions are the following.
1. Do you agree with my thought process or you would change some element to make it more efficient?
2. What do you think about running a python script to transform the data? Is there any simpler idea?
3. Which technology would you suggest to set up the database?
Thank you for the comments!
Best regards,
Bartek
If you want to do some analysis on your data I recommend to upload the data to BigQuery and once this is done, here you can create new queries and get the results you want to analyze. I was cheking the dataset "data.1h.json" and I would create a table in BigQuery using a schema like this one:
CREATE TABLE dataset.pollution
(
id NUMERIC,
sampling_rate STRING,
timestamp TIMESTAMP,
location STRUCT<
id NUMERIC,
latitude FLOAT64,
longitude FLOAT64,
altitude FLOAT64,
country STRING,
exact_location INT64,
indoor INT64
>,
sensor STRUCT<
id NUMERIC,
pin STRING,
sensor_type STRUCT<
id INT64,
name STRING,
manufacturer STRING
>
>,
sensordatavalues ARRAY<STRUCT<
id NUMERIC,
value FLOAT64,
value_type STRING
>>
)
Ok, we have already created our table, so now we need to insert all the data from the JSON file into that table, to do that and since you want to use Python, I would use the BigQuery Python Client library [1] to read the Data from a bucket in Google Cloud Storage [2] where the file has to be stored and transform the data to upload it to the BigQuery table.
The code, would be something like this:
from google.cloud import storage
import json
from google.cloud import bigquery
client = bigquery.Client()
table_id = "project.dataset.pollution"
# Instantiate a Google Cloud Storage client and specify required bucket and
file
storage_client = storage.Client()
bucket = storage_client.get_bucket('bucket')
blob = bucket.blob('folder/data.1h.json')
table = client.get_table(table_id)
# Download the contents of the blob as a string and then parse it using
json.loads() method
data = json.loads(blob.download_as_string(client=None))
# Partition the request in order to avoid reach quotas
partition = len(data)/4
cont = 0
data_aux = []
for part in data:
if cont >= partition:
errors = client.insert_rows(table, data_aux) # Make an API request.
if errors == []:
print("New rows have been added.")
else:
print(errors)
cont = 0
data_aux = []
# Avoid empty values (clean data)
if part['location']['altitude'] is "":
part['location']['altitude'] = 0
if part['location']['latitude'] is "":
part['location']['latitude'] = 0
if part['location']['longitude'] is "":
part['location']['longitude'] = 0
data_aux.append(part)
cont += 1
As you can see above, I had to create a partition in order to avoid reaching a quota on the size of the request. Here you can see the amount of quotas to avoid [3].
Also, some Data in the location field seems to have empty values, so it is necessary to control them to avoid errors.
And since you already have your data stored in BigQuery, in order to create a new Dashboard I would use Data Studio tool [4] to visualize your BigQuery data and create queries over the columns you want to display.
[1] https://cloud.google.com/bigquery/docs/reference/libraries#using_the_client_library
[2] https://cloud.google.com/storage
[3] https://cloud.google.com/bigquery/quotas
[4] https://cloud.google.com/bigquery/docs/visualize-data-studio
I notice when you query the data catalog in the Google Cloud Platform it retrieves stats for the amount of times a table has been queried:
Queried (Past 30 days): 5332
This is extremely useful information and I was wondering where this is actually stored and if it can be retrieved for all the tables in a project or a dataset.
I have trawled the data catalog tutorials and written some python scripts but these just retrieve entry names for tables and in an iterator which is not what I am looking for.
Likewise I also cannot see this data in the information schema metadata.
You can retrieve the number of completed/performed queries of any table/dataset exporting log entries to BiqQuery. Every query generates some logging on Stackdriver so you can use advanced filters to select the logs you are interested it and store them as a new table in Bigquery.
However, the retention period for the data access logs in GCP is 30 days, so you can only export the logs in the past 30 days.
For instance, use the following advance filter for getting the logs corresponding to all the jobs completed of an specific table:
resource.type="bigquery_resource" AND
log_name="projects/<project_name>/logs/cloudaudit.googleapis.com%2Fdata_access" AND
proto_payload.method_name="jobservice.jobcompleted"
"<table_name>"
Then select Bigquery as Sink Service and state a name for your sink table and the dataset where it will be stored.
All the completed jobs on this table performed after the sink is established will appear as a new table in BigQuery. You can query this table to get information about the logs (you can use a COUNT statement on any column to get the total number of successful jobs for instance).
This information is available in the projects.locations.entryGroups.entries/get API. It is availble as UsageSignal, and contains usage information of 24 hours, 7days, 30days.
Sample output:
"usageSignal": {
"updateTime": "2021-05-23T06:59:59.971Z",
"usageWithinTimeRange": {
"30D": {
"totalCompletions": 156890,
"totalFailures": 3,
"totalCancellations": 1,
"totalExecutionTimeForCompletionsMillis": 6.973312e+08
},
"7D": {
"totalCompletions": 44318,
"totalFailures": 1,
"totalExecutionTimeForCompletionsMillis": 2.0592365e+08
},
"24H": {
"totalCompletions": 6302,
"totalExecutionTimeForCompletionsMillis": 25763162
}
}
}
Reference:
https://cloud.google.com/data-catalog/docs/reference/rest/v1/projects.locations.entryGroups.entries/get
https://cloud.google.com/data-catalog/docs/reference/rest/v1/projects.locations.entryGroups.entries#UsageSignal
With Python Datacatalog - You first need to search the Data catalog and you will receive linked_resource in response.
Pass this linked_resource as a request to lookup_entry and you will fetch the last queried (30 days)
results = dc_client.search_catalog(request=request, timeout=120.0)
for result in results:
linked_resource = result.linked_resource
# Get the Location and number of times the table is queried in last 30 days
table_entry = dc_client.lookup_entry(request={"linked_resource": linked_resource})
queried_past_30_days = table_entry.usage_signal.usage_within_time_range.get("30D")
if queried_past_30_days is not None:
dc_num_queried_past_30_days = int(queried_past_30_days.total_completions)
else:
dc_num_queried_past_30_days = 0
I have an S3 bucket which is constantly being filled with new data, I am using Athena and Glue to query that data, the thing is if glue doesn't know that a new partition is created it doesn't search that it needs to search there. If I make an API call to run the Glue crawler each time I need a new partition is too expensive so the best solution to do this is to tell glue that a new partition is added i.e to create a new partition is in it's properties table. I looked through AWS documentation but no luck, I am using Java with AWS. Any help?
You may want to use batch_create_partition() glue api to register new partitions. It doesn't require any expensive operation like MSCK REPAIR TABLE or re-crawling.
I had a similar use case for which I wrote a python script which does the below -
Step 1 - Fetch the table information and parse the necessary information from it which is required to register the partitions.
# Fetching table information from glue catalog
logger.info("Fetching table info for {}.{}".format(l_database, l_table))
try:
response = l_client.get_table(
CatalogId=l_catalog_id,
DatabaseName=l_database,
Name=l_table
)
except Exception as error:
logger.error("Exception while fetching table info for {}.{} - {}"
.format(l_database, l_table, error))
sys.exit(-1)
# Parsing table info required to create partitions from table
input_format = response['Table']['StorageDescriptor']['InputFormat']
output_format = response['Table']['StorageDescriptor']['OutputFormat']
table_location = response['Table']['StorageDescriptor']['Location']
serde_info = response['Table']['StorageDescriptor']['SerdeInfo']
partition_keys = response['Table']['PartitionKeys']
Step 2 - Generate a dictionary of lists where each list contains the information to create a single partition. All lists will have same structure but their partition specific values will change (year, month, day, hour)
def generate_partition_input_list(start_date, num_of_days, table_location,
input_format, output_format, serde_info):
input_list = [] # Initializing empty list
today = datetime.utcnow().date()
if start_date > today: # To handle scenarios if any future partitions are created manually
start_date = today
end_date = today + timedelta(days=num_of_days) # Getting end date till which partitions needs to be created
logger.info("Partitions to be created from {} to {}".format(start_date, end_date))
for input_date in date_range(start_date, end_date):
# Formatting partition values by padding required zeroes and converting into string
year = str(input_date)[0:4].zfill(4)
month = str(input_date)[5:7].zfill(2)
day = str(input_date)[8:10].zfill(2)
for hour in range(24): # Looping over 24 hours to generate partition input for 24 hours for a day
hour = str('{:02d}'.format(hour)) # Padding zero to make sure that hour is in two digits
part_location = "{}{}/{}/{}/{}/".format(table_location, year, month, day, hour)
input_dict = {
'Values': [
year, month, day, hour
],
'StorageDescriptor': {
'Location': part_location,
'InputFormat': input_format,
'OutputFormat': output_format,
'SerdeInfo': serde_info
}
}
input_list.append(input_dict.copy())
return input_list
Step 3 - Call the batch_create_partition() API
for each_input in break_list_into_chunks(partition_input_list, 100):
create_partition_response = client.batch_create_partition(
CatalogId=catalog_id,
DatabaseName=l_database,
TableName=l_table,
PartitionInputList=each_input
)
There is a limit of 100 partitions in a single api call, So if you are creating more than 100 partitions then you will need to break your list into chunks and iterate over it.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition
You can configure you're glue crawler to get triggered every 5 mins
You can create a lambda function which will either run on schedule, or will be triggered by an event from your bucket (eg. putObject event) and that function could call athena to discover partitions:
import boto3
athena = boto3.client('athena')
def lambda_handler(event, context):
athena.start_query_execution(
QueryString = "MSCK REPAIR TABLE mytable",
ResultConfiguration = {
'OutputLocation': "s3://some-bucket/_athena_results"
}
Use Athena to add partitions manualy. You can also run sql queries via API like in my lambda example.
Example from Athena manual:
ALTER TABLE orders ADD
PARTITION (dt = '2016-05-14', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_14_May_2016'
PARTITION (dt = '2016-05-15', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_15_May_2016';
This question is old but I wanted to put it out there that someone could have s3:ObjectCreated:Put notifications trigger a Lambda function which registers new partitions when data arrives on S3. I would even expand this function to handle deprecations based on object deletes and so on. Here's a blog post by AWS which details S3 event notifications: https://aws.amazon.com/blogs/aws/s3-event-notification/
AWS Glue recently added a RecrawlPolicy that only crawls the new folders/paritions that you add to your S3 bucket.
https://docs.aws.amazon.com/glue/latest/dg/incremental-crawls.html
This should help you with minimizing crawling all the data again an again. From what I read, you can define incremental crawls while setting up your crawler, or editing an existing one. One thing however to note is that incremental crawls require the schema of new data to be more or less the same as existing schema.
I'm looking for a sample code which get data from SQL Server and push this to PowerBI in real time, This is basically using the Push Dataset option.
I am not sure how to Push the datas from SQL
Thanks
Why not creating a custom streaming dataset and 'pushing' your sql data directly. In this case you may use either Power apps (create a flow and a trigger on insert) or simply right some code to push your data in a form of a post request.
For instance you have your sql table containing a value you want to push. Thus the steps should the following:
Create a dashboard
Add tile
Choose 'Custom Streaming Dataset' as a source
Define the data colums to be pushed (for instance train_number and departure_time)
Copy the API
From your code (Python for example) get the data, convert it to json and publish
Go back to power bi, add a tile from newly created streaming dataset and chose the visual type. Important: the visuals are quite limited
Here is a sample code in python:
def data_generation(counter=None):
# get your SQL data and save it into 2 variables (row by row)
return [train_number, departure_time]
while True:
data_raw = []
# simple counter increment
counter += 1
for i in range(1):
row = data_generation(counter)
data_raw.append(row)
# set the header record
HEADER = ["train_number", "departure_time"]
# generate a temp data frame to convert it to json
data_df = pd.DataFrame(data_raw, columns=HEADER)
# prepare date for post request (to be sent to Power BI)
data_json = bytes(data_df.to_json(orient='records'), encoding='utf-8')
# Post the data on the Power BI API
req = requests.post(PowerBI_REST_API_URL, data_json)
print("Data posted in Power BI API")
print(data_json)
# wait 5 seconds
time.sleep(5)
Microsoft published similar walk-through. It has to be slightly expanded with SQL Server calls though:
Push data into a Power BI dataset
---> Create Dataset
You can't 'push' data from SQL, but you can use DirectQuery instead of Import. Then your data will always be actual.
Just connect to a SQL Server, and choose for 'Direct Query' and you'll be ready to go.
Edit:
With #Alexander Volok, of course, with an application and/or API calls you can push data into Power BI. My bad.
You Can push the data by using power shell which where you need to add the your api link and you have to put your sql connection string and you and you ca fire a query to same data set by declaring it into code you can refer the below code which will help you to understand how to push the data into your data set once you run your power shell script then data will be pushed to power bi data set and you can see your live
$SqlServer = ''; #your server name
$SqlDatabase = ''; #your database name
$uid ="" #User id
$pwd = "*****" # your password
$SqlConnectionString = 'Data Source={0};Initial Catalog={1};Integrated Security=SSPI;uid=$uid;Password=$pwd' -f $SqlServer, $SqlDatabase;
$SqlQuery = "SELECT * FROM abc;";
$SqlCommand = New-Object System.Data.SqlClient.SqlCommand;
$SqlCommand.CommandText = $SqlQuery;
$SqlConnection = New-Object System.Data.SqlClient.SqlConnection -ArgumentList $SqlConnectionString;
$SqlCommand.Connection = $SqlConnection;
$SqlAdapter = New-Object System.Data.SqlClient.SqlDataAdapter
$SqlAdapter.SelectCommand = $SqlCommand
$SqlConnection.Open();
$SqlDataReader = $SqlCommand.ExecuteReader();
##you would find your own endpoint in the Power BI service
$endpoint = "" ## add your api link middle of endpoint ""
#Fetch data from your table and write out to files
while ($SqlDataReader.Read()) {
$payload =
#{
"Date" =$SqlDataReader['Date']
"First Name" =$SqlDataReader['Name']
"Production" =$SqlDataReader['prdt']
}
Invoke-RestMethod -Method Post -Uri "$endpoint" -Body (ConvertTo-Json #($payload))
}
$SqlConnection.Close();
$SqlConnection.Dispose();
## every time you run script data will automaticaly pushed from sql server to your power bi report
e streaming chart