Join PubSub data with BigQuery data and then save result into BigQuery using dataflow SDK in python - google-cloud-platform

I have a problem statement to read the streaming data from pubsub topic (PubSubTopic1) and join the data with Bigquery table (BQTable1) using dataflow and then save the result into new Bigquery table (ResultBQTable)
PubSubTopic1: has ItemID, UnitPrice
BQTable1: has ItemID, ItemName, OfferPrice
ResultBQTable: need to have columns: ItemID, ItemName, UnitPrice, OfferPrice, TotalCost
I am able to create Dataflow job using 'DataFlow SQL Workbench' but this is one time, I can not automate this, hence I want to write python code using apache beam ask and dataflow sdk to automate this so that it can be shared with anyone to implement same thing.
I am new to dataflow hence my approach might be tedious. Better and optimal approaches are all welcome.
Thinking of below things, but do not know how to implement 2nd:
I can try to implement windowing to pubsub topic to reach in small batches using time limit.
Can we read PubSubTopic1 streaming data in one PColleciton and data from BQTable1 in another PCollection and then join these?

Can we read PubSubTopic1 streaming data in one PColleciton and data from BQTable1 in another PCollection and then join these
Yes, that's exactly the right way to be thinking about it!
The Side Input Patterns page on the Beam docs contains an example of enriching streaming data with a slowly-changing side input. This is a slightly modified version to match your input and output types:
from apache_beam.transforms.periodicsequence import PeriodicImpulse
from apache_beam.transforms.window import TimestampedValue
from apache_beam.transforms import window
# from apache_beam.utils.timestamp import MAX_TIMESTAMP
# last_timestamp = MAX_TIMESTAMP to go on indefninitely
# Any user-defined function.
# cross join is used as an example.
def cross_join(left, rights):
for x in rights:
yield (left, x)
# Create pipeline.
pipeline = beam.Pipeline()
side_input = (
pipeline
| 'PeriodicImpulse' >> PeriodicImpulse(
first_timestamp, last_timestamp, interval, True)
| 'ReadFromBQ' >> beam.io.ReadFromBigQuery(table='mytable')
main_input = (
pipeline
| 'ReadFromPubSub' >> beam.io.ReadFromPubSub(topic='mytopic')
| 'WindowMpInto' >> beam.WindowInto(
window.FixedWindows(main_input_windowing_interval)))
result = (
main_input
| 'ApplyCrossJoin' >> beam.FlatMap(
cross_join, rights=beam.pvalue.AsIter(side_input)))
result | beam.io.WriteToBigQuery(table='my_output_table')
result here is an windowed, unbounded PCollection and Beam will use streaming inserts to send the data to BigQuery as windows are processed.

Related

How to trigger cloud data fusion from airflow with dynamic parameters

I am trying to create a DAG in Airflow 2+ which will trigger multiple data fusion pipelines using CloudDqtaFusionStartPipeline operator and they will run in parallel.
However, I want to assign the parameter values (like pipeline name, runtime argument etc.) for each data fusion pipeline dynamically, based on the output of previous Python task.
The flow I am trying is something like below.
start - read_bq - [df_1, ... df_n]
Here, read_bq is a Python task which will read the values from a BigQuery table as a list (values like pipeline name, runtime argument etc.)
Then looping over that list I will determine how many data fusion pipelines to trigger and assign the values returned from BQ to those pipelines.
The problem I am facing, neither CloudDqtaFusionStartPipeline does have any task_instance option which can be used for xcom pull, nor can I run a loop within DAG by doing xcom pull (as it works only with task).
Any technical help or suggestion is appreciated.
Thanks,
Santanu
If I understand correctly your goal, the main issue here is to create dag that will by dynamic based on the output of your BQ query.
Airflow has this functionality (but it's quite limited) and it is called dynamic task mapping.
But there are a few limitations
firstly not all parameters are mappable (e.g for BashOperator you can map command but not task_id)
secondly if you pass multiple parameters it will create a cross product:
I was able to solve the issue but not happy with this solution as the execution is not visible in the logs section of the airflow ui:
from airflow.decorators import dag, task
from datetime import datetime
from airflow.operators.bash import BashOperator
#dag(
schedule=None,
start_date=datetime(2022, 10, 29, hour=8),
catchup=False,
tags=['stack'],
)
def dynamic_dag():
"""
### Template dag"""
#task()
def add(x, y):
print(f'adding {x} to {y}')
return x+y
#task()
def query_bq():
return [('first','echo df1'), ('second', 'echo df2')]
#task()
def run_bash(inp):
first, second = inp
b = BashOperator(
task_id=first,
bash_command=second)
b.execute(dict())
#this is to show multiple parameters
added_vals = add.expand(x=[2,3], y=[3,5])
#this works as intended but leaves no logs in ui
run_this = run_bash.expand(inp=query_bq())
dynamic_dag()
As a last try I would try to create two dags. Main and worker. Main would pass the amount of df (df_n) and all the params something similar to here

Scalable way to read large numbers of files with Apache Beam?

I’m writing a pipeline where I need to read the metadata files (500.000+ files) from the Sentinel2 dataset located on my Google Cloud Bucket with apache_beam.io.ReadFromTextWithFilename.
It works fine on a small subset, but when I ran it on the full dataset it seems to block on "Read Metadata" >> ReadFromTextWithFilename(f'gs://{BUCKET}/{DATA_FOLDER}/**/*metadata.json').
It dosen’t even show up in the Dataflow jobs list.
The pipeline looks like this:
with beam.Pipeline(options=pipeline_options) as pipeline:
meta = (
pipeline
| "Read Metadata" >> ReadFromTextWithFilename(f'gs://{BUCKET}/{DATA_FOLDER}/**/*metadata.json')
| "Extract metadata" >> beam.ParDo(ExtractMetaData())
)
table_spec = bigquery.TableReference(
datasetId="sentinel_2",
tableId="image_labels",
)
(
meta
| "Write To BigQuery" >> beam.io.WriteToBigQuery(
table_spec,
schema=table_schema(),
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
)
I'm wondering:
Is there a smarter way to read large numbers of files?
Will copying the metadata files into one folder be more performant? (How much more does it cost to traverse sub-folders, as opposed to files in one folder)
Is the way to go to match the file names first with apache_beam.io.fileio.MatchAll and then read and extract in one or two following ParDos?
This is probably due to the pipeline running into Dataflow API limits when splitting the text source glob into a large number of sources.
Current solution is to use the transform ReadAllFromText which should not run into this.
In the future we hope to update transform ReadFromText for this case as well by using the Splittable DoFn framework.
It looks like I was suffering from a case of unwanted fusion.
Drawing inspiration from the page about file-processing on the Apache Beam website, I tried to add a Reshuffle to the pipeline.
I also upgraded to a paid Google Cloud account, thus getting higher quotas.
That resulted in Dataflow handling the job a lot better.
In fact Dataflow wanted to scale to 251 workers for my BiqQuery write job. At firste it didn't provision more workers, so I stopped the job and sat --num_workers=NUM_WORKERS and --max_num_workers=NUM_WORKERS, where NUM_WORKERS was the max qouta form my project. When running with those paramaters it scaled up automatically.
My final pipeline looks like this:
with beam.Pipeline(options=pipeline_options) as pipeline:
meta = (
pipeline
| MatchFiles(f'gs://{BUCKET}/{DATA_FOLDER}/*metadata.json')
| ReadMatches()
| beam.Reshuffle()
| beam.Map(lambda x: (x.metadata.path, x.read_utf8()))
| beam.ParDo(ExtractMetaData())
)
table_spec = bigquery.TableReference(
datasetId="sentinel_2",
tableId="image_labels",
)
(
meta
| "Write To BigQuery" >> beam.io.WriteToBigQuery(
table_spec,
schema=table_schema(),
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE
)
)
Appendix
I also got a hint that SplitableParDos might be a solution, but I have not tested it.

Computing GroupBy once then passing it to multiple transformations in Google DataFlow (Python SDK)

I am using Python SDK for Apache Beam to run a feature extraction pipeline on Google DataFlow. I need to run multiple transformations all of which expect items to be grouped by key.
Based on the answer to this question, DataFlow is unable to automatically spot and reuse repeated transformations like GroupBy, so I hoped to run GroupBy first and then feed the result PCollection to other transformations (see sample code below).
I wonder if this is supposed to work efficiently in DataFlow. If not, what is a recommended workaround in Python SDK? Is there an efficient way to have multiple Map or Write transformations taking results of the same GroupBy? In my case, I observe DataFlow scale to the max number of workers at 5% utilization and make no progress at the steps following the GroupBy as described in this question.
Sample code. For simplicity, only 2 transformations are shown.
# Group by key once.
items_by_key = raw_items | GroupByKey()
# Write groupped items to a file.
(items_by_key | FlatMap(format_item) | WriteToText(path))
# Run another transformation over the same group.
features = (items_by_key | Map(extract_features))
Feeding output of a single GroupByKey step into multiple transforms should work fine. But the amount of parallelization you can get depends on the total number of keys available in the original GroupByKey step. If any one of the downstream steps are high fanout, consider adding a Reshuffle step after those steps which will allow Dataflow to further parallelize execution.
For example,
pipeline | Create([<list of globs>]) | ParDo(ExpandGlobDoFn()) | Reshuffle() | ParDo(MyreadDoFn()) | Reshuffle() | ParDo(MyProcessDoFn())
Here,
ExpandGlobDoFn: expands input globs and generates files
MyReadDoFn: reads a given file
MyProcessDoFn: processes an element read from a file
I used two Reshuffles here (note that Reshuffle has a GroupByKey in it) to allow (1) parallelizing reading of files from a given glob (2) parallelizing processing of elements from a given file.
Based on my experience in troubleshooting this SO question, reusing GroupBy output in more than one transformation can make your pipeline extremely slow. At least this was my experience with Apache Beam SDK 2.11.0 for Python.
Common sense told me that branching out from a single GroupBy in the execution graph should make my pipeline run faster. After 23 hours of running on 120+ workers, the pipeline was not able to make any significant progress. I tried adding reshuffles, using a combiner where possible and disabling the experimental shuffling service.
Nothing helped until I split the pipeline into two ones. The first pipeline computes the GroupBy and stores it in a file (I need to ingest it "as is" to the DB). The second reads the file with a GroupBy output, reads additional inputs and run further transformations. The result - all transformations successfully finished under 2 hours. I think if I just duplicated the GroupBy in my original pipeline, I would have probably achieved the same results.
I wonder if this is a bug in the DataFlow execution engine or the Python SDK, or it works as intended. If it is by design, then at least it should be documented, and the pipeline like this should not be accepted when submitted, or there should be a warning.
You can spot this issue by looking at the 2 branches coming out of the "Group keywords" step. It looks like the solution is to rerun GroupBy for each branch separately.

Best way to read BigQuery Table

Reading from BigQuery and Filtering Data, I have 2 Way's
Read From BigQuery in Dataflow(Using BigqueryIO.readTableRow.from(ValueProvider)) Whole Data and Then Filter on Basis of Condition like max Date
Reading From BigQuery in Dataflow Using NestedValueProvider by making a Query Which will Only Fetch the Required Data is Much Slower.
Since there would be an Issue if I read whole Data and my Table is in Append Mode which will Increase the Time of Reading Data as Day Pass.
But if I read only Particular Date Data Which will Make my Pipeline Reading Time Consistent.
But for 200 Records Nested Value Provider is taking Much more time than Reading Whole Data Using BigqueryIO.readTableRow.from(ValueProvider).
What am I Missing anyone can help?
My Snippet is Below Please Find.
Snippet:
PCollection<TableRow> targetTable = input.apply("Read TRUSTED_LAYER_TABLE_DESCRIPTION", BigQueryIO
.readTableRows()
.withoutValidation()
.withTemplateCompatibility()
.fromQuery(NestedValueProvider.of(options.get(Constants.TABLE_DESCRIPTION.toString())
, new QueryTranslator(options.get(Constants.ETL_BATCH_ID.toString())))).usingStandardSql());
Nested Value Provider Class Snippet:
public class QueryTranslator implements SerializableFunction{
/**
* Read data with max etlbatchid from query
*/
ValueProvider<String> etlbatchid;
public QueryTranslator(ValueProvider<String> etlbatchid){
this.etlbatchid = etlbatchid;
}
private static final long serialVersionUID = -2754362391392873056L;
#Override
public String apply(String input) {
String batchId = this.etlbatchid.get();
if(batchId.equals("-1"))
return String.format("SELECT * from `%s`", input);
else
return String.format("SELECT * from `%s` where etlbatchid = %s;", input,batchId);
}
}
Depending on your use case, both of the 2 ways can be employed and you should consider the pros and cons of each one that you choose.
The first one (reading a whole table) will be very fast as Dataflow can easily split the workload in multiple shards and process it with parallelism hence the rapidity. The downside is that the cost is likely to be higher due to intensive CPU use.
The second option is expected to be slower due to multiple operations BigQuery will perform but will be cost effective. The cons for this option will be that probably, you'll be hitting one or several quota and limit of BigQuery that will require elaborate coding to overthrown.
You can also check if you can implement these examples for reading the whole table, use a string query and use a filter method (inspired from this StackOverflow thread).

Joining inputs for a complicated output

Im new in azure analytics. Im using analytics to get feedbacks from users. There are about 50 events that im sending to azure in a second and im trying to get a combined result from two inputs but couldnt get a working output. My problem is in sql query for output.
Now I'm sending in the inputs.
Recommandations:
{"appId":"1","sequentialId":"28","ItemId":"1589018","similaristyValue":"0.104257207028537","orderId":"0"}
ShownLog:
{"appId":"1","sequentialId":"28","ItemId":"1589018"}
I need to join them with sequentialId and ItemId and calculate the difference between two ordered sequential.
For example: I send 10 Recommandations events and after that (like after 2 sec) i send 3 ShownLog event. So what i need to do is i have to get sum of first 3 (because i send 3 shownlog event) event's similaristyValue ordered by "orderid" from "Recommandations". I also need to get the sum of similarityValues from "ShownLog". At the end i need an input like (for every sequential ID):
sequentialID Difference
168 1.21
What i ve done so far is. I save all the inputs my azure sql and i ve managed to write the sql i want. You may find the mssql query for it:
declare #sumofSimValue float;
declare #totalItemCount int;
declare #seqId float;
select
#sumofSimValue = sum(b.[similarityValue]),
#totalItemCount = count(*),
#seqId = a.sequentialId
from EventHubShownLog a inner join EventHubResult b on a.sequentialId=b.sequentialId and a.ItemId=b.ItemId group by a.sequentialId
--select #sumofSimValue,#totalItemCount,#seqId
SELECT #seqId, SUM([similarityValue])-#sumofSimValue
FROM (
SELECT TOP(#totalItemCount) [similarityValue]
FROM [EventHubResult] where sequentialId=#seqId order by orderId
) AS T
But it gives lots of error in analytics. Also it lacks the logic of azure analytcs. I hope i could tell the problem.
Can you tell me how can i do such a job for my system? How can i use the time windows or how can i join them properly?
For every shown log, you have to select sum of similarity value. Is that the intention? Why not just join and select sum? It would only select as many rows as there are shown logs.
One thing to decide is the maximum time difference between recommendation events and shown log events, with that you can use Azure Stream analytics join, https://msdn.microsoft.com/en-us/library/azure/dn835026.aspx