Interprocedural dataflow analysis with Roslyn - roslyn

Is it possible to perform data flow analysis in an interprocedural level with Roslyn? As far I check the data flow API it only performs analysis in the context of a block.

Related

Connecting data from Big Query to Cloud Fucntion to perfrom NLP

I wish to perform sentimental analysis using Google Natural Language API.
I found a documentation that perform sentiment analysis directly on a file located in Cloud Storage, https://cloud.google.com/natural-language/docs/analyzing-sentiment#language-sentiment-string-python.
However, my data that i am working on is instead located in Big Query. I am wondering how do I call the data directly from Big Query table to do the Sentimental Analysis?
An example of the Big Query Table schema:
I wish to do NLP on the tweet columns of the table.
I tried to search for documentation on it but seems to not find anything.
I would appreciate any help or references. Thank You.
You can take a look at BigQuery Remote Functions which provide a direct integration with Cloud Functions and Cloud Run. The columns returned from BigQuery SQL can be passed to the Remote Functions and a custom code can be executed as per the requirements. Please do note that Remote Functions are still in preview and might not be suitable for production systems.
This should be fairly straightforward to do with Dataflow - you could write a pipeline that reads from BigQuery followed by a DoFn that uses Google's NLP Libraries, and then writes the results to BigQuery.
Some wrappers are already provided for you in https://github.com/apache/beam/blob/master/sdks/python/apache_beam/ml/gcp/naturallanguageml.py

How can I use Google Natural Language API to enrich data in a Bigquery table?

I want to use data stored in a BigQuery table as input to Google's Natural Language API, perform entity extraction and sentiment analysis, and persist the result back to BigQuery. What tools/services could I use to handle this in GCP? Performance is not a concern, and running this in an overnight batch would be acceptable for this use-case.
This should be fairly straightforward to do with Dataflow--you could write a pipeline that reads from BigQuery followed by a DoFn that uses Google's NLP Libraries, and then writes the results to BigQuery.
Some wrappers are already provided for you in https://github.com/apache/beam/blob/master/sdks/python/apache_beam/ml/gcp/naturallanguageml.py

Will IGC allow me to trace where data has been sourced from or how data is being consumed, for any ETL or Data Transformation Tool?

As part of our Governance initiative and regulatory requirement, we need to produce a Lineage (tractability) report, outlining the flow of data into our Warehouse, and the Reports or Services consuming its data. We are aware that Information Governance Catalog can produce such a report automatically when DataStage is writing data to the Warehouse. Can Information Governance Catalog do the same when we use SQL Scripts or other tooling to read or write information to our Warehouse? Can I view a complete Lineage report, that incorporates such different information?
What are the steps within IGC to document or otherwise define the usage of information to support Data Lineage and Regulatory reporting?
Yes, while we can automate the production of Lineage (traceability) reports for DataStage, IGC does offer facility to document the flow of data for other data movement scripts, tools or processes. This will produce the same Lineage reports, that can be used to satisfy needs for compliance, or build confidence and trust in the use or consumption of data.
At it simplest, IGC allows one to draft a Mapping Document. Essentially a spreadsheet that delineates the Data Source and Data Target, and documentation to support the transformation, aggregation or other logic. The spreadsheet can be directly authored in IGC, or loaded from Excel (text file) which further supports automation of the process. Documentation for Extension Mapping Documents can be found here: https://www.ibm.com/support/knowledgecenter/en/SSZJPZ_11.5.0/com.ibm.swg.im.iis.mdwb.doc/topics/c_extensionMappings.html (though suggest creating such a document from IGC, and exporting the results to Excel).
In addition, IGC supports a more formal process for extending the Catalog and introducing new types of Assets. This would go one step further, and properly document and catalog the Data Processes (SQL commands, other ETL tooling) and map the data movement thru those Processes. This will allow users to identify with the Data Process and even allow one to include operational data (as is supported for IGC). More information on this process can be found here: https://www-01.ibm.com/support/docview.wss?uid=swg21699130
Suggest to review the absolute requirements, and what information is required for the ensuing traceability report. Starting with the Extension Mapping Document should suffice, and would be the simplest to implement and drive immediate benefit.

Google Dataprep integration whith message brokers

Is it possible to read form Kafka or Google Pub/Sub in a Dataprep Job?
If so, any 'best practice' deployment considerations I should expect when the samples are edited on board an "oh so snappy, live and responsive" a la visual studio (minus the ability to purchase or download the tool) whereas debugging the production flow (same "type" of data") is performed on top of anything but such tools (coding Scala/Java on our favorite IDE)?
There is not a native way to read from a message system, like Kafka or Pub/Sub, directly into Cloud Dataprep.
I'd recommend an alternative approach:
Stream the data into BigQuery and then read the data from BQ
Write the stream data to Cloud Storage and then load the data
Both approaches will require writing the data to an intermediate location beforehand. I'd recommend BQ if you have needs for low latency, performance, or query-ability in the future. I'd recommend GCS for low-cost when speed is not critical.

Cache a dataset in Dataflow

I was wondering whether I can directly cache a dataset in Google Dataflow platform (like caching RDDs in Spark).
If there is no such feature, how does Dataflow pick hot datasets in the applications, especially if you have multiple hot datasets, and you want to prioritize caching based on the importance of the datasets?
Dataflow has a very different execution model than Spark. In Spark, the central concept is an RDD and the typical mode of working with an RDD is to interactively query it in unpredictable ways; hence, RDDs need caching, potentially controllable by the user.
In Dataflow (Apache Beam), the central concept is a Pipeline, built and optimized and executed as a monolithic whole, where PCollection (the closest analogue to RDD) is merely a logical node in the pipeline.
Both of these approaches have their advantages, but with Dataflow's approach, Dataflow knows exactly how a PCollection will be used in the pipeline, so there is no unpredictability involved and there is no need for a caching strategy.
Dataflow currently materializes some intermediate PCollections in temporary files on Google Cloud Storage, trying to avoid materialization whenever possible by using fusion. If a PCollection is materialized, then a pipeline stage that processes this collection will need to read it from Cloud Storage; otherwise (if the stage is fused with the stage producing the dataset), it will be processing elements of the dataset in-memory, immediately as they are produced, co-located on the worker that produces them.
GroupByKey operations and alike (e.g. Combine) are special: Dataflow has several implementations of GroupByKey, different between batch and streaming pipelines; they either use local disk on VMs to store the data, or use high-performance Google internal infrastructure.