I recently read about a new Google-code hosted (open source) project from Google that allows you to perform data mining and analytics on various input files. There was even a video showing the user importing an Excel file and filtering on various conditions. However I cannot find it now even after looking for several hours, does anyone know the name of this project?
Original
Google Refine http://code.google.com/p/google-refine/
Update (2013/12/02)
Google Refine is now OpenRefine http://openrefine.org/
Please note that since October 2nd, 2012, Google is not supporting actively this project which have been rebranded to OpenRefine. Project development, documentation and promotion is now fully supported by volunteers.
Repository location:
https://github.com/OpenRefine
For Classification and Prediction, Google Prediction API
For operations on very large data set, BigQuery
For Data Cleaning and Transformation, Google Refine
Related
Assuming a proper database management system (e.g., BigQuery) is used to store input data in Step 1, what would be an efficient approach to automating the workflow outlined below, and what would be some ideal tools / programming languages to use?
For context, I am not a software developer by trade and currently use Excel to handle the following invoicing workflow:
Store all client booking & utility data for multiple properties;
Calculate appropriate utility charges due for each individual client;
Pull the data & calculations to a well-formatted “Utility Statement” tab within the same Excel workbook (changing the client name will automatically update the calculations); and
For every single client:
Update the Utility Statement calculations (by pasting in a new client name);
Export the worksheet as a PDF;
Rename the PDF file; and
Email the PDF file to the appropriate client
The problem here pertains to the time and work required to complete Step 4 (currently an entire workday), which increases linearly as the number of clients increases. An important consideration is that I do have an Adobe subscription (perhaps the Adobe API is relevant here) and have access to most of the Google Cloud Platform tools (e.g., BigQuery, Cloud Composer, etc.)
What have I done to research and solve this problem myself?
I have spent many hours researching an ideal solution to this problem, but have been consistently overwhelmed due to my lack of familiarity with the data and software engineering fields. Out-of-the-box PDF generation software with hooks to Airtable cost an excessive amount of money for what they do, and lacks the robust formatting capabilities provided by Microsoft Office products like Excel. I am also not confident that these tools would handle the entirety of this workflow. I am now consulting Stack Overflow (i.e., a community of data engineers & software developers) as a last resort, because I feel as though a well-educated engineer could guide me down the right path.
I wish to perform sentimental analysis using Google Natural Language API.
I found a documentation that perform sentiment analysis directly on a file located in Cloud Storage, https://cloud.google.com/natural-language/docs/analyzing-sentiment#language-sentiment-string-python.
However, my data that i am working on is instead located in Big Query. I am wondering how do I call the data directly from Big Query table to do the Sentimental Analysis?
An example of the Big Query Table schema:
I wish to do NLP on the tweet columns of the table.
I tried to search for documentation on it but seems to not find anything.
I would appreciate any help or references. Thank You.
You can take a look at BigQuery Remote Functions which provide a direct integration with Cloud Functions and Cloud Run. The columns returned from BigQuery SQL can be passed to the Remote Functions and a custom code can be executed as per the requirements. Please do note that Remote Functions are still in preview and might not be suitable for production systems.
This should be fairly straightforward to do with Dataflow - you could write a pipeline that reads from BigQuery followed by a DoFn that uses Google's NLP Libraries, and then writes the results to BigQuery.
Some wrappers are already provided for you in https://github.com/apache/beam/blob/master/sdks/python/apache_beam/ml/gcp/naturallanguageml.py
I need to automatically extract raw data of a PowerBI visualisation across multiple published reports.
Why not just pull the underlying dataset? Because the visualisations are using anomaly detection features of PowerBI, which include anomaly flags not available in the underlying dataset (basically, the visualisations contain calculated columns that are not included in main PowerBI data model)
Ideally a REST API solution would be best, but dumping CSV files or other more roundabout methods are ok.
So far, the closest functionality I can see is in the Javascript API here - https://learn.microsoft.com/en-us/javascript/api/overview/powerbi/export-data, which allows a website to communicate with an embedded PowerBI report and pass in and out information. But this doesn't seem to match my implementation needs.
I have also seen this https://learn.microsoft.com/en-us/azure/cognitive-services/anomaly-detector/tutorials/batch-anomaly-detection-powerbi which is to manually implement anomaly detection via Azure Services rather than the native PowerBI functionality, however this means abandoning the simplicity of the PowerBI anomaly function that is so attractive in the first place.
I have also seen this StackOverflow question here PowerBI Report Export in csv format via Rest API and it mentions using XMLA endpoints, however it doesn't seem like the client applications have the functionality to connect to visualisations - for example I tried DAX Studio and it doesn't seem to have any ability to query the data on a visualisation level.
I'm afraid all information on PowerBI says this is not possible. The API only supports PDF, PPTX and PNG options, and as such the integration with Power Automate doesn't do any better.
The StackOverflow question you link has some information on retrieving the Dataset but that's before the anomaly detection has processed the data.
I'm afraid your best bet is to, indeed, use the Azure service. I'd suggest ditching PowerBI and going to an ETL tool like DataFactory or even into the AzureML propositions Microsoft offers. You'll be more flexible than in PowerBI as well since you'll have the full power of Python/R notebooks at your disposal.
Sorry I can't give you a better answer.
we have a dataset of ~10 million entities or a certain Kind in Datastore. We want to change the products functionality, so we would like to change the fields on all Kind entities.
Is there a smart/quick way to do it, that does not involve iterating over all of the entities in series?
Probably you can use Dataflow to help you with your problem.
Dataflow is a stream and batch data processing service, fully managed by GCP.
It was open sourced in the Apache Beam project. It is fully compatible with this SDK. This allows you to test your developments locally before run them on GCP.
It exposes two main concepts, a PCollection, basically the data that is being handled by the tool, and pipelines, the different steps necessary to capture the data, the transformations that must be performed, and how and where the results obtained should be written.
It provides support for Java, Python and Go, and a rich feature set and variety of possible data sources and transformations.
In the specific case of Datastore, Dataflow provides support for read, write and delete data. See for instance the relevant documentation for Python.
You can see a good example of how to interact with datastore in the Apache Beam Github repository.
These two other articles could be also interesting: 1 2.
I would presume that you have to loop through each one and update it as it's a NoSQL data store like mongo from what I can see. We have a system that uses SQL and Mongo and the demoralised data is a pain, we had to write migrations that would loop through all and update.
I'm working on a project where I am tasked to use google cloud services to process and visualize fitness data. For example, I have exported some apple health data from my watch, and it is in .xml format. From a high level, I envision this .xml file starting off in object storage, and being converted to .csv through a cloud function (triggered by the creation of the .xml object in storage) and stored again in object storage (different bucket). Then I see these .csv files being processed by a DataFlow pipeline, which will reformat the data to the template schema that I would like the data to be organized with. This pipeline will output the resultant .csv to BigQuery, which will then be designated as a data source for Data Studio. I will then configure Data Studio to produce some simple reports that compare the health data to recommended values. I would like for this report to be accessible as a .pdf in object storage potentially as well. Am I on the right track, or am I missing some key services to accomplish this?
Also, I'm new to posting on StackOverflow, so if this question is against the rules or not welcome, please let me know.
Any feedback is greatly appreciated, as I have not been able to bounce these ideas off of other experienced cloud architects/developers.
This question is currently off-topics by the rule of StackOverflow, as it does not contain any problems to resolve. See point 4-5.
As a high-level advice, I do not see why it should not be possible based on the services you mentioned but you would need to implement it and try it on your side and evaluate the features of each service in your workflow.
In terms of solution or architecture advice, those are generally paid services and you would most likely find little help here for those unless you have a specific problem to solve with said services. You might find some help on the internet as well. ie.Cloud Solutions, Built it on GCP, etc
You might find this interesting to review as well as it mimics your solution. Hope this helps.