I am working on an application where we receive csv files from a govt. dept. that has approx 1.5 million rows, monthly. We have to get this into azure table storage. We are trying to avoid having to provision VM's for this and are wondering if webjobs are a good choice for such a large dataset?
Thanks.
Yes, they should work. WebJobs are nothing more that a process running on the website machine.
You'll probably want to turn on the "Always On" feature if your webjob will take a long time to complete.
Related
I'm looking for some advice on the best / most cost effective solutions to use for my use case on Google Cloud (described below).
Currently, I'm using Cloud Composer, and it's way too expensive. It seems like this is the result of composer always running, so I'm looking for something that either isn't constantly running or is much cheaper to run / can accomplish the same thing.
Use Case / Process >> I have a process setup that follows the below steps:
There is a site built with Firebase that has a file drop / upload (CSV) functionality to import data into Google Storage
That file drop triggers a cloud function that starts the Cloud Composer DAG
The DAG moves the CSV from Cloud Storage to BigQuery while also performing a bunch of modifications to the dataset using Python / SQL queries.
Any advice on what would potentially be a better solution?
It seems like Dataflow might be an option, but pretty new and wanted a second opinion.
Appreciate the help!
If your file is not so big, you can process it with python and pandas data frame, in my experience it works very well with files around 1,000,000 rows
then with the bigquery API you can upload directly the dataframe transformed into bigquery, all in your cloud function, remember that cloud functions can process data until 9 minutes, the best, this way is costless.
Was looking into it recently myself. I'm pretty sure Dataflow can be used for this case, but I doubt it will be cheaper (also considering time you will spend learning and migrating to Dataflow if you are not an expert already).
Depending on the complexity of transformations you do on the file, you can look into data integration solutions such as https://fivetran.com/, https://www.stitchdata.com/, https://hevodata.com/ etc. They are mainly build to just transfer your data from one place to another, but most of them are also able to perform some transformations on the data. If I'm not mistaken in Fivetran it's sql based and in Hevo it's python.
There's also this article that goes into scaling up and down Composer nodes https://medium.com/traveloka-engineering/enabling-autoscaling-in-google-cloud-composer-ac84d3ddd60 . Maybe it will help you to save some cost. I didn't notice any significant cost reduction to be honest, but maybe it works for you.
The data is a few hundred Mb up to a few Gb. It could be running some BQ procedures and in the end a select. The values of this need to be transferred as a valid CSV to an SFTP.
Cloud functions could be problematic because of the 9 minute timeout limit and the 2Gb RAM limit.
Is there a serverless solution or do I have to run manual instances?
There are two scenarios I would consider:
Export table with standard BQ options (here) into GCP bucket storage. Then you can pick it up and upload it to SFTP by Cloud Run. There are containers, which are built for this, e.g. this one
Run a pipelining project. Considering you want to use simple export, I would suggest Dataflow. You can write a small Python or Java code to pick up a file and upload it to SFTP. If you would like more complex logic in processing - have a look at Dataproc
Report authoring in Power BI is done in Power BI Desktop, which is installed on users' workstations. Report sharing in Power BI is done in the Power BI cloud service (either shared or dedicated capacity). This means that different resources (i.e., memory, CPU, disk) are available during report authoring and report sharing, particularly for data load (dataset refresh). So, it seems impossible to test a report's data load / ETL performance prior to releasing to production (i.e., publish to the cloud service). And, usually, data load performance is faster in the cloud service than in Desktop. Because my reports contain a lot of data and transformations, data loads in Desktop can take a long time. How can I make the resources available to Desktop identical to the resources in the cloud service, so that I can reduce data load times in Desktop (during development) and to predict performance in the cloud service?
Perhaps a better question to ask is, should I even be doing this? That is, should I be trying to predict (in Desktop) a report's refresh performance in the cloud service (and / or load production-level data volumes into Desktop during development)?
Microsoft do not specify what hardware CPU/Memory is used in the Power BI Service. It is also a shared service, so more that one Power BI tenancy could be hosted on the same cluster. They do mention that you may suffer from noisy neighbour issues, so if some other tenancy is hitting it hard, your performance may suffer.
I know from experience that the memory available is greater than 25GB, as queries that have not run on Premium P1 nodes, have run ok in the service. With the dedicated nodes, you can use the admin reports to see what's going on in the background, query times, refresh time cpu/memory usage.
There are a few of issues trying to performance test Desktop vs Service. For example, a SQL query in desktop will run twice, first to check the structure and data, the second to get the data. This doesn't happen when it is deployed to the service so in that example your load will be quicker.
If you are accessing on-premise data then it will be quicker in the desktop, than the service as you'll have to go via a gateway. Also if you are connecting to an Azure SQL Database, then the connections and bandwidth between the Azure Services will be slightly quicker when you deploy it to the service, than a desktop connection to an Azure Service as the data has to travel outside the data centre to get to you.
So for importing datasets, you can look at the dataset refresh start and end times and work out how long it did take.
For a base line test, generate 1 millions rows of data, it doesn't have to be complex. Test the load time in desktop a few time to get an average, deploy and then try it in the service. Then keep adding 1 million rows to see if there is a liner relationship between the amount and time taken.
However it will not be a full like for like comparison depending on the type of data, the location and network speed, but it should give you a fair indication of any performance increase you may get when using the service to balance desktop spec to the service.
I've developed a tool at some point that uses the PowerBI-Tools-For-Capacities Microsoft under the hood.
We are running into quota limits for a small data set which is less than 1Gb in Bigquery. Google cloud gives us no indication of what queries are running on the backend which isn't allowing us to tune the setup. We have a Bigquery dataset and a dashboard built in data studio which is querying on the data set.
I've used relational databases like Oracle in the past and they have excellent tooling to diagnose issues. But with Bigquery, I feel like I am staring into the dark.
I'd appreciate any help/pointers you can give.
The concurrent queries limit makes reference to the number of statements that are executed simultaneously in BigQuery. The quota limit for on-demand, interactive queries is 100 concurrent queries (updated).
Based on this, it is seems that your Data Studio is hitting this quota when running your reports in which case is suggested to re-design your dashboard build in order to avoid exceeding those limits.
Additionally, you can use the bq ls -j -a PROJECTNAME command to list the jobs that have been run in your project in order to identify the queries you need to work with, as well mentioned by Elliott Brossard
We are trying to run an ETL process in an High I/O Instance on Amazon EC2. The same process locally on a very well equipped laptop (with a SSD) take about 1/6th the time. This process is basically transforming data (30 million rows or so) from flat tables to a 3rd normal form schema in the same Oracle instance.
Any ideas on what might be slowing us down?
Or another option is to simply move off of AWS and rent beefy boxes (raw hardware) with SSDs in something like Rackspace.
We have moved most of our ETL processes off of AWS/EMR. We host most of it on Rackspace and getting a lot more CPU/Storage/Performance for the money. Don't get me wrong AWS is awesome but there comes a point where it's not cost effective. On top of that you never know how they are really managing/virtualizing the hardware that applies to your specific application.
My two cents.