Kettle hangs on post data using web service step

Kettle hangs on post data using web service step - web-services

I am using Kettle to bulk load data and i am facing issue in dealing with web service step.As per the inspection after few thousands of call web service becomes unresponsive only time counter is increased but no progress is done.I can notice that all previous steps are finished and the transformation is stuck at web service step.Looking for the solution or workaround to overcome from this problem.

I have found the solution to this problem.The Web service step was becoming a bottle neck step and therefor requires some tweaks to be done to transformation. I have followed the solution found on the following link :
http://type-exit.org/adventures-with-open-source-bi/2010/06/parallel-processing-in-pentaho-kettle-jobs/
and now job is done.All just need to set No of rows in rowset to deal with bottle neck steps to maintain the buffer. Refer the screenshots and multiple threads on bottleneck step to overcome from it.
Check the attached screenshots.

Related

Is there a way to compute this amount of data and still serve a responsive website?

Currently I am developing a django + react website, that will (I hope) serve a decent number of users. The project demo is mostly complete, and I am starting to think about the scale required to put this thing into production
The website essentially does three things:
Grab data from external APIs (i.e. Twitter) for 50,000 unique keywords (the keywords dont change). This process happens every 30 minutes
Run computation on all of the data, and save that computation to the database. Assume that the algorithm is as optimized as possible
When a user visits the website it should serve a pretty graph/chart of all of the computational data per keyword
The issue being, this is far too intense a task to be done by the same application that serves the website, users would be waiting decades to see their data. My current plan is to have a separate API made that services the website with the data, that the website can then store in it's database. This separate API would process the data without fear of affecting users, and it should be able to finish its current computation in under 30 minutes, in time for the next round of data.
Can anyone help me understand how I can better equip my project to handle the scale? I'd love some ideas.
As a 4th year CS Student I figured it's time to put a real project out into the world and I am very excited about it and the progress I've made so far. My main worry is that the end users will be negatively effected, if I don't figure out some kind of pipeline to make this process happen.
To re-iterate my idea:
Django + React - This is the forward facing website
External API - Grabs the data off the internet and processes it, and waits for a GET request from the website
Is there a better way to do this? Or on the other hand am I severely overestimating how computationally heavy this is.
Edit: Including current research
Handling computationally intensive tasks in a Django webapp
Separation of business logic and data access in django

What you want is to have the computation task to be executed by a different process in the "background".
The most straight-forward and popular solution is to use Celery, see here.
The Celery worker(s) - which performs the background task - can either run on the same machine as the web-application or (when scale becomes an issue), you can change the configuration so that it will run on an entirely different machine.

General guidance around Bigtable performance

I'm using a single node Bigtable cluster for my sample application running on GKE. Autoscaling feature has been incorporated within the client code.
Sometimes I experience slowness (>80ms) for the GET calls. In order to investigate it further, I need some clarity around the following Bigtable behaviour.
I have cached the Bigtable table object to ensure faster GET calls. Is the table object persistent on GKE? I have learned that objects are not persistent on Cloud Function. Do we expect any similar behaviour on GKE?
I'm using service account authentication but how frequently auth tokens get refreshed? I have seen frequent refresh logs for gRPC Java client. I think Bigtable won't be able to serve the requests over this token refreshing period (4-5 seconds).
What if client machine/instance doesn't scale enough? Will it cause slowness for GET calls?
Bigtable client libraries use connection pooling. How frequently connections/channels close itself? I have learned that connections are closed after minutes of inactivity (>15 minutes or so).
I'm planning to read only needed columns instead of entire row. This can be achieved by specifying the rowkey as well as column qualifier filter. Can I expect some performance improvement by not reading the entire row?

According to GCP official docs you can get here the cause of slower performance of Bigtable. I would like to suggest you to go through the docs that might be helpful. Also you can see Troubleshooting performance issues.

Why is Django on Google App Engine is very slow?

I have Django server deployed on Google App Engine, I am doing simple GET request which is taking around 2 seconds, while the same request takes around 300ms when run locally. Both servers are using the same mysql database on Google Cloud SQL. I am testing this on my home wifi (100mbps), so don't think it's a network issue, anyway the payload is pretty small (2.5kb).
Anyone seen this slowness when deployed to Google App Engine? Is there any configuration change I could make to Google App Engine, that would make it faster?
Any suggestions are welcome.
Thanks!

When comparing Google App Engine’s performance with the local one you should keep in mind that deploying on GAE needs more time in order to import all the necessary libraries and set up the Django framework.
Here , it is stated that Instance Startup Time for Standard Environment is up to seconds and for Flexible up to minutes. Additionally, I found some StackOverflow posts that shed some light on this here and here.
You may profile your application by using Cloud Trace to analyze the requests and isolate what causes the issue so that you may improve it afterwards.
In addition to that there are various ways to optimize your application’s performance, as the following typical ones:
Scaling configuration, by setting up “min_idle_instances” to be kept running and ready to serve traffic.
using Warm Up Requests to reduce request and response latency during the time when your app's code is being loaded to a newly created instance.
Furthermore, here and here you may find the official Running Django on the App Engine environments tutorials so that you can spot any details you may have missed.
Finally, during my investigation, I came across PageSpeed Insights, which analyzes the content of a web page, then generates suggestions to make that page faster and could be handy.
I hope this information is helpful and point you in the right direction.

Sync Framework 2.1 - Large data over WCF

I have implemented data sync using MS Sync framework 2.1 over WCF to sync multiple SQL Express databases with a central SQL server. Syncing is happening every three minutes through a windows service. Recently, we noticed that huge amounts of data is being exchanged over the network (~100 MB every 15 minutes). When I checked using Fiddler, the client calls the service with a GetKnowledge request four times in a session and each response is around 6 MB in size, although there are no changes at all in either database. This does not seem to be normal? How do I optimize the system to reduce such heavy traffic? Please help.
I have defined two scopes with first one having 15 tables all download only. The second one has 3 tables with upload only direction.
The XML response has a very huge number of <range> tags under coreFragments/coreFragment/ranges tag which is the major portion contributing to the response size.
Let me know if any additional information is required.

must be the sync knowledge. do you do lots of deletes? or do you have lots of replicas? try running a metadata cleanup and see if it compacts the sync knowledge.

Creating one to one scopes and re-provisioning fixed the issue. I am not still sure what caused the original issue.

Do you happen to have any join tables and use any ORM. If you do, then this post might help.
https://kumarkrish.wordpress.com/2015/01/07/microsoft-sync-frameworks-heavy-traffic/

Limit number of business connector users in AX2009

Background
We provide some webservices to export and import some data to a website. Unfortunatly the programmers of that website don't seem to, or don't want to understand, that if they try three times and get three errors, the 1,000,000th time it also will give an error.
So they constantly open new requests to the webservice wich result in a constant flow of new business connector users. The problem with this is that they creating database blocks, but the database will not be able to solve this because when it will time out, there are a few 1000 new business connector users waiting to block that process all over again. This morning the whole server was inresponsive and a reboot of the AOS toke about 32 minutes to complete. (normally it would take 2 minutes)
Question
I was searching for a way to limit the number of business connector users. The only related post I found was this one:
http://www.archivum.info/microsoft.public.axapta.programming/2010-01/00045/RE-.NET-business-connector-amp-Web-Services.html
Unfortunatly there is no answer to their question and I couldn't find more topics.. Does anyone have an idea how I could solve this?
Any help or pointers in the right direction would be greatly appriciated.. :)

It sounds as if the problem is with the web service. Can you rework it so that it does not cause blocking?
Meanwhile, look into the MaxConcurrenctBCSessions setting.
see
http://msdn.microsoft.com/en-us/library/aa569637(v=ax.10).aspx

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js