Issue with type: street_address - web-services

I have a loop that sends individual queries of places to the Places-API and returns their type if possible. The code works fine for most of the places, but I have realized for a lot of places I receive type: street_address.
When I manually input this search with:
https://maps.googleapis.com/maps/api/place/textsearch/output?parameters
I receive the actual type for the place NOT street_address. Could this be due to running out my quota of queries for the day?
Thanks!

Related

Powerquery & Zabbix API - DataSource.Error (500) Internal Server Error when asking for much data

I am using Powerquery to fetch the data from Zabbix using their API. It works fine when I fetch the data for some days, but as I increase the period and the amount of data surpasses the millions of rows I just get the error below after some time waiting, and the query doesn't return anything.
I am using Web.contents to get the data as follows:
I have added that timeout as you can see above but the error just happens much before 5 minutes have passed. How should I solve this? Are there ways to fetch large amounts of data in power query in parts, without being all at once? Or does this error happen because of connection parameters inherent to zabbix configs.?
My team changed all the possible parameters regarding server memory and nothing seemed to have worked. One thing to notice is that, although power query seems to face the same error (500) internal server error if I get data for a period of 3 days or 30 days, for the first case it shows the error much faster while in the last case it takes much more time and eventually gets to the same error.
Thanks!
It's a PHP memory limit hit, you should modify the maximum memory.
For example, in an Apache standard setup you should edit /etc/httpd/conf.d/zabbix.conf and modify the php_value memory_limit to a greater value (restart apache!).
The default is 128M, the "right" setting depends on the memory available on the system and the maximum data size you want to get.

How to know when elasticsearch is ready for query after adding new data?

I am trying to do some unit tests using elasticsearch. I first start by using the index API about 100 times to add new data to my index. Then I use the search API with aggs. The problem is if I don't pause for 1 second after adding data 100 times, I get random results. If I wait 1 second I always get the same result.
I'd rather not have to wait x amount of time in my tests, that seems like bad practice. Is there a way to know when the data is ready?
I am waiting until I get a success response from elasticsearch /index api already, but that is not enough it seems.
First I'd suggest you to index your documents with a single bulk query : it would save some time because of less http/tcp overhead.
To answer your question, you should consider using the refresh=true parameter (or wait_for) while indexing your 100 documents.
As stated in documentation, it would :
Refresh the relevant primary and replica shards (not the whole index)
immediately after the operation occurs, so that the updated document
appears in search results immediately
More about it here :
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-refresh.html

BigQueryIO - only first day table can be created, despite having CreateDisposition.CREATE_IF_NEEDED

I have a dataflow job processing data from pub/sub defined like this:
read from pub/sub -> process (my function) -> group into day windows -> write to BQ
I'm using Write.Method.FILE_LOADS because of bounded input.
My job works fine, processing lots of GBs of data but it fails and tries to retry forever when it gets to create another table. The job is meant to run continuously and create day tables on its own, it does fine on the first few ones but then gives me indefinitely:
Processing stuck in step write-bq/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 05h30m00s without outputting or completing in state finish
Before this happens it also throws:
Load job <job_id> failed, will retry: {"errorResult":{"message":"Not found: Table <name_of_table> was not found in location US","reason":"notFound"}
It is indeed a right error because this table doesn't exists. Problem is that the job should create it on its own because of defined option CreateDisposition.CREATE_IF_NEEDED.
The number of day tables that it creates correctly without a problem depens on number of workers. It seems that when some worker creates one table its CreateDisposition changes to CREATE_NEVER causing the problem, but it's only my guess.
The similar problem was reported here but without any definite answer:
https://issues.apache.org/jira/browse/BEAM-3772?focusedCommentId=16387609&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16387609
ProcessElement definition here seems to give some clues but I cannot really say how it works with multiple workers: https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L138
I use 2.15.0 Apache SDK.
I encountered the same issue, which is still not fixed in BEAM 2.27.0 of january 2021. Therefore I had to develop a workaround: a custom PTransform which checks if the target table exist before the the BigQueryIO stage. It uses the bigquery java client for this and a Guava cache, as well as a windowing strategy (fixed, check every 15s) to sustain a heavy traffic of about 5000 elements per second. Here is the code: https://gist.github.com/matthieucham/85459eff5fdea8d115be520e2dd5ccc1
There was a bug in the past that caused this error, but that particular one was fixed in commit https://github.com/apache/beam/commit/d6b4dcec5f297f5c1bd08f345f0e1e5c756775c2#diff-3f40fd931c8b8b972772724369cea310 Can you check to see if the version of Beam you are running includes this commit?

Stream Analytics Output

I have a project that uses an event hub to receive data, this is sent every second, the data is received by a website using SignalR, this is all working fine, i have been storing the data in to blob storage via a Stream Analytics Job, but this is really slow to access, and with the amount of data i am receiving off just 6 devices, it will get even slower as this increases, i need to access the data to display historical data on via graphs on the website, and then this is topped up with the live data coming in.
I don't really need to store the data every second, so thought about only storing it every 30 seconds instead, but into a SQL DB, what i am trying to do, is still receive the data every second but only store it every 30, i have tried a tumbling window, but from what i can see, this just dumps everything every 30 seconds instead of the single entries.
am i miss understanding the Tumbling, Sliding and Hopping windows, i am guessing i cannot use them in this way ? if that is the case, i am guessing the only way to do it, would be to have the output db as an input, so i can cross reference the timestamp with the current time ?
unless anyone has any other ideas ? any help would be appreciated.
Thanks
am i miss understanding the Tumbling, Sliding and Hopping windows
You are correct that this will put all events within the Tumbling/Sliding/Hopping window together. However, this is only valid within a group by case, which requires a aggregate function over this group.
There is a aggregate function Collect() which will create an array of the events within a group.
I think this should be possible when you group every event within a 30 second tumbling window using Collect(), then in the next step, CROSS APPLY each record, which should output all received events within the 30 seconds.
With Grouper AS (
SELECT Collect() AS records
FROM Input TIMESTAMP BY time
GROUP BY TumblingWindow(second, 30)
)
SELECT
record.ArrayValue.FieldA AS FieldA,
record.ArrayValue.FieldB AS FieldB
INTO Output
FROM Grouper
CROSS APPLY GetArrayElements(Grouper.records) AS record
If you are trying to aggregate 30 entries into one summary row every 30 seconds then a tumbling window is a good choice. Something like the following should work:
SELECT System.TimeStamp AS OutTime, TollId, COUNT(*) as cnt, sum(TollCharge) as TollCharge
FROM Input TIMESTAMP BY EntryTime
GROUP BY TollId, TumblingWindow(second, 30)
Thanks for the response, I have been speaking to my contact at Microsoft and he suggested something similar, I had also found something like that in various examples online. what I actually want to do, is only update the database with the data every 30 seconds. so I will receive the event, store it, and I will not store it again until 30 seconds have passed. I am not sure how I can do it with and ASA job to be honest, as I need to have a record of the last time it was updated, I actually have a connection to the event hub from my web site, so in the receiver, I am going to perform a simple check, and then store the data from there.

In Mechanical Turk, how do you limit to one HIT per worker

I know from communication with Mechanical Turk workers that there is a way to limit the number of HITs a specific worker can complete, but I cant figure out how to do it. Any help would be greatly appreciated!
I've developed a script that mostly solves this problem. The main idea is to check the worker ID against a database and then hide the HIT if the worker has already completed a related HIT.
So that you don't need to host your own database server, I've made my script available as a (free) service at: http://uniqueturker.myleott.com. Please let me know if you have any trouble using the script, or if you have any questions or suggestions.
I'm also including the script here, in case you wish to use it with your own URL/database. If you go that route, you'll need to set up a web interface to your DB that takes a worker ID and returns "1" if the worker is allowed to work on the HIT and "0" otherwise. Then you'll just replace "YOUR_URL" below to point to that web interface:
<script type="text/javascript">
(function() {
var assignmentId = turkGetParam('assignmentId', '');
if (assignmentId != '' && assignmentId != 'ASSIGNMENT_ID_NOT_AVAILABLE') {
var workerId = turkGetParam('workerId', '');
var url = 'http://YOUR_URL/?workerId='+workerId;
var request = new XMLHttpRequest();
request.open('GET', url, false);
request.send();
if (request.responseText != '1') {
document.getElementById('mturk_form').style.display = 'none';
document.getElementsByTagName('body')[0].innerHTML = "You have already completed the maximum number of HITs allowed by this requester. Please click 'Return HIT' to avoid any impact on your approval rating.";
}
}
})();
</script>
Create a hit that really is a single HIT but use javascript to dynamically change the HIT every time it is viewed. Then when posting the HIT set "Number of assignments per HIT" to the number of participants you want. In this way you will only get unique participants.
Depending on the the type of HIT you want to run this is a technique that might work for you. I have used it to randomize the stimuli shown to the participants.
You can also do this with external questions. I run psychology experiments on mechanical turk, so I need unique participants. In addition to requesting that workers only perform one HIT, I use a python script to verify uniqueness. My HITs all run a cgi script to produce the question. The script consults a log file, and if the worker has previously accepted a related job, it politely informs them that because I need unique participants, the HIT won't be available.
I used to do this with qualifications, but found that it really limited participation.
If you want x unique users, make the number of assignments per HIT equal to x in the "Design" section. Then, when loading your csv file, only put one HIT in the file.
See
http://docs.amazonwebservices.com/AWSMechanicalTurkRequester/2008-08-02/
You can set two types of limits:
The maximum number of assignments any Worker can accept for a specific HIT type you've created. This value is undefined until you set it.
The maximum number of assignments any Worker can accept for all your HITs that don't otherwise have a HIT-type-specific limit already assigned. The initial default value is 10.
Initially, all your HITs are grouped together with an overall limit (default of 10) applying to the group, regardless of HIT type.
Note that this refers to the number of assignments that a worker can have currently accepted. Once the worker has submitted an assignment, they can accept another assignment.
You probably shouldn't care how many HITs a worker completes overall, but there might be a reason why you want to change the number a worker can have currently accepted from the default 10. Of course, a worker can only accept one assignment from a HIT with multiple assignments.
If you really, really want to limit the number of HITs a worker can actually do, you're going to need to either specify that you're not going to accept more than a certain number per worker, explicitly stating that you're going to reject any submissions once the limit has been reached OR you could monkey with qualification types to do that (but that could be a lot of work)!
As an example of the latter, if you want to limit someone to doing N total assignments, you could a qualification type for each HIT and grant no more than N types to any one worker.
Just to clarify if you want only unique workers to complete a single hit all you have to do is to set the max assignments to however many unique workers you want and the Mechanical Turk by default will assign only unique workers to that hit.
Now if you want unique workers across multiple hits then you have to have to get fancy and use an external question coupled with a script that logs worker id's ext...
psiTurk (MTurk behavioral research app) automatically prevents workers from repeating HITs using an approach similar to #david-l. Disclosure: I'm one of the developers on the project.