Wonder if someone could help me here. I am trying to download data using Webservice task. The data supplier has a limit of 1000 records per call and asked us to iterate through the whole data set using the "select" and "skip" parameters. "For example to select the first 1000 records in the data set you should set the select parameter to 1000 and the skip parameter to 0. To select the next 1000 records you should set the select parameter to 1000 and the skip parameter to 1000. You should continue to do this until 0 records are returned to you to get the whole data set."
I am not sure how i can implement this in Webservice task using for loop or foreach loop? any help or tips will be greatly appreciated.
Many thanks
I don't see any options within the foreach loop as the enumerator supports no sort of dynamic HTTP connection. So if it is possible, it should be done somehow using a script task that will defines a list of url's that your loop is able to iterate over, after which for each of the iterations, the url from de dynamic list is linked in a dynamic connection that can be used in the web service task inside the foreach loop.
It might that you want to consider program this task fully in the script task and store the result in a SQL table.
Related
I'm very new to postman so please bear with me. Basically, I am trying to get data from the clinicaltrials.gov API, which can only give me 1000 studies at a time. Since the data I need is about 25000 studies, I'm querying it based on dates. So, is there any way in Postman that I can GET multiple requests at one time wherein I am only changing one parameter?
Here is my URL: ClinicalTrials.gov/api/query/study_fields??expr=AREA[LocationCountry]United States AND AREA[StudyFirstPostDate]RANGE[MIN,01/01/2017] AND AREA[OverallStatus]Recruiting
I will only be changing the RANGE field in each request but I do not want to manually change it every time. So, is there any other way in which I can maybe at a list of dates and have Postman go through them all?
There's several ways to do this.
So, is there any way in Postman that I can GET multiple requests at one time wherein I am only changing one parameter?
I'm going to assume you don't mind if the requests are sequenced or parallel, the latter is less trivial and doesn't seem to add more value to you. So I'll focus on the following problem statement.
We want to retrieve multiple pages of a resource, where the cursor is StudyFirstPostDate. On each page retrieved the cursor should increment to the latest date from the previous poge. The following is just one way to code this, but the building blocks are:
You have a collection with a single request, the GET described above
We will have a pre-request script that will read a collection variable with the next StudyFirstPostDate
We will have a test script (post-request) that will re-set the StudyFirstPostDate to the next value of the pagination.
On the test script you should save the data the same way you're doing now.
You set the next request (postman.setNextRequest("NAMEOFREQUEST")) to the same GET request we're dealing with, to effectively create a loop. When you've retrieved all pages you kill the looip with postman.setNextRequest(null) - although not calling any function should also stop it. This then goes to step (2) and loop.
This flow will only work on a collection run. Even if you code all of this, just triggering the request by itself will not initiate a loop. setNextRequest only works within a collection run.
Setting a initial value to the variable on the pre-request script
// Set the initial value on the collection variables
// You could use global or environment variables, up to you.
const startDate = pm.collectionVariables.get("startDate")
Re-setting the value on the Tests
// Loop through the results save the data and retrieve the next start date for the request
// After you have it
const startDate = pm.collectionVariables.set("startDate",variableWithDate)
// If you've reach the end you stop, if not you call the same request to loop
// nextPage is an example of a boolean that you've set before
if (nextPage) {
postman.setNextRequest("NAMEOFREQUEST")
} else
postman.setNextRequest(null)
}
I am having an array of structure. I need to insert all the rows from that array to a table.
So I have simply used cfquery inside cfloop to insert into the database.
Some people suggested me not to use cfquery inside cfloop as each time it will make a new connection to the database.
But in my case Is there any way I can do this without using cfloop inside cfquery?
Its not so much about maintaining connections as hitting the server with 'n' requests to insert or update data for every iteration in the cfloop. This will seem ok with a test of a few records, but then when you throw it into production and your client pushes your application to look around a couple of hundred rows then you're going to hit the database server a couple of hundred times as well.
As Scott suggests you should see about looping around to build a single query rather than the multiple hits to the database. Looping around inside the cfquery has the benefit that you can use cfqueryparam, but if you can trust the data ie. it has already been sanatised, you might find it easier to use something like cfsavecontent to build up your query and output the string inside the cfquery at the end.
I have used both the query inside loop and loop inside query method. While having the loop inside the query is theoretically faster, it is not always the case. You have to try each method and see what works best in your situation.
Here is the syntax for loop inside query, using oracle for the sake of picking a database.
insert into table
(field1, field2, etc)
select null, null, etc
from dual
where 1 = 2
<cfloop>
union
select <cfqueryparam value="#value1#">
, <cfqueryparam value="#value2#">
etc
from dual
</cfloop>
Depending on the database, convert your array of structures to XML, then pass that as a single parameter to a stored procedure.
In the stored procedure, do an INSERT INTO SELECT, where the SELECT statement selects data from the XML packet. You could insert hundreds or thousands of records with a single INSERT statement this way.
Here's an example.
There is a limit to how many <CFQUERY><cfloop>... iterations you can do when using <cfqueryparam>. This is also vendor specific. If you do not know how many records you will be generating, it is best to remove <cfqueryparam>, if it is safe to do so. Make sure your data is coming from trusted sources & is sanitised. This approach can save huge amounts of processing time, because it is only make one call to the database server, unlike an outer loop.
I am trying to to set a value for all items in a domain that do not already have a certain value and have an additional flag set.
Basically for all my items,
SET ValueA to 100 if ValueB is 0
But I am confused about how to achieve this. So far ive been setting the value for individual items by just using a PutRequest like this:
ArrayList<ReplaceableAttribute> newAttributes = new ArrayList<ReplaceableAttribute>();
newAttributes.add(new ReplaceableAttribute("ValueA",Integer.toString(100), true));
PutAttributesRequest newRequest = new PutAttributesRequest();
newRequest.setDomainName(usersDomain);
newRequest.setItemName(userID);
newRequest.setAttributes(newAttributes);
sdb.putAttributes(newRequest);
This works for an individual item and requires me to first get the item name (userID). Does this means that I have to "list" all of my items and do this 1 by 1?
I suppose that since I have around 19000+ items I would also have to use the token to get the next set after the 2000 limit right?
Isn't there a more efficient way? This might not be so heavy right now but I expect to eventually have over 100k items.
PD: I am using the AWS Java SDK for Eclipse.
If you are talking about how you can do it grammatically by writing your own code then Yes. First you have to know all item name i.e in your case UserID and then you need to set a value one by one. You can use BatchPUTAttribute in this case. Using Batch PUT you can update 25 items in one request. You can do 5 to 20 BatchPutAttribute requests in parallel threads. Know more to tune the performance.
If you need to do it somehow in tricky way then you can use SDBExplorer. Please Remember it will set 100 for all items because SDBExplorer does not support conditional PUT. If you would like to set it anyway then Follow these steps-
Download SDBExplorer zip version form download page.
Extract it and run the executable.
Download 30 days trial license.
Once license has been downloaded main UI will open.
Provide valid Access Key and Secret keys and click on "GO" button.
You will see list of domains in Left side tree.
Right click on the domain in which you would like to set value for all item.
Choose "Export to CSV" option.
Export the content of domain into CSV. http://www.sdbexplorer.com/documentation/simpledb--how-to-export-domain-in-csv-using-sdbexplorer.html
Go to path where your domain has exported.
Open CSV file.
Your first column is item name.
Delete all columns other then item Name and column "ValueA".
Set 100 for all item name under "ValueA" column.
Save the CSV.
Go to the SDBExplorer main UI.
Select the same domain.
Click on "Import" option from tool bar.
A panel will open.
Now Import the data into the Domain. http://www.sdbexplorer.com/documentation/simpledb--how-to-upload-csv-file-data-and-specifying-column-as-amazon-simple-db-item-name.html
Once import is done, explore the domain and you will find the value 100 set to all items for column ValueA.
Please try the steps first on any dummy domain.
What exactly I am trying to suggest you?
To know all item name in your domain, I am suggesting you to export all content of your domain into CSV file at local file system. Once you get all item name in CSV, keep only one column "ValueA". Set "100" for all the items in CSV file and upload/import the content back into domain.
Discloser: I am one of the developer of SDBExplorer.
I want to process all of the data in a column family in a MapReduce job. Ordering is not important.
An approach is to iterate over all the row keys of the column family to use as the input. This could be potentially a bottleneck and could replaced with a parallel method.
I'm open to other suggestions, or for someone to tell me I'm wasting my time with this idea. I'm currently investigating the following:
A potentially more efficient way is to assign ranges to the input instead of iterating over all row keys (before the mapper starts). Since I am using RandomPartitioner, is there a way to specify a range to query based on the MD5?
For example, I want to split the task into 16 jobs. Since the RandomPartitioner is MD5 based (from what I have read), I'd like to query everything starting with a for the first range. In other words, how would I query do a get_range on the MD5 with the start of a and ends before b. e.g. a0000000000000000000000000000000 - afffffffffffffffffffffffffffffff?
I'm using the pycassa API (Python) but I'm happy to see Java examples.
I'd cheat a little:
Create new rows job_(n) with each column representing each row key in the range you want
Pull all columns from that specific row to indicate which rows you should pull from the CF
I do this with users. Users from a particular country get a column in the country specific row. Users with a particular age are also added to a specific row.
Allows me to quickly pull the rows i need based on the criteria i want and is a little more efficient compared to pulling everything.
This is how the Mahout CassandraDataModel example functions:
https://github.com/apache/mahout/blob/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/cassandra/CassandraDataModel.java
Once you have the data and can pull the rows you are interested in, you can hand it off to your MR job(s).
Alternately, if speed isn't an issue, look into using PIG: How to use Cassandra's Map Reduce with or w/o Pig?
Actually, I would like to use this for logging.
I want to put a dictionary into beanstalkd.
Everytime someone goes into my website, I want to put a dictionary into beanstalkd, and then every night, I want a script that will get all the jobs and stick them in the database.
THis will make it fast and easy.
You can have a large upper limit on the size of each job in beanstalk (>2MB) but performance does seem to be adversely affected at that point. If the size of the dictionary is large you probably want to store the actual dictionary in a SQL table and store the ID of that row in the job, then have the worker retrieve the SQL row when it grabs the job that has the correlated ID.