Amazon Mechanical Turks Qualification Multiple Locals - amazon-web-services

I am trying to define a worker qualification for a HIT via command line tool.
The requirements to workers are
Locale: US, CA
Number of accomplished HITs >= 30
Acceptance rate >= 75%
The following is the part of external_hit.properties file
# Worker_PercentAssignmentsApproved > 75%
qualification.1:000000000000000000L0
qualification.comparator.1:greaterthan
qualification.value.1:75
qualification.private.1:false
# Worker_Locale
qualification.2:00000000000000000071
qualification.comparator.2:in
qualification.LocaleValue.2.Country.1=US
qualification.LocaleValue.2.Country.2=CA
# Worker_NumberHITsApproved > 30
qualification.3:00000000000000000040
qualification.comparator.3:GreaterThan
qualification.value.3:30
The problem is I am not sure about the syntax, especially about this part, I made up this part.
qualification.LocaleValue.2.Country.1=US
qualification.LocaleValue.2.Country.2=CA
I didn't find any example in command line tool format with multiple locales.
I appreciate if you could check the syntax.

The format is a set of comma-separated values:
qualification.LocaleValue.2.Country=US,CA
However, AWS does not update the command line tools (and the last time I looked, they appeared to be officially unsupported), so this feature may not work at all, thus requiring you to go through the web API.

Related

AWS Tools for Powershell, version differences

I have been testing an older AWS Tools install using AWSToolsAndSDKForNet_sdk-3.3.398.0_ps-3.3.390.0_tk-1.14.4.1.msi and a newer install using AWSToolsAndSDKForNet_sdk-3.5.2.0_ps-4.1.0.0_tk-1.14.5.0.msi. The code that I am using to test with is
Set-AWSCredential -AccessKey:$ACCESSKEY -SecretKey:$SECRETKEY -StoreAs:default
$items = Get-S3Object -BucketName:$BUCKETNAME -Region:'eu-west-1' -Key:'revit/2020'
Write-Host "$($items.Length) items"
$count = 1
foreach ($item in $items) {
Write-Host "$count $($item.key)"
$count ++
}
I am seeing VERY different behavior, and can't figure out why. With 3.3 the code works as intended, I end up with a list of files in my bucket and key. Performance is pretty decent, it takes a moment but I have about 5000 files in may "subfolders".
When I run this with 4.1 it takes 3-5 times as long and returns nothing.
It seems that Help is a bit different too. A first run of get-help Get-S3Object -detailed will take as long as 10 minutes to run, with CPU, Memory and Disk access often at 99% utilization. A second run is quite quick. 3.3 Does nothing of the sort.
So, is this current build of AWS Tools for Powershell just not ready for prime time? My searches for AWS Tools 4.1 performance have turned up nothing.
For what it is worth, I am using the MSI installer because I need the install to actually work consistently, and the NuGet approach has been very problematic on a number of production workstations. But if there is another option I would love to look at it. The main issue is I need ultimately to do the install and immediately load the modules and work with AWS. I don't have that working with the MSI based install yet, but that's for a different thread.
It looks like they changed the results from Get-S3Object. You will need to add -Select S3Objects.Key to get the results you're looking for (or just -select *). Here's the excerpt from the change notes:
Most cmdlets have a new parameter: -Select. Select can be used to change the value returned by the cmdlet. For example the service API used by Get-S3Object returns a ListObjectsResponse object but the cmdlet is configured to return only the S3Objects field. Now you can specify -Select * to receive the full API response. You can also specify the path to a nested result property like -Select S3Objects.Key. In certain situations it may be useful to return a cmdlet parameter, this can be achieved with -Select ^ParameterName.
Found by going to the Change Notes and doing a CTRL+F for Get-S3Object. Hope this resolves it for you!

Tensorflow runs "Running per image evaluation" indefinitly

I am running my first tensorflow job (object detection training) right now, using the tensorflow API. I am using the ssd mobilenet network from the modelzoo. I used the >>ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync.config<< as a config-file and as a fine tune checkpoint the >>ssd_mobilenet_v1_0.75_depth_300x300_coco14_sync_2018_07_03<< checkpoint.
I started my training with the following command:
PIPELINE_CONFIG_PATH='/my_path_to_tensorflow/tensorflow/models/research/object_detection/models/model/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync.config'
MODEL_DIR='/my_path_to_tensorflow/tensorflow/models/research/object_detection/models/model/train'
NUM_TRAIN_STEPS=200000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python object_detection/model_main.py \
--pipeline_config_path=${PIPELINE_CONFIG_PATH} \
--model_dir=${MODEL_DIR} \
--num_train_steps=${NUM_TRAIN_STEPS} \
--sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
--alsologtostderr
No coming to my problem, I hope the community can help me with. I trained the network over night and it trained for 1400 steps and then started evaluating per image, which was running the entire night. Next morning I saw, that network only evaluated and the training was still at 1400 steps. You can see part of the console output in the image below.
Console output from evaluation
I tried to take control by using the eval config parameter in the config file.
eval_config: {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
num_examples: 5000
}
I added max_evals = 1, because the documentation says that I can limit the evaluation like this. I also changend eval_interval_secs = 3600 because I only wanted one eval every hour. Both options had no effect.
I also tried other config-files from the modelzoo, with no luck. I searched google for hours, only to find answers which told me to change the parameters I already changed. So I am coming to stackoverflow to find help in this Matter.
Can anybody help me, maybe hat the same experience? Thanks in advance for all your help!
Environment information
$ pip freeze | grep tensor
tensorboard==1.11.0
tensorflow==1.11.0
tensorflow-gpu==1.11.0
$ python -V
Python 2.7.12
I figured out a solution for the problem. The problem with tensorflow 1.10 and after is, that you can not set checkpoint steps or checkpoint secs in the config file like before. By default tensorflow 1.10 and after saves a checkpoint every 10 min. If your hardware is not fast enough and you need more then 10 min for evaluation, you are stuck in a loop.
So to change the time steps or training steps till a new checkpoint is safed (which triggers the evaluation), you have to navigate to the model_main.py in the following folder:
tensorflow/models/research/object_detection/
Once you opened model_main.py, navigate to line 62. Here you will find
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir)
To trigger the checkpoint save after 2500 steps for example, change the entry to this:
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir,save_checkpoints_steps=2500).
Now the model is saved every 2500 steps and afterwards an evaluation is done.
There are multiple parameters you can pass through this option. You can find a documentation here:
tensorflow/tensorflow/contrib/learn/python/learn/estimators/run_config.py.
From Line 231 to 294 you can see the parameters and documentation.
I hope I can help you with this and you don't have to look for an answer as long as I did.
Could it be that evaluation takes more than 10 minutes in your case? It could be that since 10 minutes is the default interval for making evaluation, it keeps evaluating.
Unfortunately, the current API doesn't easily support altering the time interval for evaluation.
By default, evaluation happens after every checkpoint saving, which by default is set to 10 minutes.
Therefore you can change the time for saving a checkpoint by specifying save_checkpoint_secs or save_checkpoint_steps as an input to the instance of MonitoredSession (or MonitoredTrainingSession). Unfortunately and best to my knowledge, these parameters are not available to be set as flags to model_main.py or from the config file. Therefore, you can either change their value by hard coding, or exporting them out so that they will be available.
An alternative way, without changing the frequency of saving a checkpoint, is modifying the evaluation frequency which is specified as throttle_secs to tf.estimator.EvalSpec.
See my explanation here as to how to export this parameter to model_main.py.

Google speech adds extra digits and mis-transcribes 9 and 10 digit strings

Scenario: a user speaks a 9 or 10 digit ID and Google speech is used to transcribe it.
Google STT sometimes forces the number into a phone number format, adding mystery digits to make it fit (and thus failing to capture the number accurately).
For example if the caller says "485839485", it may come out as "485-839-4850", with an extra digit that the caller never said. Digits are sometimes added in the middle of the number as well.
This happens even with added hints such as "one,two,three,four,five,six,seven,eight,nine,zero"
Has anyone found a workaround to this issue?
There are many open source speech recognition toolkits which will recognize number sequences reliably and for free, you just need to spend an hour to setup them.
This behavior seems to be related to the logic used by the API's model when performing the transcription tasks. Since this issue is part of an internal process that tries to fit the transcribed numbers into a phone format, I don't think there is a current workaround for this scenario; however, I recommend you to take a look on this ticket that has been created to review this issue, as well as the Release Notes documentation of Speech-to-Text API to keep the track of the new functionalities added to the service.

Code changes needed for custom distributed ML Engine Experiment

I completed this tutorial on distributed tensorflow experiments within an ML Engine experiment and I am looking to define my own custom tier instead of the STANDARD_1 tier that they use in their config.yaml file. If using the tf.estimator.Estimator API, are any additional code changes needed to create a custom tier of any size? For example, the article suggests: "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches." so this would suggest the config.yaml file below would be possible
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m
workerType: complex_model_m
parameterServerType: complex_model_m
workerCount: 10
parameterServerCount: 4
Are any code changes needed to the mnist tutorial to be able to use this custom configuration? Would this distribute the X number of batches across the 10 workers as the tutorial suggests would be possible? I poked around some of the other ML Engine samples and found that reddit_tft uses distributed training, but they appear to have defined their own runconfig.cluster_spec within their trainer package: task.pyeven though they are also using the Estimator API. So, is there any additional configuration needed? My current understanding is that if using the Estimator API (even within your own defined model) that there should not need to be any additional changes.
Does any of this change if the config.yaml specifies using GPUs? This article suggests for the Estimator API "No code changes are necessary as long as your ClusterSpec is configured properly. If a cluster is a mixture of CPUs and GPUs, map the ps job name to the CPUs and the worker job name to the GPUs." However, since the config.yaml is specifically identifying the machine type for parameter servers and workers, I am expecting that within ML-Engine the ClusterSpec will be configured properly based on the config.yaml file. However, I am not able to find any ml-engine documentation that confirms no changes are needed to take advantage of GPUs.
Last, within ML-Engine I am wondering if there are any ways to identify usage of different configurations? The line "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches." suggests that the use of additional workers would be roughly linear, but I don't have any intuition around how to determine if more parameter servers are needed? What would one be able to check (either within the cloud dashboards or tensorboard) to determine if they have a sufficient number of parameter servers?
are any additional code changes needed to create a custom tier of any size?
No; no changes are needed to the MNIST sample to get it to work with different number or type of worker. To use a tf.estimator.Estimator on CloudML engine, you must have your program invoke learn_runner.run, as exemplified in the samples. When you do so, the framework reads in the TF_CONFIG environment variables and populates a RunConfig object with the relevant information such as the ClusterSpec. It will automatically do the right thing on Parameter Server nodes and it will use the provided Estimator to start training and evaluation.
Most of the magic happens because tf.estimator.Estimator automatically uses a device setter that distributes ops correctly. That device setter uses the cluster information from the RunConfig object whose constructor, by default, uses TF_CONFIG to do its magic (e.g. here). You can see where the device setter is being used here.
This all means that you can just change your config.yaml by adding/removing workers and/or changing their types and things should generally just work.
For sample code using a custom model_fn, see the census/customestimator example.
That said, please note that as you add workers, you are increasing your effective batch size (this is true regardless of whether or not you are using tf.estimator). That is, if your batch_size was 50 and you were using 10 workers, that means each worker is processing batches of size 50, for an effective batch size of 10*50=500. Then if you increase the number of workers to 20, your effective batch size becomes 20*50=1000. You may find that you may need to decrease your learning rate accordingly (linear seems to generally work well; ref).
I poked around some of the other ML Engine samples and found that
reddit_tft uses distributed training, but they appear to have defined
their own runconfig.cluster_spec within their trainer package:
task.pyeven though they are also using the Estimator API. So, is there
any additional configuration needed?
No additional configuration needed. The reddit_tft sample does instantiate its own RunConfig, however, the constructor of RunConfig grabs any properties not explicitly set during instantiation by using TF_CONFIG. And it does so only as a convenience to figure out how many Parameter Servers and workers there are.
Does any of this change if the config.yaml specifies using GPUs?
You should not need to change anything to use tf.estimator.Estimator with GPUs, other than possibly needing to manually assign ops to the GPU (but that's not specific to CloudML Engine); see this article for more info. I will look into clarifying the documentation.

TSFRESH library for python is taking way too long to process

I came across the TSfresh library as a way to featurize time series data. The documentation is great, and it seems like the perfect fit for the project I am working on.
I wanted to implement the following code that was shared in the quick start section of the TFresh documentation. And it seems simple enough.
from tsfresh import extract_relevant_features
feature_filtered_direct=extract_relevant_features(result,y,column_id=0,column_sort=1)
My data included 400 000 rows of sensor data, with 6 sensors each for 15 different id's. I started running the code, and 17 hours later it still had not finished. I figured this might be too large of a data set to run through the relevant feature extractor, so I trimmed it down to 3000, and then further down to 300. None of these actions made the code run under an hour, and I just ended up shutting it down after an hour or so of waiting. I tried the standard feature extractor as well
extracted_features = extract_features(timeseries, column_id="id", column_sort="time")
Along with trying the example dataset that TSfresh presents on their quick start section. Which includes a dataset that is very similar to my orginal data, with about the same amount of data points as I reduced to.
Does anybody have any experience with this code? How would you go about making it work faster? I'm using Anaconda for python 2.7.
Update
It seems to be related to multiprocessing. Because I am on windows, using the multiprocess code requires to be protected by
if __name__ == "__main__":
main()
Once I added
if __name__ == "__main__":
extracted_features = extract_features(timeseries, column_id="id", column_sort="time")
To my code, the example data worked. I'm still having some issues with running the extract_relevant_features function and running the extract features module on my own data set. It seems as though it continues to run slowly. I have a feeling its related to the multiprocess freeze as well, but without any errors popping up its impossible to tell. Its taking me about 30 minutes to run to extract features on less than 1% of my dataset.
which version of tsfresh did you use? Which OS?
We are aware of the high computational costs of some feature calculators. There is less we can do about it. In the future we will implement some tricks like caching to increase the efficiency of tsfresh further.
Have you tried calculating only the basic features by using the MinimalFeatureExtractionSettings? It will only contain basic features such as Max, Min, Median and so on but should run way, way faster.
from tsfresh.feature_extraction import MinimalFeatureExtractionSettings
extracted_features = extract_features(timeseries, column_id="id", column_sort="time", feature_extraction_settings = MinimalFeatureExtractionSettings())
Also it is probably a good idea to install the latest version from the repo by pip install git+https://github.com/blue-yonder/tsfresh. We are actively developing it and the master should contain the newest and freshest version ;).
Syntax has changed slightly (see docs), the current approach would be:
from tsfresh.feature_extraction import EfficientFCParameters, MinimalFCParameters
extract_features(timeseries, column_id="id", column_sort="time", default_fc_parameters=MinimalFCParameters())
Or
extract_features(timeseries, column_id="id", column_sort="time", default_fc_parameters=EfficientFCParameters())
Since version 0.15.0 we have improved our bindings for Apache Spark and dask.
It is now possible to use the tsfresh feature extraction directly in your usual dask or Spark computation graph.
You can find the bindings in tsfresh.convenience.bindings with the documentation here. For example for dask, it would look something like this (assuming df is a dask.DataFrame, for example the robot failure dataframe from our example)
df = df.melt(id_vars=["id", "time"],
value_vars=["F_x", "F_y", "F_z", "T_x", "T_y", "T_z"],
var_name="kind", value_name="value")
df_grouped = df.groupby(["id", "kind"])
features = dask_feature_extraction_on_chunk(df_grouped, column_id="id", column_kind="kind",
column_sort="time", column_value="value",
default_fc_parameters=EfficientFCParameters())
# or any other parameter set
Using either dask or Spark (or anything alike) might help you with very large data - both for memory as well as speed (as you can distribute the work over multiple machines). Of course, we still support the usual distributors (docu) as before.
Additional to that, it is also possible to run tsfresh together with a task orchestration system, such as luigi. You can create a task to
* read in the data for only one id and kind
* extract the features
* write out the result to disk
and let luigi handle all the rest. You may find a possible implementation of this here on my blog.
I've found, at least on a multicore machine, that a better way to distribute extract_features calculation over independent subgroups (identified by the column_id value) is through joblib.Parallel with the Loky backend.
For example, you define your features extraction function on a single value of columnd_id and you apply it
from joblib import Parallel, delayed
def map_extract_features(df):
return extract_features(
timeseries_container=df,
default_fc_parameters=settings,
column_id="ID",
column_sort="DATE",
n_jobs=1,
disable_progressbar=True
).reset_index().rename({"index":"ID_CONTO"}, axis=1)
out = Parallel(n_jobs=cpu_count()-1)(
delayed(map_extract_features)(
my_dataframe[my_dataframe["ID"]==id]
) for id in tqdm(my_dataframe["ID"].unique())
)
This method takes way less memory than specifying column_id directly in the extract_features function.