Cloud ML Feature methods - google-cloud-ml

The pre-processing page in the cloud ML How to guide (https://cloud.google.com/ml/docs/how-tos/preprocessing-data) says that you should see the SDK reference documentation for details about each type of feature and the
Can anyone point me to this documentation or a list of feature types and their methods? I'm trying to setup a discrete target but keep getting "data type int64 expected type: float" errors whenever I set my target to .discrete() rather than .continuous()

You need to download the SDK reference documentation:
Navigate to the directory where you want to install the docs in the
command line. If you used ~/google-cloud-ml to download the samples
as recommended in the setup guide, it's a good place.
Copy the documentation archive to your chosen directory using
gsutil:
gsutil cp gs://cloud-ml/sdk/cloudml-docs.latest.tar.gz .
Unpack the archive:
tar -xf cloudml-docs.latest.tar.gz
This creates a docs directory inside the directory that you chose. The
documentation is essentially a local website: open docs/index.html in your browser to open it at its root. You can find the transform references in there.
(This information is now in the setup guide as well. It's the final step under LOCAL: MAC/LINUX)

On the type-related errors, let's assume for a bit that your feature set is specified somewhat along the following lines:
feature_set = {
'target': features.target('category').discrete()
}
When a discrete target is specified like above, the data-type of the target feature is an int64 due to one of the following:
No vocab for target data-column (i.e. 'category') was generated during the analysis of your data, i.e. the metadata (in the generated metadata.yaml) has an empty list for the target data-column's vocab.
A vocab for 'category' was indeed generated, and the data-type of the very first item (or key) of this vocab was an int.
Under these circumstances, if a float is encountered, the transformation to the target feature's data-type will fail.
Instead, casting the entire data-column ('category' in this case) into a float should help with this.

Related

Beam/Dataflow ReadAllFromParquet doesn't read anything but my job still succeeds?

I have a Dataflow job which:
Reads a text file from GCS with other filenames in it
Passes the filenames to ReadAllFromParquet to read the .parquet files
Writes to BigQuery
Despite my job 'succeeding' it basically doesn't have an output collection past the ReadAllFromParquet step.
I successfully read the files in a list such as:['gs://my_bucket/my_file1.snappy.parquet','gs://my_bucket/my_file2.snappy.parquet','gs://my_bucket/my_file3.snappy.parquet']
I am also confirming this list is correct and the GCS paths to the files are correct using a logger on the step before ReadAllFromParquet.
That's what my pipeline looks like (omitting the full code for brevity but I am confident that it normally works as I have the exact same pipeline for .csv using ReadAllFromText and it works fine):
with beam.Pipeline(options=pipeline_options_batch) as pipeline_2:
try:
final_data = (
pipeline_2
|'Create empty PCollection' >> beam.Create([None])
|'Get accepted batch file: {}'.format(runtime_options.complete_batch) >> beam.ParDo(OutputValueProviderFn(runtime_options.complete_batch))
|'Read all filenames into a list'>> beam.ParDo(FileIterator(runtime_options.files_bucket))
|'Read all files' >> beam.io.ReadAllFromParquet(columns=['locationItemId','deviceId','timestamp'])
|'Process all files' >> beam.ParDo(ProcessSch2())
|'Transform to rows' >> beam.ParDo(BlisDictSch2())
|'Write to BigQuery' >> beam.io.WriteToBigQuery(
table = runtime_options.comp_table,
schema = SCHEMA_2,
project = pipeline_options_batch.view_as(GoogleCloudOptions).project, #options.display_data()['project'],
create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED, #'CREATE_IF_NEEDED',#create if does not exist.
write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND #'WRITE_APPEND' #add to existing rows,partitoning
)
)
except Exception as exception:
logging.error(exception)
pass
That's what my job diagram looks like after:
Does somebody have an idea what might be going wrong here and what's the best way to debug?
My ideas currently:
A bucket permissions issue. I noticed the bucket I am reading from is odd as earlier I couldn't download the files despite being a project Owner. The Owners of project only had 'Storage Legacy Bucket Owner'. I added 'Storage Admin' and it then worked fine when manually downloading files with my own account. As per the Dataflow documentation I have ensured that both the default compute service account as well as the dataflow one have 'Storage Admin' on this bucket. However, maybe that's all a red herring as ultimately if there was a permissions issue I should see this in the log and the job would fail?
ReadAllFromParquet expects the file patterns in a different format? I have showed an example of the list (in my diagram above I can see the input collection correctly shows elements added = 48 for 48 files in the list) I supply above. I know this format works for ReadAllFromText so I assumed that they are equivalent and should work.
=========
EDIT:
Noticed something else potentially consequential. Comparing against my other job which uses ReadAllFromText and works fine I noticed a slight mismatch in the naming that is worrying.
This is the name of the output collection for my working job:
And that's the name on my parquet job that doesn't actually read anything:
Note specifically
Read all files/ReadAllFiles/ReadRange.out0
vs
Read all files/Read all files/ReadRange.out0
The first part of the path is the name of my step for both jobs.
But I believe the second to be the ReadAllFiles class from apache_beam.io.filebasedsource (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filebasedsource.py) which both ReadAllFromText and ReadAllFromParquet call.
Seems like a potential bug but don't seem to be able to trace it in the source code.
=============
EDIT 2
After some more digging it seems that ReadAllFromParquet just isn't functional yet. ReadFromParquet calls apache_beam.io.parquetio._ParquetSource whereas ReadAllFromParquet simply calls
apache_beam.io.filebasedsource._ReadRange.
I wonder if there's a way to turn this on if it's an experimental function?
You didn't mentioned if you are using the last Beam SDK, try using SDK 2.16 to test the last changes.
The doc states that ReadAllFromParquet is an experimental funtion as well as ReadFromParquet; nonetheless, ReadFromParquet is reported as working in this thread Apache-Beam: Read parquet files from nested HDFS directories, you might want to try to using this funtion.

How to train SageMaker BlazingText model using multiple channels

I have two separate normalized text files that I want to train my BlazingText model on.
I am struggling to get this to work and the documentation is not helping.
Basically I need to figure out how to supply multiple files or S3 prefixes as "inputs" parameter to the sagemaker.estimator.Estimator.fit() method.
I first tried:
s3_train_data1 = 's3://{}/{}'.format(bucket, prefix1)
s3_train_data2 = 's3://{}/{}'.format(bucket, prefix2)
train_data1 = sagemaker.session.s3_input(s3_train_data1, distribution='FullyReplicated', content_type='text/plain', s3_data_type='S3Prefix')
train_data2 = sagemaker.session.s3_input(s3_train_data2, distribution='FullyReplicated', content_type='text/plain', s3_data_type='S3Prefix')
bt_model.fit(inputs={'train1': train_data1, 'train2': train_data2}, logs=True)
this doesn't work because SageMaker is looking for the key specifically to be "train" in the inputs parameter.
So then i tried:
bt_model.fit(inputs={'train': train_data1, 'train': train_data2}, logs=True)
This trains the model only on the second dataset and ignores the first one completely.
Now finally I tried using a Manifest file using the documentation here: https://docs.aws.amazon.com/sagemaker/latest/dg/API_S3DataSource.html
(see manifest file format under "S3Uri" section)
the documentation says the manifest file format is a JSON that looks like this example:
[
{"prefix": "s3://customer_bucket/some/prefix/"},
"relative/path/to/custdata-1",
"relative/path/custdata-2"
]
Well, I don't think this is valid JSON in the first place but what do I know, I still give it a try.
When I try this:
s3_train_data_manifest = 'https://s3.us-east-2.amazonaws.com/bucketpath/myfilename.manifest'
train_data_merged = sagemaker.session.s3_input(s3_train_data_manifest, distribution='FullyReplicated', content_type='text/plain', s3_data_type='ManifestFile')
data_channel_merged = {'train': train_data_merged}
bt_model.fit(inputs=data_channel_merged, logs=True)
I get an error saying:
ValueError: Error training blazingtext-2018-10-17-XX-XX-XX-XXX: Failed Reason: ClientError: Data download failed:Unable to parse manifest at s3://mybucketpath/myfilename.manifest - invalid format
I tried replacing square brackets in my manifest file with curly braces ...but still I feel the JSON file format seems to be missing something that documentation fails to describe correctly?
You can certainly match multiple files with the same prefix, so your first attempt could have worked as long as you organize your files in your S3 bucket to suit. For e.g. the prefix: s3://mybucket/foo/ will match the files s3://mybucket/foo/bar/data1.txt and s3://mybucket/foo/baz/data2.txt
However, if there is a third file in your bucket called s3://mybucket/foo/qux/data3.txt that you don't want matched (while still matching the first two) there is no way to do achieve that with a single prefix. In these cases a manifest would work. So, in the above example, the manifest would simply be:
[
{"prefix": "s3://mybucket/foo/"},
"bar/data1.txt",
"baz/data2.txt"
]
(and yes, this is valid json - it is an array whose first element is an object with an attribute called prefix and all subsequent elements are strings).
Please double check your manifest (you didn't actually post it so I can't do that for you) and make sure it conforms to the above syntax.
If you're still stuck please open up a thread on the AWS sagemaker forums - https://forums.aws.amazon.com/forum.jspa?forumID=285 and after you do that we can setup a PM to try and get to the bottom of this (never post your AWS account id in a public forum like StackOverflow or even in AWS forums).

Set screenshot path from default project location to different folder location

I have a suite which has 50 test cases. When I execute my suite, I get all the failed screenshots listed in the project's folder. I want to point and store those screenshots to a different directory with the name of the test case. I wanted it to be a one time setup than doing it explicitly for every test cases.
There's quite a few ways to change the screenshots default directory.
One way is to set the screenshot_root_directory argument when importing Selenium2Library. See the importing section of Selenium2Library's documentation, and importing libraries in the user guide.
Another way is to use the Set Screenshot Directory keyword, which will do pretty much the same thing as specifying a path when importing the library. Though, using this keyword you can set the path to a new one whenever you like. For example, you could make it so that each test case could have it's own screenshot directory using this keyword. According to your question, this may be the best solution.
And finally, you may also post-process screenshots using an external tool, or even a listener, that would move all screenshots to another directory. Previously mentioned solutions are in most cases much better, but you still may want to do this in some cases, where say, the directory where you want screenshots to be saved would be created only after the tests have finished executing.
I suggest you to do the follow:
For new directory, you should put the following immediately after where you open a browser such:
Open Browser ${URL} chrome
Set screenshot directory ${OUTPUT FILE}${/}..${/}${TEST_NAME}${/}
For replace the screenshot name from the default to your own name, create the following keyword:
sc
Capture page screenshot filename=${SUITE_NAME}-{index}.png
Then, create another keyword and run it on Setup's test case:
Register Keyword To Run On Failure sc
In the above example, I created a new folder with the test case name, which create a screenshot (in case of failure) with the name of suite project name (instead of 'selenium-screenshot-1.png').

WSDL Document generator wsdl-viewer

I am looking for a tool to generate documentation from a WSDL file. I have found wsdl-viewer.xsl which seems promissing. See https://code.google.com/p/wsdl-viewer/
However, I am seeing a limitation where references to complex data types are not mapped out (or explained). For example, say we have a createSnapshot() operation that creates a snapshot and returns an object representing an snapshot. Runing xsltproc(1) and using wsdl-viewer.xsl, the rendered documentation has an output section that describes the output as
Output: createSnapshotOut
parameter type ceateSnapshotResponse
snapshots type Snapshot
I'd like to be able to click on "Snapshot" and see the schema definition of Snapshot.
Is this possible? Perhaps I am not using xsltproc(1) correctly. Perhaps it is not able to find the xsd files. Here are some of the relevant files I have:
SnapshotMgmntProvider.wsdl
SnapshotMgmntProviderDefinitions.xsd
SnapshotMgmntProviderTypes.xsd
Thanks
Medi

Compile error using F# WsdlService type provider

I'm trying out type providers in F#. I've had some success using the WsdlService provider in the following fashion:
type ec2 = WsdlService<"http://s3.amazonaws.com/ec2-downloads/ec2.wsdl">
but when I download that wsdl, rename it to .wsdlschema and supply it as a local schema according to the method specified in this example:
type ec2 = WsdlService< ServiceUri="N/A", ForceUpdate = false,
LocalSchemaFile = """C:\ec2.wsdlschema""">
Visual Studio emits an error message:
The type provider
'Microsoft.FSharp.Data.TypeProviders.DesignTime.DataProviders'
reported an error: Error: No valid input files specified. Specify
either metadata documents or assembly files
This message is wrong, since the file quite plainly is valid, as the previous example proves.
I've considered permissions issues, and I've repeated the same example from my user folder, making sure to grant full control to all users in both cases, as well as running VS as administrator.
Why does the F# compiler think the file isn't valid?
edit #1: I have confirmed that doing the same thing doesn't work for http://gis1.usgs.gov/arcgis/services/gap/GAP_Land_Cover_NVC_Class_Landuse/MapServer?wsdl either (a USGS vegetation-related API) whereas referencing the wsdl online works fine.
Hmmm, it appears that the type provider is rather stubborn and inflexible in that it requires a true "wsdlschema" doc when using the LocalSchemaFile option. A wsdlschema document can contain multiple .wsdl and .xsd files within it, wrapped in some XML to keep them separate. I'm guessing this is some kind of standard thing in the Microsoft toolchain, but perhaps others (e.g. Amazon) don't expose stuff like this.
The first thing the TP attempts to do is unpack the wsdlschema file into its separate parts, and sadly it does the wrong thing if in fact there is no unpacking to be done. Then, when it tries to point svcutil.exe at the unpacked schema files to do the codegen, this dies with the error message you are seeing.
Workaround: Add the expected bits of XML into your file, and it will work.
<?xml version="1.0" encoding="utf-8"?>
<ServiceMetadataFiles>
<ServiceMetadataFile name="ec2.wsdl">
[body of your WSDL goes here]
</ServiceMetadataFile>
</ServiceMetadataFiles>