Use a custom classifier in Glue for multi line records - amazon-web-services

I have some files in the following format
AB1|STUFF|1234|
AB2|SF|STUFF|
AB1|STUFF|45670|
AB2|AF|STUFF|
Each bit of data is delimited by '|' and a record is made up of the data in lines AB1 and AB2.
I would like to use a custom grok classifier in Glue something like the following:
?<LINE1>(?:AB1)?|%{WORD:ignore1}|%{NUMBER:id}\|\n%{WORD:LINE2}|%{WORD:make}|%{WORD:stuff2}\|
That is a multi line grok expression to extract the data from a multi line record as shown above. I am unsure how the classifiers in Glue work any comments or advice would be very helpful.

According to the Glue Documentation:
Grok patterns can process only one line at a time. Multiple-line
patterns are not supported. Also, line breaks within a pattern are not
supported.
I am not sure what the actual question is, if you need general guidance on how to create your own classifier, I would advise you to read this and this.

Related

how to structure input/formats for batch inference in sagemaker?

example provided in the aws documentation , https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html, states that the input csv can be structured like a sample below. I noticed for batch jobs in sagemaker, it can accept json as well. how to structure the json, does each record need to in a single line as shown in a csv example or can it be multiline?
Record1-Attribute1, Record1-Attribute2, Record1-Attribute3, ..., Record1-AttributeM
...
It is recommended to make use of JSON Lines (i.e. each JSON to be on a single line). You can then set BatchStrategy to MultiRecord and SplitType to Line.
Batch Transform can then fit as many records in a mini-batch within the MaxPayloadInMB limit.
Kindly see the CreateTransformJob API for more information.

Annotation specs - AutoML (GCP)

I'm using the Natural Language module on Google Cloud Platform and more specifically AUTOML for text classification.
I come across this error which I do not understand when I have finished importing my data and the text has been processed :
Error: The dataset has too many annotation specs, the maximum allowed number is 5000.
What does it mean? Have you already got it?
Thanks
Take a look at the AutoML Quotas & Limits documentation for better understanding.
It seems that you are touching the highest limit of labels per dataset. Check it on the AutoML limits --> Labels per dataset --> 2 - 5000 (for classification).
Take into account that limits, unlike quotas, cannot be increased.
I also got this error while I was certain that my number of labels are below 5000. It turns out to be an error with my CSV formatting.
When you create your text data using to_csv() in Pandas, it will only quotes that part of text data that contains comma, while AutoML Text wants you to quote all lines of the text. I have written the solution in this Stackoverflow answer

ClientError: Unable to parse csv: rows 1-1000, file

I've looked at the other answers to this issue and none of them are helping me. I am trying to run a simple random cut forest algorithm. I have a small data set of IPs which have been stripped down to only have numbers. I still get this error. It only has one column of these numbers. The CSV looks like this:
176162144
176862141
176762141
176761141
176562141
Have you looked at this sample notebook, and tried using it with your own data?
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.ipynb
In a nutshell, it reads the CSV file with Pandas and trains the model like this:
rcf = RandomCutForest(role=execution_role,
train_instance_count=1,
train_instance_type='ml.m4.xlarge',
data_location='s3://{}/{}/'.format(bucket, prefix),
output_path='s3://{}/{}/output'.format(bucket, prefix),
num_samples_per_tree=512,
num_trees=50)
# automatically upload the training data to S3 and run the training job
rcf.fit(rcf.record_set(taxi_data.value.as_matrix().reshape(-1,1)))
You didn't say what your use case was, but as you're working with IP addresses, you may find the IP Insights built-in algorithm useful too: https://docs.aws.amazon.com/sagemaker/latest/dg/ip-insights.html
I was utilizing the sample notebook Julien Simon mentioned earlier, but at some point the data was ending up as strings! The funny thing about RCF algorithms is they have to run on integer data.
What I did is I made sure to cast the array as an int array as a double check and vallah! It worked. I am at loss over how the data ended up in a string format but alas, that was the issue. Simple solution.

How to read Text file and returns additional input field using TextIO?

I have a PCollection of KV where key is filename and value is some additional info of the files (e.g., the "Source" systems that generated the files). E.g.,
KV("gs://bucket1/dir1/X1.dat", "SourceX"),
KV("gs://bucket1/dir2/Y1.dat", "SourceY")
I need to read all lines from the files and with the "Source" field, returning as a KV PCollection.
KV(line1 from X1.dat, "SourceX")
KV(line2 from X1.dat, "SourceX")
...
KV(line1 from Y1.dat, "SourceY")
I was able to achieve this by calling FileIO.match() and followed by a DoFn in which I sequentially read the file and append the SourceX (retrieved from a map passed in SideInput).
To get the benefit of parallel reading, could I use TextIO.readAll() to achieve this? TextIO.read() returns a PCollection, without filename info. How can I join it back the map of Filename to Source mapping? Tried WithKeys transfer, but not working ...
Currently using FileIO.match() as you are doing is the best way to accomplish this, but once https://github.com/apache/beam/pull/12645 is merged you'll be able to use the new ContextualTextIO transforms.
Note that computing line numbers in a distributed manner is inherently expensive; you might want to see if you can use offsets (much esasier to compute, and ordered the same as line numbers) instead.
If I understand correctly, you want to read the file in parallel? Unfortunately, TextIO.readAll does not have this feature. You will have to use FileIO.match, and then write your DoFn to read the file in the custom way that you want.
This is because you will not be able to do a random seek into a file and preserve the count of line numbers.
Is reading files serially a bottleneck for your pipeline?

Clear approach for assigning semantic tags to each sentence (or short documents) in python

I am looking for a good approach using python libraries to tackle the following problem:
I have a dataset with a column that has product description. The values in this column can be very messy and would have a lot of other words that are not related to the product. I want to know which rows are about the same product, so I would need to tag each description sentence with its main topics. For example, if I have the following:
"500 units shoe green sport tennis import oversea plastic", I would like the tags to be something like: "shoe", "sport". So I am looking to build an approach for semantic tagging of sentences, not part of speech tagging. Assume I don't have labeled (tagged) data for training.
Any help would be appreciated.
Lack of labeled data means you cannot apply any semantic classification method using word vectors, which would be the optimal solution to your problem. An alternative however could be to construct the document frequencies of your token n-grams and assume importance based on some smoothed variant of idf (i.e. words that tend to appear often in descriptions probably carry some semantic weight). You can then inspect your sorted-by-idf list of words and handpick(/erase) words that you deem important(/unimportant). The results won't be perfect, but it's a clean and simple solution given your lack of training data.