Can Apache Great Expectation segregate good and bad records? - great-expectations

I am using Great Expectations in my ETL data pipeline for a POC. I have a validation which is failing (as expected), and I have the following data in my validation JSON:
"unexpected_count": 205,
"unexpected_percent": 10.25,
"unexpected_percent_nonmissing": 10.25,
"unexpected_percent_total": 10.25
Is there a way to identify/segregate these bad records (depicted as unexpected_count)?

Related

Amazon Sagemaker Groundtruth: Cannot get active learning to work

I am trying to test Sagemaker Groundtruth's active learning capability, but cannot figure out how to get the auto-labeling part to work. I started a previous labeling job with an initial model that I had to create manually. This allowed me to retrieve the model's ARN as a starting point for the next job. I uploaded 1,758 dataset objects and labeled 40 of them. I assumed the auto-labeling would take it from here, but the job in Sagemaker just says "complete" and is only displaying the labels that I created. How do I make the auto-labeler work?
Do I have to manually label 1,000 dataset objects before it can start working? I saw this post: Information regarding Amazon Sagemaker groundtruth, where the representative said that some of the 1,000 objects can be auto-labeled, but how is that possible if it needs 1,000 objects to start auto-labeling?
Thanks in advance.
I'm an engineer at AWS. In order to understand the "active learning"/"automated data labeling" feature, it will be helpful to start with a broader recap of how SageMaker Ground Truth works.
First, let's consider the workflow without the active learning feature. Recall that Ground Truth annotates data in batches [https://docs.aws.amazon.com/sagemaker/latest/dg/sms-batching.html]. This means that your dataset is submitted for annotation in "chunks." The size of these batches is controlled by the API parameter MaxConcurrentTaskCount [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HumanTaskConfig.html#sagemaker-Type-HumanTaskConfig-MaxConcurrentTaskCount]. This parameter has a default value of 1,000. You cannot control this value when you use the AWS console, so the default value will be used unless you alter it by submitting your job via the API instead of the console.
Now, let's consider how active learning fits into this workflow. Active learning runs in between your batches of manual annotation. Another important detail is that Ground Truth will partition your dataset into a validation set and an unlabeled set. For datasets smaller than 5,000 objects, the validation set will be 20% of your total dataset; for datasets largert than 5,000 objects, the validation set will be 10% of your total dataset. Once the validation set is collected, any data that is subsequently annotated manually consistutes the training set. The collection of the validation set and training set proceeds according to the batch-wise process described in the previous paragraph. A longer discussion of active learning is available in [https://docs.aws.amazon.com/sagemaker/latest/dg/sms-automated-labeling.html].
That last paragraph was a bit of a mouthful, so I'll provide an example using the numbers you gave.
Example #1
Default MaxConcurrentTaskCount ("batch size") of 1,000
Total dataset size: 1,758 objects
Computed validation set size: 0.2 * 1758 = 351 objects
Batch #
Annotate 351 objects to populate the validation set (1407 remaining).
Annotate 1,000 objects to populate the first iteration of the training set (407 remaining).
Run active learning. This step may, depending on the accuracy of the model at this stage, result in the annotation of zero, some, or all of the remaining 407 objects.
(Assume no objects were automatically labeled in step #3) Annotate 407 objects. End labeling job.
Example #2
Non-default MaxConcurrentTaskCount ("batch size") of 250
Total dataset size: 1,758 objects
Computed validation set size: 0.2 * 1758 = 351 objects
Batch #
Annotate 250 objects to begin populating the validation set (1508 remaining).
Annotate 101 objects to finish populating the validation set (1407 remaining).
Annotate 250 objects to populate the first iteration of the training set (1157 remaining).
Run active learning. This step may, depending on the accuracy of the model at this stage, result in the annotation of zero, some, or all of the remaining 1157 objects. All else being equal, we would expect the model to be less accurate than the model in example #1 at this stage, because our training set is only 250 objects here.
Repeat alternating steps of annotating batches of 250 objects and running active learning.
Hopefully these examples illustrate the workflow and help you understand the process a little better. Since your dataset consists of 1,758 objects, the upper bound on the number of automated labels that can be supplied is 407 objects (assuming you use the default MaxConcurrentTaskCount).
Ultimately, 1,758 objects is still a relatively small dataset. We typically recommend at least 5,000 objects to see meaningful results [https://docs.aws.amazon.com/sagemaker/latest/dg/sms-automated-labeling.html]. Without knowing any other details of your labeling job, it's difficult to gauge why your job didn't result in more automated annotations. A useful starting point might be to inspect the annotations you received, and to determine the quality of the model that was trained during the Ground Truth labeling job.
Best regards from AWS!

How to do prediction with weka

i'm using weka to do some text mining, i'm a little bit confused so i'm here to ask how can i ( with a set of comments that are in a some way classified as: notes, status of work, not conformity, warning) predict if a new comment belong to a specific class, with all the comment (9551) i've done a preprocess obtaining with the filter "stringtowordvector" a vector of tokens, and then i've used the simple kmeans to obtain a number of cluster.
So the question is: if a user post a new comment can i predict with those data if it belong to a category of comment?
sorry if my question is a little bit confused but so am i.
thank you
Trivial Training-validation-test
Create two datasets from your labelled instances. One will be training set and the other will be validation set. The training set will contain about 60% of the labelled data and the validation will contain 40% of the labelled data. There is no hard and fast rule for this split, but a 60-40 split is a good choice.
Use K-means (or any other clustering algorithm) on your training data. Develop a model. Record the model's error on training set. If the error is low and acceptable, you are fine. Save the model.
For now, your validation set will be your test dataset. Apply the model you saved on your validation set. Record the error. What is the difference between training error and validation error? If they both are low, the model's generalization is "seemingly" good.
Prepare a test dataset where you have all the features of your training and test dataset but the class/cluster is unknown.
Apply the model on the test data.
10-fold cross validation
Use all of your labelled data instances for this task.
Apply K-means (or any other algorithm of your choice) with a 10-fold CV setup.
Record the training error and CV error. Are they low? Is the difference between the errors is low? If yes, then save the model and apply it on the test data whose class/cluster is unknown.
NB: The training/test/validation errors and their differences will give you an "very initial" idea of overfitting/underfitting of your model. They are sanity tests. You need to perform other tests like learning curves to see if your model overfits or underfits or perfect. If there appears to be an overfitting and underfitting problem, you need to try many different techniques to overcome them.

Weka cross validation wrong results

I am classifying 5 minutes of EEG data of 4 classes using a Bayesian Network.
When applying cross validation I get 100% correct results whereas when I use training and supplied testing data (the first 3.7 minutes for training, 1.3 minutes for testing) in a separate file I get really low results (30%).
I am new to Weka and do not know how this is possible. Any help would be highly appreciated :)

XML or CSV for "Tabular Data"

I have "Tabular Data" to be sent from server to client --- I am analyzing should I be going for CSV kind of formate or XML.
The data which I send can be in MB's, server will be streaming it and client will read it line by line to start paring the output as it gets (client can't wait for all data to come).
As per my present thought CSV would be good --- it will reduce the data size and can be parsed faster.
XML is a standard -- I am concerned with parsing data as it comes to system(live parsing) and data size.
What would be the best solution?
thanks for all valuable suggestions.
If it is "Tabular data" and the table is relatively fixed and regular, I would go for a CSV-format. Especially if it is one server and one client.
XML has some advantage if you have multiple clients and want to validate the file format before using the data. On the other hand, XML has cornered the market for "code bloat", so the amount transfered will be much larger.
I would use CSV, with a header which indicate the id of each field.
id, surname, givenname, phone-number
0, Doe, John, 555-937-911
1, Doe, Jane, 555-937-911
As long as you do not forget the header, you should be fine if the data format ever changes. Of course the client need be updated before the server starts sending new streams.
If not all clients can be updated easily, then you need a more lenient messaging system.
Google Protocol Buffer has been designed for this kind of backward/forward compatibility issues, and combines this with excellent (fast & compact) binary encoding abilities to reduce the message sizes.
If you go with this, then the idea is simple: each message represents a line. If you want to stream them, you need a simple "message size | message blob" structure.
Personally, I have always considered XML bloated by design. If you ever go with Human Readable formats, then at least select JSON, you'll cut down the tag overhead by half.
I would suggest you go for XML.
There are plenty of libraries available for parsing.
Moreover, if later the data format changes, the parsing logic in case of XML won't change only business logic may need change.
But in case of CSV parsing logic might need a change
CSV format will be smaller since you only have to delare the headers on the first row then rows of data below with only commas in between to add any extra characters to the stream size.

Yahoo Maps Geocode

How Do I work around a problem with the yahoo map geocode result set? The result set being returned is wrong. The city field contains the city, region and postal code. As seen below.
Is there a way to work around this issue without breaking scalability.
-33.924320
151.187057
203 Coward St
MASCOT NSW 2020
Australia
AU
The Yahoo geoencoding returns usually an XML or a PHP serialized. By querying the encoding service I suppose you already have the address and you want to get the coordinates for your geoPoint. It is possible that you are feeding the maps engine with a wrong request.
If you think you found a bug you can send them an email, but I suggest you to check with other locations or to publish first here your code in order to spot the eventual errors.