Weka setDataset(), do I need to use a full data set - weka

Do I need to use my full training data set, or can I use a data set with only the attribute descriptions built from an arff file with the exact same attributes and say one instance?
I am using a classifier on an EC2 instance so I don't want to have the entire data-set on the EC2 instance as it is very large and grows.
Dose weka require the entire data-set or only the description from the arff file?

The setDataset() method only takes the attribute descriptions for your Instance from the Instances object (your dataset) that you have defined previously. Therefore it does not matter how large the dataset to which you are referring via the setDataset() method is.

Related

What is AWS S3 dataset?

Looking at documentation of awswrangler.s3.to_csv or awswrangler.s3.to_parquet, there is a dataset parameter.
From testing, it looks like setting dataset=True allows, among other things, to append new data to an already existing set. It also looks like when dataset=True, I can't specify the file name and AWS autogenerates the names for the files which are added to the specified path.
Apart from that, I can't find more information on what dataset means. Is it just referring to the general concept or is there a specific meaning within the context of AWS? What exactly is dataset and when should it be set to True?
The dataset=True option allows you to store the entire dataset, including all metadata, indexes, etc.
The dataset parameter documentation:
dataset (bool) – If True store as a dataset instead of ordinary file(s) If True, enable all follow arguments: partition_cols, mode, database, table, description, parameters, columns_comments, concurrent_partitioning, catalog_versioning, projection_enabled, projection_types, projection_ranges, projection_values, projection_intervals, projection_digits, catalog_id, schema_evolution.
Note all those extra things that get saved when you save a dataset. All that information, like columns_comments, concurrent_partitioning, projection_values, will be lost when you save to CSV or Parquet. But on the other hand, those values are probably only useful if you plan to do further manipulation of the data via awswrangler/pandas at some later date.
Also note that if you set dataset=True you have to give it a file name prefix instead of a single file name, because the output generated will be spread across multiple files.
If you want to use the data in any other tool besides Pandas, such as loading the CSV into Excel, then you most likely want to set dataset=False and output to a single file.

What are the differences between Object Storages for example S3 and a columnar based Technology

I was thinking about the difference between those two approches.
Imagine you must handle information about pattern calls, which later should be
displayed to the user. A pattern call is a tuple consisting of a unique integer
identifier ("id"), a user defined name (“name"), a project relative path to the so
called pattern file ("patternFile") and a convenience flag, which states whether
the pattern should be called or not called. And the number of tuples are not known before and they won't be modified after initialization.
I thought that in this case a column based approach with big query for example would be better in terms of I/O and performance as well as the evolution of the schema. But actually I can't understand why. I would appreciate any help.
Amazon S3 is like a large key-value store. The Key is the filename (with full path) and the Value is the contents of the file. It's just a blob of data.
A columnar data store organizes data in such a way that specific data can be "jumped to", and only desired values need to be read from disk.
If you are wanting to perform a search on the data, then some form of logic is required on the data. This could be done by storing data in a database (typically a proprietary format) or by using a columnar storage format such as Parquet and ORC plus a query engine that understands this format (eg Amazon Athena).
The difference between S3 and columnar data stores is like the difference between a disk drive and an Oracle database.

Can I use pre-labeled data in AWS SageMaker Ground Truth NER?

Let's say I have some text data that has already been labeled in SageMaker. This data could have either been labeled by humans or an ner model. Then let's say I want to have a human go back over the dataset, either to label new entity class or correct existing labels. How would I set up a labeling job to allow this? I tried using an output manifest from another labeling job, but all of the documents that were already labeled cannot be accessed by workers to re-label.
Yes, this is possible you are looking for Custom Labelling worklflows you can also apply either Majority Voting (MV) or MDS to evaluate the accuracy of the job

Google Dataprep: Save GCS file name as one of the column

I have a Dataprep flow configured. The Dataset is a GCS folder (all files from it). Target is BigQuery table.
Since data is coming from multiple files, I want to have filename as of the columns in the resulting data.
Is that possible?
UPDATE: There's now a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage (starting at the top-level bucket). You can use this in formulas or add it to a new formula column and then do anything you want in additional recipe steps. (If your data source sample was created before this feature, you'll need to generate a new sample in order to see it in the interface)
Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148
Original Answer
This is not currently possible out of the box. IF you're manually merging datasets with UNION, you could first process them to add a column with the source so that it's then present in the combined output.
If you're bulk-ingesting files, that doesn't help—but there is an open feature request open that you can comment on and/or follow for updates:
https://issuetracker.google.com/issues/74386476

libpq: get data type

I am coding a cpp project with the database "postgreSQL".
I created a table in my database its type is character varying(40).
Now I need to SELECT these data FROM the table in my cpp project. I knew that I should use the library libpq, this is the interface of "postgreSQL" for c/cpp.
I have succeeded in selecting data from the table. Now I am considering if it's possible to get the data type of this table. For example, here I want to get character varying(40).
You need to use PQftype.
As described here: http://www.idiap.ch/~formaz/doc/postgreSQL/libpq-chapter17861.htm
And just take a look here about decoding return values: http://www.postgresql.org/message-id/da7021e0608040738l3b0880a1q5a76b838937f8c78#mail.gmail.com
You must also use PQfsize to get field size.