Blankspace and colon not found in firstline - amazon-web-services

I have a jupyter notebook in SageMaker in which I want to run the XGBoost algorithm. The data has to match 3 criteria:
-No header row
-Outcome variable in the first column, features in the rest of the columns
-All columns need to be numeric
The error I get is the following:
Error for Training job xgboost-2019-03-13-16-21-25-000:
Failed Reason: ClientError: Blankspace and colon not found in firstline
'0.0,0.0,99.0,314.07,1.0,0.0,0.0,0.0,0.48027846,0.0...' of file 'train.csv'
In the error itself it can be seen that there are no headers, the output is the first column (it just takes 1.0 and 0.0 values) and all features are numerical. The data is stored in its own bucket.
I have seen a related question in GitHub but there are no solution there. Also, the example notebook that Amazon has does not take care of change the default sep or anything when saving a dataframe to csv for using it later on.

The error message indicated XGBoost was expecting the input data set as libsvm format instead of csv. SageMaker XGBoost by default assumed the input data set was in libsvm format. For using input data set in csv, please explicitly specify content-type as text/csv.
For more information: https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#InputOutput-XGBoost

Related

how to structure input/formats for batch inference in sagemaker?

example provided in the aws documentation , https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html, states that the input csv can be structured like a sample below. I noticed for batch jobs in sagemaker, it can accept json as well. how to structure the json, does each record need to in a single line as shown in a csv example or can it be multiline?
Record1-Attribute1, Record1-Attribute2, Record1-Attribute3, ..., Record1-AttributeM
...
It is recommended to make use of JSON Lines (i.e. each JSON to be on a single line). You can then set BatchStrategy to MultiRecord and SplitType to Line.
Batch Transform can then fit as many records in a mini-batch within the MaxPayloadInMB limit.
Kindly see the CreateTransformJob API for more information.

Annotation specs - AutoML (GCP)

I'm using the Natural Language module on Google Cloud Platform and more specifically AUTOML for text classification.
I come across this error which I do not understand when I have finished importing my data and the text has been processed :
Error: The dataset has too many annotation specs, the maximum allowed number is 5000.
What does it mean? Have you already got it?
Thanks
Take a look at the AutoML Quotas & Limits documentation for better understanding.
It seems that you are touching the highest limit of labels per dataset. Check it on the AutoML limits --> Labels per dataset --> 2 - 5000 (for classification).
Take into account that limits, unlike quotas, cannot be increased.
I also got this error while I was certain that my number of labels are below 5000. It turns out to be an error with my CSV formatting.
When you create your text data using to_csv() in Pandas, it will only quotes that part of text data that contains comma, while AutoML Text wants you to quote all lines of the text. I have written the solution in this Stackoverflow answer

GCP > Video Intelligence: Prepare CSV error: Has critical error in root level csv, Expected 2 columns, but found 1 columns only

I'm trying to follow documentation from below GCP link to prepare my video training data. In the doc, it says that if you want to use GCP to label videos, you can use UNASSIGNED feature.
I have my videos uploaded to a bucket.
I have a traffic_video_labels.csv with below rows:
gs://video_intel/1.mp4
gs://video_intel/2.mp4
Now, in my Video Intelligence Import section, I want to use a CSV called check.csv that has below row as it references back to the video locations. Using UNNASIGNED value should let me use the labelling feature within GCP.
UNASSIGNED,gs://video_intel/traffic_video_labels.csv
However, when I try to check.csv as a file, I get the error:
Has critical error in root level csv gs://video_intel/check.csv line 1: Expected 2 columns, but found
1 columns only.
Can anyone pls help with this? thanks!
https://cloud.google.com/video-intelligence/automl/object-tracking/docs/prepare
For the error message "Expected 2 columns, but found
1 columns only." try to fix the format of your CSV file, open the file in a text editor of your choice (such as Cloud Shell, Sublime, Atom, etc.) to inspect the file format.
When opening a CSV file in Google Sheets or a similar product, you won't be able to format the file properly (i.e. empty values from tailing commas) due to limitation on the user interface, but in text editors, you should not run into those issues.
If this does not work, please share your CSV file to make a test with your file by my own.

How can i remove junk values and load multiple .csv files(different Schema) into BigQuery?

i have many .csv files which are stored into gcs and i want to load data from.csv to BigQuery using below commands:
bq load 'datasate.table' gs://path.csv json_schema
i have tried but giving errors, same error is giving for many file.
error screenshot
how can i remove unwanted values from .csv files before importing into table.
Suggest me to load file in easiest way
The answer depends on what do you want to do with this junk rows. If you look at the documentation, you have several options
Number of errors allowed. By default, it's set to 0 and that why the load job fails at the first line. If you know the total number of rom, set this value to the Number of errors allowed and all the errors will be ignored in the Load Job
Ignore unknown values. If your errors are made because some line contains more column as defined in the schema, this option keep the line in error and only the known column, the others are ignore
Allow jagged rows. If your errors are made by too short line (and it is in your message) and you still want to keep the first columns (because the last ones are optional and/or not relevant), you can check this option
For more advanced and specific filters, you have to perform pre or post processing. If it's the case, let me know to add this part to my answer.

PySpark Write Parquet Binary Column with Stats (signed-min-max.enabled)

I found this apache-parquet ticket https://issues.apache.org/jira/browse/PARQUET-686 which is marked as resolved for parquet-mr 1.8.2. The feature I want is the calculated min/max in the parquet metadata for a (string or BINARY) column.
And referencing this is an email https://lists.apache.org/thread.html/%3CCANPCBc2UPm+oZFfP9oT8gPKh_v0_BF0jVEuf=Q3d-5=ugxSFbQ#mail.gmail.com%3E
which uses scala instead of pyspark as an example:
Configuration conf = new Configuration();
+ conf.set("parquet.strings.signed-min-max.enabled", "true");
Path inputPath = new Path(input);
FileStatus inputFileStatus =
inputPath.getFileSystem(conf).getFileStatus(inputPath);
List<Footer> footers = ParquetFileReader.readFooters(conf, inputFileStatus, false);
I've been unable to set this value in pyspark (perhaps I'm setting it in the wrong place?)
example dataframe
import random
import string
from pyspark.sql.types import StringType
r = []
for x in range(2000):
r.append(u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10)))
df = spark.createDataFrame(r, StringType())
I've tried a few different ways of setting this option:
df.write.format("parquet").option("parquet.strings.signed-min-max.enabled", "true").save("s3a://test.bucket/option")
df.write.option("parquet.strings.signed-min-max.enabled", "true").parquet("s3a://test.bucket/option")
df.write.option("parquet.strings.signed-min-max.enabled", True).parquet("s3a://test.bucket/option")
But all of the saved parquet files are missing the ST/STATS for the BINARY column. Here is an example output of the metadata from one of the parquet files:
creator: parquet-mr version 1.8.3 (build aef7230e114214b7cc962a8f3fc5aeed6ce80828)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"value","type":"string","nullable":true,"metadata":{}}]}
file schema: spark_schema
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
value: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:33 TS:515
---------------------------------------------------------------------------------------------------
Also, based on this email chain https://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3C9DEF4C39-DFC2-411B-8987-5B9C33842974#videoamp.com%3E and question: Specify Parquet properties pyspark
I tried sneaking the config in through the pyspark private API:
spark.sparkContext._jsc.hadoopConfiguration().setBoolean("parquet.strings.signed-min-max.enabled", True)
So I am still unable to set this conf parquet.strings.signed-min-max.enabled in parquet-mr (or it is set, but something else has gone wrong)
Is it possible to configure parquet-mr from pyspark
Does pyspark 2.3.x support BINARY column stats?
How do I take advantage of the PARQUET-686 feature to add min/max metadata for string columns in a parquet file?
Since historically Parquet writers wrote wrong min/max values for UTF-8 strings, new Parquet implementations skip those stats during reading, unless parquet.strings.signed-min-max.enabled is set. So this setting is a read option that tells the Parquet library to trust the min/max values in spite of their known deficiency. The only case when this setting can be safely enabled is if the strings only contain ASCII characters, because the corresponding bytes for those will never be negative.
Since you use parquet-tools for dumping the statistics and parquet-tools itself uses the Parquet library, it will ignore string min/max statistics by default. Although it seems that there are no min/max values in the file, in reality they are there, but get ignored.
The proper solution for this problem is PARQUET-1025, which introduces new statistics fields min-value and max-value. These handle UTF-8 strings correctly.