How can I figure out why BigQuery is rejecting my parquet file?

How can I figure out why BigQuery is rejecting my parquet file? - google-cloud-platform

When trying to upload a parquet file into BigQuery, I get this error:
Error while reading data, error message: Read less values than expected from: prod-scotty-45ecd3eb-e041-450c-bac8-3360a39b6c36; Actual: 0, Expected: 10
I don't know why I get the error.
I tried inspecting the file with parquet-tools and it prints the file contents without issues.
The parquet file is written using the parquetjs JavaScript library.
Update: I also filed this in the BigQuery issue tracker here: https://issuetracker.google.com/issues/145797606

It turns out BigQuery doesn't support the latest version of the parquet format. I changed the output not to use the version 2 format and BigQuery accepted it.

From the error message it seems like a rogue line break might be causing this.
We use DataPrep to clean up our data, it works quite well. If I am wrong it's also google recommended method of cleaning up / sanitising data for big query.
https://cloud.google.com/dataprep/docs/html/BigQuery-Data-Type-Conversions_102563896

Related

awswrangler write parquet dataframes to a single file

I am creating a very big file that cannot fit in the memory directly. So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them. I am using aws wrangler to do this
My code is as follows:
try:
dfs = wr.s3.read_parquet(path=input_folder, path_suffix=['.parquet'], chunked=True, use_threads=True)
for df in dfs:
path = wr.s3.to_parquet(df=df, dataset=True, path=target_path, mode="append")
logger.info(path)
except Exception as e:
logger.error(e, exc_info=True)
logger.info(e)
The problem is that w4.s3.to_parquet creates a lot of files, instead of writing in one file, also I can't remove chunked=True because otherwise my program fails with OOM
How do I make this write a single file in s3.

AWS Data Wrangler is writing multiple files because you have specified dataset=True. Removing this flag or switching to False should do the trick as long as you are specifying a full path

I don't believe this is possible. #Abdel Jaidi suggestion won't work as append=True requires dataset to be true or will throw an error. I believe that in this case, append has more to do with "appending" the data in Athena or Glue by adding new files to the same folder.
I also don't think this is even possible for parquet in general. As per this SO post it's not possible in a local folder, let alone S3. To add to this parquet is compressed and I don't think it would be easy to add a line to a compressed file without loading it all into memroy.
I think the only solution is to get a beefy ec2 instance that can handle this.
I'm facing a similar issue and I think I'm going to just loop over all the small files and create bigger ones. For example, you could append sever dataframes together and then rewrite those but you won't be able to get back to one parquet file unless you get a computer with enough ram.

How to resolve the issue of file not found (even though it exists) on GCP's AI-Platform?

FileNotFoundError: File b'gs://text-recognition-modelling/Dhruv/cmle/eval_data_nott03.csv' does not exist

Problem seems like you are using pandas for reading the csv, try using some tensorflow method to read csv, it would work.

TypeError: Unable to modify PDF file, make sure that output file target is available and that it is not protected

I'm trying to fill pdf form using HummusJS but it throws me an error TypeError: Unable to modify PDF file, make sure that output file target is available and that it is not protected while running pdf generation in AWS lambda function but its working fine on my local machine and log aren't generated. Is there any debugging option available??. I wasted 3 days for solving this issue.
Any help will highly be appreciated.
Thanks

I experienced this issue when the PDF file I was reading had a size of 0. Somewhere during my tests I suspect that the PDF I was working with got overwritten with empty data and it turns out that this is what was creating the permission error.
I was working from this sample.

AWS Glue Error "Path does not exist"

Every time I try to run some very simple jobs (import json on s3 to Redshift) I get the following error:
pyspark.sql.utils.AnalysisException: u'Path does not exist:
s3://my-temp-glue-dir/f316d46f-eaf3-497a-927b-47ff04462e4a;'
This is not a permissions issue, since I have some other (more complex jobs with joins) working reliably. Really not sure what the issue could be - any help would be appreciated.
I'm using 2 DPU's, but have tried 5. I also tried using a different temp directory. Also, there are hundreds of files, and some of the files are very small (a few lines), but I'm not sure if that is relevant.

I believe the cause of this error is simply the number of files I'm attempting to load at the same time (and that the error itself is misleading). After disabling bookmarks, and using a subset of the data, things are working as expected.

Save compressed files into s3 and load in Athena

Hi I am writing some program that will write in some files (with more processes at the time) like:
with gzip.open('filename.gz', 'a') as f:
f.write(json.dumps(some dictionary) + '\n')
f.flush()
After writing finishes I upload files with:
s3.meta.client(filename, bucket, destination, filename without .gz)
Than I want to query data from Athena, after MSCK REPAIR everything seems fine but when I try to select data my rows are empty. Does anyone know what am I doing wrong?
EDIT: My mistake. I have forgot to add ContentType parameter to 'text/plain'

Athena detects the file compression format with the appropriate file extension.
So if you upload a GZIP file, but remove the '.gz' part (as I would guess from your "s3.meta.client(filename, bucket, destination, filename without .gz)" statement), the SerDe is not able to read the information.
If you rename your files to filename.gz, Athena should be able to read your files.

I have fixed the problem by first saving bigger chunks of files locally and than gzip them. I repeat the process but with appending to gziped file. Read that it is better to add bigger chunks of text than just line by line
For the upload I used boto3.transfet.upload_file with extra_args={'ContentEncoding': 'gzip', 'ContentType': 'text/plain'}
I forgot to add ContetType first time so the s3 saved them differently and Athena gave me errors that said my JSON is not formatted right.

I suggest you break the problem into several parts.
First, create a single JSON file that is not gzipped. Store it in Amazon S3, then use Athena to query it.
Once that works, manually gzip the file from the command-line (rather than programmatically), put the file in S3 and use Athena to query it.
If that works, use your code to programmatically gzip it, and try it again.
If that works with a single file, try it with multiple files.
All of the above can be tested with the same command in Athena -- you're simply substituting the source file.
This way, you'll know which part of the process is upsetting Athena without compounding the potential causes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How can I figure out why BigQuery is rejecting my parquet file? - google-cloud-platform

It turns out BigQuery doesn't support the latest version of the parquet format. I changed the output not to use the version 2 format and BigQuery accepted it.

Related

awswrangler write parquet dataframes to a single file

How to resolve the issue of file not found (even though it exists) on GCP's AI-Platform?

TypeError: Unable to modify PDF file, make sure that output file target is available and that it is not protected

AWS Glue Error "Path does not exist"

Save compressed files into s3 and load in Athena

Categories

Resources