Enabling Compression on Avro via PySpark

Enabling Compression on Avro via PySpark - compression

Using PySpark I'm trying to save an Avro file with compression (preferably snappy).
This line of code successfully saves a 264MB file:
df.write.mode('overwrite').format('com.databricks.spark.avro').save('s3n://%s:%s#%s/%s' % (access_key, secret_key, aws_bucket_name, output_file))
When I add the codec option .option('codec', 'snappy') the code successfully runs but the file size is still 264MB:
df.write.mode('overwrite').option('codec', 'snappy').format('com.databricks.spark.avro').save('s3n://%s:%s#%s/%s' % (access_key, secret_key, aws_bucket_name, output_file))
I've also tried 'SNAPPY' and 'Snappy' and it also runs successfully but with the same file size.
I've read the documentation but it focuses on Java and Scala. Is this not supported in PySpark, is Snappy the default and it's not documented, or am I not using the correct syntax? There's also a related question (I assume) but it's focused on Hive and has no answers.
TIA

I believe by default, spark is enabled with Snappy compression. you try to compare the size with uncompressed format, you should see the size difference.

Related

awswrangler write parquet dataframes to a single file

I am creating a very big file that cannot fit in the memory directly. So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them. I am using aws wrangler to do this
My code is as follows:
try:
dfs = wr.s3.read_parquet(path=input_folder, path_suffix=['.parquet'], chunked=True, use_threads=True)
for df in dfs:
path = wr.s3.to_parquet(df=df, dataset=True, path=target_path, mode="append")
logger.info(path)
except Exception as e:
logger.error(e, exc_info=True)
logger.info(e)
The problem is that w4.s3.to_parquet creates a lot of files, instead of writing in one file, also I can't remove chunked=True because otherwise my program fails with OOM
How do I make this write a single file in s3.

AWS Data Wrangler is writing multiple files because you have specified dataset=True. Removing this flag or switching to False should do the trick as long as you are specifying a full path

I don't believe this is possible. #Abdel Jaidi suggestion won't work as append=True requires dataset to be true or will throw an error. I believe that in this case, append has more to do with "appending" the data in Athena or Glue by adding new files to the same folder.
I also don't think this is even possible for parquet in general. As per this SO post it's not possible in a local folder, let alone S3. To add to this parquet is compressed and I don't think it would be easy to add a line to a compressed file without loading it all into memroy.
I think the only solution is to get a beefy ec2 instance that can handle this.
I'm facing a similar issue and I think I'm going to just loop over all the small files and create bigger ones. For example, you could append sever dataframes together and then rewrite those but you won't be able to get back to one parquet file unless you get a computer with enough ram.

How can I figure out why BigQuery is rejecting my parquet file?

When trying to upload a parquet file into BigQuery, I get this error:
Error while reading data, error message: Read less values than expected from: prod-scotty-45ecd3eb-e041-450c-bac8-3360a39b6c36; Actual: 0, Expected: 10
I don't know why I get the error.
I tried inspecting the file with parquet-tools and it prints the file contents without issues.
The parquet file is written using the parquetjs JavaScript library.
Update: I also filed this in the BigQuery issue tracker here: https://issuetracker.google.com/issues/145797606

It turns out BigQuery doesn't support the latest version of the parquet format. I changed the output not to use the version 2 format and BigQuery accepted it.

From the error message it seems like a rogue line break might be causing this.
We use DataPrep to clean up our data, it works quite well. If I am wrong it's also google recommended method of cleaning up / sanitising data for big query.
https://cloud.google.com/dataprep/docs/html/BigQuery-Data-Type-Conversions_102563896

How to upload file from local disk to google cloud storage bucket?

I am trying to upload an excel and pdf file from local disk to google cloud storage . Excel is creating using pandas library in python. When i try to upload the generated file it is giving this error
'ascii' codec can't decode byte 0xd0 in position 217: ordinal not in range(128)
I am using flexible appengine. Here is the upload file code
def upload_to_gcs(file_path,file_name,file_type,bucket_name):
try:
import google.cloud.storage
storage_client = google.cloud.storage.Client.from_service_account_json(
os.getcwd()+'relative_path_of_service_json_file')
bucket = storage_client.get_bucket('bucket_name')
d = bucket.blob(bucket_name+'/'+file_name+'.'+file_type)
d.upload_from_filename(file_path)
except Exception, e:
print str(e)
Thanks in advance

The error is saying that there's a character that can't be translated to ASCII. Normally, this kind of errors appear due to special characters in the file name or due to how the file is encoded. Check if there's any character that ASCII might be having issues with in the file or the file name.
Here's another StackOverflow question about the same error. In this question the OP is using gsutil command, but at the end both the Python Client Library and the command do the same thing, call the Cloud Storage API, so the solution can be the same for both.
Other SO answers to similar errors tackle the solutions programatically. Using decode("utf-8") or decode(encoding='unicode-escape'). This solutions can also help in your case, you should give them a try.
If you still can't find the root cause of the issue, try uploading the file through the API directly. This way you can check if the issue is still present without using the Python Client Library.

cPickle.load() doesnt accept non-.gz files, what can I use for .pkl files?

I am trying to run an example of a LSTM recurrent neural network that is presented in this git: https://github.com/mesnilgr/is13.
I've installed theano and everything and when I got to the point of running the code, I've noticed the data was not being downloaded, so I've opened an issue on the github (https://github.com/mesnilgr/is13/issues/12) and this guy came up with a solution that consisted in:
1-get the data from the dropbox link he provides.
2- change the code of the 'load.py' file to download, and read the data properly.
The only issue is that the data in the dropbox folder(https://www.dropbox.com/s/3lxl9jsbw0j7h8a/atis.pkl?dl=0) is not a compacted .gz file as, I suppose, was the data from the original repository. So I dont have enough skill to change the code in order to do with the uncompressed data exaclty what it would do with the compressed one. Can someone help me?
The modification suggested and the changes I've done are described on the issue I've opened on the git(https://github.com/mesnilgr/is13/issues/12).

It looks like your code is using
gzip.open(...)
But if the file is not gzipped then you probably just need to remove the gzip. prefix and use
open(...)

CDH4.2.0 Unable to set HBase Compression

Since we've updated our installation of CDH4.1.2 to CDH4.2.0 we're no longer able to create new tables with enabled compression.
We were using SNAPPY Compression successfully before.
Now when we try to execute a create statement like:
create 'tableWithCompression', {NAME => 't1', COMPRESSION => 'SNAPPY'}
an error occurs:
ERROR: Compression SNAPPY is not supported. Use one of LZ4 SNAPPY LZO GZ NONE
We realized that other compression algorithms weren't found either: e.g. same problem with 'GZ'.
ERROR: Compression GZ is not supported. Use one of LZ4 SNAPPY LZO GZ NONE
We've added
"export HBASE_LIBRARY_PATH=/usr/lib/hadoop/lib/native/"
to hbase-env.sh.
Unfortunately this did not fix our problem.
What else can we try?

I'm getting the same. This seems to be a bug in the admin.rb script.
The code in question is this:
if arg.include?(org.apache.hadoop.hbase.HColumnDescriptor::COMPRESSION)
compression = arg[org.apache.hadoop.hbase.HColumnDescriptor::COMPRESSION].upcase
unless org.apache.hadoop.hbase.io.hfile.Compression::Algorithm.constants.include?(compression)
raise(ArgumentError, "Compression #{compression} is not supported. Use one of " + org.apache.hadoop.hbase.io.hfile.Compression::Algorithm.constants.join(" "))
else
family.setCompressionType(org.apache.hadoop.hbase.io.hfile.Compression::Algorithm.valueOf(compression))
end
end
Some "p" statements later, I know that. compression is "SNAPPY", and org.apache.hadoop.hbase.io.hfile.Compression::Algorithm.constants is [:LZ4, :SNAPPY, :LZO, :GZ, :NONE].
See the diffrence? We're comparing strings and symbols. The quick fix is to change the line that sets compression to the following:
compression = arg[org.apache.hadoop.hbase.HColumnDescriptor::COMPRESSION].upcase.to_sym
I guess this has to do with there being a ton of different jruby variants and configurations. I suppose in some, the constants are strings, in others symbols. A more permanent fix is to use to_sym on both ends of the comparison.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Enabling Compression on Avro via PySpark - compression

I believe by default, spark is enabled with Snappy compression. you try to compare the size with uncompressed format, you should see the size difference.

Related

awswrangler write parquet dataframes to a single file

How can I figure out why BigQuery is rejecting my parquet file?

How to upload file from local disk to google cloud storage bucket?

cPickle.load() doesnt accept non-.gz files, what can I use for .pkl files?

CDH4.2.0 Unable to set HBase Compression

Categories

Resources