Concat Avro files in Google Cloud Storage - google-cloud-platform

I have some big .avro files in the Google Cloud Storage and I want to concat all of them in a single file.
I got
java -jar avro-tools.jar concat
However, as my files are in the google storage path: gs://files.avro I can't concat them by using avro-tools. Any suggestion about how to solve it?

You can use the gsutil compose command. For example:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
Note: For extremely large files and/or very low per-machine bandwidth, you may want to split the file and upload it from multiple machines, and later compose these parts of the file manually.
On my case I tested it with the following values: foo.txt contains a word Hello and bar.txt contains a word World. Running this command:
gsutil compose gs://bucket/foo.txt gs://bucket/bar.txt gs://bucket/baz.txt
baz.txt would return:
Hello
World
Note: GCS does not support inter-bucket composing.
Just in case if you're encountering an exception error with regards to integrity checks, run gsutil help crcmod to get an instructions on how to fix it.

Check out https://github.com/spotify/gcs-tools
Light weight wrapper that adds Google Cloud Storage (GCS) support to common Hadoop tools, including avro-tools, parquet-cli, proto-tools for Scio's Protobuf in Avro file, and magnolify-tools for Magnolify code generation, so that they can be used from regular workstations or laptops, outside of a Google Compute Engine (GCE) instance.

Related

Google Cloud Bucket mounted on Compute Engine Instance using gcsfuse does not create files

I have been able to mount Google Cloud Bucket using
gcsfuse --implicit-dirs " production-xxx-appspot /mount
or equally
sudo mount -t gcsfuse -o implicit_dirs,allow_other,uid=1000,gid=1000,key_file=service-account.json production-xxx-appspot /mount
Mounting works fine.
What happens is that when I execute the following commands after mounting, they also work fine :
mkdir /mount/files/
cp -rf /home/files/* /mount/files/
However, when I use :
mcedit /mount/files/a.txt
or
vi /mount/files/a.txt
The output says that there is no file available which makes sense.
Is there any other way to cover this situation, and use applications in a way that they can directly create files on the mounted google cloud bucket rather than creating files locally and copying afterwards.
If you do not want to create files locally and upload later, you should consider using a file storage system like Google Drive
Google Cloud storage is an object Storage system that means objects cannot be modified, you have to write the object completely at once. Object storage also does not work well with traditional databases, because writing objects is a slow process and writing an app to use an object storage API is not as simple as using file storage.
In a file storage system, Data is stored as a single piece of information inside a folder, just like you would organize pieces of paper inside a manila folder. When you need to access that piece of data, your computer needs to know the path to find it. (Beware—It can be a long, winding path.)
If you want to use Google Cloud Storage, you need to create your file locally and then push it to your bucket.
Here are an example of how to configure Google Cloud Storage with Node.js: File Upload example
Here is a tutorial on How to mount Object Storage on Cloud Server using s3fs-fuse
If you want to know more about storage formats please follow this link
More information about reading and writing to Cloud Storage in this link

What's the quickest way to upload a large CSV file (8GB) from local computer to Google Cloud Storage/BigQuery table?

I have a 8GB-size CSV file of 104 million rows sat on the local hard drive. I need to upload this either directly to BigQuery as a table or via Google Cloud Storage + then point link in BigQuery. What's the quickest way to accomplish this? After trying the web console upload and Google Cloud SDK, both are quite slow (moving at 1% progress every few minutes).
Thanks in advance!
All the 3 existing answer are right, but if you have a low bandwidth, no one will help you, you will be physically limited.
My recommendation is to gzip your file before sending it. Text file has an high compression rate (up to 100 times) and you can ingest gzip files directly into BigQuery without unzipped them
Using the gsutil tool is going to be much faster, and fault tolerant than the web console (which will probably time out before finishing anyway). You can find detailed instructions here (https://cloud.google.com/storage/docs/uploading-objects#gsutil) but essentially, once you have the gcloud tools installed on your computer, you'll run:
gsutil cp [OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/
From there, you can upload the file into BigQuery (https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv) which will all happen on Google's network.
The bottleneck you're going to face is your internet upload speed during the initial upload. What we've done in the past to bypass this is spin up a compute box, run whatever process generated the file, and have it output onto the compute box. Then, we use the built in gsutil tool to upload the file to cloud storage. This has the benefit of running entirely on Google's Network and will be pretty quick.
I would recomment you to give a look to this article where there are several points to take into consideration.
Basically the best option is to upload your object making use of the parallel upload feature of gsutil, into the article you can find this command:
gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp ./localbigfile gs://your-bucket
And also there you will find several tips to improve your upload, like moving the chunk size of the objects to upload.
Once uploaded I'd go to the option that dweling has provided for the Bigquery part by looking further at this document.
Have you considered using the BigQuery Command Line Tool, as per example provided below?
bq load --autodetect --source-format=CSV PROJECT_ID:DATASET.TABLE ./path/to/local/file/data.csv
The above command will directly load the contents of the local CSV file data.csv into the specified table with schema automatically detected. Alternatively, details on how you could customise the load job as per your requirements through parsing additional flags can be found here https://cloud.google.com/bigquery/docs/loading-data-local#bq

Is there a way to grep through text documents stored in Google Cloud Storage?

Question
Is there a way to grep through the text documents stored in Google Cloud Storage?
Background
I am storing over 10 thousand documents (txt file) on a VM and is using up space. And before it reaches the limit I want to move the documents to an alternative location.
Currently, I am considering to move to Google Cloud Storage on GCP.
Issues
I sometimes need to grep the documents with specific keywords.
I was wondering if there is any way I can grep through the documents uploaded on Google Cloud Storage?
I checked the gsutil docs, but it seems ls,cp,mv,rm is supported but I dont see grep.
Unfortunately, there is no such command like grep for gsutil.
The only similary command is gsutil cat.
I suggest you can create a small vm, and grep on the cloud will faster and cheaper.
gsutil cat gs://bucket/ | grep "what you wnat to grep"
#howie answer is good. I just want to mention that Google Cloud Storage is a product intended to store files and does not care about the contents of them. Also, it is designed to be massively scalable and the operation you are asking for is computationally expensive, so it is very unlikely that it will be supported natively in the future.
In your case, I would consider to create a index of the text files and trigger an update for it every time a new file is upload to GCS.
i found the answer to this issue.
gcpfuse solved this problem.
mount the google cloud storage to a specific directory.
and you can grep from there.
https://cloud.google.com/storage/docs/gcs-fuse
https://github.com/GoogleCloudPlatform/gcsfuse
I have another suggestion. You might want to consider using Google Dataflow to process the documents. You can just move them, but more importantly, you can transform the documents using Dataflow.
I've written a Linux native binary [mrgrep] (for ubuntu 18.04) (https://github.com/romange/gaia/releases/tag/v0.1.0) that does exactly this. It reads directly from GCS, and as a bonus, it handles compressed files and it's multi-threaded.

Data science workflow with large geospatial datasets

I am relatively new to the docker approach so please bear with me.
The goal is to ingest large geospatial datasets to Google Earth Engine using an open source replicable approach. I got everything working on my local machine and a Google Compute Engine but would like to make the approach accessible to others as well.
The large static geospatial files (NETCDF4) are currently stored on Amazon S3 and Google Cloud Storage (GEOTIFF). I need a couple of python based modules to convert and ingest the data into Earth Engine using a command line interface. This has to happen only once. The data conversion is not very heavy and can be done by one fat instance (32GB RAM, 16 cores takes 2 hours), there is no need for a cluster.
My question is how I should deal with large static datasets in Docker. I thought of the following option but would like to know best practices.
1) Use docker and mount the amazon s3 and Google Cloud Storage buckets to the docker container.
2) Copy the large datasets to a docker image and use Amazon ECS
3) just use the AWS CLI
4) use Boto3 in Python
5) A fifth option that I am not yet aware of
The python modules that I use are a.o.: python-GDAL, pandas, earth-engine, subprocess

Doing a remote grep/count on a file stored on amazon S3

We have a cloud based applicaiton which has been storing user projects on the normal disk of our EC2 server. I am in the process of moving our project storage to S3 but I have recently run into a tough challenge. When a project is modified we sometimes need to run some analysis of the xml files stored in a project. Before we would do this with a grep and a count which would look for certain xml tags, something like this:
grep -o "<tag" "' + path + '" | wc -l
Now that the files are being stored on S3 I am at a loss for how I might be able to do similar analysis (without downloading the whole project which would mostly defeat the purpose of switching to S3). Is there anyway to do this?
Unfortunately S3 doesn't provide that functionality. You have to download the file(s) before grep can be applied (even if you use third party tools like s3cmd, they download the files behind the scene).
If there aren't too many patterns, you can grep the files before you upload and keep the results on local machine. You don't have to hit S3 every time. Yes, you may end up with stale data but the other alternative is expensive.