I have a .dat file that is 1.08 GB that I am trying to upload to SAS OnDemand (University Edition) but I am unable to do so since there is an upload limit of 1GB I believe?
Is there a way to "split" the .dat file into 2 smaller files, or any other way to work around this problem?
Related
I have 7Gb of file which have 10 million of data, to process that file we had created 100 MB chunks in s3 bucket and processing one by one file from s3 to insert records in 24 table along with rollbacking/commit transactions. It takes around more than one week to handle those data. Can anyone help me with another quick solution.
What I tried
Created SP/Function and triggers but not able to figure out as Postgres have lot's of limitations and I have related tables with has many relations
Please note this is an example. I have lot's of file like this.
I have a large amount of json files in Google cloud storage that I would like to load to Bigquery. Average file size is 5MB not compressed.
The problem is that they are not new line delimited so I can't load them as is to bigquery.
What's my best approach here? Should I use Google functions or data prep or just spin up a server and have it download the file, reformat it and upload it back to cloud storage and then to Bigquery?
Do not compress the data before loading into Bigquery. Another item, 5 MB is small for Bigquery. I would look at consolidation strategies and maybe changing file format while processing each Json file.
You can use Dataprep, Dataflow or even Dataproc. Depending on how many files, this may be the best choice. Anything larger than say 100,000 5 MB files will require one of these big systems with many nodes.
Cloud Functions would take too long for anything more than a few thousand files.
Another option is to write a simple Python program that preprocesses your files on Cloud Storage and directly loads them into BigQuery. We are only talking about 20 or 30 lines of code unless you add consolidation. A 5 MB file would take about 500 ms to load and process and write back. I am not sure about the Bigquery load time. For 50,000 5 MB files, 12 to 24 hours for one thread on a large Compute Engine instance (you need high network bandwidth).
Another option is to spin up multiple Compute Engines. One engine will put the names of N files (something like 4 or 16) per message into Pub/Sub. Then multiple Compute instances subscribe to the same topic and process the files in parallel. Again, this is only another 100 lines of code.
If your project consists of many millions of files, network bandwidth and compute time will be an issue unless time is not a factor.
You can use Dataflow to do this.
Choose the “Text Files on Cloud Storage to BigQuery” template:
A pipeline that can read text files stored in GCS, perform a transform
via a user defined javascript function, and load the results into
BigQuery. This pipeline requires a javascript function and a JSON
describing the resulting BigQuery schema.
You will need to add an UDF in Javascript that converts from JSON to new line delimited JSON when creating the job.
This will retrieve the files from GCS, convert them and upload them to BigQuery automatically.
I am trying to upload a compressed file from my GCS bucket into BigQuery.
In the new UI it is not clear how should I specify to uncompress the file.
I get an error specifying as if the gs://bucket/folder/file.7z is a .csv file.
Any help?
Unfortunately, .7z files are not supported by Bigquery, only gzip files and the decompression process is made automatically, after selecting the data format and creating the table.
If you consider that BigQuery should accept 7z files too, you could fill a feature request so the BigQuery engineers have it in mind for further releases.
I've configures my DMS to read from a MySQL database and migrate its data to S3 with replication. Everything seems to work fine, it creates big CSV files for all the data and starts to create smaller CSV files with the deltas.
The problem is when I read this CSV files with AWS Glue Crawlers, they don't seem to get these deltas or even worse, they seem to get only the deltas, ignoring the big CSV files.
I know that there is a similar post here: Athena can't resolve CSV files from AWS DMS
But it is unaswered and I can't comment there, so I'm opening this one.
Does anyone have found the solution to this?
Best regards.
We are able to load uncompressed CSV files and gzipped files completely fine.
However, if we want to load CSV files compressed in ".zip" - what is the best approach to move ahead?
Will we need to manually convert the zip to gz or BigQuery has added some support to handle this?
Thanks
BigQuery supports loading gzip files
The limitation is - If you use gzip compression BigQuery cannot read the data in parallel. Loading compressed CSV data into BigQuery is slower than loading uncompressed data.
You can try 42Layers.io for this. We use it to import ziped CSV files directly from FTP into BQ, and then set it on a schedule to do it every day. They also let you do field mapping to your existing tables within BQ. Pretty neat.