I unloaded a set of 200 million records from Redshift to S3 using SQLWorkbench. I got a message saying "Unload complete, 2,00,00,00,000 records complete". However, when I download this file from s3 and open it, there are only 40 million rows. No errors at any point of time. I am very confused and unable to proceed because of this issue.
What could be causing the issue?
An unload of this size will not be in 1 file. Each unloaded file is limited to 6.2GB or smaller if the MAXFILESIZE parameter is set. Also, if PARALLEL is ON (default) each slice in Redshift will make its own set of files in S3. I expect you are only look at one of many files that were created by the UNLOAD. Each file will have a slice number and a part number attached to the file base name you provided in the UNLOAD statement.
Related
First of all I would like to mention that I tried searching for this issue in existing SO questions but I couldn't find the scenario I came across. Hence asking a new question.
So I am trying to import data from S3 to Redshift.
The data in S3 is JSON data seperated by new line character i.e. \n (exported using UNLOAD command from other redshift cluster)
The copy command is -
copy redhist_table_name
from 's3://bucket/path/to/s3/json/file.json'
iam_role 'iam_role_info'
region 'region';
json 'auto';
The STL_LOAD_ERRORS shows error as - delimiter not found but when I looked closely, I found that the copy command is copying only first 1024 characters from the json row which results in above error
I looked for all the options that copy command offers to see if there's way to increase this limit but I found none.
Any ideas where is this limit coming from? Or is this not the root cause of this issue?
So I expect this is not the root cause. Stl_load_errors only stores 1024 characters in the "raw_line" column. There very well may be a limit to how long a line can be but I know that it is much longer than 1024 characters.
Which line in the file is COPY failing on? First or somewhere later in the file?
If a line deep in the file there may be something off about it. UNLOAD to COPY should work correctly but I can see that there may be some corner cases (like " in string values). If it a specific line then posting that line (sanitized if need be) would be helpful.
I was recently asked a Question in an interview , if anyone can help me to figure out.
Suppose we have 100 files , and a process read a file , parse it , and write data into a database.
Now lets say process was at file number 60 and power got off , Now how will you design a system such that when power comes up , process should start write data into database , where it left before shut down.
This would be one way:
Loop over:
Pick up a file
Check it hasn't been processed with a query to the database.
Process the file
Update the database
Update the database with a log of the file processed
Commit
Move the file out of the non-processed queue
You can also log the file entry to some other persistent resource.
Q. What if there are many files. Doesn't writing to logs slow down the process?
A: Probably not much, it's just one entry into the database per file. It's the cost of resilience.
Q: What if the files are so small it's almost only updating one row per file?
A: Make your update query idempotent. Don't log, but ensure that files are removed from the queue once the transaction is complete.
Q: What if there are many lines in a file. Do you really want to restart with the first line of a file?
A: Depends on the cost/benefit. You could split the file into smaller ones prior to processing each sub-file. If the power out happens all the time, then that's a good compromise. If it happens very rarely, the extra work by the system may not be worth it.
A: What if there is a mix of small and large files?
Q: Put the files into separate queues that handle them accordingly.
The UPS idea by #TimBiegeleisen is very good, though:
Well actually it is about that, because unplugging a database in the middle of a lengthy transaction might result in corrupted data. – Tim Biegeleisen Feb 22 '20 at 10:24
I've experienced failure of one such, so you'll need two.
I think you must:
Store somewhere a reference to a file (ID, index of processed file - depend on the case really).
Your have to define the bounduaries of a single transaction - let it be full processing of one file so: read a file, parese it, store data to the database and update reference to the file you processed. If all of that succeeds you can commit the transaction to the database.
You main task which will process all the files should look into reference table and based on it's state featch next file.
In this case you create transaction around single file processing. If anything goes wrong there, you can always rerun the processing job and it will start where it left off.
Please be aware that this is very simple exaple in most scenarios you want to keep transactions as thin as possible.
My actual data in csv extracts starts from line 10. How can I skip top few lines in snowflake load using copy or any other utility. Do we have anything similar to SKIP_HEADER ?
I have files on S3 and its my stage. I would be creating a snowpipe later on this datasource.
yes there is a skip_header option for CSV, allowing you to skip a specified number of rows, when defining a file format. Please have a look here:
https://docs.snowflake.net/manuals/sql-reference/sql/create-file-format.html#type-csv
So you create a file format associated with the csv files you have in mind and then use this when calling the copy commands.
I'm loading files into Azure DW from blob store using polybase.
I usually use sys.dm_pdw_exec_requests and sys.dm_pdw_sql_requests to see what any long running processes are doing, but polybase loads have limited information.
Is there a fiew that can show the list of files Polybase has found in the directory and indicate any kind of progress (maybe completed files or rows loaded?)
We're still adding to the functionality around Polybase monitoring.
Here is a query that will help you to monitor the progress of the current files being loaded. "Current" means that if there are 1,000 files in a data set, and Polybase is processing them 10 at a time, only 10 rows should result from this query at any given time.
-- To track bytes and files
SELECT
r.command,
s.request_id,
r.status,
count(distinct input_name) as nbr_files,
sum(s.bytes_processed)/1024/1024/1024 as gb_processed
FROM
sys.dm_pdw_exec_requests r
inner join sys.dm_pdw_dms_external_work s
on r.request_id = s.request_id
GROUP BY
r.command,
s.request_id,
r.status
ORDER BY
nbr_files desc,
gb_processed desc;
This is an increasingly important topic, and I've created a User Voice task to register user support. Would you mind adding your votes/comments?
While unloading a large result set to s3, redshift automatically split the files into multiple parts. Is there a way to set the size of each part while unloading?
When unload, you can use maxfilesize to indicate the maximum size of the file.
For exampe:
unload ('select * from venue')
to 's3://mybucket/unload/'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
maxfilesize 1 gb;
From here
Redshift by default unloads data into multiple files according to the number of slices in your cluster. So, if you have 4 slices in the cluster, you would have 4 files written by each cluster concurrently.
Here is short answer to your question from the Documentation. Go here for details.
"By default, UNLOAD writes data in parallel to multiple files, according to the number of slices in the cluster. The default option is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or more data files serially, sorted absolutely according to the ORDER BY clause, if one is used. The maximum size for a data file is 6.2 GB. So, for example, if you unload 13.4 GB of data, UNLOAD creates the following three files."
I hope this helps.