I created a dataframe and selected some columns say col1col2 and col3 using df.select().
df1=df.select(col1,col2,col3)
I am writing this into a parquet file and saving it to s3.
df1.write.partitionBy("col1").format("parquet").save('s3a://myBucket/fol1/subfolder')
currently there is no location like 's3a://myBucket/fol1/subfolder' in my s3. Only thing I have is 's3a:myBucket'. My question as there are no objects named fol1 and subfolder.Will It create objects itself and save the file? or the code will fail?
I think you're asking if save('s3a://myBucket/fol1/subfolder') will create the fol1/subfolder structure in S3, and if it doesn't, do you need to.
The bottom line is that you don't need to worry about creating the intermediate folder structure because Hadoop FS API creates it for you, as needed.
#SteveLoughran's answer provides much more detail and deserves to be the accepted answer.
Although S3 is an object store, Spark, Hive &c all pretend its a filesystem & use the Hadoop filesystem API.
Some early actions of a spark save() are
call FileSystem.exists(dest) & fail if there's something there (unless you have enabled appending to existing data)
call FileSystem.mkdir(dest).
set up some _temporary dir underneath for the job, renaming things into place when the job is committed.
Action #2 triggers a scan for any entry in the path /a/b/c/dest being a file (Failure), creates an empty directory marker object /a/b/c/dest/. That marker will be deleted as soon as a child directory (i.e _temporary) is created.
At the end of the job then, there won't be any parent marker entries, but they go in there just to keep quiet all those bits of code which expect that after a mkdirs() call that the created directory exists.
Finally, be advised: the whole commit-by-rename mechanism is broken when it comes to S3 as it is (a) slow and (b) at risk of losing data due to directory listing consistency. You need a consistent listing layer (EMR: Consistent S3, Apache Hadoop: S3Guard, Databricks: something also DynamoDB based), and, for maximum performance atop Apache Hadoop 3.1, switch to a specific zero-rename S3A committer.
Related
We have a csv file that is maintained by an analyst who manually updates it at irregular intervals and reuploads (by drag and drop) the same file to an S3 bucket. I have Snowpipe set up to ingest files from this S3 bucket, but it won't re-process the same filename even when the contents change. We don't want to rely on the analyst(s) remembering to manually rename the file each time they upload it, so are looking for an automated solution. I have pretty minimal input on how the analysts work with this file, I just need to ingest it for them. The options I'm considering are:
Somehow adding a timestamp or unique identifier to the filename on
upload (not finding a way to do this easily in S3). I've also
experimented with versioning in the S3 bucket but this doesn't seem
to have any effect.
Somehow forcing the pipe to grab the file again even with the same name. I've read
elsewhere that setting "Force=true" might do it, but that seems to
be an invalid option for a pipe COPY INTO statement.
Here is the pipe configuration, I'm not sure if this will be helpful here:
CREATE OR REPLACE PIPE S3_INGEST_MANUAL_CSV AUTO_INGEST=TRUE AS
COPY INTO DB.SCHEMA.STAGE_TABLE
FROM(
SELECT $1, $2, metadata$filename, metadata$file_row_number
FROM #DB.SCHEMA.S3STAGE
)
FILE_FORMAT=(
TYPE='csv'
skip_header=1
) ON_ERROR='SKIP_FILE_1%'
enter code here
Ignoring the fact that updating the same file rather than having a unique filename is really bad practice, you can use the FORCE option to force the reloading of the same file.
If the file hasn't been changed and you run the process with this option you'll potentially end up with duplicates in your target
I'm trying to figure out how to:
a) Store / insert an object on Google Cloud Storage within a sub-directory
b) List a given sub-directory's contents
I managed to resolve how to get an object here: Google Cloud Storage JSON REST API - Get object held in a sub-directory
However, the same logic doesn't seem to apply to these other types of call.
For store, this works:
https://www.googleapis.com/upload/storage/v1/b/bucket-name/o?uploadType=media&name=foldername%2objectname
but it then stores the file name on GCS as foldername_filename, which doesn't change functionality but isn't really ideal.
For listing objects in a bucket, not sure where the syntax for a nested directory should go in here: storage.googleapis.com/storage/v1/b/bucketname/o.
Any insight much appreciated.
The first thing to know is that GCS does not have directories or folders. It just has objects that share a fixed prefix. There are features that both gsutil and the UI used to create the illusion that folders do exist.
https://cloud.google.com/storage/docs/gsutil/addlhelp/HowSubdirectoriesWork
With that out of the way: to create an object with a prefix you need to URL encode the object name, as I recall %2F is the encoding for /:
https://cloud.google.com/storage/docs/request-endpoints#encoding
Finally to list only the objects with a common prefix you would use the prefix parameter:
https://cloud.google.com/storage/docs/json_api/v1/objects/list#parameters
Using prefix=foo/bar/baz (after encoding) would list all the objects in the foo/bar/baz "folder", note that this is recursive, it will include foo/bar/baz/quux/object-name in the results. To stop at one level you want to read about the delimiter parameter too.
I am currently using rclone accessing AWS S3 data, and since I don't use either one much I am not an expert.
I am accessing the public bucket unidata-nexrad-level2-chunks and there are 1000 folders I am looking at. To see these, I am using the windows command prompt and entering :
rclone lsf chunks:unidata-nexrad-level2-chunks/KEWX
Only one folder has realtime data being written to it at any time and that is the one I need to find. How do I determine which one is the one I need? I could run a check to see which folder has the newest data. But how can I do that?
The output from my command looks like this :
1/
10/
11/
12/
13/
14/
15/
16/
17/
18/
19/
2/
20/
21/
22/
23/
... ... ... (to 1000)
What can I do to find where the latest data is being written to? Since it is only one folder at a time, I hope it would be simple.
Edit : I realized I need a way to list the latest file (along with it's folder #) without listing every single file and timestamp possible in all 999 directories. I am starting a bounty and the correct answer that allows me to do this without slogging through all of them will be awarded the bounty. If it takes 20 minutes to list all contents from all 999 folders, it's useless as the next folder will be active by that time.
If you wanted to know the specific folder with the very latest file, you should write your own script that retrieves a list of ALL objects, then figures out which one is the latest and which bucket it is in. Here's a Python script that does it:
import boto3
s3_resource = boto3.resource('s3')
objects = s3_resource.Bucket('unidata-nexrad-level2-chunks').objects.filter(Prefix='KEWX/')
date_key_list = [(object.last_modified, object.key) for object in objects]
print(len(date_key_list)) # How many objects?
date_key_list.sort(reverse=True)
print(date_key_list[0][1])
Output:
43727
KEWX/125/20200912-071306-065-I
It takes a while to go through those 43,700 objects!
I have a Dataflow job which:
Reads a text file from GCS with other filenames in it
Passes the filenames to ReadAllFromParquet to read the .parquet files
Writes to BigQuery
Despite my job 'succeeding' it basically doesn't have an output collection past the ReadAllFromParquet step.
I successfully read the files in a list such as:['gs://my_bucket/my_file1.snappy.parquet','gs://my_bucket/my_file2.snappy.parquet','gs://my_bucket/my_file3.snappy.parquet']
I am also confirming this list is correct and the GCS paths to the files are correct using a logger on the step before ReadAllFromParquet.
That's what my pipeline looks like (omitting the full code for brevity but I am confident that it normally works as I have the exact same pipeline for .csv using ReadAllFromText and it works fine):
with beam.Pipeline(options=pipeline_options_batch) as pipeline_2:
try:
final_data = (
pipeline_2
|'Create empty PCollection' >> beam.Create([None])
|'Get accepted batch file: {}'.format(runtime_options.complete_batch) >> beam.ParDo(OutputValueProviderFn(runtime_options.complete_batch))
|'Read all filenames into a list'>> beam.ParDo(FileIterator(runtime_options.files_bucket))
|'Read all files' >> beam.io.ReadAllFromParquet(columns=['locationItemId','deviceId','timestamp'])
|'Process all files' >> beam.ParDo(ProcessSch2())
|'Transform to rows' >> beam.ParDo(BlisDictSch2())
|'Write to BigQuery' >> beam.io.WriteToBigQuery(
table = runtime_options.comp_table,
schema = SCHEMA_2,
project = pipeline_options_batch.view_as(GoogleCloudOptions).project, #options.display_data()['project'],
create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED, #'CREATE_IF_NEEDED',#create if does not exist.
write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND #'WRITE_APPEND' #add to existing rows,partitoning
)
)
except Exception as exception:
logging.error(exception)
pass
That's what my job diagram looks like after:
Does somebody have an idea what might be going wrong here and what's the best way to debug?
My ideas currently:
A bucket permissions issue. I noticed the bucket I am reading from is odd as earlier I couldn't download the files despite being a project Owner. The Owners of project only had 'Storage Legacy Bucket Owner'. I added 'Storage Admin' and it then worked fine when manually downloading files with my own account. As per the Dataflow documentation I have ensured that both the default compute service account as well as the dataflow one have 'Storage Admin' on this bucket. However, maybe that's all a red herring as ultimately if there was a permissions issue I should see this in the log and the job would fail?
ReadAllFromParquet expects the file patterns in a different format? I have showed an example of the list (in my diagram above I can see the input collection correctly shows elements added = 48 for 48 files in the list) I supply above. I know this format works for ReadAllFromText so I assumed that they are equivalent and should work.
=========
EDIT:
Noticed something else potentially consequential. Comparing against my other job which uses ReadAllFromText and works fine I noticed a slight mismatch in the naming that is worrying.
This is the name of the output collection for my working job:
And that's the name on my parquet job that doesn't actually read anything:
Note specifically
Read all files/ReadAllFiles/ReadRange.out0
vs
Read all files/Read all files/ReadRange.out0
The first part of the path is the name of my step for both jobs.
But I believe the second to be the ReadAllFiles class from apache_beam.io.filebasedsource (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filebasedsource.py) which both ReadAllFromText and ReadAllFromParquet call.
Seems like a potential bug but don't seem to be able to trace it in the source code.
=============
EDIT 2
After some more digging it seems that ReadAllFromParquet just isn't functional yet. ReadFromParquet calls apache_beam.io.parquetio._ParquetSource whereas ReadAllFromParquet simply calls
apache_beam.io.filebasedsource._ReadRange.
I wonder if there's a way to turn this on if it's an experimental function?
You didn't mentioned if you are using the last Beam SDK, try using SDK 2.16 to test the last changes.
The doc states that ReadAllFromParquet is an experimental funtion as well as ReadFromParquet; nonetheless, ReadFromParquet is reported as working in this thread Apache-Beam: Read parquet files from nested HDFS directories, you might want to try to using this funtion.
For reference: GET Bucket (List Objects)
When I do a get request on the root bucket it comes back with test/ and test/subdir/ both 0 bytes. Which is correct, there should be 2 folders up there. When I upload a file to test/subdir/file. The root bucket has an item with the key=test/subdir/file. test/ and test/subdir/ are still 0 bytes. When I do a get request on test/subdir/ it returns nothing.
What's going on here?
Note: I do not have access to the console.
Greg, this might sound confusing at first, but the truth is that there's no such thing as "a folder" in Amazon S3. I'll explain.
The data structure of S3 is like a flat list of objects -- not like a tree. When you think you have a "file" called puppy.jpg inside a "folder" called pics, what you actually have is an object which key is pics/puppy.jpg. Note that the / character is not any more special than the . character, or the p characters.
You might be thinking, Bruno is nuts, I see folders in the AWS Management Console. True, you see the folders. But they are actually emulated by the GUI.
When you create a folder through the AWS Management Console, what it will actually do is create an object which name is the full path of the "folder", with a trailing slash, and 0 bytes. Just like the test/ object (not "folder") and the test/subdir/ object (not "folder") you mention in your question.
To actually identify and draw "folders", the AWS Management Console (as well as many other S3 browsing tools) is doing is some API magic with the parameters delimiter and prefix.
Now, knowing the fact that there's no such thing as a folder, and that they are emulated through the use of those 0-byte, trailing-/ objects, it should be easy to understand why you see the test/ object as a 0-byte object... The same reasoning would explain why you see nothing when you do a GET on a "folder" -- you are actually downloading a 0-byte object!
Finally, as a conclusion, there's no easy way to obtain from S3 the size of "a folder" (they don't exist...). The only way would be for you to list all the objects with that prefix and add their sizes. Or keep an index of your object ("files" and "folders") in some kind of database with more advanced querying capabilities.