We have a function that gets the list of files in a zip file and it works standalone and in Lambda until the fire is larger than 512 meg.
The function needs to get a list of files in the zip file and read the contents of a JSON file that should be in the zip file.
This is part of the function:
try:
s3_object = s3_client.get_object(Bucket=bucketname, Key=filename)
#s3_object = s3_client.head_object(Bucket=bucketname, Key=filename)
#s3_object = s3_resource.Object(bucket_name=bucketname, key=filename)
except:
return ('NotExist')
zip_file = s3_object['Body'].read()
buffer = io.BytesIO(zip_file)
# buffer = io.BytesIO(s3_object.get()['Body'].read())
with zipfile.ZipFile(buffer, mode='r', allowZip64=True) as zip_files:
for content_filename in zip_files.namelist():
zipinfo = zip_files.getinfo(content_filename)
if zipinfo.filename[:2] != '__':
no_files += 1
if zipinfo.filename == json_file:
json_exist = True
with io.TextIOWrapper(zip_files.open(json_file), encoding='utf-8') as jsonfile:
object_json = jsonfile.read()
The get_object is the issue as it loads gets the file into memory hand obviously the large of the file it is going to be more than it's available in Lambda.
I've tried using head_object but that only gives me the meta data for the file and I don't know how to get the list of files in the zip file when using head_object or resource.Object.
I would be grateful for any ideas please.
It would likely be the .read() operation that consumes the memory.
So, one option is to simply increase the memory allocation given to the Lambda function.
Or, you can download_file() the zip file, but Lambda functions are only given 512MB in the /tmp/ directory for storage, so you would likely need to mount an Amazon EFS filesystem for additional storage.
Or you might be able to use smart-open ยท PyPI to directly read the contents of the Zip file from S3 -- it knows how to use open() with files in S3 and also Zip files.
Not a full answer, but too long for comments.
It is possible to download only part of a file from S3, so you should be able to grab only the list of files and parse that.
The zip file format places the list of archived files at the end of the archive, in a Central directory file header.
You can download part of a file from S3 by specifying a range to the GetObject API call. In Python, with boto3, you would pass the range as a parameter to the get_object() s3 client method, or the get() method of the Object resource.
So, you could read pieces from the end of the file in 1 MiB increments until you find the header signature (0x02014b50), then parse the header and extract the file names. You might even be able to trick Python into thinking it's a proper .zip file and convince it to give you the list while providing only the last piece(s). An elegant solution that doesn't require downloading huge files.
Or, it might be easier to ask the uploader to provide a list of files with the archive. Depending on your situation, not everything has to be solved in code :).
Related
We have a csv file that is maintained by an analyst who manually updates it at irregular intervals and reuploads (by drag and drop) the same file to an S3 bucket. I have Snowpipe set up to ingest files from this S3 bucket, but it won't re-process the same filename even when the contents change. We don't want to rely on the analyst(s) remembering to manually rename the file each time they upload it, so are looking for an automated solution. I have pretty minimal input on how the analysts work with this file, I just need to ingest it for them. The options I'm considering are:
Somehow adding a timestamp or unique identifier to the filename on
upload (not finding a way to do this easily in S3). I've also
experimented with versioning in the S3 bucket but this doesn't seem
to have any effect.
Somehow forcing the pipe to grab the file again even with the same name. I've read
elsewhere that setting "Force=true" might do it, but that seems to
be an invalid option for a pipe COPY INTO statement.
Here is the pipe configuration, I'm not sure if this will be helpful here:
CREATE OR REPLACE PIPE S3_INGEST_MANUAL_CSV AUTO_INGEST=TRUE AS
COPY INTO DB.SCHEMA.STAGE_TABLE
FROM(
SELECT $1, $2, metadata$filename, metadata$file_row_number
FROM #DB.SCHEMA.S3STAGE
)
FILE_FORMAT=(
TYPE='csv'
skip_header=1
) ON_ERROR='SKIP_FILE_1%'
enter code here
Ignoring the fact that updating the same file rather than having a unique filename is really bad practice, you can use the FORCE option to force the reloading of the same file.
If the file hasn't been changed and you run the process with this option you'll potentially end up with duplicates in your target
I have a use case where I want to read the filename from a metadata table, I have written a pipeline function to read the metadata table, but I am not sure how can I pass this information to ReadFromText as it only takes string as input, Is it possible to assign this value to ReadFromText(). Please suggest some workarounds or ideas how to achieve this, Thanks
code: pipeline | 'Read from a File' >> ReadFromText(I want to pass the file path here?,
skip_header_lines=1)
Note: There will be various folders and files in storage, files are in csv format, but in my use case I can't directly pass the storage location or filename to file path in ReadFromText. I want to read it from metadata and pass the value. Hope I am clear, Thanks
I don't understand why you need to read the metadata. If you want to read all the files inside a folder, you can just provide a blob. This solution working in python, not sure about java.
p|readfromtext("./folder/*.csv")
"*" is the blob here, which allows pipeline to read all the patterns matching .csv. You can also add something at the starting.
What you want is textio.ReadAllFromText which reads from a PCollection instead of taking a string directly.
How could I possible open a symbolic link and get the content of the file instead of the file it is pointing to?
By doing:
with open('/home/symlink.txt', 'rb') as f:
data=f.read()
If the symbolic link points to /foo/faa.txt, the variable data will contain the content of faa.txt. This is a big security and file problem from my server because I'm generating zip archives.
If for example, a folder contains multiple symbolic links with different names to avoid duplicating files, the zip archive will contain multiple files instead of multiple symbolic links!
I hope to be clear enough!
An extra explanation:
The point of this is to allow downloading symlinks in a django server. The way of returning files is the following one:
response = HttpResponse()
response.write(data))
return response
This means that data must contain the content that the user will download. I can not just give it a path. So what I need to do is to give it a symbolic link. The problem is that reading a symbolic link makes python read the content where it is pointing to instead of its real content. In a few words, the user downloads the real file instead of the symbolic link!
A possible solution to this would be to get the path where the symlink points to, and then generate the link in the buffer. Is this possible?
It looks like there are 2 questions here: How can you read a symlink from the filesystem, and how can you store this in a .zip file such that it will be recreated when you unzip it.
Reading a symlink
The contents of a symlink are defined here:
http://man7.org/linux/man-pages/man7/symlink.7.html
A symbolic link is a special type of file whose contents are a string that is the pathname of another file, the file to which the link refers
You can read that path by using os.readlink (https://docs.python.org/2/library/os.html#os.readlink) - this is analogous to C's readlink function.
It's also important to note that these symlinks aren't distinguished by their content or file attributes, but by the fact that the file entry on disk points to a string rather than a file object:
In other words, a symbolic link is a pointer to another name, and not to an underlying object.
This means that there isn't really a "file" you could store in the ZIP. So how do the existing zip & unzip utilities do it?
Storing a symlink in a zip file
The spec for the ZIP format is here: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
Note that section 4.5.7 (defining UNIX Extra Field) says:
The variable length data field will contain file type specific data. Currently the only values allowed are the original "linked to" file names for hard or symbolic links, and the major and minor device node numbers for character and block device nodes. [...] Link files will have the name of the original file stored.
This means that to store a symlink, all you need to do is add the UNIX extra field block to the data you are writing (these appear to live immediately after the filename is written, and you need to set the extra field length accordingly), and populate its "Variable length data field" with the path you get from readlink. The content you store for the node will be empty.
If you're using a library to generate the zip data (recommended!), it will probably have an abstraction available for that. If not, I'd suggest you put in a feature request!
Of course, most existing zip and unzip utilities follow the same definition, which is why you are able to zip and unzip symbolic links as if they were regular files.
I have a talend job which is simple like below:
ts3Connection -> ts3Get -> tfileinputDelimeted -> tmap -> tamazonmysqloutput.
Now the scenario here is that some times I get the file in .txt format and sometimes I get it in a zip file.
So I want to use tFileUnarchive to unzip the file if it's in zip or process it bypassing the tFileUnarchive component if the file is in unzipped format i.e only in .txt format.
Any help on this is greatly appreciated.
The trick here is to break the file retrieval and potential unzipping into one sub job and then the processing of the files into another sub job afterwards.
Here's a simple example job:
As normal, you connect to S3 and then you might list all the relevant objects in the bucket using the tS3List and then pass this to tS3Get. Alternatively you might have another way of passing the relevant object key that you want to download to tS3Get.
In the above job I set tS3Get up to fetch every object that is iterated on by the tS3List component by setting the key as:
((String)globalMap.get("tS3List_1_CURRENT_KEY"))
and then downloading it to:
"C:/Talend/5.6.1/studio/workspace/S3_downloads/" + ((String)globalMap.get("tS3List_1_CURRENT_KEY"))
The extra bit I've added starts with a Run If conditional link from the tS3Get which links the tFileUnarchive with the condition:
((String)globalMap.get("tS3List_1_CURRENT_KEY")).endsWith(".zip")
Which checks to see if the file being downloaded from S3 is a .zip file.
The tFileUnarchive component then just needs to be told what to unzip, which will be the file we've just downloaded:
"C:/Talend/5.6.1/studio/workspace/S3_downloads/" + ((String)globalMap.get("tS3List_1_CURRENT_KEY"))
and where to extract it to:
"C:/Talend/5.6.1/studio/workspace/S3_downloads"
This then puts any extracted files in the same place as the ones that didn't need extracting.
From here we can now iterate through the downloads folder looking for the file types we want by setting the directory to "C:/Talend/5.6.1/studio/workspace/S3_downloads" and the global expression to "*.csv" in my case as I wanted to read in only the CSV files (including the zipped ones) I had in S3.
Finally, we then read the delimited files by setting the file to be read by the tFileInputDelimited component as:
((String)globalMap.get("tFileList_1_CURRENT_FILEPATH"))
And in my case I simply then printed this to the console but obviously you would then want to perform some transformation before uploading to your AWS RDS instance.
HIn Django, when uploading a file with spaces, and brackets, it's stored in the file system with a different filename.
For example, when uploading the file 'lo go (1).jpg' via the admin interface, it's stored on the filesystem as 'lo__go_1.jpg'.
How can I know what the file will be called at upload time? I can't seem to find the source code that replaces the characters.
I found out the answer to my question.
https://github.com/django/django/blob/master/django/db/models/fields/files.py#L310
https://github.com/django/django/blob/master/django/core/files/storage.py#L58
https://github.com/django/django/blob/master/django/utils/text.py#L234