I have a S3 Bucket Streaming logs to a lambda function that tags files based on some logic.
While I have worked around this issue in the past and I understand there are some characters that need to be handled I'm wondering if there is a safe way to handle this with some API or is it something I need to handle on my own.
For example I have a lambda function like so:
import boto3
def lambda_handler(event, context):
s3 = boto3.client("s3")
for record in event["Records"]:
bucket = record["s3"]["bucket"]["name"]
objectName = record["s3"]["object"]["key"]
tags = []
if "Pizza" in objectName:
tags.append({"Key" : "Project", "Value" : "Great"})
if "Hamburger" in objectName:
tags.append({"Key" : "Project", "Value" : "Good"})
if "Liver" in objectName:
tags.append({"Key" : "Project", "Value" : "Yuck"})
s3.put_object_tagging(
Bucket=bucket,
Key=objectName,
Tagging={
"TagSet" : tags
}
)
return {
'statusCode': 200,
}
This code works great. I upload a file to s3 called Pizza-Is-Better-Than-Liver.txt then the function runs and tags the file with both Great and Yuck (sorry for the strained example).
However If I upload the file Pizza Is+AmazeBalls.txt things go sideways:
Looking at the event in CloudWatch the object key shows as: Pizza+Is%2BAmazeBalls.txt.
Obviously the space is escaped to a + and the + to a %2B when I pass that key to put_object_tagging() it fails with a NoSuchKey Error.
My question; is there a defined way to deal with escaped characters in boto3 or some other sdk, or do I just need to do it myself? I really don't and to add any modules to the function and I could just use do a contains / replace(), but it's odd I would get something back that I can't immediately use without some transformation.
I'm not uploading the files and can't mandate what they call things (i-have-tried-but-it-fails), if it's a valid Windows or Mac filename it should work (I get that is a whole other issue but I can deal with that).
EDIT:
So after some comments on the GitHub I should have been using urllib.parse.unquote_plus in this situation. this would be the proper way to solve escaping issues like this.
from urllib.parse import unquote_plus
print(unquote_plus("Pizza+Is%2BAmazeBalls.txt"))
# Pizza Is+AmazeBalls.txt
Original Answer:
Since no other answers I guess I post my bandaid:
def format_path(path):
path = path.replace("+", " ")
path = path.replace("%21", "!")
path = path.replace("%24", "$")
path = path.replace("%26", "&")
path = path.replace("%27", "'")
path = path.replace("%28", "(")
path = path.replace("%29", ")")
path = path.replace("%2B", "+")
path = path.replace("%40", "#")
path = path.replace("%3A", ":")
path = path.replace("%3B", ";")
path = path.replace("%2C", ",")
path = path.replace("%3D", "=")
path = path.replace("%3F", "?")
return path
I'm sure there is a simpler, more complete way to do this but this seems to work... for now.
Related
The following packages are installed in my Visual Studio solution’s project:
Amazon.Extensions.Configuration.SystemsManager" Version="4.0.0"
Amazon.Lambda.APIGatewayEvents" Version="2.5.0"
Amazon.Lambda.Core" Version="2.1.0"
Amazon.Lambda.Serialization.SystemTextJson" Version="2.3.0"
AWSSDK.S3" Version="3.7.9.21"
AWSSDK.SecretsManager" Version="3.7.2.65"
AWSSDK.SecretsManager.Caching" Version="1.0.4"
Microsoft.Extensions.Configuration.EnvironmentVariables" Version="3.1.27"
Microsoft.Extensions.Configuration.FileExtensions" Version="3.1.27"
Microsoft.Extensions.Configuration.Json" Version="3.1.27"
Microsoft.Extensions.DependencyInjection" Version="3.1.27"
Microsoft.Extensions.Logging" Version="3.1.27"
Microsoft.Extensions.Logging.Console" Version="3.1.27"
starkbank-ecdsa" Version="1.3.3"
Swashbuckle.AspNetCore" Version="6.3.0"
Let’s say that my AWS Cloud account has the following parameters:
/bible/OldTestament/Law/Genesis/Chapter1
/bible/OldTestament/Law/Genesis/Chapter2
/bible/OldTestament/Law/Genesis/Chapter3
…..
/bible/OldTestament/Law/Exodus/Chapter1
/bible/OldTestament/Law/Exodus/Chapter2
…..
/bible/NewTestament/Gospel/Mark/Chapter1
/bible/NEwTestament/Gospel/Mark/Chapter2
…..
/bible/NewTestament/Gospel/John/Chapter1
/bible/NewTestament/Gospel/John/Chapter2
private GetParametersByPathResponse GetMyAppAWSParameters(string path)
{
GetParametersByPathRequest request = new GetParametersByPathRequest()
{
Path = path,
WithDecryption = false
};
return _ssm.GetParametersByPathAsync(request).Result;
}
The aforementioned method works with Paths that are just one-level up from the Leaf Node (i.e.,
path = /bible/OldTestament/Law/Genesis/
The returned response parameter list contains:
/bible/OldTestament/Law/Genesis/Chapter1
/bible/OldTestament/Law/Genesis/Chapter2
/bible/OldTestament/Law/Genesis/Chapter3
Or
path = /bible/NewTestament/Gospel/John/
The return response parameter list contains:
/bible/NewTestament/Gospel/John/Chapter1
/bible/NewTestament/Gospel/John/Chapter2
However, if I provide a shorter paths further up the hierarchy like the following:
path = /bible/OldTestament/Law/
Or
path = /bible/NewTestament/
Unfortunately The return response parameter list is empty
Essentially, I was trying to implement code that is flexible, intelligent & sophisticated enough to handle paths regardless of the hierarchy level.
Could someone please provide code that will allow me to do this?
It works when I set Recursive to true.
private GetParametersByPathResponse GetMyAppAWSParameters(string
path)
{
GetParametersByPathRequest request = new GetParametersByPathRequest()
{
Path = path,
Recursive = true,
WithDecryption = false
};
return _ssm.GetParametersByPathAsync(request).Result;
}
I'm trying to get the table name from parquet files using regex. I'm using the following code to attempt this but the ctSchema dataframe doesn't seem to run causing the job to return 0 results.
ci= spark.createDataFrame(data=[("","","")], schema=ciSchema)
files=dbutils.fs.ls('a filepath goes here')
results = {}
is_error = False
for fi in files:
try:
dataFile = spark.read.parquet(fi.path)
ctSchema = spark.createDataFrame(data = dataFile.dtypes, schema = tSchema).withColumn("TableName", regexp_extract(input_file_name(),"([a-zA-Z0-9]+_[a-zA-Z0-9]+)_shard_\d+_of_\d+\.parquet",1), lit(fi.name))
ci = ci.union(ctSchema)
except Exception as e:
results[fi.name] = f"Error: {e}"
is_error = True
your regex ([a-zA-Z0-9]+_[a-zA-Z0-9]+)_shard_\d+_of_\d+\.parquet is incorrect, try this one instead [a-zA-Z0-9]+_([a-zA-Z0-9]+)_page_\d+_of_\d+\.parquet.
First, I used page_ instead of shard_, which matches your file name.
Second, you don't want to group ([a-zA-Z0-9]+_[a-zA-Z0-9]+) which would match TCP_119Customer. You only want the second group, so change it to [a-zA-Z0-9]+_([a-zA-Z0-9]+) will fix the issue.
Alright, so I have the file transfer part working, but what I'm dealing with is on a huge scale (100s of thousands of potential uploads); so what I'm trying to do is this:
Trigger lambda to move source uploaded object to a new location
Location to be a named key that includes the objects name (in a different bucket)
I have it moving the files from one s3 bucket to another, i just can't figure out how to get it to create a new key in my destination bucket based on the name of the uploaded file.
Example: uploaded file : grandkids.jpg -> lambda put trigger moves file to /grandkids/grandkids.jpg
Thank you all in advance (It doesn't help that I only know the little bit of nodejs/python due to lambda, I am not an experienced coder at all)
You just want to split the filename and use that as the prefix, like below.
fn = 'grandkids.jpg'
folder = fn.split('.')[0]
newkey = folder + '/' + fn
print(newkey)
grandkids/grandkids.jpg
But what if you have a filename with more than one '.'? Use rsplit and '1' to only split on the farthest right '.'
fn = 'my.awesome.grandkids.jpg'
folder = fn.rsplit('.', 1)[0].replace('.', '_') #personal preference to use underscores in folder names
newkey = folder + '/' + fn
print(newkey)
my_awesome_grandkids/my.awesome.grandkids.jpg
I'm trying to apply regex in something like this, in terraform.
variable "version" {
type = "string"
default = https://s3-us-east-1.amazonaws.com/bucket/folder/item-v1.2/item-item2-v1.2.gz
description = "version"
}
name = "${replace(var.version, "//", "")}"
I just need to replace everything and output only "v1.2" for the name, specifically from the item-v1.2 file path. I was thinking of doing something along the lines of
name = "${replace(var.version, "//", "")}"
wherein I replace everything but v1.2.
Any idea how to do it? Thank you in advance!
There are a couple of issues with your variable definition (e.g. wrong use of quotes, variable name "version" is reserved due to its special meaning inside module blocks, and so on) so I operate on the following:
variable "my_version" {
type = string
default = "https://s3-us-east-1.amazonaws.com/bucket/folder/item-v1.2/item-item2-v1.2.gz"
description = "version"
}
Then you can extract the version (or replace everything else except the version as requested) from the sub-folder path like this (this has been run and tested with terraform console):
$ terraform console
> var.my_version
"https://s3-us-east-1.amazonaws.com/bucket/folder/item-v1.2/item-item2-v1.2.gz"
> regex(".*-(.*)/.*.gz", var.my_version)[0]
"v1.2"
> replace(var.my_version, "/.*-(.*)/.*.gz/", "<before> $1 <after>")
"<before> v1.2 <after>"
I use Paperclip 4.0.2 and in my app to upload pictures.
So my Document model has an attached_file called attachment
The attachment has few styles, say :medium, :thumb, :facebook
In my model, I stop the styles processing, and I extracted it inside a background job.
class Document < ActiveRecord::Base
# stop paperclip styles generation
before_post_process
false
end
But the :original style file is still uploaded!
I would like to know if it's possible to stop this behavior and copy the file inside the :original/filename.jpg from a remote directory
My goal being to use a file that has been uploaded in a S3 /temp/ directory with jQuery File upload, and copy it to the directory where Paperclip needs it to generate the others styles.
Thank you in advance for your help!
New Answer:
paperclip attachments get uploaded in the flush_writes method which, for your purposes, is part of the Paperclip::Storage::S3 module. The line which is responsible for the uploading is:
s3_object(style).write(file, write_options)
So, by means of monkey_patch, you can change this to something like:
s3_object(style).write(file, write_options) unless style.to_s == "original" and #queued_for_write[:your_processed_style].present?
EDIT: this would be accomplished by creating the following file: config/initializers/decorators/paperclip.rb
Paperclip::Storage::S3.class_eval do
def flush_writes #:nodoc:
#queued_for_write.each do |style, file|
retries = 0
begin
log("saving #{path(style)}")
acl = #s3_permissions[style] || #s3_permissions[:default]
acl = acl.call(self, style) if acl.respond_to?(:call)
write_options = {
:content_type => file.content_type,
:acl => acl
}
# add storage class for this style if defined
storage_class = s3_storage_class(style)
write_options.merge!(:storage_class => storage_class) if storage_class
if #s3_server_side_encryption
write_options[:server_side_encryption] = #s3_server_side_encryption
end
style_specific_options = styles[style]
if style_specific_options
merge_s3_headers( style_specific_options[:s3_headers], #s3_headers, #s3_metadata) if style_specific_options[:s3_headers]
#s3_metadata.merge!(style_specific_options[:s3_metadata]) if style_specific_options[:s3_metadata]
end
write_options[:metadata] = #s3_metadata unless #s3_metadata.empty?
write_options.merge!(#s3_headers)
s3_object(style).write(file, write_options) unless style.to_s == "original" and #queued_for_write[:your_processed_style].present?
rescue AWS::S3::Errors::NoSuchBucket
create_bucket
retry
rescue AWS::S3::Errors::SlowDown
retries += 1
if retries <= 5
sleep((2 ** retries) * 0.5)
retry
else
raise
end
ensure
file.rewind
end
end
after_flush_writes # allows attachment to clean up temp files
#queued_for_write = {}
end
end
now the original does not get uploaded. You could then add some lines, like those of my origninal answer below, to your model if you wish to transfer the original to its appropriate final location if it was uploaded to s3 directly.
Original Answer:
perhaps something like this placed in your model executed with the after_create callback:
paperclip_file_path = "relative/final/destination/file.jpg"
s3.buckets[BUCKET_NAME].objects[paperclip_file_path].copy_from(relative/temp/location/file.jpg)
thanks to https://github.com/uberllama