We are currently in the process of setting up an Artifactory Pro instance on GCP and want to use GCS as its Filestore. The connection to the bucket is successful, uploads and downloads to and from the bucket via Artifactory are successful (using a generic repo).
However: Artifactory does not delete an artifact if we tell it so, via the GUI. The Artifact gets deleted and disappears in the GUI, (Trash Can is disabled in the System Settings) but continues to exist in the bucket in GCS.
This is our binarystore.xml:
<?xml version="1.0" encoding="UTF-8"?>
<config version="v1">
<chain>
<provider id="cache-fs" type="cache-fs">
<provider id="eventual" type="eventual">
<provider id="retry" type="retry">
<provider id="google-storage" type="google-storage"/>
</provider>
</provider>
</provider>
</chain>
<provider id="google-storage" type="google-storage">
<endpoint>commondatastorage.googleapis.com</endpoint>
<bucketName>rtfdev</bucketName>
<identity>xxx</identity>
<credential>xxx</credential>
<bucketExists>false</bucketExists>
<httpsOnly>true</httpsOnly>
<httpsPort>443</httpsPort>
</provider>
</config>
Our setup:
Artifactory 7.12.6
OS: Debian 10 (buster)
Machine Type: e2-highcpu-4 (4 vCPUs, 4 GB memory)
Disk: 200 GB SSD
The questions are:.
Is this working as intended? Does Artifactory never ever delete artifacts in a bucket?
On a related note: How can we convince Artifactory to be more verbose with its interactions with GCS? (the artifactory-binarystore.log is suspiciously empty, console.log is quiet as well...)
The reason you are not seeing the artifact being deleted immediately from the storage is the fact that Artifactory is using a checksum based storage.
TL;DR - you will see the artifact deleted from storage once the garbage collection process will delete it.
Artifactory stores any binary file only once. This is what we call "once and once only storage". First time a file is uploaded, Artifactory runs the required checksum calculations when storing the file, however, if the file is uploaded again (to a different location, for example), the upload is implemented as a simple database transaction that creates another record mapping the file's checksum to its new location. There is no need to actually store the file again in storage. No matter how many times a file is uploaded, the filestore only hosts a single copy of the file.
Deleting a file is also a simple database transaction in which the corresponding database record is deleted. The file itself is not directly deleted, even if the last database entry pointing to it is removed. So-called "orphaned" files are removed in the background by Artifactory's garbage collection processes.
Related
I'm currently building an application with Apache Spark (pyspark), and I have the following use case:
Run pyspark with local mode (using spark-submit local[*]).
Write the results of my spark job to S3 in the form of partitioned Parquet files.
Ensure that each job overwrite the particular partition it is writing to, in order to ensure idempotent jobs.
Ensure that spark-staging files are written to local disk before being committed to S3, as staging in S3, and then committing via a rename operation, is very expensive.
For various internal reasons, all four of the above bullet points are non-negotiable.
I have everything but the last bullet point working. I'm running a pyspark application, and writing to S3 (actually an on-prem Ceph instance), ensuring that spark.sql.sources.partitionOverwriteMode is set to dynamic.
However, this means that my spark-staging files are being staged in S3, and then committed by using a delete-and-rename operation, which is very expensive.
I've tried using the Spark Directory Committer in order to stage files on my local disk. This works great unless spark.sql.sources.partitionOverwriteMode.
After digging through the source code, it looks like the PathOutputCommitter does not support Dynamic Partition Overwriting.
At this point, I'm stuck. I want to be able to write my staging files to local disk, and then commit the results to S3. However, I also need to be able to dynamically overwrite a single partition without overwriting the entire Parquet table.
For reference, I'm running pyspark=3.1.2, and using the following spark-submit command:
spark-submit --repositories https://repository.cloudera.com/artifactory/cloudera-repos/ --packages com.amazonaws:aws-java-sdk:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253
I get the following error when spark.sql.sources.partitionOverwriteMode is set to dynamic:
java.io.IOException: PathOutputCommitProtocol does not support dynamicPartitionOverwrite
My spark config is as follows:
self.spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
self.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
self.spark.conf.set("spark.hadoop.fs.s3a.committer.name", "magic")
self.spark.conf.set("spark.sql.sources.commitProtocolClass",
"org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
self.spark.conf.set("spark.sql.parquet.output.committer.class",
"org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
self.spark.conf.set(
"spark.sql.sources.partitionOverwriteMode", "dynamic"
)
afraid the s3a committers don't support the dynamic partition overwrite feature. That actually works by doing lots of renaming, so misses the entire point of zero rename committers.
the "partioned" committer was written by netflix for their use case of updating/overwriting single partitions in an active table. it should work for you as it is the same use case.
consult the documentation
My Goal: I have hundreds of Google Cloud Storage folders with hundreds of images in them. I need to be able to zip them up and email a user a link to a single zip file.
I made an attempt to zip these files on an external server using PHP's zip function, but that has proved to be fruitless given the ultimate size of the zip files I'm creating.
I have since found that Google Cloud offers a Bulk Compress Cloud Storage Files utility (docs are at https://cloud.google.com/dataflow/docs/guides/templates/provided-utilities#api). I was able to successfully call this utility, but for zips each file into it's own bzip or gzip file.
For instance, if I had the following files in the folder I'm attempt to zip:
apple.jpg
banana.jpg
carrot.jpg
The resulting outputDirectory would have:
apple.bzip2
banana.bzip2
carrot.bzip2
Ultimately, I'm hoping to create a single file named fruits.bzip2 that can be unzipped to reveal these three files.
Here's an example of the request parameters I'm making to https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Bulk_Compress_GCS_Files
{
"jobName": "ziptest15",
"environment": {
"zone": "us-central1-a"
},
"parameters": {
"inputFilePattern": "gs://PROJECT_ID.appspot.com/testing/samplefolder1a/*.jpg",
"outputDirectory": "gs://PROJECT_ID.appspot.com/testing/zippedfiles/",
"outputFailureFile": "gs://PROJECT_ID.appspot.com/testing/zippedfiles/failure.csv",
"compression": "BZIP2"
}
}
The best way to achieve that is to create an app that:
Download locally all the file of a GCS prefix (that you name "directory" but directory doesn't exist on GCS, only file with the same prefix)
Create an archive (can be a ZIP or a TAR. ZIP won't really compress the image. The image format is already a compressed format. You especially want only one 1 with all the image in it)
Upload the archive to GCS
Clean the files
Now you have to choose where to run this app.
On Cloud Run, you are limited by the space that you have in memory (for now, new feature are coming). For now you are limited to 8Gb of memory (and soon 16Gb), your app will be able to process total image size of 45% of the memory capacity (45% for the image size, 45% for the archive size, 10% for the app memory footprint.). Set the concurrency parameter to 1.
If you need more space, you can use Compute Engine.
Set up a startup script that run your script and stop automatically the VM at the end. The script read the parameter from the metadata server and run your app with the correct parameters
Before each run, update the Compute Engine metadata with the directory to process (and maybe other app parameter
-> The issue is that you can only run 1 process at a time. Or you need to create a VM for each job, and then delete the VM at the end of the startup script instead of stopping the VM
A side solution is to use Cloud Build. Run a Build with the parameters in the substitutions variables and perform the job in Cloud Build. You are limited to 10 builds in parallel. Use the diskSizeGb build option to set the correct disk size according to your file size requirements.
The dataflow template only zip each file unitary, and don't create an archive.
I am having a service that uploads files during the day. The same file gets updated multiple time on different events (no determined way to know when it gets updated). At the same time there is a client that downloads the file. What happens if the file gets updated during the download? Does s3 still preserve an old version until all active processes with it are done (kind of like filesystem)? Can the file be corrupted (part from old version, part from new)? Can the connection be closed abruptly in this case?
An object will only be created in Amazon S3 if the upload process completed fully. Partial files will not appear in Amazon S3.
Similarly, when overwriting an object in Amazon S3, the object will only be replaced if the new object was fully uploaded. The new object completely replaces the old object.
There might be a small delay between the upload completing and the new object appearing because objects in Amazon S3 are replicated between multiple servers for durability.
I've setup a localstack install based off the article How to fake AWS locally with LocalStack. I've tested copying a file up to the mocked S3 service and it works great.
I started looking for the test file I uploaded. I see there's an encoded version of the file I uploaded inside .localstack/data/s3_api_calls.json, but I can't find it anywhere else.
Given: DATA_DIR=/tmp/localstack/data I was expecting to find it there, but it's not.
It's not critical that I have access to it directly on the file system, but it would be nice.
My question is: Is there anywhere/way to see files that are uploaded to the localstack's mock S3 service?
After the latest update, now we have only one port which is 4566.
Yes, you can see your file.
Open http://localhost:4566/your-funny-bucket-name/you-weird-file-name in chrome.
You should be able to see the content of your file now.
I went back and re-read the original article which states:
"Once we start uploading, we won't see new files appear in this directory. Instead, our uploads will be recorded in this file (s3_api_calls.json) as raw data."
So, it appears there isn't a direct way to see the files.
However, the Commandeer app provides a view into localstack that includes a directory listing of the mocked S3 buckets. There isn't currently a way to see the contents of the files, but the directory structure is enough for what I'm doing. UPDATE: According to #WallMobile it's now possible to see the contents of files too.
You could use the following command
aws --endpoint-url=http://localhost:4572 s3 ls s3:<your-bucket-name>
In order to list the exact folder in s3 bucket you could use this command:
aws --endpoint-url=http://localhost:4566 s3 ls s3://<bucket-name>/<folder-in-bucket>/
Image is saved as base64 encoded string in the file recorded_api_calls.json
I have passed DATA_DIR=/tmp/localstack/data
and the file is saved at /tmp/localstack/data/recorded_api_calls.json
Open the file and copy the data (d) from any API call that looks like this
"a": "s3", "m": "PUT", "p": "/bucket-name/foo.png"
extract this data using bas64 decoding
You can use this script to extract data from localstack s3
https://github.com/nkalra0123/extract-data-from-localstack-s3/blob/main/script.sh
From my understanding, localstack saves the data in memory by default. This is what happens unless you specify a data directory. Obvious, if in memory, you won't see any files anywhere.
To create a data directory you can run a command such as:
mkdir ~/.localstack
Then you have to instruct localstack to save the data at that location. You can do so by adding the DATA_DIR=... path and a volume in your docker-compose.yml file like so:
localstack:
image: localstack/localstack:latest
ports:
- 4566:4566
- 8055:8080
environment:
- SERVICES=s3
- DATA_DIR=/tmp/localstack/data
- DOCKER_HOST=unix:///var/run/docker.sock
volumes:
- /home/alexis/.localstack:/tmp/localstack
Then rebuild and start the docker.
Once the localstack process started, you'll see a JSON database under ~/.localstack/data/....
WARNING: if you are dealing with very large files (Gb), then it is going to be DEAD SLOW. The issue is that all the data is going to be saved inside that one JSON file in base64. In other words, it's going to generate a file much bigger than what you sent and re-reading it is also going to be enormous. It may be possible to fix this issue by setting the legacy storage mechanism to false:
LEGACY_PERSISTENCE=false
I have not tested that flag (yet).
I'm working on a Laravel 5.2 application where users can send a file by POST, the application stores that file in a certain location and retrieves it on demand later. I'm using Amazon Elastic Beanstalk. For local development on my machine, I would like the files to store in a specified local folder on my machine. And when I deploy to AWS-EB, I would like it to automatically switch over and store the files in S3 instead. So I don't want to hard code something like \Storage::disk('s3')->put(...) because that won't work locally.
What I'm trying to do here is similar to what I was able to do for environment variables for database connectivity... I was able to find some great tutorials where you create an .env.elasticbeanstalk file, create a config file at ~/.ebextiontions/01envconfig.config to automatically replace the standard .env file on deployment, and modify a few lines of your database.php to automatically pull the appropriate variable.
How do I do something similar with file storage and retrieval?
Ok. Got it working. In /config/filesystems.php, I changed:
'default' => 'local',
to:
'default' => env('DEFAULT_STORAGE') ?: 'local',
In my .env.elasticbeanstalk file (see the original question for an explanation of what this is), I added the following (I'm leaving out my actual key and secret values):
DEFAULT_STORAGE=s3
S3_KEY=[insert your key here]
S3_SECRET=[insert your secret here]
S3_REGION=us-west-2
S3_BUCKET=cameraflock-clips-dev
Note that I had to specify my region as us-west-2 even though S3 shows my environment as Oregon.
In my upload controller, I don't specify a disk. Instead, I use:
\Storage::put($filePath, $filePointer, 'public');
This way, it always uses my "default" disk for the \Storage operation. If I'm in my local environment, that's my public folder. If I'm in AWS-EB, then my Elastic Beanstalk .env file goes into effect and \Storage defaults to S3 with appropriate credentials.