Is it possible to define a TTL on Riak CS? - riak-cs

I'm using Riak CS (Cloud Storage) to store files and I want them to expire using a TTL. I'm OK with defining the same TTL value for all the files, e.g. 1 week.
From what I've understood, Riak CS uses 2 backends:
bitcak for binary data
leveldb for metadata
I know bitcask supports defining a TTL, meaning binary data will be cleaned on a regular basis.
Is it possible to achieve the same with leveldb, i.e. for metadata?

Unfortunately LevelDB does not have such TTL feature so it won't work. If you want all object names disappear from list (a bucket), lifecycle API is a suitable interface of S3 but it's not yet implemented.

Related

what does it mean "partitioned data" - S3

I want to use Netflix's outputCommitter (Using Spark with Amazon EMR).
In the README there are 2 options:
S3DirectoryOutputCommitter - for writing unpartitioned data to S3 with conflict resolution.
S3PartitionedOutputCommitter - for writing partitioned data to S3 with conflict resolution.
I tried to understand the differences but unsuccessfully. Can someone explain what is "partitioned data" in s3?
according to the hadoop docs, "This committer an extension of the “Directory” committer which has a special conflict resolution policy designed to support operations which insert new data into a directory tree structured using Hive’s partitioning strategy: different levels of the tree represent different columns."
search in the hadoop docs for the full details.
be aware that the EMR committers are not the ASF S3A ones, so take different config options and have their own docs. but since their work is a reimplementation of the netflix work, they should do the same thing here
I'm not familiar with outputCommitter, by partitioned data in Amazon S3 normally refers to splitting files amongst directories to reduce the amount of data that needs to be read from disk.
For example:
/data/month=1/
/data/month=2/
/data/month=3/
...
If a Hive-type query is run against the data with a clause like WHERE month=1, then it would only need to look in the month=1/ subdirectory, thereby saving 2/3rds of disk access.

How can we change compression policy on timescaledb

i am not able to use indexes on compressed chunks, so need to re configure compression policy so that at least 1 month data resides in hotchunks instead of 7 days
The easiest way to do this is to simply remove the current compression policy and create a new one with a new interval. Removing the policy won't decompress any chunks.
https://docs.timescale.com/api/latest/compression/remove_compression_policy/
That aside, you can see all policies (compression, continuous aggregates, data retention, etc.) through the Jobs view, including the configuration for each job. Technically you can alter_job and update the configuration (stored as JSONB), but in most cases it's easier to just remove the policy and setup a new one with the appropriate add... API.

How to build an index of S3 objects when data exceeds object metadata limit?

Building an index of S3 objects can be very useful to make them searchable quickly : the natural, most obvious way is to store additional data on the object meta-data and use a lambda to write in DynamoDB or RDS, as described here: https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/
However, this strategy is limited by the amount of data one can store in the object metadata, which is 2KB, as described here: https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html. Suppose you need to build a system where every time an object is uploaded on S3 you store need to add some information not contained in the file and the object name to a database and this data exceeds 2KB:you can't store it in the object metadata.
What are viable strategies to keep the bucket and the index updated?
Implement two chained API calls where each call is idempotent: if the second fails when the first succeed, one can retry until success. What happens if you perform a PUT of an identical object on S3, and you have versioning activated? Will S3 increase the version? In this case, implementing idempotency requires a single writer to be active at each time
Use some sort of workflow engine to keep track of this two-step behaviour, such as AWS Step. What are the gotchas with this solution?

How to properly (scale-ably) read many ORC files into spark

I'd like to use EMR and Spark to process an AWS S3 inventory report generated in ORC format that has many ORC files (hundreds) and the total size of all the data is around 250GB.
Is there a specific or best practice way to read all the files in to one Dataset? It seems like I can pass the sqlContext.read().orc() method a list of files, but I wasn't sure if this would scale/parallelize properly if I pass it a large list of hundreds of files.
What is the best practice way of doing this? Ultimately my goal is to have the contents of all the files in one dataset so that I can run a sql query on the dataset and then call .map on the results for subsequent processing on that result set.
Thanks in advance for your suggestions.
Just specify a folder where your orc files are located. Spark will automatically detect all of them and will put into a single DataFrame.
sparkSession.read.orc("s3://bucket/path/to/folder/with/orc/files")
You shouldn't care much about scalability since everything is done by spark based on default config provided by EMR depending on the EC2 instance type selected. You can experiment with number of slave nodes and it's instance type though.
Besides that, I would suggest to set maximizeResourceAllocation to true to configure executors to utilize maximum resources on each slave node.

Is there a way to query S3 object key names for the latest per prefix?

In an S3 bucket, I have thousands and thousands of files stored with names having a structure that comes down to prefix and number:
A-0001
A-0002
A-0003
B-0001
B-0002
C-0001
C-0002
C-0003
C-0004
C-0005
New objects for a given prefix should come in with varying frequency, but might not. Older objects may disappear.
Is there a way to efficiently query S3 for the highest number of every prefix, i.e. without listing the entire bucket? The result I want is:
A-0003
B-0002
C-0005
The S3 API itself does not seem to offer anything usable for that. However, perhaps another service, like Athena, could do it? So far I have only found it capable of searching within objects, but all I care about are their key names. If it can report on the contents of objects in the bucket, can't it on the bucket itself?
I would be okay with the latest modification date per prefix, but I want to avoid having to switch to a versioned bucket with just the prefixes as names to achieve that.
I think this is what you are looking for:
variable name is $path and you can regexp to get the pattern you are querying...
WHERE regexp_extract(sp."$path", '[^/]+$') like concat('%',cast(current_date - interval '1' day as varchar),'.csv')
The S3 API itself does not seem to offer anything usable for that.
However, perhaps another service, like Athena, could do it?
Yes at the moment, there is not direct way of doing it only with AWS S3. Even with Athena, it will go through the files to query their content but it will be easier using standard SQL support with Athena and would be faster since the queries runs in parallel.
So far I have only found it capable of searching within objects, but
all I care about are their key names.
Both Athena and S3 Select is to query by content not keys.
The best approach I can recommend is to use AWS DynamoDB to keep the metadata of the files, including file names for faster querying.