Cant loop over a google page HTTPIterator object twice? - google-cloud-platform

I have what I hope is an easy question. I am using the Google Storage Client library to loop over blobs in a bucket. After I get the list of blobs on the bucket I am unable to loop over the bucket unless I re-run the command to list the bucket.
I read the documentation on page iterators but I still dont quite understand why this sort of thing couldnt just be stored in memory like a normal variable in python. Why is this ValueError being thrown when I try to loop over the object again? Does anyone have any suggestions on how to interact with this data better?

For many sources of data, the potential returned items could be huge. While you may only have dozens or hundreds of objects in your bucket, there is absolutely nothing to prevent you from having millions (billions?) of objects. If you list a bucket, it would make no sense to return a million entries and have any hope of maintaining their state in memory. Instead, Google says you should "page" or "iterate" through them. Each time you ask for a new page, you get the next set of data and are presumed to have lost reference to the previous set of data ... and hence maintain only one set of data at a time at your client.
It is the back-end server that maintains your "window" into that data that is being returned. All you need do is say "give me more data ... my context is " and the next chunk of data is returned.
If you want to walk through your data twice then I would suggest asking for a second iteration. Be careful though, the result of the first iteration may not be the same as the second. If new files are added or old ones removed, the results will be different between one iteration and another.
If you really believe that you can hold the results in memory then as you execute your first iteration, save the results and keep appending new values as you page through them. This may work for specific use cases but realize that you are likely setting yourself up for trouble if the number of items gets too large.

Related

AWS console how to list s3 bucket content order by last modified date

I am writing files to a S3 bucket. How can I see the newly added files? E.g. in the below pic, you can see the files are not ordered by Last modified field. And I can't find a way to do any sort on that field or any other field.
You cannot sort on that, it is just how the UI works.
The main reason being that for buckets with 1000+ objects the UI only "knows" about the current 1000 elements displayed on the current page. And sorting them is meaningless because it would imply to show you the newest or oldest 1000 objects of the bucket but in fact it would just order the currently displayed 1000 objects. That would really confuse people and it is better to not let the user sort instead of sorting incorrectly.
Showing the actual 1000 newest or oldest objects requires you to list everything in the bucket, which takes time (minutes or hours for larger buckets) and backend requests and incurs more of a cost since List requests are billed. If you want to retrieve the 1000 newest or oldest you need to write code to do a full listing on the bucket or the prefix, then order all objects and then display parts of the result.
If you can sufficiently decrease the number of displayed objects with the "Find objects by prefix" field, the sort options become available and meaningful.

Find the size of S3 folders using PHP SDK

I am using an S3 bucket but I can't find a way to retrieve the size of a specific folder inside my bucket.
Scénario is:
I have a doc for every user on my website /user1..../user2 where each one will have a limited amount of space (1gb per folder). I need to show on my website the space that they still have like:
Canheadd = consumed space (what I'm lookingfor) + new file size
to determine if the user still has space or not to them to upload a new file.
I'm aware that you can do a loop of objectlist but for me it's not the ideal thing because of the amount and size of each document.
Any new or direct solution is welcome.
There is no 'quick' way to obtain the amount of storage used in a particular 'folder'.
The correct way would be to call ListObjects(Prefix='folder/',...), iterate through the objects returned and sum the size of each object. Please note that each call returns a maximum of 1000 objects, so the code might need to make repeated calls to ListObjects.
If this method is too slow, you could maintain a database of all objects and their sizes and query the database when the app needs to determine the size. Use Amazon S3 Events to trigger an AWS Lambda function when objects are created/deleted to keep the database up-to-date. This is a rather complex method, so I would suggest the first method unless there is a specific reason why it is not feasible.

C++ and Sqlite DELETE query doesn`t actually delete the value from the database file

I`ve came accross this issue on SQlite and c++ and i can't find any answer on it.
Everything is working fine in SQlite and c++ all queries all outputs all functions but i have this question that can`t find any solution around it.
I create a database MyTest.db
I create a table test with an id and a name as fields
I enter 2 values to each id=1 name=Name1 and id=2 name=Name2
I delete the 2nd value
The data inside table now says that i have only the id=1 with name=Name1
When i open my Mytest.db with notepad.exe the values that i have deleted such as id=2 name=Name2 are still inside the database file though that it doesn`t come to the data results of this table but still exists though.
What i like to ask from anyone that knows about it is this :
Is there any other way that the value has to be deleted also from the database file or is it my mistake with the DELETE option of SQLITE (that i doubt it)
Its like the database file keeps collecting all the trash inside it without removing DELETED values from its tables...
Any help or suggestion is much appreciated
If you use "PRAGMA secure_delete=ON;" then SQLite overwrites deleted content with zeros. See https://www.sqlite.org/pragma.html#pragma_secure_delete
Even with secure_delete=OFF, the deleted space will be reused (and overwritten) to store new content the next time you INSERT. SQLite does not leak disk space, nor is it necessary to VACUUM in order to reclaim space. But normally, deleted content is not overwritten as that uses extra CPU cycles and disk I/O.
Basically all databases only mark rows active or inactive, they won't delete the actual data from the file immediately. It would be a huge waste of time and resources, since that part of the file can just be reused.
Since your queries show that the row isn't active in results, is this in some way an issue? You can always run a VACUUM on the database if you want to reclaim the space, but I would just let the database engine handle everything by itself. It won't "keep collecting all the trash inside it", don't worry.
If you see that the file size is growing and the space is not reused, then you can issue vacuums from time to time.
You can also test this by just inserting other rows after deleting old ones. The engine should reuse those parts of the file at some point.

Random file access of stl::map data in C++

I have a stl::map data-structure
key:data pair
which I need to store in a binary file.
key is an unsigned short value, and is not sequential
data is another big structure, but is of fixed size.
This map is managed based on some user actions of add, modify or delete. And I have to keep the file updated every time I update the map. This is to survive a system crash scenario.
Adding can always be done at the end of the file. But, user can modify or delete any of the existing records.
That means I have to randomly access the file to update that modified/deleted record.
My questions are:
Is there a way I can reach the modified record in the file directly without sequentially searching thru the whole records ? ( Max record size is 5000)
On a delete, how do I remove it from the file and move the next record to the deleted record's position ?
Appreciate your help!
Assuming you have no need for the tree structure of std::map and you just need an associative container, the most common way I've seen to do this is to have two files: One with the keys and one with the data. In the key file, it will contain all of they keys along with the corresponding offset of their data in the data file. Since you said the data is all of the same size, updating should be easy to do (since it won't change any of the offsets). Adding is done by appending. Deleting is the only hard part; you can delete the key to remove it from the database, but it's up to you if you want to keep track of "freed" data sections and try to write over them. To keep track of the keys, you might want another associative container (map or unordered_map) in memory with the location of keys in the key file.
Edit: For example, the key file might be (note that offsets are in bytes)
key1:0
key2:5
and the corresponding data file would be
data1data2
This is a pretty tried and true pattern, used in everyone from hadoop to high speed local databases. To get an idea of persistence complications you might consider, I would highly recommend reading this Redis blog, it taught me a lot about persistence when I was dealing with similar issues.

Do the cursors used in pagination have a specific lifespan or timeout?

In the documentation on pagination there is a section about using cursors to check for new content. This implies that you can store the cursor and come back later to see if something new has appeared. Do the cursors timeout at some point or have a specific lifespan? If I get a cursor while paging through the comments on a post, will that cursor still be valid after an hour, a day, or even a week?
According to the documentaiton, it will be always valid.
http://developers.facebook.com/docs/reference/api/pagination/
"Cursor pagination is our preferred method of paging, and the list of Graph API endpoints which support it are growing. A cursor refers to a random string of characters which mark a specific point in a list of data. You can reliably assume that a cursor will always point to that same part of the list and use it to page through the data.
If you are using additional filters on the API endpoint when you receive a cursor, this cursor will only work for calls with those filters."
Update
As pointed by Scutterman, these cursors also have a lifespan. You should discard it after 1 day.
"Pagination markers used in these methods should not be used in long-lived applications as a way to jump back into a stream. For example, storing the cursors in a database and then re-using them a few days later may return stale, incorrect, or no data at all. Make sure that your cursors are relatively fresh - less than a day at most."
Update:
At some point the documentation added the information point:
Don't store cursors. Cursors can quickly become invalid if items are added or deleted.
There is a note in the documentation:
Cursors can't be stored for long periods of time and expected to still work. They are meant to paginate through the result set in a short period of time.