gsutil / gcloud storage list files by limits and pagination - google-cloud-platform

Is there any way we can list files from GCS bucket with limits.
Say I have 2k objects in my bucket. But when I do gsutil ls, I only want the 1st 5 objects, not all.
How to achieve this.
Also is there any pagination available ?
gsutil ls gs://my-bucket/test_file_03102021* 2>/dev/null | grep -i ".txt$" || :

From looking at gsutil help ls, gsutil doesn't currently have an option to limit the number of items returned from an ls call.
While you could pipe the results to something like awk to get only the first 5 items, that would be pretty wasteful if you have lots of objects in your bucket (since gsutil would continue making paginated HTTP calls until it listed all N of your objects).
If you need to do this routinely on a bucket with lots of objects, you're better off writing a short script that uses one of the GCS client libraries. As an example, check out the google-cloud-storage Python library -- specifically, see the list_blobs method, which accepts a max_results parameter.

There is a pagination available when you use the API directly. If you want only the 5 first objects and you use gsutil, you will have to wait the full answer of hundreds (thousands, millions,...) of files before getting only the first 5.
If you use the API you can do this
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
"https://storage.googleapis.com/storage/v1/b/<BUCKET_NAME>/o?alt=json&&maxResults=5" \
| jq .items[].name
Of course, you can change the max results size
You can include prefix also when you filter. More detail in the API documentation

Related

How I Can Search Unknown Folders in S3 Bucket. I Have millions of object in my bucket I only want Folder List?

I Have a bucket with 3 million objects. I Even don't know how many folders are there in my S3 bucket and even don't know the names of folders in my bucket.I want to show only list of folders of AWS s3. Is there any way to get list of all folders ?
I would use AWS CLI for this. To get started - have a look here.
Then it is a matter of almost standard linux commands (ls):
aws s3 ls s3://<bucket_name>/path/to/search/folder/ --recursive | grep '/$' > folders.txt
where:
grep command just reads what aws s3 ls command has returned and searches for entries with ending /.
ending > folders.txt saves output to a file.
Note: grep (if I'm not wrong) is unix only utility command. But I believe, you can achieve this on windows as well.
Note 2: depending on the number of files there this operation might (will) take a while.
Note 3: usually in systems like AWS S3, term folder is there only for user to maintain visual similarity with standard file systems however inside it does treat it as a part of a key. You can see in your (web) console when you filter by "prefix".
Amazon S3 buckets with large quantities of objects are very difficult to use. The API calls that list bucket contents are limited to returning 1000 objects per API call. While it is possible to request 'folders' (by using Delimiter='/' and looking at CommonPrefixes), this would take repeated calls to obtain the hierarchy.
Instead, I would recommend using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You can then play with that CSV file from code (or possibly Excel? Might be too big?) to obtain your desired listings.
Just be aware that doing anything on that bucket will not be fast.

How to set metadata that created at certain time?

I want to set metadata all object when created date is on 12 o'clock tonight. For now I just can set metadata for all objects that's already in a bucket with this command below :
gsutil -m setmeta -h "Content-Type:application/pdf" -h "Content-disposition: inline" gs://mystorage/pdf/*.pdf
My plan is to set all new object by run gsutil command in the midnight automatically because I already make a command witch upload all file from my server to google storage every midnight. But the only problem is I don't know witch file is new.
I know that we can use google cloud trigger but I just want to use gsutil command if it's possible
I think there is no feature that gsutil or the GCS API provides to set metadata for objects based on timestamp.
According to link At upload time you can specify one or more metadata properties to associate with objects.
As you mentioned I already make a command which uploads all files from my server to google storage every midnight you can set metadata while uploading objects. command may look like below in your case.
gsutil -m setmeta -h "Content-Type:application/pdf" -h "Content-disposition: inline" cp -r images gs://bucket/images
Or else
you can list the objects based on timestamp and store the output to a file. By iterating through each line in the outfile file use your setmetadata command for the objects.
Or, you can use Pub/Sub notifications for Cloud Storage, and subscribe to the new objects event OBJECT_FINALIZE.
Some sample code showing this can be referred here

How is the AWS SSO url generated when you access the management console?

When you login via SSO in the browser, if you open one of your accounts and then assume a role, a new tab is opened after you click on "Management console". The syntax of the url of that link is something like https:/ /my-sso-portal.awsapps.com/start/#/saml/custom/my-account-name/base-64-string
If you decode that base 64 string in the url, you can notice there are 3 numbers, with this structure: number1_ins-number2_p-number3 The first number is your AWS Organization number, the second one identifies the account, and the third one the assumed role.
Even though I figured out the structure of this string, I still have no idea whether is possible for an user to get the second and third number (without using the url, of course). I basically want to programmatically construct that url but it looks like those two numbers are IDs that AWS keeps for itself, I'm not sure though. Anyone else knows more about this?
You can use AWS SSO APIs to get this information. I haven't found documentation for them, but the SSO user portal uses them.
The first one (the organization id) can be retrieved using GET https://portal.sso.<SSO_REGION>.amazonaws.com/token/whoAmI. Search for the 'accountId' property in the response.
The second one (the app instance id) can be retrieved using GET https://portal.sso.<SSO_REGION>.amazonaws.com/instance/appinstances. The response contains a list of instances and each contains an 'id' property.
The last one (the profile id) can be retrieved using GET https://portal.sso.<SSO_REGION>.amazonaws.com/instance/appinstance/<APP_INSTANCE_ID>/profiles. The response contains a list of profiles, each has the id property.
For each of these APIs you need the SSO token which can be retrieved by the CLI/SDK. The UI is inserting the token into two headers: 'x-amz-sso-bearer-token' and 'x-amz-sso_bearer_token'.
Taking Gal's answer and building on top of that:
A note that as far as I can tell the relevant values returned from these GET requests don't appear to change, so you can cache them locally for each user.
With command-line tools like base64, curl and jq and a quick assist with Perl for %-encoding it can look like this, assuming you have your SSO token stored in TOKEN, account number (not the alias) in account and role-name in role:
ORGID=$(curl -s -H "x-amz-sso_bearer_token: $TOKEN" -H "x-amz-sso-bearer-token: $TOKEN" $PORTALBASE/token/whoAmI | jq -r .accountId)
AID=$(curl -s -H "x-amz-sso_bearer_token: $TOKEN" -H "x-amz-sso-bearer-token: $TOKEN" $PORTALBASE/instance/appinstances | jq -r '.result[]|select(.searchMetadata.AccountId=="'$account'").id')
ANAME=$(curl -s -H "x-amz-sso_bearer_token: $TOKEN" -H "x-amz-sso-bearer-token: $TOKEN" $PORTALBASE/instance/appinstances | jq -r '.result[]|select(.searchMetadata.AccountId=="'$account'").name')
RID=$(curl -s -H "x-amz-sso_bearer_token: $TOKEN" -H "x-amz-sso-bearer-token: $TOKEN" $PORTALBASE/instance/appinstance/$AID/profiles | jq -r '.result[]|select(.name=="'$role'").id')
Combine those parts and base64-encode that:
LINK=$(echo -n "${ORGID}_${AID}_${RID}" | base64)
%-encode the account "name" and the link component:
ANAME=$(echo -n "$ANAME" | perl -lne 's/([^a-zA-Z0-9-])/sprintf("%%%02X", ord($1))/ge;print')
LINK=$(echo -n "$LINK" | perl -lne 's/([^a-zA-Z0-9-])/sprintf("%%%02X", ord($1))/ge;print')
And putting it all together, where AWS_SSO_START_URL is something along the lines of https://my-sso-portal.awsapps.com/start/#/:
URL="${AWS_SSO_START_URL}saml/custom/${ANAME}/${LINK}"

Google storage count number of objects in a bucket

I'm looking for a fast way to get the number of archive in a bucket now i'm doing somthing like this
gsutil ls -r gs://my_bucket/ | grep tar.gz | wc -l
But it's incredibly slow.
The fastway way would be using either Google Cloud Monitoring [1] and watching the Count of objects metric or enabling bucket logging [2] and looking in storage logs.
These two methods are particularly useful when your bucket contains very large number of objects and listing them with API takes too long.
Please note however that both [1] and [2] doesn't show up-to-the-minute information and often refreshed only once in 24 hours. Still, sometimes, this is the only way.
[1]https://cloud.google.com/monitoring/support/available-metrics
[2]https://cloud.google.com/storage/docs/access-logs

What is the optimised way to get a count of keys in a riak bucket?

I have a riak cluster set up with 3 servers. I can look at the bitcask to establish how much disk space this cluster is currently using but I'd also like to find out how many items are currently being stored in the cluster.
The cluster is being used to store images, meaning that binary data is being stored against a key in a set of buckets. I have tried to use map reduce functions against the HTTP interface in order to return the number of items in the bucket however they have timed out.
What is the most time optimised way to get the count of the number of keys from a specific bucket?
Counting the number of keys in a bucket on the Riak cluster is not very efficient, even with the use of the MapReduce functions.
The most efficient way I have found to count the number of items is to do it on the client through the streaming API. The following example uses node-js to do this.
First install the riak-js client
npm install riak-js#latest
Then run the following on the command line to give you your count.
node -e "require('riak-js').getClient({ host: 'hostname', port: 8098 }).count('bucket');"
Here is what worked for me - put it into console, no further installs:
curl -XPOST http://localhost:8098/mapred -H 'Content-Type: application/json' -d '
{"inputs":"THE_BUKET",
"query":[{"map":{"language":"javascript",
"keep":false,
"source":"function(riakobj) {return [1]; }"}},
{"reduce":{"language":"javascript",
"keep":true,
"name":"Riak.reduceSum"}}]}'
There is also a open request on features.basho.com to make this easier (because, as bennettweb pointed out, it's not the most straightforward task).
http://features.basho.com/entries/20721603-efficiently-count-keys-in-a-bucket
Upvotes, comments, etc., are encouraged.
Mark
http://docs.basho.com/riak/latest/dev/using/2i/
paragraph "Count Bucket Objects via $bucket Index"
$ curl -XPOST http://localhost:8098/mapred
-H 'Content-Type: application/json'
-d '{"inputs":{
"bucket":"mybucket",
"index":"$bucket",
"key":"mybucket"
},
"query":[{"reduce":{"language":"erlang",
"module":"riak_kv_mapreduce",
"function":"reduce_count_inputs",
"arg":{"reduce_phase_batch_size":1000}
}
}]
}'
EOF
reduce index is better than mapreduce data