AWS glue delete all partitions - amazon-web-services

I defined several tables in AWS glue.
Over the past few weeks, I've had different issues with the table definition which I had to fix manually - I want to change column names, or types, or change the serialization lib. However, If i already have partitions created, the repairing of table doesn't change them, and so I have to delete all partitions manually and then repairing.
Is there a simple way to do this? Delete all partitions from an AWS Glue table?
I'm using aws batch-delete-partition CLI command, but it's syntax is tricky, and there are some limitations on the amount of partitions you can delete in one go, the whole thing is cumbersome...

For now, I found this command line solution, runinng aws glue batch-delete-partition iteratively for batches of 25 partitions using xargs
(here I am assuming there are max 1000 partitions):
aws glue get-partitions --database-name=<my-database> --table-name=<my-table> | jq -cr '[ { Values: .Partitions[].Values } ]' > partitions.json
seq 0 25 1000 | xargs -I _ bash -c "cat partitions.json | jq -c '.[_:_+25]'" | while read X; do aws glue batch-delete-partition --database-name=<my-database> --table-name=<my-table > --partitions-to-delete=$X; done
Hope it helps someone, but I'd prefer a more elegant solution

Using python3 with boto3 looks a little bit nicer. Albeit not by much :)
Unfortunately AWS doesn't provide a way to delete all partitions without batching 25 requests at a time. Note that this will only work for deleting the first page of partitions retrieved.
import boto3
glue_client = boto3.client("glue", "us-west-2")
def get_and_delete_partitions(database, table, batch=25):
partitions = glue_client.get_partitions(
DatabaseName=database,
TableName=table)["Partitions"]
for i in range(0, len(partitions), batch):
to_delete = [{k:v[k]} for k,v in zip(["Values"]*batch, partitions[i:i+batch])]
glue_client.batch_delete_partition(
DatabaseName=database,
TableName=table,
PartitionsToDelete=to_delete)
EDIT: To delete all partitions (beyond just the first page) using paginators makes it look cleaner.
import boto3
glue_client = boto3.client("glue", "us-west-2")
def delete_partitions(database, table, partitions, batch=25):
for i in range(0, len(partitions), batch):
to_delete = [{k:v[k]} for k,v in zip(["Values"]*batch, partitions[i:i+batch])]
glue_client.batch_delete_partition(
DatabaseName=database,
TableName=table,
PartitionsToDelete=to_delete)
def get_and_delete_partitions(database, table):
paginator = glue_client.get_paginator('get_partitions')
itr = paginator.paginate(DatabaseName=database, TableName=table)
for page in itr:
delete_partitions(database, table, page["Partitions"])

Here is a PowerShell version FWIW:
$database = 'your db name'
$table = 'your table name'
# Set the variables above
$batch_size = 25
Set-DefaultAWSRegion -Region eu-west-2
$partition_list = Get-GLUEPartitionList -DatabaseName $database -TableName $table
$selected_partitions = $partition_list
# Uncomment and edit predicate to select only certain partitions
# $selected_partitions = $partition_list | Where-Object {$_.Values[0] -gt '2020-07-20'}
$selected_values = $selected_partitions | Select-Object -Property Values
for ($i = 0; $i -lt $selected_values.Count; $i += $batch_size) {
$chunk = $selected_values[$i..($i + $batch_size - 1)]
Remove-GLUEPartitionBatch -DatabaseName $database -TableName $table -PartitionsToDelete $chunk -Force
}
# Now run `MSCK REPAIR TABLE db_name.table_name` to add the partitions again

Related

Airflow to copy most recent file from GCS bucket to local

I want to copy latest file from a gcs bucket to local using airflow composer.
I was trying to use gustil cp to get the latest file and load into local airflow but got issue: CommandException: No URLs matched error . If I check the XCom I am getting value='Objects' .Any suggestion?
download_file = BashOperator(
task_id='download_file',
bash_command="gsutil cp $(gsutil ls -l gs://<bucket_name> | sort -k 2 | tail -1 | awk '''{print $3}''') /home/airflow/gcs/dags",
xcom_push=True
)
Executing the gsutil command gsutil ls -l gs://<bucket_name> | sort -k 2 | tail -1 | awk '''{print $3}''' will also display the row with total size, objects and etc., will sort by date then get the last row and get the third column of row. That's why you get 'objects' as value like the output sample below:
TOTAL: 6 objects, 28227013 bytes (26.92 MiB)
Try this code to get the second last row only :
download_file = BashOperator(
task_id='download_file',
bash_command="gsutil cp $(gsutil ls -l gs://bucket_name | sort -k 2 | tail -2 | head -n1 | awk '''{print $3}''') /home/airflow/gcs/dags",
xcom_push=True
)

Copy column of list to new column of same list in Sharepoint

I want to copy the whole column values to a new column.
As a solution, I prepare a workflow:
SET FIELD TO VALUE and make the workflow start when item update
But, I have 16000+ rows and to manually update each one is not possible as of now.
I also tried using Microsoft Flow but no success.
Could anyone please suggest a way to achieve it.
I would suggest PowerShell for such 'migration' work. Script from here,the script need to be run in SharePoint server.
Add-PSSnapin Microsoft.SharePoint.Powershell -ErrorAction SilentlyContinue
#Parameters
$SiteURL = "http://siteurl/"
$listName = "list"
$web = Get-SPweb $SiteURL
#Use the Display Names
$CopyFromColumnName = "Description" #column copy source
$CopyToColumnName = "Desc" #destination column
#Get the List
$list = $web.lists[$ListName]
#Get all Items
$Items = $list.Items
ForEach ($Item in $items)
{
#copy data from one column to another
$item[$copyToColumnName] = $item[$copyFromColumnName]
#Do a system update to avoid Version and to Keep same metadata
$item.SystemUpdate($false)
}
For SharePoint online, refer this thread, replace the iterate logic as pageing.
$Query = New-Object Microsoft.SharePoint.Client.CamlQuery
$Query.ViewXml = "<View Scope='RecursiveAll'><Query><OrderBy><FieldRef Name='ID' Ascending='TRUE'/></OrderBy></Query><RowLimit Paged='TRUE'>$BatchSize</RowLimit></View>"
$Counter = 0
#Batch process list items - to mitigate list threshold issue on larger lists
Do {
#Get items from the list
$ListItems = $List.GetItems($Query)
$Ctx.Load($ListItems)
$Ctx.ExecuteQuery()
$Query.ListItemCollectionPosition = $ListItems.ListItemCollectionPosition
#Loop through each List item
ForEach($ListItem in $ListItems)
{
//to do copy field value
$Counter++
Write-Progress -PercentComplete ($Counter / ($List.ItemCount) * 100) -Activity "Processing Items $Counter of $($List.ItemCount)" -Status "Searching Unique Permissions in List Items of '$($List.Title)'"
}
} While ($Query.ListItemCollectionPosition -ne $null)

Split Strings in a Value column with Powercli

This is what I wrote to get output with powercli;
Get-VM -name SERVERX | Get-Annotation -CustomAttribute "Last EMC vProxy Backup"|select #{N='VM';E={$_.AnnotatedEntity}},Value
This is the output
VM Value
-- -----
SERVERX Backup Server=networker01, Policy=vmbackup, Workflow=Linux_Test_Production, Action=Linux_Test_Production, JobId=1039978, StartTime=2018-10-31T00:00:27Z, EndTime=2018-10-31T00:12:45Z
SERVERX1 Backup Server=networker01, Policy=vmbackup, Workflow=Linux_Test_Production, Action=Linux_Test_Production, JobId=1226232, StartTime=2018-12-06T00:00:29Z, EndTime=2018-12-06T00:0...
SERVERX2 Backup Server=networker01, Policy=vmbackup, Workflow=Linux_Test_Production, Action=Linux_Test_Production, JobId=1226239, StartTime=2018-12-05T23:58:27Z, EndTime=2018-12-06T00:0...
But I would like retrieve only "starttime" and "endtime" values
Desired output is;
VM Value
-- -----
SERVERX StartTime=2018-10-31T00:00:27Z, EndTime=2018-10-31T00:12:45Z
SERVERX1 StartTime=2018-12-06T00:00:29Z, EndTime=2018-1206T00:11:14Z
SERVERX2 StartTime=2018-12-05T23:58:27Z, EndTime=2018-12-06T00:11:20Z
How can I get this output?
This would be better suited in Powershell forum as this is just data manipulation.
Providing your output is always the same number of commas then
$myannotation = Get-VM -name SERVERX | Get-Annotation -CustomAttribute "Last EMC
vProxy Backup"|select #{N='VM';E={$_.AnnotatedEntity}},Value
$table1 = #()
foreach($a in $myannotation)
$splitter = $a.value -split ','
$splitbackupstart = $splitter[5]
$splitbackupend = $splitter[6]
$row = '' | select vmname, backupstart, backupend
$row.vmname = $a.AnnotatedEntity # or .vm would have to try
$row.backupstart = $splitbackupstart
$row.backupend= $splitbackupend
$table1 += $row
}
$table1
Untested. If you format of the string is going to change over time then a regex to search for starttime will be better.

How can I get unique values in array in a jmespath query?

In an aws cli jmespath query, with for example the output ["a","a","b","a","b"], how do i extract the unique values of it to get ["a","b"]?
Unfortunately this is not currently possible in jmespath.
It's not what you asked for but I've used the following:
aws ... | jq -r ".[]" | sort | uniq
This will convert ["a", "a", "b", "a"] to:
a
b
The closest I've come to "unique values"... is to deduplicate outside of JMESPath (so not really in JMESPath pipelines).
aws ec2 describe-images \
--region us-east-1 \
--filter "Name=architecture,Values=x86_64" \
--query 'Images[].ImageOwnerAlias | join(`"\n"`, #)' \
--output text \
| sort -u
Output:
amazon
aws-marketplace
If you use JMESPath standalone, you'd write things like this.
jp -u -f myjson.json 'Images[].ImageOwnerAlias | join(`"\n"`, #)' | sort -u
The idea is to get jp to spit out a list of values (on separate lines) and then apply all the power of your favorite sorter. The tricky part is to get the list (of course).

AWS CLI "s3 ls" command to list a date range of files in a virtual folder

I'm trying to list files from a virtual folder in S3 within a specific date range. For example: all the files that have been uploaded for the month of February.
I currently run a aws s3 ls command but that gives all the files:
aws s3 ls s3://Bucket/VirtualFolder/VirtualFolder --recursive --human-readable --summarize > c:File.txt
How can I get it to list only the files within a given date range?
You could filter the results with a tool like awk:
aws s3 ls s3://Bucket/VirtualFolder/VirtualFolder --recursive --human-readable --summarize \
| awk -F'[-: ]' '$1 >= 2016 && $2 >= 3 { print }'
Where awk splits each records using -, :, and space delimiters so you can address fields as:
$1 - year
$2 - month
$3 - day
$4 - hour
$5 - minute
$6 - second
The aws cli ls command does not support filters, so you will have to bring back all of the results and filter locally.
Realizing this question was tagged command-line-interface, I have found the best way to address non-trivial aws-cli desires is to write a Python script.
Tersest example:
$ python3 -c "import boto3; print(boto3.client('s3').list_buckets()['Buckets'][0])"
Returns: (for me)
{'Name': 'aws-glue-scripts-282302944235-us-west-1', 'CreationDate': datetime.datetime(2019, 8, 22, 0, 40, 5, tzinfo=tzutc())}
That one-liner isn't a profound script, but it can be expounded into one. (Probably with less effort than munging a bash script, much as I love bash.) After looking up a few boto3 calls, you can deduce the rest from equivalent cli commands.