Powershell - gzip large file and load to s3 using stream - amazon-web-services

I'm trying to compress some csv files using gzip and then upload them to S3. I need to use streams to compress and load because the files could be very large and I don't want to write the file back to disk before loading it to s3. I'm new to using streams in Powershell and I'm struggling to figure out the issue.
This is what I have so far but I can't get it to work. It loads a very small gzip file that shows my original file inside but I can't extract it - I get an "Unexpected end of data" error. I believe it's not finalizing the gzip stream or something like that. If I remove the "gzip" commands and just write out the inputFileStream to S3 then it works to load the uncompressed file, so I know the S3 load using a stream works.
Also, I'm using "CopyTo" which I believe will bring the whole file into memory which I don't want either (let me know if I'm not correct with that thinking).
$sourcePath = "c:\temp\myfile.csv"
$bucketName = "mybucket"
$s3Key = "staging/compress_test/"
$fileInfo = Get-Item -Path $sourcePath
$destPath = "$s3Key$($fileInfo.Name).gz"
$outputMemoryStream = New-Object System.IO.MemoryStream
$gzipStream = New-Object System.IO.Compression.GZipStream $outputMemoryStream, ([IO.Compression.CompressionMode]::Compress)
$inputFileStream = New-Object System.IO.FileStream $sourcePath, ([IO.FileMode]::Open), ([IO.FileAccess]::Read), ([IO.FileShare]::Read)
$inputFileStream.CopyTo($gzipStream)
Write-S3Object -BucketName $destBucket -Key $destPath -Stream $outputMemoryStream -ProfileName Dev -Region us-east-1
$inputFileStream.Close()
$outputMemoryStream.Close()
UPDATE: Thanks #FoxDeploy. I got it at least loading the file now. I needed to close the gzip stream before writing to S3 causing the gzip to finalize. But as I suspected the "CopyTo" causes the full file to be compressed and stored in memory and then it loads to S3. I would like it to stream to S3 as it's compressing to reduce the memory load, if that's possible.
Here's the current working code:
$sourcePath = "c:\temp\myfile.csv"
$bucketName = "mybucket"
$s3Key = "staging/compress_test/"
$fileInfo = Get-Item -Path $sourcePath
$destPath = "$s3Key$($fileInfo.Name).gz"
$outputMemoryStream = New-Object System.IO.MemoryStream
$gzipStream = New-Object System.IO.Compression.GZipStream $outputMemoryStream, ([IO.Compression.CompressionMode]::Compress), true
$inputFileStream = New-Object System.IO.FileStream $sourcePath, ([IO.FileMode]::Open), ([IO.FileAccess]::Read), ([IO.FileShare]::Read)
$inputFileStream.CopyTo($gzipStream)
$gzipStream.Close()
Write-S3Object -BucketName $bucketName -Key $destPath -Stream $outputMemoryStream -ProfileName Dev -Region us-east-1
$inputFileStream.Close()
$outputMemoryStream.Close()

Related

Download Last 24 hour files from s3 using Powershell

I have an s3 bucket with different filenames. I need to download specific files (filenames that starts with impression) that are created or modified in last 24 hours from s3 bucket to local folder using powershell?
$items = Get-S3Object -BucketName $sourceBucket -ProfileName $profile -Region 'us-east-1' | Sort-Object LastModified -Descending | Select-Object -First 1 | select Key Write-Host "$($items.Length) objects to copy" $index = 1 $items | % { Write-Host "$index/$($items.Length): $($_.Key)" $fileName = $Folder + ".\$($_.Key.Replace('/','\'))" Write-Host "$fileName" Read-S3Object -BucketName $sourceBucket -Key $_.Key -File $fileName -ProfileName $profile -Region 'us-east-1' > $null $index += 1 }
A workaround might be to turn on access log, and since the access log will contain timestamp, you can get all access logs in the past 24 hours, de-duplicate repeated S3 objects, then download them all.
You can enable S3 access log in the bucket settings, the logs will be stored in another bucket.
If you end up writing a script for this, just bear in mind downloading the S3 objects will essentially create new access logs, making the operation irreversible.
If you want something fancy perhaps you can even query the logs and perhaps deduplicate using AWS Athena.

aws s3 sync missing to create root folders

I am archiving some folders to S3
Example: C:\UserProfile\E21126\data ....
I expect to have a folder structure in s3 like, UserProfiles\E21126.
Problem is it created the folders under \E21126 and misses creating the root folder \E21126.
Folds1.txt contains these folders to sync:
G:\UserProfiles\E21126
G:\UserProfiles\E47341
G:\UserProfiles\C68115
G:\UserProfiles\C30654
G:\UserProfiles\C52860
G:\UserProfiles\E47341
G:\UserProfiles\C68115
G:\UserProfiles\C30654
G:\UserProfiles\C52860
my code below:
ForEach ($Folder in (Get-content "F:\scripts\Folds1.txt")) {
aws s3 sync $Folder s3://css-lvdae1cxfs003-archive/Archive-Profiles/ --acl bucket-owner-full-control --storage-class STANDARD
}
It will upload all the folders with their names excluding the path. If you want to include the UserProfiles in the S3 bucket then you will needs to include that in the key. You need to upload them to the S3 bucket with specifying the key name
aws s3 sync $Folder s3://css-lvdae1cxfs003-archive/Archive-Profiles/UserProfiles --acl bucket-owner-full-control --storage-class STANDARD
and If your files have different name instead of UserProfiles string then you can get the parent path and then fetch the leaf to get the username from the string
PS C:\> Split-Path -Path "G:\UserProfiles\E21126"
G:\UserProfiles
PS C:\> Split-Path -Path "G:\UserProfiles" -Leaf -Resolve
UserProfiles
If you were to modify the text file to contain:
E21126
E47341
C68115
Then you could use the command:
ForEach ($Folder in (Get-content "F:\scripts\Folds1.txt")) {
aws s3 sync G:\UserProfiles\$Folder s3://css-lvdae1cxfs003-archive/Archive-Profiles/$Folder/ --acl bucket-owner-full-control --storage-class STANDARD
}
Note that the folder name is included in the destination path.

Powershell writing to AWS S3

I'm trying to get powershell to write results to AWS S3 and I can't figure out the syntax. Below is the line that is giving me trouble. If I run this without everything after the ">>" the results print on the screen.
Write-host "Thumbprint=" $i.Thumbprint " Expiration Date="$i.NotAfter " InstanceID ="$instanceID.Content" Subject="$i.Subject >> Write-S3Object -BucketName arn:aws:s3:::eotss-ssl-certificatemanagement
Looks like you have an issue with >> be aware that you can't pass the write-host function result into another command.
In order to do that, you need to assign the string you want into a variable and then pass it into the -Content.
Take a look at the following code snippet:
Install-Module AWSPowerShell
Import-Module AWSPowerShell
#Set AWS Credential
Set-AWSCredential -AccessKey "AccessKey" -SecretKey "SecretKey"
#File upload
Write-S3Object -BucketName "BucketName" -Key "File upload test" -File "FilePath"
#Content upload
$content = "Thumbprint= $($i.Thumbprint) Expiration Date=$($i.NotAfter) InstanceID = $($instanceID.Content) Subject=$($i.Subject)"
Write-S3Object -BucketName "BucketName" -Key "Content upload test" -Content $content
How to create new AccessKey and SecretKey - Managing Access Keys for Your AWS Account.
AWSPowerShell Module installation.
AWS Tools for PowerShell - S3 Documentation.

Issue with a Powershell script transferring files to a S3 Bucket

I am trying to setup a powershell script in an attempt to setup a automated transfer of a directory to a S3 Bucket, I have been following instructions listed at http://todhilton.com/technicalwriting/upload-backup-your-files-to-amazon-s3-with-powershell/ but when I run it I get the following error.
Unable to find type [Amazon.AWSClientFactory].
At line:18 char:9
+ $client=[Amazon.AWSClientFactory]::CreateAmazonS3Client($accessKeyID, ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (Amazon.AWSClientFactory:TypeName) [], RuntimeException
+ FullyQualifiedErrorId : TypeNotFound
The Code I have is pasted in below... If someone has some insight that would be awsome :)
# Constants
$sourceDrive = "C:\"
$sourceFolder = "Users\Administrator\AppData\Roaming\folder"
$sourcePath = $sourceDrive + $sourceFolder
$s3Bucket = "bucket"
$s3Folder = "Archive"
# Constants – Amazon S3 Credentials
$accessKeyID="KEY"
$secretAccessKey="Secret"
# Constants – Amazon S3 Configuration
$config=New-Object Amazon.S3.AmazonS3Config
$config.RegionEndpoint=[Amazon.RegionEndpoint]::"ap-southeast-2"
$config.ServiceURL = "https://s3-ap-southeast-2.amazonaws.com/"
# Instantiate the AmazonS3Client object
$client=[Amazon.AWSClientFactory]::CreateAmazonS3Client($accessKeyID,$secretAccessKey,$config)
# FUNCTION – Iterate through subfolders and upload files to S3
function RecurseFolders([string]$path) {
$fc = New-Object -com Scripting.FileSystemObject
$folder = $fc.GetFolder($path)
foreach ($i in $folder.SubFolders) {
$thisFolder = $i.Path
# Transform the local directory path to notation compatible with S3 Buckets and Folders
# 1. Trim off the drive letter and colon from the start of the Path
$s3Path = $thisFolder.ToString()
$s3Path = $s3Path.SubString(2)
# 2. Replace back-slashes with forward-slashes
# Escape the back-slash special character with a back-slash so that it reads it literally, like so: "\\"
$s3Path = $s3Path -replace "\\", "/"
$s3Path = "/" + $s3Folder + $s3Path
# Upload directory to S3
Write-S3Object -BucketName $s3Bucket -Folder $thisFolder -KeyPrefix $s3Path
}
# If subfolders exist in the current folder, then iterate through them too
foreach ($i in $folder.subfolders) {
RecurseFolders($i.path)
}
}
# Upload root directory files to S3
$s3Path = "/" + $s3Folder + "/" + $sourceFolder
Write-S3Object -BucketName $s3Bucket -Folder $sourcePath -KeyPrefix $s3Path
# Upload subdirectories to S3
RecurseFolders($sourcePath)
Please check below:
Amazon AWSClientFactory does not exists
Change: AWSClientFactory is removed
I used below script:
# Bucket region details
$RegionEndpoint = 'us-east-1'
#$ServiceURL = 'https://s3-us-east-1.amazonaws.com/'
#Credentials initialized
$credsCSV = Get-ChildItem "E:\myAPIUser_credentials.csv"
$credsContent = Import-Csv $credsCSV.FullName
$accessKeyID = $credsContent.'Access key ID'
$secretAccessKey = $credsContent.'Secret access key'
Initialize-AWSDefaults -Region $RegionEndpoint -AccessKey $accessKeyID -SecretKey $secretAccessKey
$sourceFolder = "E:\Code\powershell\PSmy\git\AWSPowerShell"
$targetFolder = Get-Date -Format "dd-MMM-yyyy"
Write-S3Object -BucketName $s3Bucket.BucketName -Folder $sourceFolder -Recurse -KeyPrefix $targetFolder\

Using powershell to download an embedded video

I need to download a monthly broadcast automatically (will set a scheduled task) using powershell.
Here is the embedded URL: https://www.jw.org/download/?fileformat=MP4&output=html&pub=jwb&issue=201601&option=TRGCHlZRQVNYVrXF&txtCMSLang=E
The only thing that changes each month is the 201602, 201603, etc. Once I have able to pull the 720p video file, I will work on programmatically adding that part of the URL, based on the current system clock (I can manage this)
I have tried these without success:
Attempt 1:
$source = "https://www.jw.org/download/?fileformat=MP4&output=html&pub=jwb&issue=201601&option=TRGCHlZRQVNYVrXF&txtCMSLang=E"
$destination = "c:\broadcasts\test.mp4"
Invoke-WebRequest $source -OutFile $destination
Attempt 2:
$source = "https://www.jw.org/download/?fileformat=MP4&output=html&pub=jwb&issue=201601&option=TRGCHlZRQVNYVrXF&txtCMSLang=E"
$dest = "c:\broadcasts\test.mp4"
$wc = New-Object System.Net.WebClient
$wc.DownloadFile($source, $dest)
Attempt 3:
Import-Module BitsTransfer
$url = "https://www.jw.org/download/?fileformat=MP4&output=html&pub=jwb&issue=201601&option=TRGCHlZRQVNYVrXF&txtCMSLang=E"
$output = "c:\broadcasts\test.mp4"
Start-BitsTransfer -Source $url -Destination $output
Both of these end up with a test.mp4 that is basically just an empty file.
Then I found the another page that holds the video (and the download links for different qualities) and tried to pull these links using the following (I know I could have used $webpage.links):
Attempt 4:
$webpage=Invoke-webrequest "http://tv.jw.org/#en/video/VODStudio/pub-
jwb_201601_1_VIDEO"
$webpage.RawContent | Out-File "c:\scripts\webpage.txt" ASCII -Width 9999
And found that the raw content doesn't have the mp4 visible. My idea was to pull the raw content, parse it with regex and grab the 720p URL, save it in a variable and then send that to a BitsTransfer bit of code.
Please help?