Aws Glue causing issue after reading array json file - amazon-web-services

We need to your assistance for the below AWS ETL Glue issue .
We are trying to read json files using AWS Glue dynamic frame .
Ex input json data :
{"type":"TripLog","count":"2","def":["CreateTimestamp","UUID","DataTimestamp","VIN","DrivingRange","DrivingRangeUnit","FinishPos.Datum","FinishPos.Event","FinishPos.Lat","FinishPos.Lon","FinishPos.Odo","FinishPos.Time","FuelConsumption1Trip","FuelConsumption1TripUnit","FuelConsumptionTripA","FuelConsumptionTripAUnit","FuelUsed","FuelUsedUnit","Mileage","MileageUnit","ODOUnit","Score.AcclAdviceMsg","Score.AcclScore","Score.AcclScoreUnit","Score.BrakeAdviceMsg","Score.BrakeScore","Score.BrakeScoreUnit","Score.ClassJudge","Score.IdleAdviceMsg","Score.IdleScore","Score.IdleScoreUnit","Score.IdleStopTime","Score.LifetimeTotalScore","Score.TotalScore","Score.TotalScoreUnit","StartPos.Datum","StartPos.Lat","StartPos.Lon","StartPos.Odo","StartPos.Time","TripDate","TripId"],"data":[["2017-10-17 08:47:17.930","xxxxxxx","20171017084659"," xxxxxxxxxxx ","419","mile","WGS84","Periodic intervals during IG ON","38,16,39.846","-77,30,45.230","33559","20171017-033104","50.1","M-G - mph(U.S. gallon)","36.0","M-G - mph(U.S. gallon)","428.1","cm3",null,null,"km",null,null,"%",null,null,"%",null,null,null,"%","0x0",null,null,"%","WGS84","39,12,50.988","-76,38,36.417","33410","20171017-015103","20171017-015103","0"],["2017-10-17 08:47:17.930"," xxxxxxx ","20171017084659","xxxxxxxxxxx","414","mile","WGS84","Periodic intervals during IG ON","38,12,12.376","-77,29,57.915","33568","20171017-033604","50.1","M-G - mph(U.S. gallon)","36.0","M-G - mph(U.S. gallon)","838.0","cm3",null,null,"km",null,null,"%",null,null,"%",null,null,null,"%","0x0",null,null,"%","WGS84","39,12,50.988","-76,38,36.417","33410","20171017-015103","20171017-015103","0"]]}
Step 1): Code to read the json file to dynamic frame : * landing_location(Our file location)
dyf = glueContext.create_dynamic_frame.from_options(connection_type = "s3",connection_options= {"paths": [landing_location], 'recurse':True, 'groupFiles': 'inPartition', 'groupSize': '1048576'},format = "json", transformation_ctx = "dyf")
dyf.printSchema()
root |-- type: string |-- count: string |-- def: array | |-- element: string |-- data: array | |-- element: array | | |-- element: choice | | | |-- int | | | |-- string
Step 2): Converting into spark data frame and exploding the data.
dtcadf = dyf.toDF()
dtcadf.show(truncate=False)
dtcadf.registerTempTable('dtcadf')
data=spark.sql('select explode(data) from dtcadf')
data.show(1,False)
Getting below issue :
An error occurred while calling o270.showString.
Note : Same file we can able to succeed when we directly read the file using spark data frame instead of AWS Glue dynamic frame.
Can you please help me to resolve the issue ,And do let me know for further information from our end.

Related

Boto3: Is it possible to parallelize paginator of list_objects_v2 API?

I have a s3 bucket called my-bucket and I have some files stored in the following structure:
my-bucket
|----- subfolder_1 ----- uniqueId_1
| |---- timestamp1.txt
| |---- timestamp2.txt
| |---- timestamp3.txt
| |---- ...
|
| ----------- uniqueId_2
|---- timestamp1.txt
|---- timestamp2.txt
|---- timestamp3.txt
|---- ...
My goal is to get all unique "uniqueIds" and I don't care about the timestamp. However the following code took me about 5 minutes to get all uniqueIds under one "subfolder_1".
import time
import boto3
paginator = boto3.client("s3").get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket="my-bucket", Prefix="subfolder_1/")
uniqueKeys = set()
for page in pages:
for obj in page['Contents']:
uniqueKeys.add(obj['Key'].split("/")[-2])
print(len(uniqueKeys))
I'm wondering if it is possible to use multithreading to replace or inside the for loop to accelerate this process?
Meanwhile I noticed if I use aws command line aws s3 ls s3://my-bucket/subfolder_1/ | wc -l it only takes 1 minute to finish, so an equivalent implementation using boto3 would help me as well.
Thank you in advance!
If you just want a list of the 'subdirectories' under subfolder_1, you could use:
boto3.client('s3').list_objects_v2(Bucket='foo',Prefix='subfolder_1/',Delimiter='/')
By specifying the Delimiter, S3 will also return a list of CommonPrefixes, which are effectively the name of the 'subdirectories' within that prefix.

Using shellspec for testing

I try to implement a testing framework using shellspec.
I have read the article and README at shellspec github project.
But I`m still confused about how to customise projects directories.
I`d like to have the next structure of my testing framework:
<root_dir>
|-- README
|
|-- tests
|
|-- test_instance_1
| |
| |-- lib
| | |
| | |-- my_test_1.sh
| |
| |-- spec
| |
| |-- my_test_1_spec.sh
|
|
|-- test_instance_2
|
|-- lib
| |
| |-- my_test_2.sh
|
|-- spec
|
|-- my_test_2_spec.sh
As it is mentioned at shellspec github project, it is possible to customise directory structure:
This is the typical directory structure. Version 0.28.0 allows many of
these to be changed by specifying options, supporting a more flexible
directory structure.
So I tried to modify my .shellspec file in the following way:
--default-path "***/spec"
--execdir #basedir/lib`
But when I run shellspec command in my command line, I get the next errors:
shellspec.sh: eval: line 23: unexpected EOF while looking for matching ``'
shellspec.sh: eval: line 24: syntax error: unexpected end of file
shellspec is run in <root_dir>.
Also I saw that there should be .shellspec-basedir file in each subdirectory, but I don`t realise, what it should contain.
I'd be happy, if someone give an example of existing project with custome directory structure or tell me, what I`m doing wrong.
The answer turned out to be very simple. Need to use
--default-path "**/spec"
to find _spec.sh files in all spec/ directories in the project

How to run "ssd_object_detection.cpp"?

I'm trying to learn about objects detection with Deep Learning, and I'm actually trying to run the sample code "ssd_object_detection.cpp". In this code it's necessary to input the following parameters:
const char* params
= "{ help | false | print usage }"
"{ proto | | model configuration }"
"{ model | | model weights }"
"{ camera_device | 0 | camera device number}"
"{ video | | video or image for detection}"
"{ min_confidence | 0.5 | min confidence }";
Following the instructions of this code, I downloaded some pretrained models from this web: https://github.com/weiliu89/caffe/tree/ssd#models\n
So, I obtained a folder with the next archives:
deploy.prototxt
finetune_ssd_pascal.py
solver.prototxt
test.prototxt
train.prototxt
VGG_VOC0712Plus_SSD_300x300_ft_iter_160000.caffemodel
Then, I copied all this data to my project folder, and tried this code:
const char* params
= "{ help | false | print usage }"
"{ proto |test.prototxt| model configuration }"
"{ model |VGG_VOC0712Plus_SSD_300x300_ft_iter_160000.caffemodel| model weights }"
"{ camera_device | 0 | camera device number}"
"{ video |MyRoute...| video or image for detection}"
"{ min_confidence | 0.5 | min confidence }";
But at output I get the following error:
[libprotobuf ERROR
C:\build\master_winpack-build-win64-vc14\opencv\3rdparty\protobuf\src\google\protobuf\text_format.cc:298] Error parsing text-format opencv_caffe.NetParameter: 13:18: Message
type "opencv_caffe.TransformationParameter" has no field named
"resize_param". OpenCV Error: Unspecified error (FAILED:
ReadProtoFromTextFile(param_file, param).
Failed to parse NetParameter
file: test.prototxt) in cv::dnn::ReadNetParamsFromTextFileOrDie, file
C:\build\master_winpack-build-win64-vc14\opencv\modules\dnn\src\caffe\caffe_io.cpp,
line 1145
I used all the .prototxt archives, and tried other possible solutions, but I still cant run the example code.
Can someone explain me what parameters should I write in this code or what am i doing wrong?
Sorry about my bad English and thanks in advance.

Accessing http resources on aws s3 with autorization

I have an s3 bucket with a set of static website resources as shown below:
Root-Bucket/
|-- website1/
| |-- index.html
| |-- 1_resource1.jpg
| |-- 1_resourse2.css
|-- website2/
| |-- index.html
| |-- 2_resource1.jpg
| |-- 2_resourse2.css
All of the objects shown above are private by default.
And I don't want to make above resources being accessible by everyone, only those who have authorized should be able to view index.html with attached resources.
Is there any way serving such static website with authorization?

Organizing Django + Static Website Folder Hierarchy

I'm currently working on developing a personal Django site that will consist of various technologies / subdomains. My main page(s) will be Django, with a blog.blah.com subdomain that runs wordpress, and several other subdomains for projects (project1.blah.com, project2.blah.com), that are static HTML files (created with Sphinx).
I'm having a lot of trouble organizing my file hierarchy and web server configurations. I'm currently running Apache on port 8080 which serves the Django stuff via mod_wsgi, and I use NGINX on port 80 to handle requests and proxying.
Here's my current filesystem layout. NOTE: I run ALL websites under a single user account.
blah#blah:~$ tree
.
`-- sites
|-- blah.org
| |-- logs
| |-- blah
| | |-- apache
| | | |-- blah.conf
| | | `-- blah.wsgi
| | |-- INSTALL
| | |-- nginx
| | | `-- blah.conf
| | |-- blah
| | | |-- app1
| | | | `-- models.py
| | | |-- app2
| | | | `-- models.py
| | | |-- manage.py
| | | |-- settings.py
| | | `-- urls.py
| | `-- README
| `-- private
`-- blah2.org
Can anyone help me figure out where to place files for a best-practices type of deployment? The structure above ONLY contains my Django code. I've got no idea where to put my static content files (eg: html subdomain sites), and my other services (eg: wordpress stuff).
Any help would be greatly appreciated! Bonus points if you show off your directory structure.
I put my stuff in /srv/www/blah.org/ like this:
-- blah.org
| -- media
| -- amedia
| -- templates
| -- blah
| django app
...
| -- settings.py
| -- config
| -- crontab
| -- blag.org.conf (nginx)
| -- manage.py
Then I confiugure static /media/ and /amedia/ with nginx and proxy everything else to gunicorn serving django.