Read FASTQ file into a AWS Glue Job Script - amazon-web-services

I need to read FASTQ file into AWS Glue Job Script but I'am getting this error:
Traceback (most recent call last): File "/opt/amazon/bin/runscript.py", line 59, in runpy.run_path(script, run_name='main') File "/usr/lib64/python3.7/runpy.py", line 261, in run_path
code, fname = _get_code_from_file(run_name, path_name) File "/usr/lib64/python3.7/runpy.py", line 236, in _get_code_from_file
code = compile(f.read(), fname, 'exec') File "/tmp/test20200930", line 24 datasource0 = spark.createDataset(sc.textFile("s3://sample-genes-data/fastq/S_Sonnei_short_reads_1.fastq").sliding(4, 4).map {
^ SyntaxError: invalid syntax During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/amazon/bin/runscript.py", line 92, in
while "runpy.py" in new_stack.tb_frame.f_code.co_filename: AttributeError: 'NoneType' object has no attribute 'tb_frame'
This is my code:
import org.apache.spark.mllib.rdd.RDDFunctions._
datasource0 = spark.createDataset(sc.textFile("s3://sample-genes-data/fastq/S_Sonnei_short_reads_1.fastq").sliding(4, 4).map {
case Array(id, seq, _, qual) => (id, seq, qual)
}).toDF("identifier", "sequence", "quality")
datasource1 = DynamicFrame.fromDF(datasource0, glueContext, "nullv")
I followed this link:
Read FASTQ file into a Spark dataframe

I was able to run the code by wrapping it inside a GlueApp object. You can use below code by replacing the S3 path of yours.
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import org.apache.spark.SparkContext
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.SparkSession
import com.amazonaws.services.glue.DynamicFrame
import org.apache.spark.mllib.rdd.RDDFunctions._
object GlueApp {
def main(sysArgs: Array[String]) {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
val sparkSession: SparkSession = glueContext.getSparkSession
import sparkSession.implicits._
val datasource0 = sparkSession.createDataset(spark.textFile("s3://<s3path>").sliding(4, 4).map {
case Array(id, seq, _, qual) => (id, seq, qual)
}).toDF("identifier", "sequence", "quality")
val datasource1 = DynamicFrame(datasource0, glueContext)
datasource1.show()
datasource1.printSchema()
Job.commit()
}
}
Passed Input :
#seq1
AGTCAGTCGAC
+
?##FFBFFDDH
#seq2
CCAGCGTCTCG
+
?88ADA?BDF8
Output:
{"identifier": "#seq1", "sequence": "AGTCAGTCGAC", "quality": "?##FFBFFDDH"}
{"identifier": "#seq2", "sequence": "CCAGCGTCTCG", "quality": "?88ADA?BDF8"}

Related

Django raw sql insert query in triple quotation: Django interprets null values from ajax request data as None column

I am working on a Django/React project. Using DRF, I am throwing an API route for performing SQL queries in my PostgreSQL database. But I am having issues with my current code setup.
I have setup my INSERT query in the API as raw queries (using cursor) enclosed in a triple quote multi-line string """INSERT...""", then format my values using string formatting %s. In my API, I capture each data from the request body into a variable. Everything works fine if all request data are filled. But, if it is null, Django obviously assigns None to the variable.
Now back to my sql query, Django will treat the null %s as None and as a table column instead of a correct null value, thus throwing a ProgrammingError column "none" does not exist.
Here are sample codes:
React Frontend
const [lastName, setLastName] = useState('')
const [firstName, setFirstName] = useState('')
const [middleName, setMiddleName] = useState('')
const [nameExtn, setNameExtn] = useState('')
const [sex, setSex] = useState('')
const [civilStatus, setCivilStatus] = useState('')
const [bloodType, setBloodType] = useState('')
const [height, setHeight] = useState('')
const [weight, setWeight] = useState('')
const newPersonalInfo = (token, data) => {
let endpoint = "/jobnet/api/profile/pds/basic/personal/"
let lookupOptions = {
method: "POST",
headers: {
'Content-Type': 'application/json',
'Authorization': `Token ${token}`
},
body: JSON.stringify(data),
credentials: 'include'
}
fetch(endpoint, lookupOptions)
.then(res=>res.json())
.then(function(info){
console.log(info)
})
.catch(err=>console.log(err));
}
const handleNewPersonalInfo = () => {
newPersonalInfo(props.user.token, {
firstname: firstName,
lastname: lastName,
middlename: middleName,
extension: nameExtn,
birthdate: selectedDate,
sex: sex,
civilstatus: civilStatus,
bloodtype: bloodType,
height: height,
weight: weight,
})
}
...
return(
<Button
variant="contained"
color="primary"
onClick={handleNewPersonalInfo}
>
SAVE
</Button>
)
Django API (DRF)
class APIListCreate__PersonalInfo(generics.ListCreateAPIView):
try:
serializer_class = PDSBasicPersonalInfoSerializer
permission_classes = (jobnet_permissions.IsAuthenticated,)
authentication_classes = (knox_TokenAuthentication,)
except Exception as e:
traceback.print_exc()
def get_queryset(self):
user = self.request.user
if user and user.is_authenticated:
query = ("""
SELECT
bsinfo.firstname,
bsinfo.middlename,
bsinfo.surname,
bsinfo.birthdate,
bsinfo.sex,
bsinfo.extention,
bsinfo.civilstatus,
bsinfo.height_m,
bsinfo.weight_kg,
bsinfo.bloodtype,
bsinfo.photo_path
FROM jobnet_app.basicinfo bsinfo
WHERE
id=%s
""" % user.id)
return raw_sql_select(query, "default")
else:
return None
def get(self, request):
data = [
{
"first_name": col.firstname,
"middle_name": col.middlename,
"last_name": col.surname,
"name_extension": col.extention,
"birthdate": col.birthdate,
"sex": col.sex,
"civil_status": col.civilstatus,
"height": col.height_m,
"weight": col.weight_kg,
"blood_type": col.bloodtype,
"photo_path": col.photo_path
} for col in self.get_queryset()[1]
]
return Response(data[0])
def post(self, request):
try:
user = request.user.id
firstname = request.data.get('firstname') or ''
middlename = request.data.get('middlename') or ''
lastname = request.data.get('lastname') or ''
birthdate = request.data.get('birthdate') or ''
sex = request.data.get('sex') or ''
extension = request.data.get('extension') or ''
civilstatus = request.data.get('civilstatus') or ''
height_m = request.data.get('height') or 0
weight_kg = request.data.get('weight') or 0
bloodtype = request.data.get('bloodtype') or ''
query = ("""
START TRANSACTION;
INSERT INTO jobnet_app.basicinfo (
id,
firstname,
middlename,
surname,
birthdate,
sex,
extention,
civilstatus,
height_m,
bloodtype,
weight_kg
)
VALUES (%s,'%s','%s','%s','%s','%s','%s','%s',%s,'%s',%s);
""" % (
user,
firstname,
middlename,
lastname,
birthdate,
sex,
extension,
civilstatus,
height_m,
bloodtype,
weight_kg
)
)
unformatted_query_result = raw_sql_insert(query, "default")
if unformatted_query_result:
raw_sql_commit("default")
return Response({
"success": True,
"message": "Your basic personal information has been updated successfully."
}, status=status.HTTP_201_CREATED)
else:
raw_sql_rollback("default")
return Response({
"success": False,
"message": "There was a problem updating your personal information."
}, status=status.HTTP_400_BAD_REQUEST)
except Exception as e:
traceback.print_exc()
return Response({
"success": False,
"message":"Internal System Error: " + str(e)
}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
With above setup, I get such error:
Traceback (most recent call last):
File "C:\Users\Acer\Envs\adnwebsite-react\lib\site-packages\django\db\backends\utils.py", line 83, in _execute
return self.cursor.execute(sql)
psycopg2.ProgrammingError: column "none" does not exist
LINE 4: ...on','2002-03-18T06:18:45.284Z','Male','','Single',None,'B+',...
^
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "D:\PGADN\Projects\pgadn-v2-website\adnwebsite\reactify\utils.py", line 17, in raw_sql_insert
cn.execute(query)
File "C:\Users\Acer\Envs\adnwebsite-react\lib\site-packages\django\db\backends\utils.py", line 100, in execute
return super().execute(sql, params)
File "C:\Users\Acer\Envs\adnwebsite-react\lib\site-packages\django\db\backends\utils.py", line 68, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "C:\Users\Acer\Envs\adnwebsite-react\lib\site-packages\django\db\backends\utils.py", line 77, in _execute_with_wrappers
return executor(sql, params, many, context)
File "C:\Users\Acer\Envs\adnwebsite-react\lib\site-packages\django\db\backends\utils.py", line 85, in _execute
return self.cursor.execute(sql, params)
File "C:\Users\Acer\Envs\adnwebsite-react\lib\site-packages\django\db\utils.py", line 89, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "C:\Users\Acer\Envs\adnwebsite-react\lib\site-packages\django\db\backends\utils.py", line 83, in _execute
return self.cursor.execute(sql)
django.db.utils.ProgrammingError: column "none" does not exist
LINE 4: ...on','2002-03-18T06:18:45.284Z','Male','','Single',None,'B+',...
I am using:
Django 2.0.6
Django Rest Framework 3.10.3
PostgreSQL 11.3 (on my development machine) and PostgreSQL 12.1 (on production server) (although the error above has not yet been replicated to the server)
Note: raw_sql_select, raw_sql_insert, raw_sql_commit, raw_sql_rollback are simply custom-made helper functions which handle the actual cursor execution in the background.
Use SQL Parameters instead of % to create your insert script.
For example-
# **WRONG**
>>> cur.execute("INSERT INTO numbers VALUES (%s, %s)" % (10, 20))
# **correct**
>>> cur.execute("INSERT INTO numbers VALUES (%s, %s)", (10, 20))
You must be using cusrsor.execute in your helper functions. Pass the values as a list as a second parameter in the cusrsor.execute as shown above.
Folow this link to read more.

ConfigParser.NoSectionError: No section: 'options'

I tried to read a config file of odoo 8 but getting this error:
My config file looks like:
[options]
#This is the password that allows database operations:
admin_passwd = admin
db_host = False
db_port = False
db_user = kabeer
db_password = password
addons_path = /OdooSass/odoo-8.0-20151229/openerp/addons
My code looks like:
def get_odoo_config():
res = {}
if not args.get('odoo_config'):
return res
p = ConfigParser.ConfigParser()
log('Read odoo config', args.get('odoo_config'))
p.read(args.get('odoo_config'))
for (name, value) in p.items('options'):
if value == 'True' or value == 'true':
value = True
if value == 'False' or value == 'false':
value = False
res[name] = value
return res
The error is:
Traceback (most recent call last):
File "saas.py", line 118, in <module>
odoo_config = get_odoo_config()
File "saas.py", line 110, in get_odoo_config
for (name, value) in p.items('options'):
File "/usr/lib/python2.7/ConfigParser.py", line 642, in items
raise NoSectionError(section)
ConfigParser.NoSectionError: No section: 'options'
How can i fix this error?

MailBox to csv using Python

I have downloaded an archive of mails from my gmail account. I am using the following python(2.7) code taken from a blog to convert the contents of the archive to csv.
import mailbox
import csv
writer = csv.writer(open(("clean_mail.csv", "wb"))
for message in mailbox.mbox('archive.mbox'):
writer.writerow([message['subject'], message['from'], message['date']])
I want to include the body of the mail(the actual messages) too...but couldn't figure out how. I have not used python earlier, can someone help please? I have used other SO options given but couldn't get through.
To do the same task, I have used the following code too: but get indentation error for line 60: return json_msg. I have tried different indentation options but nothing improved.
import sys
import mailbox
import email
import quopri
import json
import time
from BeautifulSoup import BeautifulSoup
from dateutil.parser import parse
MBOX = 'Users/mymachine/client1/Takeout/Mail/archive.mbox'
OUT_FILE = 'Users/mymachine/client1/Takeout/Mail/archive.mbox.json'
def cleanContent(msg):
msg = quopri.decodestring(msg)
try:
soup = BeautifulSoup(msg)
except:
return ''
return ''.join(soup.findAll(text=True))
# There's a lot of data to process, and the Pythonic way to do it is with a
# generator. See http://wiki.python.org/moin/Generators.
# Using a generator requires a trivial encoder to be passed to json for object
# serialization.
class Encoder(json.JSONEncoder):
def default(self, o): return list(o)
def gen_json_msgs(mb):
while 1:
msg = mb.next()
if msg is None:
break
yield jsonifyMessage(msg)
def jsonifyMessage(msg):
json_msg = {'parts': []}
for (k, v) in msg.items():
json_msg[k] = v.decode('utf-8', 'ignore')
for k in ['To', 'Cc', 'Bcc']:
if not json_msg.get(k):
continue
json_msg[k] = json_msg[k].replace('\n', '').replace('\t', '').replace('\r', '')\
.replace(' ', '').decode('utf-8', 'ignore').split(',')
for part in msg.walk():
json_part = {}
if part.get_content_maintype() == 'multipart':
continue
json_part['contentType'] = part.get_content_type()
content = part.get_payload(decode=False).decode('utf-8', 'ignore')
json_part['content'] = cleanContent(content)
json_msg['parts'].append(json_part)
then = parse(json_msg['Date'])
millis = int(time.mktime(then.timetuple())*1000 + then.microsecond/1000)
json_msg['Date'] = {'$date' : millis}
return json_msg
mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file)
f = open(OUT_FILE, 'w')
for msg in gen_json_msgs(mbox):
if msg != None:
f.write(json.dumps(msg, cls=Encoder) + '\n')
f.close()
Try this.
import mailbox
import csv
writer = csv.writer(open(("clean_mail.csv", "wb"))
for message in mailbox.mbox('archive.mbox'):
if message.is_multipart():
content = ''.join(part.get_payload() for part in message.get_payload())
else:
content = message.get_payload()
writer.writerow([message['subject'], message['from'], message['date'],content])
or this:
import mailbox
import csv
def get_message(message):
if not message.is_multipart():
return message.get_payload()
contents = ""
for msg in message.get_payload():
contents = contents + str(msg.get_payload()) + '\n'
return contents
if __name__ == "__main__":
writer = csv.writer(open("clean_mail.csv", "wb"))
for message in mailbox.mbox("archive.mbox"):
contents = get_message(message)
writer.writerow([message["subject"], message["from"], message["date"],contents])
Find the documentation here.
A little improvement of Rahul snippet for multipart content:
import sys
import mailbox
import csv
from email.header import decode_header
infile = sys.argv[1]
outfile = sys.argv[2]
writer = csv.writer(open(outfile, "w"))
def get_content(part):
content = ''
payload = part.get_payload()
if isinstance(payload, str):
content += payload
else:
for part in payload:
content += get_content(part)
return content
writer.writerow(['date', 'from', 'to', 'subject', 'content'])
for index, message in enumerate(mailbox.mbox(infile)):
content = get_content(message)
row = [
message['date'],
message['from'].strip('>').split('<')[-1],
message['to'],
decode_header(message['subject'])[0][0],
content
]
writer.writerow(row)

Cookie for Django

I use Django 1.6 and the following error occurred:
Exception happened during processing of request from ('server ip ', 54687)
Traceback (most recent call last):
File "/usr/local/Python2.7.5/lib/python2.7/SocketServer.py", line 593, in process_request_thread
self.finish_request(request, client_address)
File "/usr/local/Python2.7.5/lib/python2.7/SocketServer.py", line 334, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/local/Python2.7.5/lib/python2.7/site-packages/django/core/servers/basehttp.py", line 126, in __init__
super(WSGIRequestHandler, self).__init__(*args, **kwargs)
File "/usr/local/Python2.7.5/lib/python2.7/SocketServer.py", line 649, in __init__
self.handle()
File "/usr/local/Python2.7.5/lib/python2.7/wsgiref/simple_server.py", line 124, in handle
handler.run(self.server.get_app())
File "/usr/local/Python2.7.5/lib/python2.7/wsgiref/handlers.py", line 92, in run
self.close()
File "/usr/local/Python2.7.5/lib/python2.7/wsgiref/simple_server.py", line 33, in close
self.status.split(' ',1)[0], self.bytes_sent
AttributeError: 'NoneType' object has no attribute 'split'
I set cookie from client program. Why it happens?
It is client C# of Unity code to create cookie.
Dictionary<string,string> headers = www.responseHeaders;
foreach (KeyValuePair<string,string> kvPair in headers)
{
if (kvPair.Key.ToLower ().Equals ("set-cookie"))
{
string stHeader = "";
string stJoint = "";
string[] astCookie = kvPair.Value.Split (
new string[] {"; "}, System.StringSplitOptions.None);
foreach (string stCookie in astCookie)
{
if (!stCookie.Substring (0, 5).Equals ("path="))
{
stHeader += stJoint + stCookie;
stJoint = "; ";
}
}
if (stHeader.Length > 0)
{
this.hsHeader ["Cookie"] = stHeader;
}
else
{
this.hsHeader.Clear ();
}
}
}
And I set like this
WWW www = new WWW (openURL, null, this.hsHeader);
I think server side is okey.
Because, from browser it runs fine.
stHeader is like this
csrftoken=2YiHLunt9QzqIavUbxfhp2sPVU22g3Vv; expires=Sat, 09-Apr-2016 00:11:19 GMT; Max-Age=31449600

pymongo.errors.OperationFailure saying ns does not exist

I got an error as following :
Traceback (most recent call last):
pymongo.errors.OperationFailure: command SON([('mapreduce',
u'tweets'), ('map', Code('function() { emit(this.via, 1); }',
{})), ('reduce', Code(' function(key,value) {\n var res =
0;\n values.forEach(function(v) {res += 1 })\n return
{count: res};\n }\n ', {})), ('out', 'via_count')]) on namespace
Corpus.$cmd failed: ns doesn't exist
The codes are :
from pymongo import MongoClient
from bson.code import Code
con = MongoClient()
db = con.Corpus
tweets = db.tweets
map = Code("function() { emit(this.via, 1); }")
reduce = Code(""" function(key,value) {
var res = 0;
values.forEach(function(v) {res += 1 })
return {count: res};
}
""")
result = tweets.map_reduce(map, reduce, "via_count")
for doc in db.via_count.find():
print(doc)
on namespace Corpus.$cmd failed: ns doesn't exist
It means you don't have collection named 'tweets' or 'via_count' in Corpus.
Not like a raw query in mongodb, map_reduce function doesn't make a new collection if it's not exist.
Even if the database does not exists drops the same error