I have this dataset which I need to parse. I'm not sure if it is a common format. It's not one I recognise, but that doesn't rule out a lot. If anyone can identify it, and perhaps point me to a python library that will help to parse it, that would be great!
Here's a sample:
{"demand":[["2021-07-28T07:20+09:30",24.4],["2021-07-28T07:25+09:30",24.8],["2021-07-28T07:30+09:30",25.0],...,["2021-07-28T19:20+09:30",26.1]]}
From the sample you shared, it looks like json for me.
You can load json in Python like this,
import json
# Opening JSON file
f = open('data.json', 'r')
# returns JSON object as
# a dictionary
data = json.load(f)
# Iterating through the json
# list
for i in data["demand"]:
print(i)
# Closing file
f.close()
Related
I'm trying to read all yaml files in a directory, but I am having trouble. First, because I am using Python 2.7 (and I cannot change to 3) and all of my files are utf-8 (and I also need them to keep this way).
import os
import yaml
import codecs
def yaml_reader(filepath):
with codecs.open(filepath, "r", encoding='utf-8') as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
def yaml_dump(filepath, data):
with open(filepath, 'w') as file_descriptor:
yaml.dump(data, file_descriptor)
if __name__ == "__main__":
filepath = os.listdir(os.getcwd())
data = yaml_reader(filepath)
print data
When I run this code, python gives me the message:
TypeError: coercing to Unicode: need string or buffer, list found.
I want this program to show the content of the files. Can anyone help me?
I guess the issue is with filepath.
os.listdir(os.getcwd()) returns the list of all the files in the directory. so you are passing the list to codecs.open() instead of filename
There are multiple problems with your code, apart from that it is invalide Python, in the way you formatted this.
def yaml_reader(filepath):
with codecs.open(filepath, "r", encoding='utf-8') as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
however it is not necessary to do the decoding, PyYAML is perfectly capable of processing UTF-8:
def yaml_reader(filepath):
with open(filepath, "rb") as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
I hope you realise your trying to load multiple documents and always get a list as a result in data even if your file contains one document.
Then the line:
filepath = os.listdir(os.getcwd())
gives you a list of files, so you need to do:
filepath = os.listdir(os.getcwd())[0]
or decide in some other way, which of the files you want to open. If you want to combine all files (assuming they are YAML) in one big YAML file, you need to do:
if __name__ == "__main__":
data = []
for filepath in os.listdir(os.getcwd()):
data.extend(yaml_reader(filepath))
print data
And your dump routine would need to change to:
def yaml_dump(filepath, data):
with open(filepath, 'wb') as file_descriptor:
yaml.dump(data, file_descriptor, allow_unicode=True, encoding='utf-8')
However this all brings you to the biggest problem: that you are using PyYAML, that will mangle your YAML, dropping flow-style, comment, anchor names, special int/float, quotes around scalars etc. Apart from that PyYAML has not been updated to support YAML 1.2 documents (which has been the standard since 2009). I recommend you switch to using ruamel.yaml (disclaimer: I am the author of that package), which supports YAML 1.2 and leaves comments etc in place.
And even if you are bound to use Python 2, you should use the Python 3 like syntax e.g. for print that you can get with from __future__ imports.
So I recommend you do:
pip install pathlib2 ruamel.yaml
and then use:
from __future__ import absolute_import, unicode_literals, print_function
from pathlib import Path
from ruamel.yaml import YAML
if __name__ == "__main__":
data = []
yaml = YAML()
yaml.preserve_quotes = True
for filepath in Path('.').glob('*.yaml'):
data.extend(yaml.load_all(filepath))
print(data)
yaml.dump(data, Path('your_output.yaml'))
I'm gonna connect to a S3 bucket, get the csv files and copy the rows to RDS DB. On this script we are using arcpy, I'm not that familiar with this package, I'm just trying to get the csv file directly from S3 bucket as source without downloading it on the server. The code is as follows:
import arcpy
from boto.s3.key import Key
import StringIO
import pandas as pd
import boto
import boto.s3.connection
access_key = ''
secret_key = ''
conn = boto.connect_s3(aws_access_key_id = access_key,aws_secret_access_key = secret_key,host = 's3.amazonaws.com')
b = conn.get_bucket('mybucket')
#for key in b.list:
b_key = b.get_key('file1.csv')
arcpy.env.overwriteOutput = True
b_url = b_key.generate_url(0, query_auth=False, force_http=True)
print b_url
##Read file
k = Key(b,file1.csv)
content = k.get_contents_as_string()
sourcefile_csv = pd.read_csv(StringIO.StringIO(content))
##CopyRows_management (in_rows, out_table, {config_keyword})
#http://pro.arcgis.com/en/pro-app/tool-reference/data-management/copy-rows.htm
arcpy.CopyRows_management(sourcefile_csv, "RDSTablePath", "")
print("copy rows done")
Error: in CopyRows arcgisscripting.ExecuteError. Failed to execute Parameters are not valid
If we use a path on the server as source path like below it works fine:
sourcefile_csv = "D:\\DEV\\file1.csv"
arcpy.CopyRows_management(sourcefile_csv, "RDSTablePath", "")
Any help would be appreciated.
It looks like you are trying to use the Pandas dataframe as the table to read from with CopyRows_management? I don't think that is a valid input for the function, thus the "Parameters are not valid" error. The documentation says that in_rows should be "The rows from a feature class, layer, table, or table view to be copied." I think the use of pandas is unnecessary here anyways.
So either save the csv somewhere that the script can access it (as you did in when you used the path on the server) or, if you don't want to save the file anywhere, just read the contents of the csv and iterate through it using an Insert Cursor to write it to your table/feature class.
See this post on how to read a csv from a string using the csv module. Then just loop through the rows of the csv and use the Insert Cursor to write to your table.
If your RDS happens to be an Aurora MySql then you should take a look into Loading Data from S3 feature, where you can skip the code and just loads line by line into your DB.
I'm trying to access the data within the logs that are printed. Here's how to set up logging:
import logging
import httplib as http_client
http_client.HTTPConnection.debuglevel = 1
#Initialize Logging
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True
When I make an HTTP request using Python Requests,
r = requests.get(api+'download?name=XXXX),headers=header)
it prints out a bunch of lines of log. This is the one I'm interested in:
header: Location: http://09.bm-data-api.prod.XXXXXXX.net/download/XXXXX
How do I pass this header info back into a variable to use?
Thanks
Dave
The value you need has been logged (presumably to a file). So you'll need to read the contents of the logfile, find the line(s) you're interested in and perform the necessary work that you wish to do. You can read a file with:
with open(myfile) as f:
data = f.readlines()
and you can then iterate over data to find the lines that you're interested in.
I'm trying to upload and parse json files using django. Everything works great up until the moment I need to parse the json. Then I get this error:
No JSON object could be decoded: line 1 column 0 (char 0)
Here's my code. (I'm following the instructions here, and overwriting the handle_uploaded_file method.)
def handle_uploaded_file(f, collection):
# assert False, [f.name, f.size, f.read()[:50]]
t = f.read()
for j in serializers.deserialize("json", t):
add_item_to_database(j)
The weird thing is that when I uncomment the "assert" line, I get this:
[u'myfile.json', 59478, '']
So it looks like my file is getting uploaded with the right size (I've verified this on the server), but the read command seems to be failing entirely.
Any ideas?
I've seen this before. Your file has length, but reading it doesn't. I'm wondering if it's been read previously... try this:
f.seek(0)
f.read()
I have a txt file that I would like to alter so I will be able to place the data into columns see example below. The reason behind this is so I can import this data into a database / array and perform calculations on them. I tried importing/pasting the data into LibreCalc but it just imports everything into one column or it opens the file in LibreWriter I'm using ubuntu 10.04. Any ideas? I'm willing to use another program to work around this issue. I could also work with a comma delimited file but I'm not to sure how to convert the data to that format automatically.
Trying to get this:
WAVELENGTH, WAVENUMBER, INTENSITY, CLASSIFICATION, CODE,
1132.8322, 88274.326, 2300, PT II, 9356- 97630, 05,
Here's a link to the full file.
pt.txt file
Try this:
sed -e "s/(\s+)/,$1/g" pt.txt
is this what you want?
awk 'BEGIN{OFS=","}NF>1{$1=$1;print}' pt.txt
if you want the output format looks better, and you have "column" installed, you can try this too:
awk 'BEGIN{OFS=", "}NF>1{$1=$1;print}' pt.txt|column -t
The awk and sed one-liners are cool, but I expect you'll end up needing to do more than simply splitting up the file. If you do, and if you have access to Python 2.7, the following little script will get you going.
# -*- coding: utf-8 -*-
"""Convert to comma-delimited"""
import csv
from os import path
import re
import sys
def splitline(line):
return re.split('\s{2,}', line)
def main():
srcpath = path.abspath(sys.argv[1])
targetpath = path.splitext(srcpath)[0] + '.csv'
with open(srcpath) as infile, open(targetpath, 'w') as outfile:
writer = csv.writer(outfile)
for line in infile:
if line.startswith(' '):
line = line.strip()
cols = splitline(line)
writer.writerow(cols)
if __name__ == '__main__':
main()
The easiest way turned out to be importing using a fixed width like tohuwawohu suggested
Thanks
Without transforming it to a comma-separated file, you could access the csv import options by simply changing the file extension to .csv (maybe you should remove the "header" part manually, so that only the columns heads and the data rows do remain). After that, you can try to use whitespace as column delimiter, or even easier: select "fixed width" and set the columns manually. – tohuwawohu Oct 20 at 9:23