Null Byte appending while reading the file through Python pandas - python-2.7

I have created a script which will give you the match rows between the two files. Post that, I am returning the output file to a function, which will be used the file as input to create pivot using pandas.
But somehow, something seems to be wrong, below is the code snippet
def CreateSummary(file):
out_file = file
file_df = pd.read_csv(out_file) ## This function is appending NULL Bytes at
the end of the file
#print file_df.head(2)
The above code is giving me the error as
ValueError: No columns to parse from file
Tried another approach:
file_df = pd.read_csv(out_file,delim_whitespace=True,engine='python')
##This gives me error as
_csv.Error: line contains NULL byte
Any suggestions and criticism is highly appreciated.

Related

Why must I run this code a few times before my entire .csv file is converted into a .yaml file?

I am trying to build a tool that can convert .csv files into .yaml files for further use. I found a handy bit of code that does the job nicely from the link below:
Convert CSV to YAML, with Unicode?
which states that the line will take the dict created by opening a .csv file and dump it to a .yaml file:
out_file.write(ry.safe_dump(dict_example,allow_unicode=True))
However, one small kink I have noticed is that when it is run once, the generated .yaml file is typically incomplete by a line or two. In order to have the .csv file exhaustively read through to create a complete .yaml file, the code must be run two or even three times. Does anybody know why this could be?
UPDATE
Per request, here is the code I use to parse my .csv file, which is two columns long (with a string in the first column and a list of two strings in the second column), and will typically be 50 rows long (or maybe more). Also note that it designed to remove any '\n' or spaces that could potentially cause problems later on in the code.
csv_contents={}
with open("example1.csv", "rU") as csvfile:
green= csv.reader(csvfile, dialect= 'excel')
for line in green:
candidate_number= line[0]
first_sequence= line[1].replace(' ','').replace('\r','').replace('\n','')
second_sequence= line[2].replace(' ','').replace('\r','').replace('\n','')
csv_contents[candidate_number]= [first_sequence, second_sequence]
csv_contents.pop('Header name', None)
Ultimately, it is not that important that I maintain the order of the rows from the original dict, just that all the information within the rows is properly structured.
I am not sure what would cause could be but you might be running out of memory as you create the YAML document in memory first and then write it out. It is much better to directly stream it out.
You should also note that the code in the question you link to, doesn't preserve the order of the original columns, something easily circumvented by using round_trip_dump instead of safe_dump.
You probably want to make a top-level sequence (list) as in the desired output of the linked question, with each element being a mapping (dict).
The following parses the CSV, taking the first line as keys for mappings created for each following line:
import sys
import csv
import ruamel.yaml as ry
import dateutil.parser # pip install python-dateutil
def process_line(line):
"""convert lines, trying, int, float, date"""
ret_val = []
for elem in line:
try:
res = int(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = float(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = dateutil.parser.parse(elem)
ret_val.append(res)
continue
except ValueError:
pass
ret_val.append(elem.strip())
return ret_val
csv_file_name = 'xyz.csv'
data = []
header = None
with open(csv_file_name) as inf:
for line in csv.reader(inf):
d = process_line(line)
if header is None:
header = d
continue
data.append(ry.comments.CommentedMap(zip(header, d)))
ry.round_trip_dump(data, sys.stdout, allow_unicode=True)
with input xyz.csv:
id, title_english, title_russian
1, A Title in English, Название на русском
2, Another Title, Другой Название
this generates:
- id: 1
title_english: A Title in English
title_russian: Название на русском
- id: 2
title_english: Another Title
title_russian: Другой Название
The process_line is just some sugar that tries to convert strings in the CSV file to more useful types and strings without leading spaces (resulting in far less quotes in your output YAML file).
I have tested the above on files with 1000 rows, without any problems (I won't post the output though).
The above was done using Python 3 as well as Python 2.7, starting with a UTF-8 encoded file xyz.csv. If you are using Python 2, you can try unicodecsv if you need to handle Unicode input and things don't work out as well as they did for me.

'~' leading to null results in python script

I am trying to extract a dynamic value (static characters) from a csv file in a specific column and output the value to another csv.
The data element I am trying to extract is '12385730561818101591' from the value 'callback=B~12385730561818101591' located in a specific column.
I have written the below python script, but the output results are always blank. The regex '=(~[0-9]+)' was validated to successfully pull out the '12385730561818101591' value. This was tested on www.regex101.com.
When I use this in Python, no results are displayed in the output file. I have a feeling the '~' is causing the error. When I tried searching for '~' in the original CSV file, no results were found, but it is there!
Can the community help me with the following:
(1) Determine root cause of no output and validate if '~' is the problem. Could the problem also be the way I'm splitting the rows? I'm not sure if the rows should be split by ';' instead of ','.
import csv
import sys
import ast
import re
filename1 = open("example.csv", "w")
with open('example1.csv') as csvfile:
data = None
patterns = '=(~[0-9]+)'
data1= csv.reader(csvfile)
for row in data1:
var1 = row[57]
for item in var1.split(','):
if re.search(patterns, item):
for data in item:
if 'common' in data:
filename1.write(data + '\n')
filename1.close()
Here I have tried to write sample code. Hope this will help you in solving the problem:
import re
str="callback=B~12385730561818101591"
rc=re.match(r'.*=B\~([0-9A-Ba-b]+)', str)
print rc.group(1)
You regex is wrong for your example :
=(~[0-9]+) will never match callback=B~12385730561818101591 because of the B after the = and before the ~.
Also you include the ~ in the capturing group.
Not exatly sure what's your goal but this could work. Give more details if you have more restrictions.
=.+~([0-9]+)
EDIT
Following the new provided information :
patterns = '=.+~([0-9]+)'
...
result = re.search(patterns, item):
number = result.group(0)
filename1.write(number + '\n')
...
Concerning your line split on the \t (tabulation) you should show an example of the full line

How to find specific part of line from file and make list of them?

Okay so if my file looks like this:
"1111-11-11";1;99.9;11;11.1;11.1
"2222-22-22";2;88.8;22;22.2;22.2
"3333-33-33";3;77.7;3.3;33.3;33.3
How I can read only parts "99.9", "88.8" and "77.7" from that file and make a list [99.9, 88.8, 77.7]? Basically I want to find parts after n semicolons.
You can open the file and read each line with the open command for csv your code might look like:
import csv
with open('filename.csv', 'rb') as f:
reader = csv.reader(f)
listOfRows = list(reader)
You will now have a list of lines, each line requires some processing.
if you lines always have the same structure you can split them by a ;
list_in_line= line.split(";")
and get the third element in that line.
Please show us some of your work, or better explain the structure of your data

Error on pandas.read_hdf

I created an HDF5 file with:
pfad = "E:\Geld\Handelssysteme\Kursdaten\Ivolatity/Daten Monatsoptionen/ODAX_alles.h5"
df.to_hdf(pfad,'df', format='table')
Now I want to read and put a portion of the table back into a dataframe without reading all of the lines in the file.
I tried
df=pandas.read_hdf('pfad', 'df', where = ['expiration<expirations[1] and expiration>=expirations[0]'])
where expirations is a list that contains datetime64[ns] values and I want to get a dataframe where the values in column "expiration" are between expirations[1] and expirations[0].
However, I get a KeyError: 'No object named df in the file'
What would the right syntax be?
The following works instead:
hdf=pandas.HDFStore(pfad)
df=hdf.select('df')

Extract indivudual tweets from a textfile with no line breaks using Python

I am trying to read tweets from a text file from a URL
http://rasinsrv07.cstcis.cti.depaul.edu/CSC455/assignment5.txt
Tweets in the file are listed in a single line (there are no line breaks) and punctuated by “EndOfTweet” string.
I am reading the file using the following code:
import urllib2
wfd = urllib2.urlopen('http://rasinsrv07.cstcis.cti.depaul.edu/CSC455/assignment5.txt')
data = wfd.read()
I understand that I have to use split on "EndOfTweet" in order to seperate the lines, but since there is only one line, I do not understand how to loop through the file and separate each line.
for line in data:
line = data.split('EndOfTweet')
You're so close!
by the time you've called wfd.read(), data will contain the raw text of that file. The normal way to loop over a file is to call something like for line in data, which is just looking for newlines to split the data on. In this case, your data doesn't contain the normal newline terminator. Instead, he's using the text EndOfTweet to separate your lines. Here's what you should have done:
import urllib2
import json
wfd = urllib2.urlopen('http://rasinsrv07.cstcis.cti.depaul.edu/CSC455/assignment5.txt')
data = wfd.read()
for line in data.split('EndOfTweet'):
# From here, line will contain a single tweet. It appears this line is a JSON parsable structure.
decoded_line = json.loads(line)
# Now, lets print out the text of the tweet to show we can
print decoded_line.get(u'text')