i am getting this error. one df dataframe is read from json API and second df2 is read from csv i want to compare one column of csv to API and then matched value to save into new csv. can anyone help me
df2=pd.read_csv(file_path)
r = requests.get('https://data.ct.gov/resource/6tja-6vdt.json')
df = pd.DataFrame(r.json())
df['verified'] = np.where(df['salespersoncredential'] == df2['salespersoncredential'],'True', 'False')
print(df)
Probably just make df['verified'] = np.where(df['salespersoncredential'] == df2['salespersoncredential'],'True', 'False')
this
df['verified'] = df['salespersoncredential'] == df2['salespersoncredential']
assuming the dtypes and are correct.
If the indexes are different on the two dataframes, you might need to .reset_index().
Related
I am trying to read a csv file that is in my S3 bucket. I would like to do some manipulations and then finally convert to a dynamic dataframe and write it back to S3.
This is what I have tried so far:
Pure Python:
Val1=""
Val2=""
cols=[]
width=[]
with open('s3://demo-ETL/read/data.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
print(row)
if ((Val1=="" ) & (Val2=="")):
Val1=row[0]
Val2=row[0]
cols.append(row[1])
width.append(int(row[4]))
else:
continues...
Here I get an error that says it cannot find the file in the directory at all.
Boto3:
import boto3
s3 = boto3.client('s3')
data = s3.get_object(Bucket='demo-ETL', Key='read/data.csv')
contents = data['Body'].read()
print(contents)
for row in content:
if ((Val1=="" ) & (Val2=="")):
Val1=row[0]
Val2=row[0]
cols.append(row[1])
width.append(int(row[4]))
else:
continues...
Here it says index is out of range which is strange because I have 4 comma separated values in the csv file. When I look at the results from the print(contents), I see that its putting each character in a list, instead of it putting each comma separated value in a list.
Is there a better way to read the csv from s3?
I ended up solving this by reading it as a pandas dataframe. I first created an object with boto3, then read the whole object as a pd which I then converted into a list.
s3 = boto3.resource('s3')
bucket = s3.Bucket('demo-ETL')
obj = bucket.Object(key='read/data.csv')
dataFrame = pd.read_csv(obj.get()['Body'])
l = dataFrame.values.tolist()
for i in l:
print(i)
get_object returns the Body response value which is of type StreamingBody. Per the docs, if you're trying to go line-by-line you probably want to use iter_lines.
For example:
import boto3
s3 = boto3.client('s3')
data = s3.get_object(Bucket='demo-ETL', Key='read/data.csv')
file_lines = data['Body'].iter_lines()
print(file_lines)
This probably does more of what you want.
You can use Spark to read the file like this:
df = spark.read.\
format("csv").\
option("header", "true").\
load("s3://bucket-name/file-name.csv")
You can find more options here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv
I have a .csv file where I need to overwrite a certain column with new values from a list.
Let's say I have the list L1 = ['La', 'Lb', 'Lc'] that I want to write in column no. 5 of the .csv file.
If I run:
L1 = ['La', 'Lb', 'Lc']
import csv
with open(r'C:\LIST.csv','wb') as f:
w = csv.writer(f)
for i in L1:
w.writerow(i)
This will write the L1 values to the first and second column.
First column will be 'L', 'L', 'L' and second column 'a', 'b', 'c'
I could not find the syntax to write to a specific column each element from the list. (this is in Python 2.7). Thank you for your help!
(for this script I must use IronPython, and just the built in Libraries that comes with IronPython)
Although you could certainly use Python's built-in csv module to read the data, modify it, and write it out, I'd recommend the excellent tablib module:
from tablib import Dataset
csv = '''Col1,Col2,Col3,Col4,Col5,Col6,Col7
a1,b1,c1,d1,e1,f1,g1
a2,b2,c2,d2,e2,f2,g2
a3,b3,c3,d3,e3,f3,g3
'''
# Read a hard-coded string just for test purposes.
# In your code, you would use open('...', 'rt').read() to read from a file.
imported_data = Dataset().load(csv, format='csv')
L1 = ['La', 'Lb', 'Lc']
for i in range(len(L1)):
# Each row is a tuple, and tuples don't support assignment.
# Convert to a list first so we can modify it.
row = list(imported_data[i])
# Put our value in the 5th column (index 4).
row[4] = L1[i]
# Store the row back into the Dataset.
imported_data[i] = row
# Export to CSV. (Of course, you could write this to a file instead.)
print imported_data.export('csv')
# Output:
# Col1,Col2,Col3,Col4,Col5,Col6,Col7
# a1,b1,c1,d1,La,f1,g1
# a2,b2,c2,d2,Lb,f2,g2
# a3,b3,c3,d3,Lc,f3,g3
I am new to coding and have a lot of big data to deal with. Currently I am trying to merge 26 tsv files (each has two columns without a header, one is a contig _number the other is a count.
If a tsv did not have a count for a particular contig_number, it does not have that row - so I am attempting to use how = 'outer' and fill in the missing values with 0 afterwards.
I have been successful for the tsvs which I have subsetted to run the initial tests, but when I run the script on the actual data, which is large (~40,000 rows, two columns), more and more memory is used...
I got to 500Gb of RAM on the server and called it a day.
This is the code that is successful on the subsetted csvs:
files = glob.glob('*_count.tsv')
data_frames = []
logging.info("Reading in sample files and adding to list")
for fp in files:
# read in the files and put them into dataframes
df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0)
# rename the columns so we know what file they came from
df = df.rename(columns = {1:str(fp)}).reset_index()
df = df.rename(columns = {0:"contig"})
# append the dataframes to a list
data_frames.append(df)
logging.info("Merging the tables on contig, and fill in samples with no counts for contigs")
# merge the tables on gene_id and select how = 'outer' which will include all rows but will leave empty space where there is no data
df=reduce(lambda left,right: pd.merge(left, right, how='outer', on="contig"), data_frames)
# this bit is important to fill missing data with a 0
df.fillna(0, inplace = True)
logging.info("Writing concatenated count table to file")
# write the dataframe to file
df.to_csv("combined_bamm_filter_count_file.tsv",
sep='\t', index=False, header=True)
I would appreciate any advice or suggestions! Maybe there is just too much to hold in memory, and I should be trying something else.
Thank you!
I usually do these types of operations with pd.concat. I don't know the exact details of why it's more efficient, but pandas has some optimizations for combining indices.
I would do
for fp in files:
# read in the files and put them into dataframes
df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0)
# rename the columns so we know what file they came from
df = df.rename(columns = {1:str(fp)})
#just keep the contig as the index
data_frames.append(df)
df_full=pd.concat(data_frames,axis=1)
and then df_full=df_full.fillna(0) if you want to.
In fact since each of your files has only one column (+ an index) you may do better yet by treating them as Series instead of DataFrame.
I have a file with millions of records like this
2017-07-24 18:34:23|CN:SSL|RESPONSETIME:23|BYTESIZE:1456|CLIENTIP:127.0.0.9|PROTOCOL:SSL-V1.2
Each record contains around 30 key-value pairs with "|" delimeter. Key-value pair position is not constant.
Trying to parse these records using python dictionary or list concepts.
Note: 1st column is not in key-value format
your file is basically a |-separated csv file holding first the timestamp, then 2 fields separated by :.
So you could use csv module to read the cells, then pass the result of str.split to a dict in a gencomp to build the dictionary for all elements but the first one.
Then update the dict with the timestamp:
import csv
list_of_dicts = []
with open("input.txt") as f:
cr = csv.reader(f,delimiter="|")
for row in cr:
d = dict(v.split(":") for v in row[1:])
d["date"] = row[0]
list_of_dicts.append(d)
list_of_dicts contains dictionaries like
{'date': '2017-07-24 18:34:23', 'PROTOCOL': 'SSL-V1.2', 'RESPONSETIME': '23', 'CN': 'SSL', 'CLIENTIP': '127.0.0.9', 'BYTESIZE': '1456'}
You repeat the below process for all the lines in your code. I am not clear about the date time value. So I haven't included that in the input. You can include it based on your understanding.
import re
given = "CN:SSL|RESPONSETIME:23|BYTESIZE:1456|CLIENTIP:127.0.0.9|PROTOCOL:SSL-
V1.2"
results = dict()
list_for_this_line = re.split('\|',given)
for i in range(len(list_for_this_line)):
separated_k_v = re.split(':',list_for_this_line[i])
results[separated_k_v[0]] = separated_k_v[1]
print results
Hope this helps!
I have a field/column in a .csv file that I am loading into Pandas that will not parse as a datetime data type in Pandas. I don't really understand why. I want both FirstTime and SecondTime to parse as datetime64 in Pandas DataFrame.
# Assigning a header for our data
header = ['FirstTime', 'Col1', 'Col2', 'Col3', 'SecondTime', 'Col4',
'Col5', 'Col6', 'Col7', 'Col8']
# Loading our data into a dataframe
df = pd.read_csv('MyData.csv', names=header, parse_dates=['FirstTime', 'SecondTime'])
The code above will only parse SecondTime as datetime64[ns]. FirstTime is left as a Object data type. If I do the following code instead:
# Assigning a header for our data
header = ['FirstTime', 'Col1', 'Col2', 'Col3', 'SecondTime', 'Col4',
'Col5', 'Col6', 'Col7', 'Col8']
# Loading our data into a dataframe
df = pd.read_csv('MyData.csv', names=header, parse_dates=['FirstTime'])
It still will not parse FirstTime as a datetime64[ns].
The format for both columns is the same:
# Example FirstTime
# (%f is always .000)
2015-11-05 16:52:37.000
# Example SecondTime
# (%f is always .000)
2015-11-04 15:33:15.000
What am I missing here? Is the first column not able to be datetime by default or something in Pandas?
did you try
df = pd.read_csv('MyData.csv', names=header, parse_dates=True)
I had a similar problem and it turned out in one of my date variables there is an integer cell. So, python recognize it as "object" and the other one is recognized as "int64". You need to make sure both variables are integer.
You can use df.dtypes to see the format of your vaiables.