Formatting thousand separator for numbers in a pandas dataframe - python-2.7

I am trying to write a dataframe to a csv and I would like the .csv to be formatted with commas. I don't see any way on the to_csv docs to use a format or anything like this.
Does anyone know a good way to be able to format my output?
My csv output looks like this:
12172083.89 1341.4078 -9568703.592 10323.7222
21661725.86 -1770.2725 12669066.38 14669.7118
I would like it to look like this:
12,172,083.89 1,341.4078 -9,568,703.592 10,323.7222
21,661,725.86 -1,770.2725 12,669,066.38 14,669.7118

Comma is the default separator. If you want to choose your own separator you can do this by declaring the sep parameter of pandas to_csv() method.
df.to_csv(sep=',')
If you goal is to create thousand separators and export them back into a csv you can follow this example:
import pandas as pd
df = pd.DataFrame([[12172083.89, 1341.4078, -9568703.592, 10323.7222],
[21661725.86, -1770.2725, 12669066.38, 14669.7118]],columns=['A','B','C','D'])
for c in df.columns:
df[c] = df[c].apply(lambda x : '{0:,}'.format(x))
df.to_csv(sep='\t')
If you just want pandas to show separators when printed out:
pd.options.display.float_format = '{:,}'.format
print(df)

What you're looking to do has nothing to do with csv output but rather is related to the following:
print('{0:,}'.format(123456789000000.546776362))
produces
123,456,789,000,000.546776362
See format string syntax.
Also, you'd do well to pay heed to #Peter 's comment above about compromising the structure of a csv in the first place.

Related

Any ideas on Iterating over dataframe and applying regex?

This may be a rudimentary problem but I am new to pandas.
I have a csv dataframe and I want to iterate over each row to extract all the string information in a specific column through regex. . (The reason why I am using regex is because eventually I want to make a separate dataframe of that column)
I tried iterating through for loop but I got ton of errors. So far, It looks like for loop reads each input row as a list or series rather than a string (correct me if i'm wrong). My main functions are iteritems() and findall() but no good results so far. How can I approach this problem?
My dataframe looks like this:
df =pd.read_csv('foobar.csv')
df[['column1','column2, 'TEXT']]
My approach looks like this:
for Individual_row in df['TEXT'].iteritems():
parsed = re.findall('(.*?)\:\s*?\[(.*?)\], Individual_row)
res = {g[0].strip() : g[1].strip() for g in parsed}
Many thanks in advance
you can try the following instead of loop:
df['new_TEXT'] = df['TEXT'].apply(lambda x: [g[0].strip(), g[1].strip()] for g in re.findall('(.*?)\:\s*?\[(.*?)\]', x), na_action='ignore' )
This will create a new column with your resultant data.

unexpected character after line continuation character. Also to keep rows after floating point rows in pandas dataframe

I have a dataset in which I want to keep row just after a floating value row and remove other rows.
For eg, a column of the dataframe looks like this:
17.3
Hi Hello
Pranjal
17.1
[aasd]How are you
I am fine[:"]
Live Free
So in this I want to preserve:
Hi Hello
[aasd]How are you
and remove the rest. I tried it with the following code, but an error showed up saying "unexpected character after line continuation character". Also I don't know if this code will solve my purpose
Dropping extra rows
for ind in data.index:
if re.search((([1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?, ind):
ind+=1
else:
data.drop(ind)
your regex has to be a string, you can't just write it like that.
re.search((('[1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?', ind):
edit - but actually i think the rest of your code is wrong too.
what you really want is something more like this:
import pandas as pd
l = ['17.3',
'Hi Hello',
'Pranjal',
'17.1',
'[aasd]How are you',
'I am fine[:"]',
'Live Free']
data = pd.DataFrame(l, columns=['col'])
data[data.col.str.match('\d+\.\d*').shift(1) == True]
logic:
if you have a dataframe with a column that is all string type (won't work for mixed type decimal and string you can find the decimal / int entries with the regex '\d+.?\d*'. If you shift this mask by one it gives you the entries after the matches. use that to select the rows you want in your dataframe.

Convert SList to Dataframe

I am reading data from a binary .out file using a python module "SWMMToolbox." The command to read the infilration time series for RG1 from the file.out is as follows:
x = !swmmtoolbox extract 'file.out' subcatchment,RG1,Infiltration_loss
See link for details about swmmtoolbox.
The data type of 'x' is a 'IPython.utils.text.SList'
The data looks like this:
I would like to import this Slist into pandas, but am having trouble. I want to get the datetime string as one column and the value after the comma as another. However, when I use
df = pd.DataFrame(data=x)
I get the following:
I also tried to use
df = pd.DataFrame.from_records(x)
but get this:
I tried to use pd.read_csv, but I couldn't get it to work since 'x' is a variable and not a file.
Any suggestions are much appreciated.

dataframe to dictionary:python

So, i have a file
F1.txt
CDUS,CBSCS,CTRS,CTRS_ID
0,0,0.000000000375,056572
0,0,4.0746,0309044
0,0,0.6182,0971094
0,0,15.4834,075614
I want to insert the column names and its dtype into a dictionary with the column names being the key and the corresponding dtype of the column being the value.
My read statement has to be like this:
csv=pandas.read_csv('F2.txt',dtype={'CTRS_ID':str})
I'm expecting something like this:
data = {'CDUS':'int64','CBSCS':'int64','CTRS':'float64','CTRS_ID':'str'}
Can someone help me with this. Thanks in advance
You can use dtypes to find the type of each column and then transform the result to a dictionary with to_dict. Also, if you want a string representation of the type, you can convert the dtypes output to string:
csv=pandas.read_csv('F2.txt',dtype={'CTRS_ID':str})
csv.dtypes.astype(str).to_dict()
Which gives the output:
{'CBSCS': 'int64', 'CDUS': 'int64', 'CTRS': 'float64', 'CTRS_ID': 'object'}
This is actually the right result, since pandas treats string as object.
I have not enough expertise to elaborate on this, but here a couple of references:
pandas distinction between str and object types
pandas string data types
"pandas doesn't support the internal string types (in fact they are always converted to object)" [from pandas maintainer #Jeff]

'~' leading to null results in python script

I am trying to extract a dynamic value (static characters) from a csv file in a specific column and output the value to another csv.
The data element I am trying to extract is '12385730561818101591' from the value 'callback=B~12385730561818101591' located in a specific column.
I have written the below python script, but the output results are always blank. The regex '=(~[0-9]+)' was validated to successfully pull out the '12385730561818101591' value. This was tested on www.regex101.com.
When I use this in Python, no results are displayed in the output file. I have a feeling the '~' is causing the error. When I tried searching for '~' in the original CSV file, no results were found, but it is there!
Can the community help me with the following:
(1) Determine root cause of no output and validate if '~' is the problem. Could the problem also be the way I'm splitting the rows? I'm not sure if the rows should be split by ';' instead of ','.
import csv
import sys
import ast
import re
filename1 = open("example.csv", "w")
with open('example1.csv') as csvfile:
data = None
patterns = '=(~[0-9]+)'
data1= csv.reader(csvfile)
for row in data1:
var1 = row[57]
for item in var1.split(','):
if re.search(patterns, item):
for data in item:
if 'common' in data:
filename1.write(data + '\n')
filename1.close()
Here I have tried to write sample code. Hope this will help you in solving the problem:
import re
str="callback=B~12385730561818101591"
rc=re.match(r'.*=B\~([0-9A-Ba-b]+)', str)
print rc.group(1)
You regex is wrong for your example :
=(~[0-9]+) will never match callback=B~12385730561818101591 because of the B after the = and before the ~.
Also you include the ~ in the capturing group.
Not exatly sure what's your goal but this could work. Give more details if you have more restrictions.
=.+~([0-9]+)
EDIT
Following the new provided information :
patterns = '=.+~([0-9]+)'
...
result = re.search(patterns, item):
number = result.group(0)
filename1.write(number + '\n')
...
Concerning your line split on the \t (tabulation) you should show an example of the full line