I need to enter regex into a field which ONLY identifies a varying date.
All files use the same format of: name.y%.m%.d%.blahblahblah
This is an example of what the filename would look like:
LordOfTheRings.14.6.28.The.Twin.Towers
Let's say we have a file called x.
f=open(x,'r')
data=f.read()
import re
y=re.sub(r".\d{2,4}.\d{1,2}.\d{1,2}."," - ",x)
new=open(y,'w')
new.write(data)
After all delete the old file.
Related
I am trying to write a dataframe to a csv and I would like the .csv to be formatted with commas. I don't see any way on the to_csv docs to use a format or anything like this.
Does anyone know a good way to be able to format my output?
My csv output looks like this:
12172083.89 1341.4078 -9568703.592 10323.7222
21661725.86 -1770.2725 12669066.38 14669.7118
I would like it to look like this:
12,172,083.89 1,341.4078 -9,568,703.592 10,323.7222
21,661,725.86 -1,770.2725 12,669,066.38 14,669.7118
Comma is the default separator. If you want to choose your own separator you can do this by declaring the sep parameter of pandas to_csv() method.
df.to_csv(sep=',')
If you goal is to create thousand separators and export them back into a csv you can follow this example:
import pandas as pd
df = pd.DataFrame([[12172083.89, 1341.4078, -9568703.592, 10323.7222],
[21661725.86, -1770.2725, 12669066.38, 14669.7118]],columns=['A','B','C','D'])
for c in df.columns:
df[c] = df[c].apply(lambda x : '{0:,}'.format(x))
df.to_csv(sep='\t')
If you just want pandas to show separators when printed out:
pd.options.display.float_format = '{:,}'.format
print(df)
What you're looking to do has nothing to do with csv output but rather is related to the following:
print('{0:,}'.format(123456789000000.546776362))
produces
123,456,789,000,000.546776362
See format string syntax.
Also, you'd do well to pay heed to #Peter 's comment above about compromising the structure of a csv in the first place.
I am trying to work with 12 different csv files, that are stored in datafolder. I have a function that opens each file separately (openFile) and performs a specific calculation on the data. I then want to be able to apply a function to each file. The names of the files are all similar to this: UOG_001-AC_TOP-Accelerometer-2017-07-22T112654.csv . The code below shows how I was planning to read the files into the openFile function:
for file in os.listdir(DATA_PATH + 'datafolder/'):
if file.endswith('.csv'):
abs_path = os.path.abspath(DATA_PATH + 'datafolder/' + file)
print(abs_path)
data = openFile(abs_path)
data2 = someFunction(data)
I need to merge specific files, which have the same two letters in the file name. At the end I should have 6 files instead of 12. The files are not stored in the order that they need to be merged in datafolder as this will eventually lead to being used for a larger number of files. The files all have the same header
Am I able to supply a list of the two letters that are the key words in the file to use in regex? e.g.
list = ['AC', 'FO', 'CK', 'OR', 'RS', 'IK']
Any suggestions on how I can achieve this with or without regex?
You could walk through the file tree and, then based on the first two letters of the file name, save each pair of files that need to be merged.
fileList = {'AC':[], 'FO':[], 'CK':[], 'OR':[], 'RS':[], 'IK':[]}
for file in os.listdir(DATA_PATH + 'datafolder/'):
if file.endswith('.csv'): #Ensure we are looking at csv
#Add the file to its correct bucket based off of the letters in name
fileList[extractTwoLettersFromFileName(file)].append(file)
for twoLetters, files in fileList.items():
mergeFiles(files)
I did not provide implementation for extracting the letters and merging the files, but from you question you seem to already have that implementation.
You can first so a simple substring check, and then based on that, classify the filenames into groups:
letters_list = ['AC', 'FO', 'CK', 'OR', 'RS', 'IK']
for letters in letters_list:
if letters in filename:
filename_list = filename_dict.get(letters, list())
filename_list.append(filename)
filename_dict[letters] = filename_list
Here is the the Path object from pathlib to create a list of files whose names end with '.csv'. Then I use a function to examine each file's name for the presence of one or another of those strings that you mentioned using a regex so that I can create a list of the pairs of these strings with their associates filenames. Notice that the length of this list of pairs is 12, and that the filenames can be recovered from this.
Having made that list I can use groupby from itertools to create two-element lists of files that share those strings in the file_kinds list. You can merge items in these lists.
>>> from pathlib import Path
>>> file_kinds = ['AC', 'FO', 'CK', 'OR', 'RS', 'IK']
>>> def try_match(filename):
... m = re.search('(%s)'%'|'.join(file_kinds), filename)
... if m:
... return m.group()
... else:
... return None
...
>>> all_files_list = [(item, try_match(item.name)) for item in list(Path(r'C:/scratch/datafolder').glob('*.csv')) if try_match(item.name)]
>>> len(all_files_list)
12
Expression for extracting full paths from all_files_list:
[str(_[0]) for _ in all_files_list]
>>> for kind, files_list in groupby(all_files_list, key=itemgetter(1)):
... kind, [str(_[0]) for _ in list(files_list)]
...
('AC', ['C:\\scratch\\datafolder\\UOG_001-AC_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-AC__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('CK', ['C:\\scratch\\datafolder\\UOG_001-CK_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-CK__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('FO', ['C:\\scratch\\datafolder\\UOG_001-FO_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-FO__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('IK', ['C:\\scratch\\datafolder\\UOG_001-IK_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-IK__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('OR', ['C:\\scratch\\datafolder\\UOG_001-OR_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-OR__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('RS', ['C:\\scratch\\datafolder\\UOG_001-RS_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-RS__B_TOP-Accelerometer-2017-07-22T112654.csv'])
I am trying to extract a dynamic value (static characters) from a csv file in a specific column and output the value to another csv.
The data element I am trying to extract is '12385730561818101591' from the value 'callback=B~12385730561818101591' located in a specific column.
I have written the below python script, but the output results are always blank. The regex '=(~[0-9]+)' was validated to successfully pull out the '12385730561818101591' value. This was tested on www.regex101.com.
When I use this in Python, no results are displayed in the output file. I have a feeling the '~' is causing the error. When I tried searching for '~' in the original CSV file, no results were found, but it is there!
Can the community help me with the following:
(1) Determine root cause of no output and validate if '~' is the problem. Could the problem also be the way I'm splitting the rows? I'm not sure if the rows should be split by ';' instead of ','.
import csv
import sys
import ast
import re
filename1 = open("example.csv", "w")
with open('example1.csv') as csvfile:
data = None
patterns = '=(~[0-9]+)'
data1= csv.reader(csvfile)
for row in data1:
var1 = row[57]
for item in var1.split(','):
if re.search(patterns, item):
for data in item:
if 'common' in data:
filename1.write(data + '\n')
filename1.close()
Here I have tried to write sample code. Hope this will help you in solving the problem:
import re
str="callback=B~12385730561818101591"
rc=re.match(r'.*=B\~([0-9A-Ba-b]+)', str)
print rc.group(1)
You regex is wrong for your example :
=(~[0-9]+) will never match callback=B~12385730561818101591 because of the B after the = and before the ~.
Also you include the ~ in the capturing group.
Not exatly sure what's your goal but this could work. Give more details if you have more restrictions.
=.+~([0-9]+)
EDIT
Following the new provided information :
patterns = '=.+~([0-9]+)'
...
result = re.search(patterns, item):
number = result.group(0)
filename1.write(number + '\n')
...
Concerning your line split on the \t (tabulation) you should show an example of the full line
I have data in sql lite table called file_path like following
full_path
---------
H:\new.docx
H:\outer
H:\outer\inner1
H:\outer\inner2
H:\outer\inner1\inner12
H:\new.docx
H:\outer\in1.pdf
H:\outer\inner1\in11.jpg
H:\outer\inner1\inner12\in121.wma
H:\new1.doc
H:\new2.rtf
H:\new.txt
I want to get the rows which are direct child of "H" means I do not want files/folders which are inside a folder. Is it possible using regex?
You'd look for rows that start with h:, but contain only one \ character (no subfolders).
So:
select * from file_path where
(full_path like 'h:%') and
not (full_path like '%\%\%');
I want all my files to be off format: 2013-03-31_142436.jpg
i.e. %Y-%m-%d_%H%M%S
I have a script to rename that way but would like to check if the filename is of the format first. I do:
for filename in files:
# check filename not already in file format
filename_without_ext = os.path.splitext(filename)[0];
How do I check filename_without_ext is of format %Y-%m-%d_%H%M%S
Use re:
import re
if re.match(r'\d{4}-\d{2}-\d{2}_\d{6}$', filename_without_ext):
pass # of the right format
This will just check it looks like it has a chance of being a valid date. Use Martijn's answer if you require it to be a valid date.
Just try to parse it as a timestamp:
from time import strptime
try:
strptime(filename_without_ext, '%Y-%m-%d_%H%M%S')
except ValueError:
# Not a valid timestamp
The strptime() test has the advantage that it guarantees that you have a valid datetime value, not just a pattern of digits that still could represent an invalid datetime ('1234-56-78_987654' is not a valid timestamp, for example).