I have one file like
"keyName":"type","start":{"row":42,"column":0},"end":{"row":42,"column":3},
"keyName":"left","start":{"row":42,"column":0},"end":{"row":42,"column":3},
I need to extract all the values of keyname like "keyName":"type" and "keyName":"left" excluding the other values using Python.
This is a very simple solution just parsing it as a text file.
I do not know what you really have to do but can be efficient enough!
text = open("pathToMyFile", 'r').read()
split = text.split(",")
for x in split:
if "keyName" in x:
print x
Related
I am trying to extract and preprocess log data for a use case.
For instance, the log consists of problem numbers with information to each ID underneath. Each element starts with:
#!#!#identification_number###96245#!#!#change_log###
action
action1
change
#!#!#attribute###value_change
#!#!#attribute1###status_change
#!#!#attribute2###<None>
#!#!#attribute3###status_change_fail
#!#!#attribute4###value_change
#!#!#attribute5###status_change
#!#!#identification_number###96246#!#!#change_log###
action
change
change1
action1
#!#!#attribute###value_change
#!#!#attribute1###status_change_fail
#!#!#attribute2###value_change
#!#!#attribute3###status_change
#!#!#attribute4###value_change
#!#!#attribute5###status_change
I extracted the identification numbers and saved them as a .csv file:
f = open(r'C:\Users\reszi\Desktop\Temp\output_new.txt', encoding="utf8")
change_log = f.readlines()
number = re.findall('#!#!#identification_number###(.+?)#!#!#change_log###', change_log)
Now what I am trying to achieve is, that for every ID in the .csv file I can append the corresponding log content, which is:
action
change
#!#!#attribute###
Since I am rather new to Python and only started working with regex a few days ago, I was hoping for some help.
Each log for an ID starts with "#!#!identification_number###" and ends with "#!#!attribute5### <entry>".
I have tried the following code, but the result is empty:
In:
x = re.findall("\[^#!#!#identification_number###((.|\n)*)#!#!#attribute5###((.|\n)*)$]", str(change_log))
In:
print(x)
Out:
[]
Try this:
pattern='entification_number###(.+?)#!#!#change_log###(.*?)#!#!#id'
re.findall(pattern, string+'#!#!#id', re.DOTALL)
The dotall flag makes the point match newline, so hopefully in the second capturing group you will find the logs.
If you want to get the attributes, for each identification number, you can parse the logs (got for the search above) of each id number with the following:
pattern='#!#!#attribute(.*?)###(.*?)#!#'
re.findall(pattern, string_for_each_log_match+'#!#', re.DOTALL)
If you put each id into the regex when you search using string.format() you can grab the lines that contain the correct changelog.
with open(r'path\to\csv.csv', 'r') as f:
ids = f.readlines()
with open(r'C:\Users\reszi\Desktop\Temp\output_new.txt', encoding="utf8") as f:
change_log = f.readlines()
matches = {}
for id_no in ids:
for i in range(len(change_log)):
reg = '#!#!#identification_number###({})#!#!#change_log###'.format(id_no)
if re.search(reg, change_log[i]):
matches[id_no] = i
break
This will create a dictionary with the structure {id_no:line_no,...}.
So once you have all of the lines that tell you where each log starts, you can grab the lines you want that come after these lines.
I want to find and replace all of the Managerial positions in a CSV file with number 3. The list contains different positions from simple ",Manager," to ",Construction Project Manager and Project Superintendent," but all of them are placed between two commas. I wrote this to find them all:
[,\s]?([A-Za-z. '\s/()\"]+)?(Manager|manager)([A-Za-z. '\s/()]+)?,
The Problem is that sometimes a comma is common between two adjacent Managrial position. So I need to include comma when I want to find the positions but I need to exclude it when I want to replace the position with 3! How Can I do that with a regular expression in Python?
Here is the CSV file.
I suggest using Python's built-in CSV module instead. Let's not reinvent the wheel here and consider handling CSV as a solved problem.
Here is some sample code that demonstrates how it can be done: The csv module is responsible for reading and writing the file with the correct delimiter and quotation char.
re.search is used to search individual cells/columns for your keyword. If manager is found, put a 3, otherwise, put the original content and write the row back when done.
import csv, sys, re
infile= r'in.csv'
outfile= r'out.csv'
o = open(outfile, 'w', newline='')
csvwri = csv.writer(o, delimiter=',', quotechar='\"', quoting=csv.QUOTE_MINIMAL)
with open(infile, newline='') as f:
reader = csv.reader(f, delimiter=',', quotechar='\"', quoting=csv.QUOTE_MINIMAL)
try:
for row in reader:
newrow = []
for col in row:
if re.search("manager", col, re.I):
newrow.append("3")
else:
newrow.append(col)
csvwri.writerow(newrow)
except csv.Error as e:
sys.exit('file {}, line {}: {}'.format(infile, reader.line_num, e))
o.flush()
o.close()
Straightforward and clean, I would say.
If you insist on using a regex, here's an improved pattern:
[,\s]?([A-Za-z. '\s/()\"]+)?(Manager|manager)([A-Za-z. '\s/()]+)?(?=,)
Replace with 3, as shown in the demo.
However, I believe you are still better off with the csv lib approach.
I am trying to write a dataframe to a csv and I would like the .csv to be formatted with commas. I don't see any way on the to_csv docs to use a format or anything like this.
Does anyone know a good way to be able to format my output?
My csv output looks like this:
12172083.89 1341.4078 -9568703.592 10323.7222
21661725.86 -1770.2725 12669066.38 14669.7118
I would like it to look like this:
12,172,083.89 1,341.4078 -9,568,703.592 10,323.7222
21,661,725.86 -1,770.2725 12,669,066.38 14,669.7118
Comma is the default separator. If you want to choose your own separator you can do this by declaring the sep parameter of pandas to_csv() method.
df.to_csv(sep=',')
If you goal is to create thousand separators and export them back into a csv you can follow this example:
import pandas as pd
df = pd.DataFrame([[12172083.89, 1341.4078, -9568703.592, 10323.7222],
[21661725.86, -1770.2725, 12669066.38, 14669.7118]],columns=['A','B','C','D'])
for c in df.columns:
df[c] = df[c].apply(lambda x : '{0:,}'.format(x))
df.to_csv(sep='\t')
If you just want pandas to show separators when printed out:
pd.options.display.float_format = '{:,}'.format
print(df)
What you're looking to do has nothing to do with csv output but rather is related to the following:
print('{0:,}'.format(123456789000000.546776362))
produces
123,456,789,000,000.546776362
See format string syntax.
Also, you'd do well to pay heed to #Peter 's comment above about compromising the structure of a csv in the first place.
I am trying to extract a dynamic value (static characters) from a csv file in a specific column and output the value to another csv.
The data element I am trying to extract is '12385730561818101591' from the value 'callback=B~12385730561818101591' located in a specific column.
I have written the below python script, but the output results are always blank. The regex '=(~[0-9]+)' was validated to successfully pull out the '12385730561818101591' value. This was tested on www.regex101.com.
When I use this in Python, no results are displayed in the output file. I have a feeling the '~' is causing the error. When I tried searching for '~' in the original CSV file, no results were found, but it is there!
Can the community help me with the following:
(1) Determine root cause of no output and validate if '~' is the problem. Could the problem also be the way I'm splitting the rows? I'm not sure if the rows should be split by ';' instead of ','.
import csv
import sys
import ast
import re
filename1 = open("example.csv", "w")
with open('example1.csv') as csvfile:
data = None
patterns = '=(~[0-9]+)'
data1= csv.reader(csvfile)
for row in data1:
var1 = row[57]
for item in var1.split(','):
if re.search(patterns, item):
for data in item:
if 'common' in data:
filename1.write(data + '\n')
filename1.close()
Here I have tried to write sample code. Hope this will help you in solving the problem:
import re
str="callback=B~12385730561818101591"
rc=re.match(r'.*=B\~([0-9A-Ba-b]+)', str)
print rc.group(1)
You regex is wrong for your example :
=(~[0-9]+) will never match callback=B~12385730561818101591 because of the B after the = and before the ~.
Also you include the ~ in the capturing group.
Not exatly sure what's your goal but this could work. Give more details if you have more restrictions.
=.+~([0-9]+)
EDIT
Following the new provided information :
patterns = '=.+~([0-9]+)'
...
result = re.search(patterns, item):
number = result.group(0)
filename1.write(number + '\n')
...
Concerning your line split on the \t (tabulation) you should show an example of the full line
I've read the other threads on this site but haven't quite grasped how to accomplish what I want to do. I'd like to find a method like .splitlines() to assign the first two lines of text in a multiline string into two separate variables. Then group the rest of the text in the string together in another variable.
The purpose is to have consistent data-sets to write to a .csv using the three variables as data for separate columns.
Title of a string
Description of the string
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
Any guidance on the pythonic way to do this would be appreciated.
Using islice
In addition to normal list slicing you can use islice() which is more performant when generating slices of larger lists.
Code would look like this:
from itertools import islice
with open('input.txt') as f:
data = f.readlines()
first_line_list = list(islice(data, 0, 1))
second_line_list = list(islice(data, 1, 2))
other_lines_list = list(islice(data, 2, None))
first_line_string = "".join(first_line_list)
second_line_string = "".join(second_line_list)
other_lines_string = "".join(other_lines_list)
However, you should keep in mind that the data source you read from is long enough. If it is not, it will raise a StopIteration error when using islice() or an IndexError when using normal list slicing.
Using regex
The OP asked for a list-less approach additionally in the comments below.
Since reading data from a file leads to a string and via string-handling to lists later on or directly to a list of read lines I suggested using a regex instead.
I cannot tell anything about performance comparison between list/string handling and regex operations. However, this should do the job:
import re
regex = '(?P<first>.+)(\n)(?P<second>.+)([\n]{2})(?P<rest>.+[\n])'
preg = re.compile(regex)
with open('input.txt') as f:
data = f.read()
match = re.search(regex, data, re.MULTILINE | re.DOTALL)
first_line = match.group('first')
second_line = match.group('second')
rest_lines = match.group('rest')
If I understand correctly, you want to split a large string into lines
lines = input_string.splitlines()
After that, you want to assign the first and second line to variables and the rest to another variable
title = lines[0]
description = lines[1]
rest = lines[2:]
If you want 'rest' to be a string, you can achieve that by joining it with a newline character.
rest = '\n'.join(lines[2:])
A different, very fast option is:
lines = input_string.split('\n', maxsplit=2) # This only separates the first to lines
title = lines[0]
description = lines[1]
rest = lines[2]