'~' leading to null results in python script - regex

I am trying to extract a dynamic value (static characters) from a csv file in a specific column and output the value to another csv.
The data element I am trying to extract is '12385730561818101591' from the value 'callback=B~12385730561818101591' located in a specific column.
I have written the below python script, but the output results are always blank. The regex '=(~[0-9]+)' was validated to successfully pull out the '12385730561818101591' value. This was tested on www.regex101.com.
When I use this in Python, no results are displayed in the output file. I have a feeling the '~' is causing the error. When I tried searching for '~' in the original CSV file, no results were found, but it is there!
Can the community help me with the following:
(1) Determine root cause of no output and validate if '~' is the problem. Could the problem also be the way I'm splitting the rows? I'm not sure if the rows should be split by ';' instead of ','.
import csv
import sys
import ast
import re
filename1 = open("example.csv", "w")
with open('example1.csv') as csvfile:
data = None
patterns = '=(~[0-9]+)'
data1= csv.reader(csvfile)
for row in data1:
var1 = row[57]
for item in var1.split(','):
if re.search(patterns, item):
for data in item:
if 'common' in data:
filename1.write(data + '\n')
filename1.close()

Here I have tried to write sample code. Hope this will help you in solving the problem:
import re
str="callback=B~12385730561818101591"
rc=re.match(r'.*=B\~([0-9A-Ba-b]+)', str)
print rc.group(1)

You regex is wrong for your example :
=(~[0-9]+) will never match callback=B~12385730561818101591 because of the B after the = and before the ~.
Also you include the ~ in the capturing group.
Not exatly sure what's your goal but this could work. Give more details if you have more restrictions.
=.+~([0-9]+)
EDIT
Following the new provided information :
patterns = '=.+~([0-9]+)'
...
result = re.search(patterns, item):
number = result.group(0)
filename1.write(number + '\n')
...
Concerning your line split on the \t (tabulation) you should show an example of the full line

Related

How to extract parts of logs based on identification numbers?

I am trying to extract and preprocess log data for a use case.
For instance, the log consists of problem numbers with information to each ID underneath. Each element starts with:
#!#!#identification_number###96245#!#!#change_log###
action
action1
change
#!#!#attribute###value_change
#!#!#attribute1###status_change
#!#!#attribute2###<None>
#!#!#attribute3###status_change_fail
#!#!#attribute4###value_change
#!#!#attribute5###status_change
#!#!#identification_number###96246#!#!#change_log###
action
change
change1
action1
#!#!#attribute###value_change
#!#!#attribute1###status_change_fail
#!#!#attribute2###value_change
#!#!#attribute3###status_change
#!#!#attribute4###value_change
#!#!#attribute5###status_change
I extracted the identification numbers and saved them as a .csv file:
f = open(r'C:\Users\reszi\Desktop\Temp\output_new.txt', encoding="utf8")
change_log = f.readlines()
number = re.findall('#!#!#identification_number###(.+?)#!#!#change_log###', change_log)
Now what I am trying to achieve is, that for every ID in the .csv file I can append the corresponding log content, which is:
action
change
#!#!#attribute###
Since I am rather new to Python and only started working with regex a few days ago, I was hoping for some help.
Each log for an ID starts with "#!#!identification_number###" and ends with "#!#!attribute5### <entry>".
I have tried the following code, but the result is empty:
In:
x = re.findall("\[^#!#!#identification_number###((.|\n)*)#!#!#attribute5###((.|\n)*)$]", str(change_log))
In:
print(x)
Out:
[]
Try this:
pattern='entification_number###(.+?)#!#!#change_log###(.*?)#!#!#id'
re.findall(pattern, string+'#!#!#id', re.DOTALL)
The dotall flag makes the point match newline, so hopefully in the second capturing group you will find the logs.
If you want to get the attributes, for each identification number, you can parse the logs (got for the search above) of each id number with the following:
pattern='#!#!#attribute(.*?)###(.*?)#!#'
re.findall(pattern, string_for_each_log_match+'#!#', re.DOTALL)
If you put each id into the regex when you search using string.format() you can grab the lines that contain the correct changelog.
with open(r'path\to\csv.csv', 'r') as f:
ids = f.readlines()
with open(r'C:\Users\reszi\Desktop\Temp\output_new.txt', encoding="utf8") as f:
change_log = f.readlines()
matches = {}
for id_no in ids:
for i in range(len(change_log)):
reg = '#!#!#identification_number###({})#!#!#change_log###'.format(id_no)
if re.search(reg, change_log[i]):
matches[id_no] = i
break
This will create a dictionary with the structure {id_no:line_no,...}.
So once you have all of the lines that tell you where each log starts, you can grab the lines you want that come after these lines.

Python iterate through txt file of dates and change date format

I have a txt file oldDates.txt. I want to loop through it, modify the date formats and write the newly formatted dates to a new txt file. My code so far:
from datetime import datetime
f = open('oldDates.txt', r)
oldDates = []
newDates = []
for line in f.readlines():
oldDates.append(line)
print(line) # for testing
for oldDate in oldDates:
dt = datetime.strptime(oldDate, '%d/%m/%Y').strftime('%d/%m/%Y')
newDates.append(dt)
with open('newDates.txt', 'w') as w:
for newDate in newDates:
w.write(newDate+"\n")
f.close()
w.close()
However, this give an error:
ValueError: unconverted data remains
I'm not sure where I'm going wrong here, and if there's a more efficient way of doing this then I'd be glad to hear about it. The date conversion seems to work fine from the test print.
There are blank lines in the file and I'm wondering if I need to handle these (I'm not sure how).
Any help much appreciated!
Now, I am no expert in Python, but do you verify your input to be the correct format ? If you apply a regexp on the input line you can easily catch blank lines and any incorrect values.

Regular expression search and substitute values in CSV file

I want to find and replace all of the Managerial positions in a CSV file with number 3. The list contains different positions from simple ",Manager," to ",Construction Project Manager and Project Superintendent," but all of them are placed between two commas. I wrote this to find them all:
[,\s]?([A-Za-z. '\s/()\"]+)?(Manager|manager)([A-Za-z. '\s/()]+)?,
The Problem is that sometimes a comma is common between two adjacent Managrial position. So I need to include comma when I want to find the positions but I need to exclude it when I want to replace the position with 3! How Can I do that with a regular expression in Python?
Here is the CSV file.
I suggest using Python's built-in CSV module instead. Let's not reinvent the wheel here and consider handling CSV as a solved problem.
Here is some sample code that demonstrates how it can be done: The csv module is responsible for reading and writing the file with the correct delimiter and quotation char.
re.search is used to search individual cells/columns for your keyword. If manager is found, put a 3, otherwise, put the original content and write the row back when done.
import csv, sys, re
infile= r'in.csv'
outfile= r'out.csv'
o = open(outfile, 'w', newline='')
csvwri = csv.writer(o, delimiter=',', quotechar='\"', quoting=csv.QUOTE_MINIMAL)
with open(infile, newline='') as f:
reader = csv.reader(f, delimiter=',', quotechar='\"', quoting=csv.QUOTE_MINIMAL)
try:
for row in reader:
newrow = []
for col in row:
if re.search("manager", col, re.I):
newrow.append("3")
else:
newrow.append(col)
csvwri.writerow(newrow)
except csv.Error as e:
sys.exit('file {}, line {}: {}'.format(infile, reader.line_num, e))
o.flush()
o.close()
Straightforward and clean, I would say.
If you insist on using a regex, here's an improved pattern:
[,\s]?([A-Za-z. '\s/()\"]+)?(Manager|manager)([A-Za-z. '\s/()]+)?(?=,)
Replace with 3, as shown in the demo.
However, I believe you are still better off with the csv lib approach.

Python using RE to find integer in text file in a for

I'm writing a bot in python using tweepy for python 2.7. I'm stumped on how to approach what I am looking to do. Currently the bot finds the tweet id and appends it to a text file. On later runs I want to use regex to search that file for a match and only write if there is no match within the text file. The intent is not to add duplicate tweet ids to my text file which could span a large amount of numbers followed by newline.
Any help is appreciate!
/edit when I try the below code the IDE says match can't be seen and syntax error as a result.
import re,codecs,tweepy
qName = Queue.txt
tweets = api.search(q=searchQuery,count=tweet_count,result_type="recent")
with codecs.open(qName,'a',encoding='utf-8') as f:
for tweet in tweets:
tweetId = tweet.id_str
match = re.findall(tweedId), qName)
#if match = false then do write, else discard and move on
f.write(tweetId + '\n')
If i get you correct,You need not to bother with regex etc. let the special containers do the work for you.I would proceed with non-duplicate-container like dictionary or set e.g read all the data from file into dictionary or set and then go for extending id into this dictionary or set after all write this dictionary or set back into file.
e.g.
>>>data = set()
>>>for i in list('asddddddddddddfgggggg'):
data.add(i)
>>>data
>>>set(['a', 's', 'd', 'g', 'f']) ## see one d and g

Removing Duplicate Lines by Title Only

I am trying to modify a script so that it will remove duplicate lines from a text file using only the title portion of that line.
To clarify the text file lines look something like this:
Title|Image Url|Description|Page Url
At the moment the script does remove duplicates, but it does so by reading the entire line, not just the first part. All the lines in the file are not going to be 100% the same, but a few will be very similar.
I want to remove all of the lines that contain the same "title", regardless of what the rest of the line contains.
This is the script I am working with:
import sys
from collections import OrderedDict
infile = "testfile.txt"
outfile = "outfile.txt"
inf = open(infile,"r")
lines = inf.readlines()
inf.close()
newset = list(OrderedDict.fromkeys(lines))
outf = open(outfile,"w")
lstline = len(newset)
for i in range(0,lstline):
ln = newset[i]
outf.write(ln)
outf.close()
So far I have tried using .split() to split the lines in the list. I have also tried .readline(lines[0:25]) in hopes of using a character limit to achieve the desired results, but no luck so far. I also can't seem to find any documentation on my exact problem so I'm stuck.
I am using Windows 8 and Python 2.7.9 for this project if that helps.
I made a few changes to the program you had set up. First, I changed your file interactions to use "with" statements, since those are very convenient and automatically handle a lot of the functionality you had to write out. Second off, I used a set instead of an OrderedDict because you were basically just trying to emulate set functionality (exclusivity of elements) by using keys in an OrderedDict. If the title hasn't been used, it adds it to the set so it can't be used again and prints the line to the output file. If it has been used, it keeps going. I hope this helps you!
with open("testfile.txt") as infile:
with open("outfile.txt",'w') as outfile:
titleset = set()
for line in infile:
title = line.split('|')[0]
if title not in titleset:
titleset.add(title)
outfile.write(line)