How to format this txt file using Regex - regex

Have a .txt file with data folded up into a single column, looking to turn it into a .csv so I can import it into a DB table.
Source file:
1000
AAAAAAAAAA
100,000.00
2000
BBBBBBBBBB
200,000.00
3000
CCCCCCCCCC
300,000.00
4000
DDDDDDDDDD
400,000.00
Looking to turn it into:
1000,AAAAAAAAA,100,000.00
2000,BBBBBBBBB,200,000.00
3000,CCCCCCCCC,300,000.00
4000,DDDDDDDDD,400,000.00
I've tried this so far and am stuck there:
find - ^(\d+)(\s)
substitue - $1,
That gets me this output:
1000,AAAAAAAAA
100,000.00
2000,BBBBBBBBB
200,000.00
3000,CCCCCCCCC
300,000.00
4000,DDDDDDDDD
400,000.00
Would love any pointers to move ahead.
Thanks,
CH

Try the following find and replace:
Find: (.*)\r?\n(.*)\r?\n(.*)(?:\r?\n|$)
Replace: $1|$2|$3\n
This approach captures each of three successive lines, and then concatenates together into a single line using pipe as the separator. Note carefully that it is not acceptable to use comma as a separator here, because some of your numeric data already uses comma.
Follow the link below for a running demo.
Demo

If every a row consists of 3 items, maybe try splitting the txt file based on spaces and then writing to a csv file?
For example in python:
result = []
def writeToCSV(result):
with open('new.csv', 'a') as writeFile:
writer = csv.writer(writeFile)
for i in range(len(result)):
writer.writerow(result)
with open('yourfile.txt', 'r') as csvfile:
spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
ind = 0
for row in spamreader:
result.append(row)
ind += 1
if(ind == 3):
ind = 0
writeToCSV(result)
result = []

You can use a regex like this:
(\d+)\n(\w+)\n([\d,.]+)
Working demo
With this replacement string:
$1,$2,$3

Related

Extract rows from csv file using regex substring?

I have a csv file that looks like this (obviously < anystring > means just that).
<anystring>tony_UPearly_start,1,2,3
<anystring>tony_UPlate_start,4,5,6
<anystring>tony_UP<anystring>_start,7,8,9
<anystring>jane_UPearly_start,1,2,3
<anystring>jane_UPlate_start,4,5,6
<anystring>jane_UP<anystring>_start,7,8,9
I am trying to extract rows UPanystring (rows 3 and 6 in this example) using negative look forward to exclude rows 1,2 and 4,5
import re
import csv
search = re.compile(r'.*_UP(?!early|late)')
output = []
with open('test.csv', mode='r', encoding='utf-8') as f:
csvfile = csv.reader(f)
for row in csvfile:
if row[0] == search:
output.append(row)
print(output)
>>>[]
when I am after
print (output)
[<anystring>tony_UP<anystring>_start,7,8,9, <anystring>jane_UP<anystring>_start,7,8,9]
The regex search works when I test on a regex platform but not in python?
Thanks for the comments: the search code now looks like
search = re.compile(r'^.*?_UP(?!early|late).*$')
output = []
with open('test.csv', mode='r', encoding='utf-8') as f:
csvfile = csv.reader(f)
for row in csvfile:
search.search(row[0]) # it think this needs and if=true but it won't accept a boolean here?
output.append(row)
This now returns all rows (ie filters nothing whereas before it filtered everything)
You want to return a list of rows that contain _UP not followed with early or late.
The pattern should look like
search = re.compile(r'_UP(?!early|late)')
You do not need any ^, .*, etc. because when you use re.search, you are looking for a pattern match anywhere inside a string.
Then, all you need is to test the row for the regex match:
if search.search(row):
output.append(row)
See the Python demo:
import re
csvfile="""<anystring>tony_UPearly_start,1,2,3
<anystring>tony_UPlate_start,4,5,6
<anystring>tony_UP<anystring>_start,7,8,9
<anystring>jane_UPearly_start,1,2,3
<anystring>jane_UPlate_start,4,5,6
<anystring>jane_UP<anystring>_start,7,8,9""".splitlines()
search = re.compile(r'_UP(?!early|late)')
output = []
for row in csvfile:
if search.search(row):
output.append(row)
print(output)
And the output is your expected list:
['<anystring>tony_UP<anystring>_start,7,8,9', '<anystring>jane_UP<anystring>_start,7,8,9']

Removing lines with only digits - regex

I am new to both python and regex. I am trying to process a text file where I want to remove lines with only digits and space. This is the regular expression I am using.
^\s*[0-9]*\s*$
I am able to match the lines I want to remove (in notepad++ find dialog).
but when I try to do the same with python, the lines are not matched. Is there a problem in the regex itself or there is some problem with my python code?
Python code that I am using :
contacts = re.sub(r'^\s*[0-9]*\s*$','\n',contents)
Sample text
Age:30
Gender:Male
20
Name:संगीता शर्मा
HusbandsName:नरेश कुमार शर्मा
HouseNo:10/183
30 30
Gender:Female
21
Name:मोनू शर्मा
FathersName:कैलाश शर्मा
HouseNo:10/183
30
Gender:Male
Use re.sub in multiline mode:
contacts = re.sub(r'^\s*([0-9]+\s*)+$','\n',x, flags=re.M)
Demo
If you want the beginning ^ and ending $ anchors to kick in, then you want to be in multiline mode.
In addition, use the following to represent a line only containing clusters of numbers, possibly separated by whitespace:
^\s*([0-9]+\s*)+$
You don't even need regex for that, a simple str.translate() to remove characters you're not interested and check if something is left should more than suffice:
import string
clear_chars = string.digits + string.whitespace # a map of characters we'd like to check for
# open input.txt for reading, out.txt for writing
with open("input.txt", "rb") as f_in, open("output.txt", "wb") as f_out:
for line in f_in: # iterate over the input file line by line
if line.translate(None, clear_chars): # remove the chars, check if anything is left
f_out.write(line) # write the line to the output file
# uncomment the following if you want added newlines when pattern matched
# else:
# f_out.write("\n") # write a new line on match
Which will produce for your sample input:
Age:30
Gender:Male
Name:संगीता शर्मा
HusbandsName:नरेश कुमार शर्मा
HouseNo:10/183
Gender:Female
Name:मोनू शर्मा
FathersName:कैलाश शर्मा
HouseNo:10/183
Gender:Male
If you want the matching lines replaced with a new line, just uncomment the else clause.

Swapping columns in vi with regex without using awk, read, etc

I have a file of 1000 lines, with 5 to 8 columns in each line separated by :
1:2:3:4:5:6:7:8
4g10:8s:45:9u5b:a:z1
I want to have all lines in some order 4:3:1:2:5:6:7...
How would I swap only first 4 columns with regex?
I think this would probably be easier to do with another approach, but you could use ex to do it, so be in command mode and enter:
:%s/^\([^:]\+\):\([^:]\+\):\([^:]\+\):\([^:]\+\):/\4:\3:\1:\2:/
which will create capture groups for the first 4 colon delimited fields, then replace them in a different order than they were there originally.
Here is a regex that should do what you are looking for:
newtext = re.sub("([^:]+):([^:]+):([^:]+):([^:]+)(:)?(.*)?",r"\4:\3:\1:\2\5\6",text)
The take away is you'll want to use parans for capturing and then reorder them in the order you want them in the replace. Each capture "group" is just one or more non : separated by : If there is possibility of empty groups change each + to a *
Here is a sample in Python for clarity:
import re
textlist = [
"1:2:3:4:5:6:7:8",
"1:2:3:4:5",
"1:2:3:4",
]
for text in textlist:
newtext = re.sub("([^:]+):([^:]+):([^:]+):([^:]+)(:)?(.*)?",r"\4:\3:\1:\2\5\6",text)
print (newtext)
output:
4:3:1:2:5:6:7:8
4:3:1:2:5
4:3:1:2

Python script to extract data from text file

I have a text file which have some website list links like
test.txt:
http://www.site1.com/
http://site232546ee.com/
https://www.site3eiue213.org/
http://site4.biz/
I want to make a simple python script which can extract only site names with length of 8 characters... no name more than 8 characters.... the output should be like:
output.txt:
site1
site2325
site3eiu
site4
i have written some code:
txt1 = open("test.txt").read()
txt2 = txt1.split("http://www.")
f = open('output.txt', 'w')
for us in txt2:
f.write(us)
print './done'
but i don't know how to split() more than one command in one line ... i also tried it with import re module but don't able to know that how to write code for it.
can some one help me please to make this script. :(
you can achieve this using regular expression as below.
import re
no = 8
regesx = "\\bhttp://www.|\\bhttp://|\\bhttps://www."
text = "http://site232546ee.com/"
match = re.search(regesx, text)
start = match.end(0)
end = start+no
string1 = text[start:end]
end = string1.find('.')
if end > 0:
final = string1[0:end]
else:
final = string1
print(final)
You said you want to extract site names with 8 characters, but the output.txt example shows bits of domain names. If you want to filter out domain names which have eight or less characters, here is a solution.
Step 1: Get all the domain names.
import tldextract
import pandas as pd
text_s=''
list_u=('http://www.site1.com/','http://site232546ee.com/','https://www.site3eiue213.org/','http://site4.biz/')
#http:\//www.(\w+).*\/?
for l in list_u:
extracted = tldextract.extract(l)
text_s+= extracted.domain + ' '
print (text_s) #gives a string of domain names delimited by whitespace
Step 2: filter domain names with 8 or less characters.
word= text_s.split()
lent= [len(x) for x in text_s.split()]
word_len_list = pd.DataFrame(
{'words': word,
'char_length': lent,
})
word_len_list[(word_len_list.char_length <= 8)]
Output looks like this:
words char_length
0 site1 5
3 site4 5
Disclaimer: I am new to Python. Please ignore any unnecessary and/or stupid steps I may have written
Have you tried printing txt2 before doing anything with it? You will see that it did not do what (I expect) you wanted it to do, since there's only one "http://www." available in the text. Try to split at a newline \n. That way you get a list of all the urls.
Then, for each url you'll want to strip the front and back, which you can do with regular expression but which can be quite hard, depending on what you want to be able to strip off. See here.
When you have found a regular expression that works for you, simply check the domain for its length and write those domains to a file that satisfy your conditions using an if statement (if len(domain) <= 8: f.write(domain))

findall function grabbing the wrong info

I am trying to writing a piece of python to read my files. The code is below:
import re, os
captureLevel = [] # capture read scale.
captureQID = [] # capture questionID.
captureDesc = [] # capture description.
file=open(r'E:\Grad\LIS\LIS590 Text mining\Final_Project\finalproject_data.csv','rt')
newfile=open('finalwordlist.csv','w')
mytext=file.read()
for row in mytext.split('\n'):
grabLevel=re.findall(r'(\d{1})+\n',row)
captureLevel.append(grabLevel)
grabQID=re.findall(r'(\w{1}\d{5})',row)
captureQID.append(grabQID) #ERROR LINE.
grabDesc=re.findall(r'\,+\s+(\w.+)',row)
captureDesc.append(grabDesc)
lineCount = 0
wordCount = 0
lines = ''.join(grabDesc).split('.')
for line in lines:
lineCount +=1
for word in line.split(' '):
wordCount +=1
newfile.write(''.join(grabLevel) + '|' + ''.join(grabQID) + '|' + str(lineCount) + '|' + str(wordCount) + '|' + word + '\n')
newfile.close()
Here are three lines of my data:
a00004," another oakstr eetrequest, helped student request item",2
a00005, asked retiree if he used journal on circ list,2
a00006, asked scientist about owner of some archival notes,2
Here is the result:
22|a00002|1|1|a00002,
22|a00002|1|2|
22|a00002|1|3|scientist
22|a00002|1|4|looking
22|a00002|1|5|for
The first column of the result should be just one number, but why is it printing out a two digit number?
Any idea what is the problem here? Thanks.
It is the tab and space difference again. Need to be careful especially for Python. Spaces are not treated as equivalent to tab. Here is a helpful link talking about the difference: http://legacy.python.org/dev/peps/pep-0008/. To be brief, space is recommended for indentation in the post. However, I find Tab works fine for indentation too. It is important to keep indentation consistent. So if you use tab, make sure you use it all the way.