tabulation separated file into pandas - regex

Good afternoon,
Following my question here Reading csv like file to pandas, I have another one that prooves a bit more tricky.
The excel spreasheet I am trying to read is separated by tabulatiobn, and has a column with quotes. I am trying to use the quotechar parameter to avoid an unexpected column, but it does not seem to work as the separator is a regex
My code is below, if that helps:
umm2017=pd.read_csv(r'{0}\DonneesIndisponibilitesProduction_2017.xls'.format(path_temp),sep = r"\t",encoding='iso-8859-2', quotechar = "'")
umm2017 = umm2017.drop(umm2017.index[len(umm2017) - 1])
umm2017.to_csv(r'{}\umm_rte_2017.csv'.format(path_output), index = 'False')
and the excel I am trying to read with pandas is here
example
And the file I am trying to parse can be found here, at the bottom https://clients.rte-france.com/lang/fr/visiteurs/vie/prod/indisponibilites.jsp
EDIT : spent the afternoon trying to understand; so it seems quotchar is not supported with the C engine, which is the only one I can use as my file uses regex as sep. I tried to deleted all hyphens from the original file, but it does not work.
here is the output csv I get; with some extra rows messing things up :
Many thanks

As a workaround, to avoid the issue with pd.read_csv interpreting separators over one character in length as regexes and causing trouble with quoting, you could replace \t with another character first, then feed that to pd.read_csv.
For example, with the 2014 file from your link, ; seemed to work:
import io
import pandas as pd
with open(fileloc, encoding='iso-8859-2') as file:
data = file.read().replace('\t', ';')
pd.read_csv(io.StringIO(data), sep=';').to_csv(newfileloc, encoding='iso-8859-2', index=False)

Related

Making a text file which will contain my list items and applying regular expression to it

I am supposed to make a code which will read a text file containing some words with some common linguistic features. Apply some regular expression to all of the words and write one file which will have the changed words.
For now let's say my text file named abcd.txt has these words
king
sing
ping
cling
booked
looked
cooked
packed
My first question starts from here. In my simple text file how to write these words to get the above mentioned results. Shall I write them line-separated or comma separated?
This is the code provided by user palvarez.
import re
with open("new_abcd", "w+") as new, open("abcd") as original:
for word in original:
new_word = re.sub("ing$", "xyz", word)
new.write(new_word)
Can I add something like -
with open("new_abcd", "w+") as file, open("abcd") as original:
for word in original:
new_aword = re.sub("ed$", "abcd", word)
new.write(new_aword)
in the same code file? I want something like -
kabc
sabc
pabc
clabc
bookxyz
lookxyz
cookxyz
packxyz
PS - I don't know whether mentioning this is necessary or not, but I am supposed to do this for a Unicode supported script Devanagari. I didn't use it here in my examples because many of us here can't read the script. Additionally that script uses some diacritics. eg. 'का' has one consonant character 'क' and one vowel symbol 'ा' which together make 'का'. In my regular expression I need to condition the diacritics.
I think the approach you have with one word by line is better since you don't have to trouble yourself with delimiters and striping.
With a file like this:
king
sing
ping
cling
booked
looked
cooked
packed
And a code like this, using re.sub to replace a pattern:
import re
with open("new_abcd.txt", "w") as new, open("abcd.txt") as original:
for word in original:
new_word = re.sub("ing$", "xyz", word)
new_word = re.sub("ed$", "abcd", new_word)
new.write(new_word)
It creates a resulting file:
kxyz
sxyz
pxyz
clxyz
bookabcd
lookabcd
cookabcd
packabcd
I tried out with the diacritic you gave us and it seems to work fine:
print(re.sub("ा$", "ing", "का"))
>>> कing
EDIT: added multiple replacement. You can have your replacements into a list and iterate over it to do re.sub as follows.
import re
# List where first is pattern and second is replacement string
replacements = [("ing$", "xyz"), ("ed$", "abcd")]
with open("new_abcd.txt", "w") as new, open("abcd.txt") as original:
for word in original:
new_word = word
for pattern, replacement in replacements:
new_word = re.sub(pattern, replacement, word)
if new_word != word:
break
new.write(new_word)
This limits one modification per word, only the first that modifies the word is taken.
It is recommended that for starters, utilize the with context manager to open your file, this way you do not need to explicitly close the file once you are done with it.
Another added advantage is then you are able to process the file line by line, this will be very useful if you are working with larger sets of data. Writing them in a single line or csv format will then all depend on the requirement of your output and how you would want to further process them.
As an example, to read from a file and say substitute a substring, you can use re.sub.
import re
with open('abcd.txt', 'r') as f:
for line in f:
#do something here
print(re.sub("ing$",'ring',line.strip()))
>>
kring
sring
pring
clring
Another nifty trick is to manage both the input and output utilizing the same context manager like:
import re
with open('abcd.txt', 'r') as f, open('out_abcd.txt', 'w') as o:
for line in f:
#notice that we add '\n' to write each output to a newline
o.write(re.sub("ing$",'ring',line.strip())+'\n')
This create an output file with your new contents in a very memory efficient way.
If you'd like to write to a csv file or any other specific formats, I highly suggest you spend sometime to understand Python's input and output functions here. If linguistics in text is what you are going for that understand encoding of different languages and further study Python's regex operations.

Read unicode csv using regex in python 3

I have an excel (*.xlsx) unicode/non-English/Amharic characters which I want to save as unicode csv. It seems there is no direct way: I first must save it as unicode.txt in excel and remove the tab characters in Sublime text 3/text editor. Unfortunately the tabs are not consistent between columns. How can I use regex in python to convert the unicode.xlsx to unicode.csv using regex? The excel table has some NaN/Blank cells, so the table does not have regular spacing (tab) between columns and it's hard to replace the tabs with comma using Find and Replace. Any solution?
ስም የወርደሞዝ ጾታ ሥራ ዕድሜ
Excel 2016 has the option to save as "CSV UTF-8 (Comma delimited)". That should work for you unless you are attached to UTF-16LE with tabs, which is what you get from "Unicode text".
If your Excel doesn't have that option then this Python 3 code will convert it:
import csv
with open('book1.txt','r',encoding='utf16',newline='') as f1, \
open('book1.csv','w',encoding='utf-8-sig',newline='') as f2:
r = csv.reader(f1,dialect='excel-tab')
w = csv.writer(f2,dialect='excel')
for line in r:
w.writerow(line)

Regex With Colons in Data

I have a text file which I'm looking to remove some data from. The data is separated using a colon ':' as the delimiter. There are approx 9 separations. The data after the 7th column is most often null and thus useless but the additional colons are still there.
An example of the file would like this:
column1:column2:column3:column4:column5:column6:column7:column8:column9:column10
I hope to remove the info from after column8. So the data to be removed would be:
:column9:column10
Could someone advise me how to do so in Regex?
I've been reading and no where have I found a way to isolate a colon and text following after x number of colons.
Any help you could offer would be much appreciated.
$_ = join ":", ( split /:/, $_, -1 )[0..7];
or
s/(?::[^:]*){2}\z//;
The following regex will keep the first 8 columns and discard all others.
s/^[^:]*(?::[^:]*){7}\K.*//;
Assumes simple single line records.

How to split CSV line according to specific pattern

In a .csv file I have lines like the following :
10,"nikhil,khandare","sachin","rahul",viru
I want to split line using comma (,). However I don't want to split words between double quotes (" "). If I split using comma I will get array with the following items:
10
nikhil
khandare
sachin
rahul
viru
But I don't want the items between double-quotes to be split by comma. My desired result is:
10
nikhil,khandare
sachin
rahul
viru
Please help me to sort this out.
The character used for separating fields should not be present in the fields themselves. If possible, replace , with ; for separating fields in the csv file, it'll make your life easier. But if you're stuck with using , as separator, you can split each line using this regular expression:
/((?:[^,"]|"[^"]*")+)/
For example, in Python:
import re
s = '10,"nikhil,khandare","sachin","rahul",viru'
re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]
=> ['10', '"nikhil,khandare"', '"sachin"', '"rahul"', 'viru']
Now to get the exact result shown in the question, we only need to remove those extra " characters:
[e.strip('" ') for e in re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]]
=> ['10', 'nikhil,khandare', 'sachin', 'rahul', 'viru']
If you really have such a simple structure always, you can use splitting with "," (yes, with quotes) after discarding first number and comma
If no, you can use a very simple form of state machine parsing your input from left to right. You will have two states: insides quotes and outside. Regular expressions is a also a good (and simpler) way if you already know them (as they are basically an equivalent of state machine, just in another form)

Easiest way to cross-reference a CSV file with a text file for common strings

I have a list of strings in a CSV file, and another text file that I would like to search for these strings. The CSV file has just the strings that I am interested in, but the text file has a bunch of other text interspersed among the strings of interest (the strings I am interested in are ID numbers for a database of proteins). What would the easiest way of going about this be? I want to check the text file for the presence of every string in the CSV file. I am working in a research lab at a top university, so you would be aiding cutting-edge research!
Thanks :)
I would use Python for this. To print the matching lines, you could do this:
import csv
with open("strings.csv") as csvfile:
reader = csv.reader(csvfile)
searchstrings = {row[0] for row in reader} # Construct a set of keywords
with open("text.txt") as txtfile:
for number, line in enumerate(txtfile):
for needle in searchstrings:
if needle in line:
print("Line {0}: {1}".format(number, line.strip()))
break # only necessary if there are several matches per line