Read unicode csv using regex in python 3 - regex

I have an excel (*.xlsx) unicode/non-English/Amharic characters which I want to save as unicode csv. It seems there is no direct way: I first must save it as unicode.txt in excel and remove the tab characters in Sublime text 3/text editor. Unfortunately the tabs are not consistent between columns. How can I use regex in python to convert the unicode.xlsx to unicode.csv using regex? The excel table has some NaN/Blank cells, so the table does not have regular spacing (tab) between columns and it's hard to replace the tabs with comma using Find and Replace. Any solution?
ስም የወርደሞዝ ጾታ ሥራ ዕድሜ

Excel 2016 has the option to save as "CSV UTF-8 (Comma delimited)". That should work for you unless you are attached to UTF-16LE with tabs, which is what you get from "Unicode text".
If your Excel doesn't have that option then this Python 3 code will convert it:
import csv
with open('book1.txt','r',encoding='utf16',newline='') as f1, \
open('book1.csv','w',encoding='utf-8-sig',newline='') as f2:
r = csv.reader(f1,dialect='excel-tab')
w = csv.writer(f2,dialect='excel')
for line in r:
w.writerow(line)

Related

Adding a space within a line in file with a specific pattern

I have a file with some data as follows:
795 0.16254624E+01-0.40318151E-03 0.45064186E+04
I want to add a space before the third number using search and replace as
795 0.16254624E+01 -0.40318151E-03 0.45064186E+04
The regular expression for the search is \d - \d. But what should I write in replace, so that I could get the above output. I have over 4000 of similar lines above and cannot do it manually. Also, can I do it in python, if possible.
Perhaps you could findall to get your matches and then use join with a whitespace to return a string where your values separated by a whitespace.
[+-]?\d+(?:\.\d+E[+-]\d+)?\b
import re
regex = r"[+-]?\d+(?:\.\d+E[+-]\d+)?\b"
test_str = "795 0.16254624E+01-0.40318151E-03 0.45064186E+04"
matches = re.findall(regex, test_str)
print(" ".join(matches))
Demo
You could do it very easily in MS Excel.
copy the content of your file into new excel sheet, in one column
select the complete column and from the data ribbon select Text to column
a wizard dialog will appear, select fixed width , then next.
click just on the location where you want to add the new space to tell excel to just split the text after this location into new column and click next
select each column header and in the column data format select text to keep all formatting and click finish
you can then copy all the new column or or export it to new text file

tabulation separated file into pandas

Good afternoon,
Following my question here Reading csv like file to pandas, I have another one that prooves a bit more tricky.
The excel spreasheet I am trying to read is separated by tabulatiobn, and has a column with quotes. I am trying to use the quotechar parameter to avoid an unexpected column, but it does not seem to work as the separator is a regex
My code is below, if that helps:
umm2017=pd.read_csv(r'{0}\DonneesIndisponibilitesProduction_2017.xls'.format(path_temp),sep = r"\t",encoding='iso-8859-2', quotechar = "'")
umm2017 = umm2017.drop(umm2017.index[len(umm2017) - 1])
umm2017.to_csv(r'{}\umm_rte_2017.csv'.format(path_output), index = 'False')
and the excel I am trying to read with pandas is here
example
And the file I am trying to parse can be found here, at the bottom https://clients.rte-france.com/lang/fr/visiteurs/vie/prod/indisponibilites.jsp
EDIT : spent the afternoon trying to understand; so it seems quotchar is not supported with the C engine, which is the only one I can use as my file uses regex as sep. I tried to deleted all hyphens from the original file, but it does not work.
here is the output csv I get; with some extra rows messing things up :
Many thanks
As a workaround, to avoid the issue with pd.read_csv interpreting separators over one character in length as regexes and causing trouble with quoting, you could replace \t with another character first, then feed that to pd.read_csv.
For example, with the 2014 file from your link, ; seemed to work:
import io
import pandas as pd
with open(fileloc, encoding='iso-8859-2') as file:
data = file.read().replace('\t', ';')
pd.read_csv(io.StringIO(data), sep=';').to_csv(newfileloc, encoding='iso-8859-2', index=False)

Easiest way to cross-reference a CSV file with a text file for common strings

I have a list of strings in a CSV file, and another text file that I would like to search for these strings. The CSV file has just the strings that I am interested in, but the text file has a bunch of other text interspersed among the strings of interest (the strings I am interested in are ID numbers for a database of proteins). What would the easiest way of going about this be? I want to check the text file for the presence of every string in the CSV file. I am working in a research lab at a top university, so you would be aiding cutting-edge research!
Thanks :)
I would use Python for this. To print the matching lines, you could do this:
import csv
with open("strings.csv") as csvfile:
reader = csv.reader(csvfile)
searchstrings = {row[0] for row in reader} # Construct a set of keywords
with open("text.txt") as txtfile:
for number, line in enumerate(txtfile):
for needle in searchstrings:
if needle in line:
print("Line {0}: {1}".format(number, line.strip()))
break # only necessary if there are several matches per line

Format lists in VIM

I would like to find a way to easy format lists in Vim.
I checked PAR and the default formatter of Vim.
p.e.
1. this is my text this is my text this is my text
2. this is my text this is my text this is my text
3. this is my text this is my text this is my text
4. this is my text this is my text this is my text
and this
- this is my text this is my text this is my text
- this is my text this is my text this is my text
- this is my text this is my text this is my text
- this is my text this is my text this is my text
when I select the lines and do a default format to 42 with PAR and VIM these are the results:
NUMBERED LIST
formatting with par:
par error:
(42) <= (0) + (50)
formatting with vim:
1. this is my text this is my text this is
my text
2. this is my text this is my text this is
my text
3. this is my text this is my text this is
my text
4. this is my text this is my text this is
my text
LIST with '-'
formatting with par:
4 lines filtered (no change)
formatting with vim:
- this is my text this is my text this is
my text
- this is my text this is my text this is
my text
- this is my text this is my text this is
my text
- this is my text this is my text this is
my text
Vim does a better job formatting lists but it is not correct as well in a numbered list.
Par does have a lot of troubles formatting lists even when I use the prefix ("p") option like this:
'<,'>!par w42p4dh or '<,'>!par w42p3dh
Does anyone know a good way how to format lists without problems?
Try set fo+=n. From :help fo-table:
n When formatting text, recognize numbered lists. This actually uses
the 'formatlistpat' option, thus any kind of list can be used. The
indent of the text after the number is used for the next line. The
default is to find a number, optionally followed by '.', ':', ')',
']' or '}'. Note that 'autoindent' must be set too. Doesn't work
well together with "2".
Example:
1. the first item
wraps
2. the second item

Format all IP-Addresses to 3 digits

I'd like to use the search & replace dialogue in UltraEdit (Perl Compatible Regular Expressions) to format a list of IPs into a standard Format.
The list contains:
192.168.1.1
123.231.123.2
23.44.193.21
It should be formatted like this:
192.168.001.001
123.231.123.002
023.044.193.021
The RegEx from http://www.regextester.com/regular+expression+examples.html for IPv4 in the PCRE-Format is not working properly:
^(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]){3}$
I'm stucked. Does anybody have a proper solution which works in UltraEdit?
Thanks in advance!
Set the regular expression engine to Perl (on the advanced section) and replace this:
(?<!\d)(\d\d?)(?!\d)
with this:
0$1
twice. That should do it.
If your input is a single IP address (per line) and nothing else (no other text), this approach will work:
I used "Replace All" with Perl style regular expressions:
Replace (?<!\d)(?=\d\d?(?=[.\s]|$))
with 0
Just replace as often as it matches. If there is other text, things will get more complicated. Maybe the "Search in Column" option is helpful here, in case you are dealing with CSV.
If this is just a one-off data cleaning job, I often just use Excel or OpenOffice Calc for this type of thing:
Open your textfile and make sure only one IP address per line.
Open Excel or whatever and goto "Data|Import External Data" and import your textfile using "." as the separator.
You should now have 4 columns in excel:
192 | 168 | 1 | 1
Right click and format each column as a number with 3 digits and leading zeroes.
In column 5 just do a string concatenation of the previous columns with a "." in between each column:
A1 & "." & B1 & "." & C1 & "." & D1
This obviously is a cheap and dirty fix and is not a programmatic way of dealing with this, but I find this sort of technique useful for cleaning up data every now and then.
I'm not sure how you can use Regular Expression in Replace With box in UltraEdit.
You can use this regular expression to find your string:
^(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])$