Regular expression and csv | Output more readable - regex

I have a text which contains different news articles about terrorist attacks. Each article starts with an html tag (<p>Advertisement) and I would like to extract from each article a specific information: the number of people wounded in the terrorist attacks.
This is a sample of the text file and how the articles are separated:
[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded 2 police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.” , The two officers were attacked on the Boulevard Lambermont.....]
[<p>Advertisement ,, By KAREEM FAHIM and MOHAMAD FAHIM ABED JUNE 30, 2016
, At least 33 people were killed and 25 were injured when the Taliban bombed buses carrying police cadets on the outskirts of Kabul, Afghanistan, on Thursday. , KABUL, Afghanistan — Taliban insurgents bombed a convoy of buses carrying police cadets on the outskirts of Kabul, the Afghan capital, on Thursday, killing at least 33 people, including four civilians, according to government officials and the United Nations. , During a year...]
This is my code so far:
text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
splitted = text.read.split("<p>")
pattern= ("wounded (\d+)|(\d+) were wounded|(\d+) were injured")
for article in splitted:
result = re.findall(pattern,article)
The output that I get is:
[]
[]
[]
[('', '40', '')]
[('', '150', '')]
[('94', '', '')]
And I would like to make the output more readable and then save it as csv file:
article_1,0
article_2,0
article_3,40
article_3,150
article_3,94
Any suggestion in how to make it more readable?

I rewrote your loop like this and merged with csv write since you requested it:
import csv
with open ("wounded.csv","w",newline="") as f:
writer = csv.writer(f, delimiter=",")
for i,article in enumerate(splitted):
result = re.findall(pattern,article)
nb_casualties = sum(int(x) for x in result[0] if x) if result else 0
row=["article_{}".format(i+1),nb_casualties]
writer.writerow(row)
get index of the article using enumerate
sum the number of victims (in case more than 1 group matches) using a generator comprehension to convert to integer and pass it to sum, that only if something matched (ternary expression checks that)
create the row
print it, or optionally write it as row (one row per iteration) of a csv.writer object.

Related

Matching diverse dates in Openrefine

I am trying to use the value.match command in OpenRefine 2.6 for splitting the information presents in a column into (at least) 2 columns.
The data are, however, quite messed up.
I have sometimes full dates:
May 30, 1949
Sometimes full dates are combined with other dates and attributes:
May 30, 1949, published 1979
May 30, 1949 and 1951, published 1979
May 30, 1949, printed 1980
May 30, 1949, print executed 1988
May 30, 1949, prints executed 1988
published 1940
Sometimes you have timespan:
1905-05 OR 1905-1906
Sometimes only the year
1905
Sometimes year with attributes
August or September 1908
Doesn't seems to follow any specific schema or order.
I would like to extract (at least)ca start and end date year, in order to have two columns:
-----------------------
|start_date | end_date|
|1905 | 1906 |
-----------------------
without the rest of the attributes.
I can find the last date using
value.match(/.*(\d{4}).*?/)[0]
and the first one with
value.match(/.*^(\d{4}).*?/)[0]
but I got some trouble with the two formulas.
The latter cannot match anything in case of:
May 30, 1949 and 1951, published 1979
while in the case of:
Paris, winter 1911-12
The latter formula cannot match anything and the former formula match 1911
Anyone know how I can resolve the problem?
I would need a solution that take the first date as start_date and final date as end_date, or better (don't know if it is possible) earliest date as start_date and latest date as end_date.
Moreover, I would be glad to have some clue about how to extract other information, such as
if published or printed or executed is present in the text -> copy date to a new column name “execution”.
should be something like create a new column
if(value.match("string1|string2|string3" + (\d{4}), "perform the operation", do nothing)
value.match() is a very useful but sometimes tricky function. To extract a pattern from a text, I prefer to use Python/Jython's regular expressions :
import re
pattern = re.compile(r"\d{4}")
return pattern.findall(value)
From there, you can create a string with all the years concatenated:
return ",".join(pattern.findall(value))
Or select only the first:
return pattern.findall(value)[0]
Or the last:
return pattern.findall(value)[-1]
etc.
Same thing for your sub-question:
import re
pattern = re.compile(r"(published|printed|executed)\s+(\d+)")
return pattern.findall(value)[0][1]
Or :
import re
pattern = re.compile(r"(published|printed|executed)\s+(\d+)")
m = re.search(pattern, value)
return m.group(2)
Example:
Here is a regex which will extract start_date and end_date in named groups :
If there is only one date, then it consider it's the start_date :
((?<start_date>\d{4}).*?)?(?<end_date>\d{4}|(?<=-)\d{2})?$
Demo

How to know if a variation (f.e. abbreviation) of a string in a list does match agains another list if the original does not?

I currently searching for a method in R which let's me match/merge two data frames. Helas both of these data frames contain non optimal data. They can have certain abbreviations of even typo's in them. Therefore I would like to define a list for each abbreviation and if a string contains one of those elements. If the original entries don't match, R should check if any of the other options of the abbreviation has a match. To illustrate: the name of a company could end with "Limited" but also with "Ltd." of "Ltd" etc.
EXAMPLE
Data
The Original "Address" file contains:
Company name Address
Deloitte Ltd. New York
Coca-Cola New York
Tesla ltd California
Microsoft Limited Washington
Would have to be merged with "EnterpriseNrList"
Company name EnterpriseNumber
Deloitte Ltd. 221
Coca-Cola 334
Tesla ltd 725
Microsoft Limited 127
So the abbreviations should work in "both directions". That's why I said, if R recognises any of the abbreviations, R should try to match all of them.
All of the matches should be reported as the return.
Therefore I would make up a list "Abbreviations" for each possible abbreviation
Limited.
limited
Ltd.
ltd.
Ltd
ltd
Questions
1) Would this be a good method, or would there be a more efficient way?
2) How can I check a list against a list of possible abbreviations (step 1, see below), sort of a containsx from excel?
3) How could I make up a list that replaces for the entries that do not match the abbreviation with all other abbreviatinos (step 2, see below)?
Thoughts for solution
Step 1
As I am still very new to this kind of work, I was thinking the following: use a regex expression to filter out wether a string contains any of the abbreviation options and create a list which will then contain either -1 if no match could be found and >0 if match is found. The no pattern matching can already be matched against the "Address" list. With the other entries I continue to step 2.
In this step I don't really know how to check against a list of options ("Abbreviations" list).
Step 2
Next I would create a list with the matches from step 1 and rbind together all options. In this step I don't really know to I could create a list that combines f.e. Coca-Cola with all it's possible abbreviations.
Coca-Cola Limited
Coca-Cola Ltd.
Coca-Cola Ltd
etc.
Step 3
Lastly I would match/merge this more complete list of companies again with the original "Data" list. With the introduction of step 2 I thought It might be a bit easier on the required computing power, as the original list is about 8000 rows.
I would go in a different approach, fixing the tables first before the merge.
To fix with abreviations, I would use a regex, case insensitive, the final dot being optionnal, I start with a list of 'Normal word' = vector of abbreviations.
abbrevs <- list('Limited'=c('Limited','Ltd'),'Incorporated'=c('Incorporated','Inc'))
The I build the corresponding regex (alternations with an optional dot at end, the case will be ignored by parameter in gsub and agrep later):
regexes <- lapply(abbrevs,function(x) { paste0("(",paste0(x,collapse='|'),")[.]?") })
Which gives:
$Limited
[1] "(Limited|Ltd)[.]?"
$Incorporated
[1] "(Incorporated|Inc)[.]?"
Now we have to apply each regex to the company.name column of each df:
for (i in seq_along(regexes)) {
Address$Company.name <- gsub(regexes[[i]], names(regexes[i]), Address$Company.name, ignore.case=TRUE)
Enterprise$Company.name <- gsub(regexes[[i]], names(regexes[i]), Enterprise$Company.name, ignore.case=TRUE)
}
This does not take into account typos. Here you'll need to work on with agrepor adist to manage it.
Result for Address example data set:
> Address
Company.name Address
1 Deloitte Limited New York
2 Coca-Cola New York
3 Tesla Limited California
4 Microsoft Limited Washington
Input data used:
Address <- structure(list(Company.name = c("Deloitte Ltd.", "Coca-Cola",
"Tesla ltd", "Microsoft Limited"), Address = c("New York", "New York",
"California", "Washington")), .Names = c("Company.name", "Address"
), class = "data.frame", row.names = c(NA, -4L))
Enterprise <- structure(list(Company.name = c("Deloitte Ltd.", "Coca-Cola",
"Tesla ltd", "Microsoft Limited"), EnterpriseNumber = c(221L,
334L, 725L, 127L)), .Names = c("Company.name", "EnterpriseNumber"
), class = "data.frame", row.names = c(NA, -4L))
I would say that the answer depends on whether you have a list of abbreviations or not.
If you have one, you could just look which element of your list contains an abbreviation with grep or greplfunctions. (grep return all indexes that have a matching pattern whereas grepl returns a logical vector).
Also, use the ignore.case= TRUE parameter of these function, so you don't have to try all capitalized/lowercase possibilities.
If you don't have such a list, my first guest would be to extract the first "word" of each company (I would guess that there is a single "Deloitte" company, and that it is "Deloitte Ltd"). You can do so with:
unlist(strsplit(CompanyNames,split = " "))
If you wanted to also correct for typos, this is more a question of string distance.
Hope that it helped!

Big Data free text with special characters- search via Python and giving unicode errors

Free text with special characters and line spaces between each record and getting impossible to search for key word. I have big text file with 3 columns (each column has seperated by “|”. Seems like each record end with } sign. There is a line gap between each row OR record. My file size is around 100 MB+
My objective is search for multiple key words and surrounding words before and after key word.
With stack overflow help, I am using this code but I am getting Unicode errors. Please help.
1.I want to get only positive result. Or I don’t want see any data if it search won’t matches.
2.Is it possible to see first 4 columns of each findings along with result? Those four columns are fixed length and same for each record.
My file Sample:
00010007308000003161|730100039|2007-11-27 09:54:17.000|ACCG| {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;\red255\green0\blue0
;}
{\stylesheet{\fs20\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}{\s2\cf0\cb1
;}}
\par\par\par\b
FOLLOW-According to the United States Census Bureau, the township has a total area of 15.1 square miles
(39 km2), of which, 14.6 square miles (38 km2) of it is land and 0.5 square miles (1.3 km2) of it
(3.58%) is water. It is drained by the Lehigh River on its western \clvertalt\cellx4320
\pard\intbl\s0\ql\widctlpar\plain\f1\fs20\lang4105\f1\fs16 3.87 10^6/uL \cell
\pard\s0\ql\widctlpar\plain\f1\fs20\par\par\b ASSESSMENT:\plain\f1\fs20 Perfect
As of the census[1] of 2000, there were 4,243 people, 1,671 households, and 1,256 families residing in
the township. The population cc:\tab Dhar xdfsd, MD\par\par\par\par\pard\s0\ql\par}
00010007308000003141|730100040|2007-11-27 10:05:09.000|ACCG| {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;\red255\green0\blue0
;}
{\stylesheet{\fs20\cf0\cb1 Normal;}{\cs1\cf0\cb1 Default Paragraph Font;} {\s2\cf2\cb1
;}{\s3\f1\fs22\cf2\cb1\tqc\tx4320\tqr\tx8640 header;} {\s4\fs20\cf2\cb1\tqc\tx4320\tqr\tx8640
footer;}}
\pgwsxn12240\pghsxn15840\marglsxn864\margrsxn864\margtsxn1440\margbsxn864\headery1440\footery864\sbkpage
\pgncont\pgndec
\plain\plain\f1\fs20\pard\par\pard\s3\tqc\tx4320\tqr\tx8640\qc\widctlpar\f0\fs28 \caps
There were 1,671 households out of which 28.8% had children under the age of 18 living with them, 64.0%
were married couples living together, 6.9% had a female householder with no husband present, census
24.8% were non-families. 19.5% of all households were made up of
30094 - (770) 761-7260 - FAX (678) 413 -1818\par\lang1024\f0\fs20\par\pard\plain\f1\fs20\par\ql\par\par
}
00010007308000003141|730100036|2007-11-19 12:36:28.000|ACCG| {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red255\green0\blue0 ;}
{\stylesheet{\fs20\cf0\cb1 Normal;}{\cs1\cf0\cb1 Default Paragraph Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1440\margb864\headery1440\footer y864\deftab720\formshade
\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd
\pgwsxn12240\pghsxn15840\marglsxn864\margrsxn864\margtsxn1440\margbsxn864\headery1440\footery864\sbkpage
\pgncont\pgndec
\plain\plain\f1\fs20\lang1033\f1 Home Care Note: CMN received from Home Medical
In the township the population was spread out with 21.4% under the age of 18, 6.5% from 18 to 24, 29.9%
from 25 to 44, 27.7% from 45 to 64, and 14.6% who were 65 years of age or older. The median age was 40
years. For every 100 females there were 101.1 males. For every 100 females age 18 and over, there were
98.5 males
on RA on the 18th of Oct. Cont. O2 at 2L/N/C was ordered. \plain\f1\fs20\par}
00010007308000003141|730100037|2007-11-15 12:05:02.000|ACCG|Clear Document - Certificate
00010007308000003141|730100038|2007-11-28 08:35:18.000|ACCG {\rtf1\ansi\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \census \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;\red255\green0\blue0
;}
{\stylesheet{\fs20\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1440\margb864\headery1440\footery864\deftab720\formshade
\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
called and faxed to Mike.\plain\f1\fs20\par}
In above file I am searching for 'census' (not case sensitive) and I found matches in 4 places. (2 times in 1st record and 2 times two different records)
Desired output is below...
00010007308000003161|730100039|2007-11-27 09:54:17.000|ACCG|United States Census Bureau, the t
00010007308000003161|730100039|2007-11-27 09:54:17.000|ACCG|of the census[1] of 2000
00010007308000003141|730100040|2007-11-27 10:05:09.000|ACCG|husband present, census 24.8% were
00010007308000003141|730100038|2007-11-28 08:35:18.000|ACCG|fonttbl{\f0 \census \fcharset0 Times
In above desired example I did choose to display only two words before and after census. That would be great if I have a flexibility to choose more than 2 words.Example 10 words before and 15 words after etc.
Also I am reading this from text file. If you give me a command to read and write back to text file that would be great. Sorry I am new to Python but I love the power of Python.
Thank you so much for your help.
You could use the below regex.
>>> s = r"""4200011|4200002|2006-12-28 10:28:42.000|{\rtf1\ansi
\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss
\fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;}
{\stylesheet{\fs20\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph
Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1152\margb720\head
ery1152\footery720\deftab720\formshade\aendnotes\aftnnrlc
Called Brian with mike
\pgbrdrhead
12/27/06 fax 293-4812\plain\f1\fs20\par}
4200011|4200007|2010-11-29 12:49:42.000|{\rtf1\ansi
\deflang1033\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss
\fcharset0 Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red255\green0\blue0 ;}
{\stylesheet{\fs20\cf0\cb1 Normal;}{\cs1\cf0\cb1 Default Paragraph
Font;}}
\paperw12240\paperh15840\margl864\margr864\margt1007\margb576\head
ery1007\footery576\deftab720\formshade\aendnotes\aftnnrlc
\pgbrdrhead them numbers and they pt
minutes\plain\f1\fs20\par}"""
>>> ls = re.findall(r'^(\d+\|\d+)\|(?:(?!\n\n)[\s\S])*?(\S+\s+\S+\s+mike\s+\S+\s+\S+)', s)
>>> print(('|'.join([j for i in ls for j in i])).replace('\n',' '))
4200011|4200002|Brian with mike \pgbrdrhead 12/27/06
s = r"""4200011|4200002|2006-12-28 10:28:42.000|{\rtf1\ansi
^^
Try this.Or else you will have to double escape \plain\f1\fs20\par

Splitting strings that contain commas (special characters?)

I'm working from a spreadsheet of values. I have code that pulls a row of content to analyze. I was planning to split it on commas, but some of the strings inside the cells include commas (that aren't regularly spaced, so escaping them would be difficult). I downloaded the sheet as a tsv instead of a csv and re-uploaded it, but my attempts to split on \t haven't been successful. (For good measure, I've also tried \n, \r, and \f to see if they're involved in delimiting cells. They don't seem to be.)
Is there a special character that means "next cell" or "next record" or something like that? Am I better off trying to end each cell with a particular character that I would then have to strip out of my data after splitting? I'd welcome any other ideas!
Code snippet:
var lastRowContents = dataSheet.getRange(lastRow, 1, 1, 21).getValues();
var contentChunks = lastRowContents.toString().split('\t');
var product = contentChunks[0];
Logger.log(product);
This outputs the entire row as one item in that array, like so:
product: Wed Jan 05 2005 02:00:00 GMT-0600 (CST),001-2005, Day-Lee Pride Beef Gyoza Potstickers, Vegetable and Beef Dumplings ,misbranded,http://www.fsis.usda.gov/wps/portal/fsis/topics/recalls-and-public-health-alerts/recall-case-archive/recall-case-archive-2005/!ut/p/a1/jZDBCoJAEIafpQdYdlZN9CgLppa7SGS2l1gW0wVTMfHQ06d0MpScOc3w_XzMYIEzLGo56EL2uqllNc3CvkMCNnEpRNz3fAiZ6acOOxDg9gjcZoBLJiBN-JFScJi5Mb9SHvzLRxsERhfTuMCilX2JdP1ocNblSlYVUvKVI9mpUg_54hIZAHt8xWKuATL2qDlbQcRM4NYvsPCHL7B-aPu8ZO9TADr0dh-fh2db/?1dmy&current=true&urile=wcm%3apath%3a%2Ffsis-archives-content%2Finternet%2Fmain%2Ftopics%2Frecalls-and-public-health-alerts%2Frecall-case-archive%2Farchives%2Fct_index271,http://www.fsis.usda.gov/wps/portal/fsis/topics/recalls-and-public-health-alerts/recall-case-archive/recall-case-archive-2005/!ut/p/a1/jZDBCoJAEIafpQdYdlZN9CgLppa7SGS2l1gW0wVTMfHQ06d0MpScOc3w_XzMYIEzLGo56EL2uqllNc3CvkMCNnEpRNz3fAiZ6acOOxDg9gjcZoBLJiBN-JFScJi5Mb9SHvzLRxsERhfTuMCilX2JdP1ocNblSlYVUvKVI9mpUg_54hIZAHt8xWKuATL2qDlbQcRM4NYvsPCHL7B-aPu8ZO9TADr0dh-fh2db/?1dmy&current=true&urile=wcm%3apath%3a%2Ffsis-archives-content%2Finternet%2Fmain%2Ftopics%2Frecalls-and-public-health-alerts%2Frecall-case-archive%2Farchives%2Fct_index386,Day-Lee Pride Beef Gyoza Potstickers, Vegetable and Beef Dumplings,Produced 10/6/2004. The products subject to recall are: One pound bags of "DAY-LEE PRIDE BEEF GYOZA POTSTICKERS, VEGETABLE AND BEEF DUMPLINGS." Each bag bears the code "28004," as well as "Est. 17309" inside the USDA mark of inspection.,The packages state that the gyozas are filled with beef, but they may instead contain shrimp, a known allergen.,The problem was discovered by the establishment.,17309 M Day-Lee Foods Inc. 13055 E. Molette St. Santa Fe Springs, CA 90670,,Approximately 2,520 pounds,California, Colorado, Georgia, Maryland, New York, and Washington.,Class I,U.S. Food and Drug Administration (FDA),,,,,
(just for visibility :)
since lastRowContents is an 2D array (link to doc) you have every cell with lastRowContents[0][0],lastRowContents[0][1],lastRowContents[0][2],etc..
in your code :
var lastRowContents = dataSheet.getRange(lastRow, 1, 1, 21).getValues();
var product = lastRowContents[0][0];
Logger.log(product);

Vim: Parsing address fields from all around the globe

Intro
This post is long, but I consider it thorough. I hope this post might be helpful (addresses) to others while teaching complex VIM regexes. Thank you for your time.
Worldwide addresses:
American, Canadian and a few other countries are offered 5 fields on a form, which is then displayed in a comma delimited format that I need to further dissect. Ideally, the comma-separated content looks like:
Some Really Nice Place, 111 Street, Beautiful Town, StateOrProvince, zip
where zip can be either a series of just numbers (US) or numbers and letters (Canada).
Invariably, people throw an extra comma into their text box field input and that adds some complexity to the parsing of this data. For example:
Some Really Nice Place, 111 Street, suite 101, Beautiful Town, StateOrProvince, zip
Further complicating this parse is that the data from non-US and non-Canadian countries contains an extra comma-delimited field that was somehow provided to them - adding a place for them to enter their country. (No, there is no "US" or "Canada" field for their entries. So, it's "in addition" to the original 5 comma-delimited fields.) Such as:
Foreign Name of Building, A street name, A City, ,zip, Country
The ",," is usually empty as non-US countries do are not segmented into states. And, yes, the same "additional commas" as described above happens here too.
Foreign Name of Building, cross streets, district, A street name, A City, ,zip, Country
Parsing Strategy:
A country name will never include a digit, whereas a US or Canadian zip will always have at least some digits. If you go backwards using this assumption about the contents of the last field then you should be able to place the country, zip, State (if not empty ",,"), City and Street into their respect positions - which are the most important fields to get right. Anything beyond those sections could be lumped together in the first or or two lines as descriptions of the address (i.e. building, name, suite, cross streets, etc). For example:
Some Really Nice Place, 111 Street, suite 101, Beautiful Town, Lovely State, Digits&Letters
Last section has a digit (therefore a US or Canadian address)
There a total of 6 sections, so that's one more than the original 5
Knowing that sections 5-2 are zip, state, town, address...
6 minus 5 (original) = add an extra Address (Address2) field and leave the first section as the header, resulting in:
Header: Some Really Nice Place, Address1: 111 Street, Address2: Suite 101, Town: Beautiful Town, State/Province: Lovely State, Zip: Digits&Letters
Whereas there might be a discrepancy on where "111 Street" or "Suite 101" goes (Address1 or Address2), it at least gets the zip, state, city and address(s) lumped together and leaves the first section as the "Header" to the email address for data entry purposes.
Under this approach, foreign address get parsed like:
Foreign Name of Building, cross streets, district, A street name, A
City, ,zip, Country
Last section has no digit, so it must be a Country
That means, moving right to left, the second section is the zip
So now (foreign) you have an "original 6 sections" to subtract from the total of 7 in the example
7th section = country, 6th = zip, 5th = state (mostly blank on foreign address), 4th = City, 3rd = address1, 2nd = address2, 1st = header
We knew to use two address fields because the example had 7 sections and foreign addresses have a base of 6 sections. Any number of sections above the base are added to a second address2 field. If there are 3 sections above the base section count then they are appended to each inside the address2 field.
Coding
In this approach using VIM, how would I initially read the number of comma-delimited sections (after I've captured the entire address in a register)? How do I do submatch(es) on a series of comma-delimited sections for which I am not sure the number of sections that exist?
Example Addresses
Here are some practice address (US and Foreign) if you are so inclined to help:
City Gas & Electric - Bldg 4, 222 Middle Park Ct, CP4120F, Dallas, Texas, 44984
MHG Engineering, Inc. Suite 200, 9899 Balboa Ave, San Diego, California, 92123-1502
SolarWind Turbines, 2nd Floor Conference Room, 2300 Ruffin Road, Seattle, Washington, 84444
123 Aeronautics, 2239 Industry Parkway, Salt Lake City, Utah, 55344
Ongwanda Gov't Resources, 6000 Portsmouth Avenue, Ottawa, Ontario, K7M 8A6
Graylang Seray Center, 6600 Haig Rd, Singapore, , 437848, Singapore
Lot 459, Block 14, Jalan Sultan Tengah, Petra Jaya, Kuching, , 93050, Malaysia
Virtual Steel, 1 Umgazi Rd Aspec Park, Pretoria, , 0075, South Africa
Idiom Towers South, Fifth Floor, Jasmen Conference Room, 1500 Freedom Street, Pretoria, , 0002, South Africa
The following code is a draft-quality Vim script (hopefully) implementing the
address parsing routine described in the question.
function! ParseAddress(line)
let r = split(a:line, ',\s*', 1)
let hadcountry = r[-1] !~ '\d'
let a = {}
let a.country = hadcountry ? r[-1] : ''
let r = r[:-1-hadcountry]
let a.zip = r[-1]
let a.state = r[-2]
let a.city = r[-3]
let a.header = r[0]
let nleft = len(r) - 4
if hadcountry
let a.address1 = r[-4]
let a.address2 = join(r[1:nleft-1], ', ')
else
let a.address1 = r[1]
let a.address2 = join(r[2:nleft], ', ')
endif
return a
endfunction
function! FormatAddress(a)
let t = map([
\ ['Header', 'header'],
\ ['Address 1', 'address1'],
\ ['Address 2', 'address2'],
\ ['Town', 'city'],
\ ['State/Province', 'state'],
\ ['Country', 'country'],
\ ['Zip', 'zip']],
\ 'has_key(a:a, v:val[1]) && !empty(a:a[v:val[1]])' .
\ '? v:val[0] . ": " . a:a[v:val[1]] : ""')
return join(filter(t, '!empty(v:val)'), '; ')
endfunction
The command below can be used to test the above parsing routines.
:g/\w/call setline(line('.'), FormatAddress(ParseAddress(getline('.'))))
(One can provide a range to the :global command to run it through fewer
number of test address lines.)
Maybe you should review some of the other questions about addresses around the world. The USA and Canada are extraordinarily systematic with their systems; most other countries are a lot less rigorous about the approved formats. Anything you devise for the USA and Canada will run into issues almost immediately you deal with other addresses.
Best practices for storing postal addresses in a database
Is there a common street address database design for all addresses of the world
How many address fields would you use for a UK address
ISO Standard Street Addresses
There are probably other related questions: see the tag street-address for some of them.