Extract Information from Text in R - regex

I am working on Entity Extraction in R.
I have a UniqueID and Text field - need to extract location information from the text field.
My Text field has description with location names
text <- c("SERANGOON JC","Blk 4","SHELL TAMPINES AVE 4","SENOKO INDUSTRIAL ESTATE","Senoko Estate","Senoko","senok Est.")
I have a list of Locations ;
Loc <- c("SERANGOON JUNIOR COLLEGE","Block 4","SHELL TAMPINES AVENUE 4","SENOKO INDUSTRIAL ESTATE")
Need to match the loc and extract those location from the text field.In the text field SENOKO INDUSTRIAL ESTATE is spelt in different ways Senoko Estate or Senoko (Half Names) or with spelling mistake senok Est. .for all the above mis-spelt and half spelt words - i need to get the exact name from loc ie. SENOKO INDUSTRIAL ESTATE.
My output would look like:(Extract location from Text field -get correct words for half- spelt and misspelt words)
ID Location
123 SERANGOON JUNIOR COLLEGE|Block 4|SHELL TAMPINES AVENUE 4|SENOKO INDUSTRIAL ESTATE|SENOKO INDUSTRIAL ESTATE|SENOKO INDUSTRIAL ESTATE|SENOKO INDUSTRIAL ESTATE

I don't think this is the prettiest way to answer it, but..
text <- c("SERANGOON JC","Blk 4","SHELL TAMPINES AVE 4","SENOKO INDUSTRIAL ESTATE","Senoko Estate","Senoko","senok Est.")
Loc <- c("SERANGOON JUNIOR COLLEGE","Block 4","SHELL TAMPINES AVENUE 4","SENOKO INDUSTRIAL ESTATE")
text <- gsub(".*serang.*", "SERANGOON JUNIOR COLLEGE", text, ignore.case=TRUE)
text <- gsub(".*bl.* 4.*", "Block 4", text, ignore.case=TRUE)
text <- gsub(".*shell.*", "SHELL TAMPINES AVENUE 4", text, ignore.case=TRUE)
text <- gsub(".*senok.*", "SENOKO INDUSTRIAL ESTATE", text, ignore.case=TRUE)
print(text)
I didn't put it exactly in the format you requested, but that would be the contents of the second column (aka Location). I used the regex expression ".*" before and after the strings you were looking for in case there are other possibilities/typos. This would make it more robust.
Hope this helps!

Related

How to know if a variation (f.e. abbreviation) of a string in a list does match agains another list if the original does not?

I currently searching for a method in R which let's me match/merge two data frames. Helas both of these data frames contain non optimal data. They can have certain abbreviations of even typo's in them. Therefore I would like to define a list for each abbreviation and if a string contains one of those elements. If the original entries don't match, R should check if any of the other options of the abbreviation has a match. To illustrate: the name of a company could end with "Limited" but also with "Ltd." of "Ltd" etc.
EXAMPLE
Data
The Original "Address" file contains:
Company name Address
Deloitte Ltd. New York
Coca-Cola New York
Tesla ltd California
Microsoft Limited Washington
Would have to be merged with "EnterpriseNrList"
Company name EnterpriseNumber
Deloitte Ltd. 221
Coca-Cola 334
Tesla ltd 725
Microsoft Limited 127
So the abbreviations should work in "both directions". That's why I said, if R recognises any of the abbreviations, R should try to match all of them.
All of the matches should be reported as the return.
Therefore I would make up a list "Abbreviations" for each possible abbreviation
Limited.
limited
Ltd.
ltd.
Ltd
ltd
Questions
1) Would this be a good method, or would there be a more efficient way?
2) How can I check a list against a list of possible abbreviations (step 1, see below), sort of a containsx from excel?
3) How could I make up a list that replaces for the entries that do not match the abbreviation with all other abbreviatinos (step 2, see below)?
Thoughts for solution
Step 1
As I am still very new to this kind of work, I was thinking the following: use a regex expression to filter out wether a string contains any of the abbreviation options and create a list which will then contain either -1 if no match could be found and >0 if match is found. The no pattern matching can already be matched against the "Address" list. With the other entries I continue to step 2.
In this step I don't really know how to check against a list of options ("Abbreviations" list).
Step 2
Next I would create a list with the matches from step 1 and rbind together all options. In this step I don't really know to I could create a list that combines f.e. Coca-Cola with all it's possible abbreviations.
Coca-Cola Limited
Coca-Cola Ltd.
Coca-Cola Ltd
etc.
Step 3
Lastly I would match/merge this more complete list of companies again with the original "Data" list. With the introduction of step 2 I thought It might be a bit easier on the required computing power, as the original list is about 8000 rows.
I would go in a different approach, fixing the tables first before the merge.
To fix with abreviations, I would use a regex, case insensitive, the final dot being optionnal, I start with a list of 'Normal word' = vector of abbreviations.
abbrevs <- list('Limited'=c('Limited','Ltd'),'Incorporated'=c('Incorporated','Inc'))
The I build the corresponding regex (alternations with an optional dot at end, the case will be ignored by parameter in gsub and agrep later):
regexes <- lapply(abbrevs,function(x) { paste0("(",paste0(x,collapse='|'),")[.]?") })
Which gives:
$Limited
[1] "(Limited|Ltd)[.]?"
$Incorporated
[1] "(Incorporated|Inc)[.]?"
Now we have to apply each regex to the company.name column of each df:
for (i in seq_along(regexes)) {
Address$Company.name <- gsub(regexes[[i]], names(regexes[i]), Address$Company.name, ignore.case=TRUE)
Enterprise$Company.name <- gsub(regexes[[i]], names(regexes[i]), Enterprise$Company.name, ignore.case=TRUE)
}
This does not take into account typos. Here you'll need to work on with agrepor adist to manage it.
Result for Address example data set:
> Address
Company.name Address
1 Deloitte Limited New York
2 Coca-Cola New York
3 Tesla Limited California
4 Microsoft Limited Washington
Input data used:
Address <- structure(list(Company.name = c("Deloitte Ltd.", "Coca-Cola",
"Tesla ltd", "Microsoft Limited"), Address = c("New York", "New York",
"California", "Washington")), .Names = c("Company.name", "Address"
), class = "data.frame", row.names = c(NA, -4L))
Enterprise <- structure(list(Company.name = c("Deloitte Ltd.", "Coca-Cola",
"Tesla ltd", "Microsoft Limited"), EnterpriseNumber = c(221L,
334L, 725L, 127L)), .Names = c("Company.name", "EnterpriseNumber"
), class = "data.frame", row.names = c(NA, -4L))
I would say that the answer depends on whether you have a list of abbreviations or not.
If you have one, you could just look which element of your list contains an abbreviation with grep or greplfunctions. (grep return all indexes that have a matching pattern whereas grepl returns a logical vector).
Also, use the ignore.case= TRUE parameter of these function, so you don't have to try all capitalized/lowercase possibilities.
If you don't have such a list, my first guest would be to extract the first "word" of each company (I would guess that there is a single "Deloitte" company, and that it is "Deloitte Ltd"). You can do so with:
unlist(strsplit(CompanyNames,split = " "))
If you wanted to also correct for typos, this is more a question of string distance.
Hope that it helped!

Remove defined strings from sentences in dataframe

I need to remove defined strings from sentences in data frame:
sent1 = data.frame(Sentences=c("bad printer for the money wireless setup was surprisingly easy",
"love my samsung galaxy tabinch gb whitethis is the first"), user = c(1,2))
Sentences User
bad printer for the money wireless setup was surprisingly easy 1
love my samsung galaxy tabinch gb whitethis is the first 2
Defined strings for excluding, e.g.:
stop_words <- c("bad", "money", "love", "is", "the")
I was wondering about something like this:
library(stringr)
words1 <- (str_split(unlist(sent1$Sentences)," "))
ddd = which(words1[[1]] %in% stop_words)
words1[[1]][-ddd]
But I need it for all items in the list. Then I need to have output table in the same structure as input table sent1, but without defined strings.
Please, I very appreciate any of help or advice.
You can combine the stop words and create a regex pattern. Therefore, you only need a single gsub command.
# create regex pattern
pattern <- paste0("\\b(?:", paste(stop_words, collapse = "|"), ")\\b ?")
# [1] "\\b(?:bad|money|love|is|the)\\b ?"
# remove stop words
res <- gsub(pattern, "", sent1$Sentences)
# [1] "printer for wireless setup was surprisingly easy"
# [2] "my samsung galaxy tabinch gb whitethis first"
# store result in a data frame
data.frame(Sentences = res)
# Sentences
# 1 printer for wireless setup was surprisingly easy
# 2 my samsung galaxy tabinch gb whitethis first

R - does failed RegEx pattern matching originate in file conversion or use of tm package?

As a relative novice in R and programming, my first ever question in this forum is about regex pattern matching, specifically line breaks. First some background. I am trying to perform some preprocessing on a corpus of texts using R before processing them further on the NLP platform GATE. I convert the original pdf files to text as follows (the text files, unfortunately, go into the same folder):
dest <- "./MyFolderWithPDFfiles"
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/Program Files (x86)/xpdfbin-win-3.04/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE))
Then, having loaded the tm package and physically(!) moved the text files to another folder, I create a corpus:
TextFiles <- "./MyFolderWithTXTfiles"
EU <- Corpus(DirSource(TextFiles))
I then want to perform a series of custom transformations to clean the texts. I succeeded to replace a simple string as follows:
ReplaceText <- content_transformer(function(x, from, to) gsub(from, to, x, perl=T))
EU2 <- tm_map(EU, ReplaceText, "Table of contents", "TOC")
However, a pattern that is a 1-3 digit page number followed by two line breaks and a page break is causing me problems. I want to replace it with a blank space:
EU2 <- tm_map(EU, ReplaceText, "[0-9]{1,3}\n\n\f", " ")
The ([0-9]{1,3}) and \f alone match. The line breaks don't. If I copy text from one of the original .txt files into the RegExr online tool and test the expression "[0-9]{1,3}\n\n\f", it matches. So the line breaks do exist in the original .txt file.
But when I view one of the .txt files as read into the EU corpus in R, there appear to be no line breaks even though the lines are obviously breaking before the margin, e.g.
[3] "PROGRESS TOWARDS ACCESSION"
[4] "1"
[5] ""
[6] "\fTable of contents"
Seeing this, I tried other patterns, e.g. to detect one or more blank space ("[0-9]{1,3}\s*\f"), but no patterns worked.
So my questions are:
Am I converting and reading the files into R correctly? If so, what has happened to the line breaks?
If no line breaks is normal, how can I pattern match the character on line 5? Is that not a blank
space?
(A tangential concern:) When converting the pdf files, is there code that will put them directly in a new folder?
Apologies for extending this, but how can one print or inspect only a few lines of the text object? The tm commands and head(EU) print the entire object, each a very long text.
I know my problem(s) must appear simple and perhaps stupid, but one has to start somewhere and extensive searching has not revealed a source that explains comprehensively how to use RegExes to modify text objects in R. I am so frustrated and hope someone here will take pity and can help me.
Thanks for any advice you can offer.
Brigitte
p.s. I think it's not possible to upload attachments in this forum, therefore, here is a link to one of the original PDF documents: http://ec.europa.eu/enlargement/archives/pdf/key_documents/1998/czech_en.pdf
Because the doc is long, I created a snippet of the first 3 pages of the TXT doc, read it into the R corpus ('EU') and printed it to the console and this is it:
dput(EU[[2]])
structure(list(content = c("REGULAR REPORT", "FROM THE COMMISSION ON",
"CZECH REPUBLIC'S", "PROGRESS TOWARDS ACCESSION ***********************",
"1", "", "\fTable of contents", "A. Introduction", "a) Preface The Context of the Progress Report",
"b) Relations between the European Union and the Czech Republic The enhanced Pre-Accession Strategy Recent developments in bilateral relations",
"B. Criteria for membership", "1. Political criteria", "1.1. Democracy and the Rule of Law Parliament The Executive The judicial system Anti-Corruption measures",
"1.2. Human Rights and the Protection of Minorities Civil and Political Rights Economic, Social and Cultural Rights Minority Rights and the Protection of Minorities",
"1.3. General evaluation", "2. Economic criteria", "2.1. Introduction 2.2. Economic developments since the Commission published its Opinion",
"Macroeconomic developments Structural reforms 2.3. Assessment in terms of the Copenhagen criteria The existence of a functioning market economy The capacity to cope with competitive pressure and market forces 2.4. General evaluation",
"3. Ability to assume the obligations of Membership", "3.1. Internal Market without frontiers General framework The Four Freedoms Competition",
"3.2. Innovation Information Society Education, Training and Youth Research and Technological Development Telecommunications Audio-visual",
"3.3. Economic and Fiscal Affairs Economic and Monetary Union",
"2", "", "\fTaxation Statistics "), meta = structure(list(author = character(0),
datetimestamp = structure(list(sec = 50.1142621040344, min = 33L,
hour = 15L, mday = 3L, mon = 10L, year = 114L, wday = 1L,
yday = 306L, isdst = 0L), .Names = c("sec", "min", "hour",
"mday", "mon", "year", "wday", "yday", "isdst"), class = c("POSIXlt",
"POSIXt"), tzone = "GMT"), description = character(0), heading = character(0),
id = "CZ1998ProgressSnippet.txt", language = "en", origin = character(0)), .Names = c("author",
"datetimestamp", "description", "heading", "id", "language",
"origin"), class = "TextDocumentMeta")), .Names = c("content",
"meta"), class = c("PlainTextDocument", "TextDocument"))
Yes, working with text in R is not always a smooth experience! But you can get a lot done quickly with some effort (maybe too much effort!)
If you could share one of your PDF files or the output of dput(EU), that might help to identify exactly how to capture your page numbers with regex. That would also add a reproducible example to your question, which is an important thing to have in questions here so that people can test their answers and make sure they work for your specific problem.
No need to put PDF and text files in separate folders, instead you can use a pattern like so:
EU <- Corpus(DirSource(pattern = ".txt"))
This will only read the text files and ignore the PDF files
There is no 'snippet view' method in tm, which is annoying. I often use just names(EU) and EU[[1]] for quick looks
UPDATE
With the data you've just added, I'd suggest a slightly tangential approach. Do the regex work before passing the data to the tm package formats, like so:
# get the PDF
download.file("http://ec.europa.eu/enlargement/archives/pdf/key_documents/1998/czech_en.pdf", "my_pdf.pdf", method = "wget")
# get the file name of the PDF
myfiles <- list.files(path = getwd(), pattern = "pdf", full.names = TRUE)
# convert to text (not my pdftotext is in a different location to you)
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE))
# read plain text int R
x1 <- readLines("my_pdf.txt")
# make into a single string
x2 <- paste(x1, collapse = " ")
# do some regex...
x3 <- gsub("Table of contents", "TOC", x2)
x4 <- gsub("[0-9]{1,3} \f", "", x3)
# convert to corpus for text mining operations
x5 <- Corpus(VectorSource(x4))
With the snippet of data your provided using dput, the output from this method is
inspect(x5)
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
REGULAR REPORT FROM THE COMMISSION ON CZECH REPUBLIC'S PROGRESS TOWARDS ACCESSION *********************** TOC A. Introduction a) Preface The Context of the Progress Report b) Relations between the European Union and the Czech Republic The enhanced Pre-Accession Strategy Recent developments in bilateral relations B. Criteria for membership 1. Political criteria 1.1. Democracy and the Rule of Law Parliament The Executive The judicial system Anti-Corruption measures 1.2. Human Rights and the Protection of Minorities Civil and Political Rights Economic, Social and Cultural Rights Minority Rights and the Protection of Minorities 1.3. General evaluation 2. Economic criteria 2.1. Introduction 2.2. Economic developments since the Commission published its Opinion Macroeconomic developments Structural reforms 2.3. Assessment in terms of the Copenhagen criteria The existence of a functioning market economy The capacity to cope with competitive pressure and market forces 2.4. General evaluation 3. Ability to assume the obligations of Membership 3.1. Internal Market without frontiers General framework The Four Freedoms Competition 3.2. Innovation Information Society Education, Training and Youth Research and Technological Development Telecommunications Audio-visual 3.3. Economic and Fiscal Affairs Economic and Monetary Union Taxation Statistics

Splitting strings that contain commas (special characters?)

I'm working from a spreadsheet of values. I have code that pulls a row of content to analyze. I was planning to split it on commas, but some of the strings inside the cells include commas (that aren't regularly spaced, so escaping them would be difficult). I downloaded the sheet as a tsv instead of a csv and re-uploaded it, but my attempts to split on \t haven't been successful. (For good measure, I've also tried \n, \r, and \f to see if they're involved in delimiting cells. They don't seem to be.)
Is there a special character that means "next cell" or "next record" or something like that? Am I better off trying to end each cell with a particular character that I would then have to strip out of my data after splitting? I'd welcome any other ideas!
Code snippet:
var lastRowContents = dataSheet.getRange(lastRow, 1, 1, 21).getValues();
var contentChunks = lastRowContents.toString().split('\t');
var product = contentChunks[0];
Logger.log(product);
This outputs the entire row as one item in that array, like so:
product: Wed Jan 05 2005 02:00:00 GMT-0600 (CST),001-2005, Day-Lee Pride Beef Gyoza Potstickers, Vegetable and Beef Dumplings ,misbranded,http://www.fsis.usda.gov/wps/portal/fsis/topics/recalls-and-public-health-alerts/recall-case-archive/recall-case-archive-2005/!ut/p/a1/jZDBCoJAEIafpQdYdlZN9CgLppa7SGS2l1gW0wVTMfHQ06d0MpScOc3w_XzMYIEzLGo56EL2uqllNc3CvkMCNnEpRNz3fAiZ6acOOxDg9gjcZoBLJiBN-JFScJi5Mb9SHvzLRxsERhfTuMCilX2JdP1ocNblSlYVUvKVI9mpUg_54hIZAHt8xWKuATL2qDlbQcRM4NYvsPCHL7B-aPu8ZO9TADr0dh-fh2db/?1dmy&current=true&urile=wcm%3apath%3a%2Ffsis-archives-content%2Finternet%2Fmain%2Ftopics%2Frecalls-and-public-health-alerts%2Frecall-case-archive%2Farchives%2Fct_index271,http://www.fsis.usda.gov/wps/portal/fsis/topics/recalls-and-public-health-alerts/recall-case-archive/recall-case-archive-2005/!ut/p/a1/jZDBCoJAEIafpQdYdlZN9CgLppa7SGS2l1gW0wVTMfHQ06d0MpScOc3w_XzMYIEzLGo56EL2uqllNc3CvkMCNnEpRNz3fAiZ6acOOxDg9gjcZoBLJiBN-JFScJi5Mb9SHvzLRxsERhfTuMCilX2JdP1ocNblSlYVUvKVI9mpUg_54hIZAHt8xWKuATL2qDlbQcRM4NYvsPCHL7B-aPu8ZO9TADr0dh-fh2db/?1dmy&current=true&urile=wcm%3apath%3a%2Ffsis-archives-content%2Finternet%2Fmain%2Ftopics%2Frecalls-and-public-health-alerts%2Frecall-case-archive%2Farchives%2Fct_index386,Day-Lee Pride Beef Gyoza Potstickers, Vegetable and Beef Dumplings,Produced 10/6/2004. The products subject to recall are: One pound bags of "DAY-LEE PRIDE BEEF GYOZA POTSTICKERS, VEGETABLE AND BEEF DUMPLINGS." Each bag bears the code "28004," as well as "Est. 17309" inside the USDA mark of inspection.,The packages state that the gyozas are filled with beef, but they may instead contain shrimp, a known allergen.,The problem was discovered by the establishment.,17309 M Day-Lee Foods Inc. 13055 E. Molette St. Santa Fe Springs, CA 90670,,Approximately 2,520 pounds,California, Colorado, Georgia, Maryland, New York, and Washington.,Class I,U.S. Food and Drug Administration (FDA),,,,,
(just for visibility :)
since lastRowContents is an 2D array (link to doc) you have every cell with lastRowContents[0][0],lastRowContents[0][1],lastRowContents[0][2],etc..
in your code :
var lastRowContents = dataSheet.getRange(lastRow, 1, 1, 21).getValues();
var product = lastRowContents[0][0];
Logger.log(product);

Vim: Parsing address fields from all around the globe

Intro
This post is long, but I consider it thorough. I hope this post might be helpful (addresses) to others while teaching complex VIM regexes. Thank you for your time.
Worldwide addresses:
American, Canadian and a few other countries are offered 5 fields on a form, which is then displayed in a comma delimited format that I need to further dissect. Ideally, the comma-separated content looks like:
Some Really Nice Place, 111 Street, Beautiful Town, StateOrProvince, zip
where zip can be either a series of just numbers (US) or numbers and letters (Canada).
Invariably, people throw an extra comma into their text box field input and that adds some complexity to the parsing of this data. For example:
Some Really Nice Place, 111 Street, suite 101, Beautiful Town, StateOrProvince, zip
Further complicating this parse is that the data from non-US and non-Canadian countries contains an extra comma-delimited field that was somehow provided to them - adding a place for them to enter their country. (No, there is no "US" or "Canada" field for their entries. So, it's "in addition" to the original 5 comma-delimited fields.) Such as:
Foreign Name of Building, A street name, A City, ,zip, Country
The ",," is usually empty as non-US countries do are not segmented into states. And, yes, the same "additional commas" as described above happens here too.
Foreign Name of Building, cross streets, district, A street name, A City, ,zip, Country
Parsing Strategy:
A country name will never include a digit, whereas a US or Canadian zip will always have at least some digits. If you go backwards using this assumption about the contents of the last field then you should be able to place the country, zip, State (if not empty ",,"), City and Street into their respect positions - which are the most important fields to get right. Anything beyond those sections could be lumped together in the first or or two lines as descriptions of the address (i.e. building, name, suite, cross streets, etc). For example:
Some Really Nice Place, 111 Street, suite 101, Beautiful Town, Lovely State, Digits&Letters
Last section has a digit (therefore a US or Canadian address)
There a total of 6 sections, so that's one more than the original 5
Knowing that sections 5-2 are zip, state, town, address...
6 minus 5 (original) = add an extra Address (Address2) field and leave the first section as the header, resulting in:
Header: Some Really Nice Place, Address1: 111 Street, Address2: Suite 101, Town: Beautiful Town, State/Province: Lovely State, Zip: Digits&Letters
Whereas there might be a discrepancy on where "111 Street" or "Suite 101" goes (Address1 or Address2), it at least gets the zip, state, city and address(s) lumped together and leaves the first section as the "Header" to the email address for data entry purposes.
Under this approach, foreign address get parsed like:
Foreign Name of Building, cross streets, district, A street name, A
City, ,zip, Country
Last section has no digit, so it must be a Country
That means, moving right to left, the second section is the zip
So now (foreign) you have an "original 6 sections" to subtract from the total of 7 in the example
7th section = country, 6th = zip, 5th = state (mostly blank on foreign address), 4th = City, 3rd = address1, 2nd = address2, 1st = header
We knew to use two address fields because the example had 7 sections and foreign addresses have a base of 6 sections. Any number of sections above the base are added to a second address2 field. If there are 3 sections above the base section count then they are appended to each inside the address2 field.
Coding
In this approach using VIM, how would I initially read the number of comma-delimited sections (after I've captured the entire address in a register)? How do I do submatch(es) on a series of comma-delimited sections for which I am not sure the number of sections that exist?
Example Addresses
Here are some practice address (US and Foreign) if you are so inclined to help:
City Gas & Electric - Bldg 4, 222 Middle Park Ct, CP4120F, Dallas, Texas, 44984
MHG Engineering, Inc. Suite 200, 9899 Balboa Ave, San Diego, California, 92123-1502
SolarWind Turbines, 2nd Floor Conference Room, 2300 Ruffin Road, Seattle, Washington, 84444
123 Aeronautics, 2239 Industry Parkway, Salt Lake City, Utah, 55344
Ongwanda Gov't Resources, 6000 Portsmouth Avenue, Ottawa, Ontario, K7M 8A6
Graylang Seray Center, 6600 Haig Rd, Singapore, , 437848, Singapore
Lot 459, Block 14, Jalan Sultan Tengah, Petra Jaya, Kuching, , 93050, Malaysia
Virtual Steel, 1 Umgazi Rd Aspec Park, Pretoria, , 0075, South Africa
Idiom Towers South, Fifth Floor, Jasmen Conference Room, 1500 Freedom Street, Pretoria, , 0002, South Africa
The following code is a draft-quality Vim script (hopefully) implementing the
address parsing routine described in the question.
function! ParseAddress(line)
let r = split(a:line, ',\s*', 1)
let hadcountry = r[-1] !~ '\d'
let a = {}
let a.country = hadcountry ? r[-1] : ''
let r = r[:-1-hadcountry]
let a.zip = r[-1]
let a.state = r[-2]
let a.city = r[-3]
let a.header = r[0]
let nleft = len(r) - 4
if hadcountry
let a.address1 = r[-4]
let a.address2 = join(r[1:nleft-1], ', ')
else
let a.address1 = r[1]
let a.address2 = join(r[2:nleft], ', ')
endif
return a
endfunction
function! FormatAddress(a)
let t = map([
\ ['Header', 'header'],
\ ['Address 1', 'address1'],
\ ['Address 2', 'address2'],
\ ['Town', 'city'],
\ ['State/Province', 'state'],
\ ['Country', 'country'],
\ ['Zip', 'zip']],
\ 'has_key(a:a, v:val[1]) && !empty(a:a[v:val[1]])' .
\ '? v:val[0] . ": " . a:a[v:val[1]] : ""')
return join(filter(t, '!empty(v:val)'), '; ')
endfunction
The command below can be used to test the above parsing routines.
:g/\w/call setline(line('.'), FormatAddress(ParseAddress(getline('.'))))
(One can provide a range to the :global command to run it through fewer
number of test address lines.)
Maybe you should review some of the other questions about addresses around the world. The USA and Canada are extraordinarily systematic with their systems; most other countries are a lot less rigorous about the approved formats. Anything you devise for the USA and Canada will run into issues almost immediately you deal with other addresses.
Best practices for storing postal addresses in a database
Is there a common street address database design for all addresses of the world
How many address fields would you use for a UK address
ISO Standard Street Addresses
There are probably other related questions: see the tag street-address for some of them.