I apologize in advance for the length of my question! I am trying to automate building a schedule of papers for our event. Paper and author data is provided in a spreadsheet (which my poor colleague currently uses to manually cut and paste, line by line, into a Word doc). This spreadsheet contains all the info I need to build the schedule, in consistently-named columns, but it can be in any order. Sort of like this (but the real paper titles won't be conveniently numbered):
Jack Doe - Co-Author - Penn State University - Aerodynamics - Aerodynamics Paper I
John Doe - Co-Author - Penn State University - Acoustics - Acoustics Paper I
John Smith - Co-Author - University of VA - Acoustics - Acoustics Paper I
Jane Doe - Main Author - Penn State University - Acoustics - Acoustics Paper I
Bob Smith - Main Author - GA Tech - Acoustics - Acoustics Paper II
Jack Smith - Main Author - University of MD - Acoustics - Acoustics Paper III
Jill Smith - Co-Author - University of MD - Acoustics - Acoustics Paper III
Bob Doe - Main Author - Penn State University - Aerodynamics - Aerodynamics Paper I
My goal is to transmogrify the spreadsheet data such that papers are grouped and ordered by session (i.e. Acoustics, Aerodynamics), then by paper title (i.e. Acoustics Paper I, Acoustics Paper II), then by authors from each university. The catch is that the "main author" for any given paper must be listed first, along with co-authors (if any) from that same school coming next, followed by co-authors from other universities. The other co-authors can be in any order but must also be grouped by university.
So taking the original example, it should come out like this:
ACOUSTICS
Acoustics Paper I
Jane Doe, John Doe, Penn State University; John Smith, University of VA
Acoustics Paper II
Bob Smith, GA Tech
Acoustics Paper III
Jack Smith, Jill Smith, University of MD
AERODYNAMICS
Aerodynamics Paper I
Bob Doe, Jack Doe, Penn State University
I am almost there, but I can only get it to
ACOUSTICS
Acoustics Paper I
Jane Doe, Penn State University; John Doe, Penn State University; John Smith, University of VA;
Acoustics Paper II
Bob Smith, GA Tech;
Acoustics Paper III
Jack Smith, University of MD; Jill Smith, University of MD;
AERODYNAMICS
Aerodynamics Paper I
Bob Doe, Penn State University; Jack Doe, Penn State University;
We are using ACF 2016. What I am doing (my code is below) is reading the spreadsheet into a query object with cfspreadsheet. Then I group the output by session and then by title with nested cfoutputs.
Then, because I could not think of any other way to identify the main author for each paper and put them first, I loop over all authors for that paper and add a flag to identify them, and sort on that with arraySort. Note that I cannot simply sort by author type DESC, because there is another type, "presenting author," which I omitted for brevity (ha). And sometimes the main author can also be the presenting author, so that type would be "main author presenting author."
At any rate, I then loop over the sorted array.
Below is what I have tried so far. I am stuck on getting the university to only show once for each list of authors. I have tried putting another loop in my authorArray loop, but I do not know what to index or loop on, so it just ends up outputting the university name after every author name. I have tried using multidimensional arrays, and even using query of query to try and build a nice, ordered data structure. But I am evidently doing it wrong, because I keep ending up getting stumped by grouping the authors by their university.
I sure would appreciate any tips or hints! Please note that I cannot change the requirement to initially work with this spreadsheet. However, once I get it, I can do anything with the info that I need to get the desired output. So I am entirely open to making any changes or rethinking my entire approach. My code below is the closest I have gotten.
Thank you all very much in advance! Here is what I am using so far:
<cfoutput query="queryPapers" group="PrimarySession">
#PrimarySession#
<cfoutput group="Title">
<p>#Title#</p>
<cfset authorArray = arrayNew(1)>
<cfoutput>
<cfset authorStruct = structNew()>
<cfset authorStruct.firstName = AuthorFirstName>
<cfset authorStruct.lastName = AuthorLastName>
<cfset authorStruct.institution = AuthorInstitution>
<cfset authorStruct.authorType = AuthorType>
<cfif findNoCase("Main", AuthorType)>
<cfset authorStruct.authorMain = "A">
<cfelse>
<cfset authorStruct.authorMain = "B">
</cfif>
<cfset arrayAppend(authorArray, authorStruct)>
<cfscript>
arraySort(
authorArray,
function (e1, e2) {
return compare(e1.authorMain, e2.authorMain);
}
);
</cfscript>
</cfoutput>
<cfloop index="i" from="1" to="#arrayLen(authorArray)#">
#authorArray[i].firstName# #authorArray[i].lastName#,
#authorArray[i].institution#;
</cfloop>
</cfoutput>
</cfoutput>
Here is some actual output of the above code:
Dynamic Stall Investigations
Sergey Smith,* University of Maryland; Tobias Lersdorf, German University; Pascal Marceau, University of Maryland;
And I am trying to get to
Dynamic Stall Investigations
Sergey Smith,* Pascal Marceau, University of Maryland; Tobias Lersdorf, German University
Thanks very much for reading!
You're on the right track with your code but I think you're overcomplicating it a bit. You can simplify your nested <cfoutput> processing with the snippet below.
<!--- Nested output loop for displaying required result --->
<cfoutput query="queryPapers" group="PrimarySession">
<strong>#Ucase(PrimarySession)#</strong><br />
<cfoutput group="Title">
<i>#Title#</i><br />
<cfoutput group="AuthorInstitution">
<cfoutput>
#AuthorFirstName# #AuthorLastName#,
</cfoutput>
#AuthorInstitution#; <!--- display institution once per group --->
</cfoutput>
<br /><br /> <!--- double-space after each title group --->
</cfoutput>
</cfoutput>
The issue you had with displaying the university only once can be accomplished by adding another level of group nesting and displaying it in the footer of the extra nested group.
The issue you have with making sure the Main Author is always first should be handled in your preprocessing. To do that, use your existing if/else logic ("A" for Main Author, "B" otherwise) and add it as an extra column to your query. This way you can resort by using it in your order by clause prior to your output loop.
UPDATE
So I realized after posting my first revision that there is a minor logic flaw. The reason it didn't surface is because when using the sample data above, the Main Author conveniently always belonged to University that was alphabetically first in the "Title" group. I realized this after seeing the additional sample output, I added these rows to my code sample and it also incorrectly displays as below.
Dynamic Stall Investigations
Sergey Smith, University of Maryland; Tobias Lersdorf, German University; Pascal Marceau, University of Maryland;
The solution is to use the existing authorMain column (A for main authors, B otherwise) and add another value A2 for non-main authors belonging to the same university as the main author. The tricky part is you have to inspect the value in another row to determine when to set A2. The best solution I could think of is to add the two blocks of code right after adding the initially populated authorMain column.
<!--- Sort query so "Main Author" is first within PrimarySession and Title --->
<cfquery name="queryPapers" dbtype="query">
select * from queryPapers
order by
PrimarySession,
Title,
AuthorMain
</cfquery>
<!--- Loop through above and update NON "Main-Author" rows "A2" if they have same University as "A" rows --->
<cfset MainInstitution = "">
<cfloop query="queryPapers">
<cfif queryPapers.authorMain eq "A">
<cfset MainInstitution = queryPapers.AuthorInstitution>
<cfelse>
<cfif MainInstitution eq queryPapers.AuthorInstitution>
<cfset QuerySetCell(queryPapers, "authorMain", "A2", queryPapers.currentRow)>
</cfif>
</cfif>
</cfloop>
First sort by PrimarySession, Title and AuthorMain and then loop through the recordset and update rows with A2 if the non-main author is in the same university as the main author by keeping track using the MainInstitution variable. This solution generates the proper result while allowing all other code to remain untouched.
You can see the difference between the first revision and the second revision of my code which simulates the OP's scenario.
Related
I am a school teacher working on a document for my school to be able to request students by teacher but I'm having a lot of trouble getting the code set up to send the email to their A2/B6 teacher. The part of the process I need help with is layers into the project, so the data compiled is very complex and I've read a lot of helps from stackoverflow over the last 2 weeks but I can't find anything that shows me or that even gets me started on this specific task.
I have a row that contains emails of teachers. Below their email is a column of their students' names and next to that is another row that contains data (the requesting teacher's name). I need to write a script that will take the email in cell "A2" and send it the range of data in "A3:B20," take the email in cell "C2" and send it the range of data in "C3:D20," take the email in cell "E2" and send it the range of data in "E3:F20," and so on and so forth for 75+ teachers.
Here is a picture of my sheet
Really what my question is, IS THIS POSSIBLE? And if so do you have any ideas that could point me in the right direction, or do you have a snippet of code you could share with me to get me started. I am new to google scripts for this project but I've learned a lot.
Any help, insights, or suggestions would be really appreciated.
I have created a dummy document with computer-generated names here that shows what my sheet set up in like:
https://docs.google.com/spreadsheets/d/1QONEAxMQLBDKwgaXc4RwH_rgb_RzlxTkHl5euSSB9Wk/edit?usp=sharing
Hopefully, this short example will help you to get started.
function myFunction() {
var hl='';
var ss=SpreadsheetApp.getActive();
var sh=ss.getSheetByName('Sheet1');
var subject='Enter Subject Here';
for(var col=1;col<sh.getLastColumn();col+=2){
var rg=sh.getRange(1,col,sh.getLastRow(),2);
var vA=rg.getValues();
var s='StudentName,RequestingTeacher\n';
var html='<table>';
html+='<tr><th>StudentName</th><th>RequestingTeacher</th></tr>';
for(var i=2;i<vA.length;i++){
html+=Utilities.formatString('<tr><td>%s</td><td>%s</td></tr>', vA[i][0],vA[i][1]);
s+=Utilities.formatString('%s,%s', vA[i][0],vA[i][1]);
}
html+='</table>';
//GmailApp.sendEmail(vA[0][0], subject, s, {htmlBody:html});
hl+=Utilities.formatString('RecipientName: %s<br />RecipientEmail:%s<br />Column: %s<br />' , vA[0][0],vA[1][0],col);
hl+=html;
hl+='<br /><br /><br />';
}
var ui=HtmlService.createHtmlOutput(hl);
SpreadsheetApp.getUi().showModelessDialog(ui, 'An Example of What Emails Body will look like.')
}
The sendEmail line is commented out and I use a dialog to show you what the emails will look like more or less.
This is what my dialog looks like:
RecipientName: Test Teacher
RecipientEmail:tteacher#schooldistrict.org
Column: 1
StudentName RequestingTeacher
Braydon Nichols
Kiley Lozano
Shania Olsen
Rodney Howell Duckworth
Tiana Shelton HOPE Squad
Stephen Wiggins Moore
Kael Rangel
Beau Pennington
Hezekiah Vincent Batman
Iyana Lewis Moore
Theodore Klein
Rubi Webster S. Ward
Natalee Wong Batman
Chris Rocha Batman
Eileen Smith
Kara Johnston
Carsen Waters Moore
Bria Schmitt Cotterell
Abby Yoder
Natalie Durham
RecipientName: Example Teacher
RecipientEmail:eteacher#schooldistrict.org
Column: 3
StudentName RequestingTeacher
Brandon Bean
Wade Cross
Jaxon Ford
Josie Barajas W. Smith
Aimee Ross
Maren Cox Batman
Kyle Morton
Beatrice Hill W. Smith
Stephen Carroll Batman
Anton Galvan
Marlie Neal Anderson
Alexander Andersen W. Smith
Jacquelyn Boyer
Nora Brennan
Derek Ayers
Van Obrien
Amari Rasmussen
Aiyana Collier Cotterell
Annalise Vance
Kieran Booker
RecipientName: Awesome Teacher
RecipientEmail:ateacher#schooldistrict.org
Column: 5
StudentName RequestingTeacher
Brooklynn Hahn W. Smith
Jenny Lutz W. Smith
Lilian Moreno HOPE Squad
Journey Travis
Kenna Lawson Anderson
Kathy Mccarthy
Dayanara Strickland Moore
Anna Knight
Kamron Osborne
Turner Mcintosh Cotterell
Tyrone Mullins
Selena Oneal
Tabitha Hernandez
Andreas Chan Batman
Dashawn Munoz HOPE Squad
Laylah Morse HOPE Squad
Jamie Anthony
Damion Duffy
Christina Donovan
Hugh Gomez
RecipientName: Dummy Teacher
RecipientEmail:dteacher#schooldistrict.org
Column: 7
StudentName RequestingTeacher
Payton Huerta Moore
Easton Pittman
Lyric Morrow HOPE Squad
Jada Richardson Batman
Jon Mckay HOPE Squad
Demetrius Horton Anderson
Lilly Atkinson
Spencer Mathews W. Smith
Jalen Hanna Dibb
Miracle Best
Emerson Frost
Colt Andersen Dibb
Leanna Gibbs
Liana Branch S. Ward
Jamie Mooney
Mara Escobar Dibb
Liliana Galloway Anderson
Jane Schmitt Cotterell
Aryan Melendez
Dalton Ritter
I have a text file which has information, like so:
product/productId: B000GKXY4S
product/title: Crazy Shape Scissor Set
product/price: unknown
review/userId: A1QA985ULVCQOB
review/profileName: Carleen M. Amadio "Lady Dragonfly"
review/helpfulness: 2/2
review/score: 5.0
review/time: 1314057600
review/summary: Fun for adults too!
review/text: I really enjoy these scissors for my inspiration books that I am making (like collage, but in books) and using these different textures these give is just wonderful, makes a great statement with the pictures and sayings. Want more, perfect for any need you have even for gifts as well. Pretty cool!
product/productId: B000GKXY4S
product/title: Crazy Shape Scissor Set
product/price: unknown
review/userId: ALCX2ELNHLQA7
review/profileName: Barbara
review/helpfulness: 0/0
review/score: 5.0
review/time: 1328659200
review/summary: Making the cut!
review/text: Looked all over in art supply and other stores for "crazy cutting" scissors for my 4-year old grandson. These are exactly what I was looking for - fun, very well made, metal rather than plastic blades (so they actually do a good job of cutting paper), safe ("blunt") ends, etc. (These really are for age 4 and up, not younger.) Very high quality. Very pleased with the product.
I want to parse this into a dataframe with the productID, title, price.. as columns and the data as the rows. How can I do this in R?
A quick and dirty approach:
mytable <- read.table(text=mytxt, sep = ":")
mytable$id <- rep(1:2, each = 10)
res <- reshape(mytable, direction = "wide", timevar = "V1", idvar = "id")
There will be issues if there are other colons in the data. Also assumes that there is an equal number (10) of variables for each case. All
As a relative novice in R and programming, my first ever question in this forum is about regex pattern matching, specifically line breaks. First some background. I am trying to perform some preprocessing on a corpus of texts using R before processing them further on the NLP platform GATE. I convert the original pdf files to text as follows (the text files, unfortunately, go into the same folder):
dest <- "./MyFolderWithPDFfiles"
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/Program Files (x86)/xpdfbin-win-3.04/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE))
Then, having loaded the tm package and physically(!) moved the text files to another folder, I create a corpus:
TextFiles <- "./MyFolderWithTXTfiles"
EU <- Corpus(DirSource(TextFiles))
I then want to perform a series of custom transformations to clean the texts. I succeeded to replace a simple string as follows:
ReplaceText <- content_transformer(function(x, from, to) gsub(from, to, x, perl=T))
EU2 <- tm_map(EU, ReplaceText, "Table of contents", "TOC")
However, a pattern that is a 1-3 digit page number followed by two line breaks and a page break is causing me problems. I want to replace it with a blank space:
EU2 <- tm_map(EU, ReplaceText, "[0-9]{1,3}\n\n\f", " ")
The ([0-9]{1,3}) and \f alone match. The line breaks don't. If I copy text from one of the original .txt files into the RegExr online tool and test the expression "[0-9]{1,3}\n\n\f", it matches. So the line breaks do exist in the original .txt file.
But when I view one of the .txt files as read into the EU corpus in R, there appear to be no line breaks even though the lines are obviously breaking before the margin, e.g.
[3] "PROGRESS TOWARDS ACCESSION"
[4] "1"
[5] ""
[6] "\fTable of contents"
Seeing this, I tried other patterns, e.g. to detect one or more blank space ("[0-9]{1,3}\s*\f"), but no patterns worked.
So my questions are:
Am I converting and reading the files into R correctly? If so, what has happened to the line breaks?
If no line breaks is normal, how can I pattern match the character on line 5? Is that not a blank
space?
(A tangential concern:) When converting the pdf files, is there code that will put them directly in a new folder?
Apologies for extending this, but how can one print or inspect only a few lines of the text object? The tm commands and head(EU) print the entire object, each a very long text.
I know my problem(s) must appear simple and perhaps stupid, but one has to start somewhere and extensive searching has not revealed a source that explains comprehensively how to use RegExes to modify text objects in R. I am so frustrated and hope someone here will take pity and can help me.
Thanks for any advice you can offer.
Brigitte
p.s. I think it's not possible to upload attachments in this forum, therefore, here is a link to one of the original PDF documents: http://ec.europa.eu/enlargement/archives/pdf/key_documents/1998/czech_en.pdf
Because the doc is long, I created a snippet of the first 3 pages of the TXT doc, read it into the R corpus ('EU') and printed it to the console and this is it:
dput(EU[[2]])
structure(list(content = c("REGULAR REPORT", "FROM THE COMMISSION ON",
"CZECH REPUBLIC'S", "PROGRESS TOWARDS ACCESSION ***********************",
"1", "", "\fTable of contents", "A. Introduction", "a) Preface The Context of the Progress Report",
"b) Relations between the European Union and the Czech Republic The enhanced Pre-Accession Strategy Recent developments in bilateral relations",
"B. Criteria for membership", "1. Political criteria", "1.1. Democracy and the Rule of Law Parliament The Executive The judicial system Anti-Corruption measures",
"1.2. Human Rights and the Protection of Minorities Civil and Political Rights Economic, Social and Cultural Rights Minority Rights and the Protection of Minorities",
"1.3. General evaluation", "2. Economic criteria", "2.1. Introduction 2.2. Economic developments since the Commission published its Opinion",
"Macroeconomic developments Structural reforms 2.3. Assessment in terms of the Copenhagen criteria The existence of a functioning market economy The capacity to cope with competitive pressure and market forces 2.4. General evaluation",
"3. Ability to assume the obligations of Membership", "3.1. Internal Market without frontiers General framework The Four Freedoms Competition",
"3.2. Innovation Information Society Education, Training and Youth Research and Technological Development Telecommunications Audio-visual",
"3.3. Economic and Fiscal Affairs Economic and Monetary Union",
"2", "", "\fTaxation Statistics "), meta = structure(list(author = character(0),
datetimestamp = structure(list(sec = 50.1142621040344, min = 33L,
hour = 15L, mday = 3L, mon = 10L, year = 114L, wday = 1L,
yday = 306L, isdst = 0L), .Names = c("sec", "min", "hour",
"mday", "mon", "year", "wday", "yday", "isdst"), class = c("POSIXlt",
"POSIXt"), tzone = "GMT"), description = character(0), heading = character(0),
id = "CZ1998ProgressSnippet.txt", language = "en", origin = character(0)), .Names = c("author",
"datetimestamp", "description", "heading", "id", "language",
"origin"), class = "TextDocumentMeta")), .Names = c("content",
"meta"), class = c("PlainTextDocument", "TextDocument"))
Yes, working with text in R is not always a smooth experience! But you can get a lot done quickly with some effort (maybe too much effort!)
If you could share one of your PDF files or the output of dput(EU), that might help to identify exactly how to capture your page numbers with regex. That would also add a reproducible example to your question, which is an important thing to have in questions here so that people can test their answers and make sure they work for your specific problem.
No need to put PDF and text files in separate folders, instead you can use a pattern like so:
EU <- Corpus(DirSource(pattern = ".txt"))
This will only read the text files and ignore the PDF files
There is no 'snippet view' method in tm, which is annoying. I often use just names(EU) and EU[[1]] for quick looks
UPDATE
With the data you've just added, I'd suggest a slightly tangential approach. Do the regex work before passing the data to the tm package formats, like so:
# get the PDF
download.file("http://ec.europa.eu/enlargement/archives/pdf/key_documents/1998/czech_en.pdf", "my_pdf.pdf", method = "wget")
# get the file name of the PDF
myfiles <- list.files(path = getwd(), pattern = "pdf", full.names = TRUE)
# convert to text (not my pdftotext is in a different location to you)
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE))
# read plain text int R
x1 <- readLines("my_pdf.txt")
# make into a single string
x2 <- paste(x1, collapse = " ")
# do some regex...
x3 <- gsub("Table of contents", "TOC", x2)
x4 <- gsub("[0-9]{1,3} \f", "", x3)
# convert to corpus for text mining operations
x5 <- Corpus(VectorSource(x4))
With the snippet of data your provided using dput, the output from this method is
inspect(x5)
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
REGULAR REPORT FROM THE COMMISSION ON CZECH REPUBLIC'S PROGRESS TOWARDS ACCESSION *********************** TOC A. Introduction a) Preface The Context of the Progress Report b) Relations between the European Union and the Czech Republic The enhanced Pre-Accession Strategy Recent developments in bilateral relations B. Criteria for membership 1. Political criteria 1.1. Democracy and the Rule of Law Parliament The Executive The judicial system Anti-Corruption measures 1.2. Human Rights and the Protection of Minorities Civil and Political Rights Economic, Social and Cultural Rights Minority Rights and the Protection of Minorities 1.3. General evaluation 2. Economic criteria 2.1. Introduction 2.2. Economic developments since the Commission published its Opinion Macroeconomic developments Structural reforms 2.3. Assessment in terms of the Copenhagen criteria The existence of a functioning market economy The capacity to cope with competitive pressure and market forces 2.4. General evaluation 3. Ability to assume the obligations of Membership 3.1. Internal Market without frontiers General framework The Four Freedoms Competition 3.2. Innovation Information Society Education, Training and Youth Research and Technological Development Telecommunications Audio-visual 3.3. Economic and Fiscal Affairs Economic and Monetary Union Taxation Statistics
I am using perl to scrape the following through .txt which I'd ultimately bring into Stata. What format option works? I have many such observations, so would like to use an approach over which I can generalize.
The original data are of the form:
First Name: Allen
Last Name: Von Schmidt
Birth Year: 1965
Location: District 1, Ocean City, Cape May, New Jersey, USA
First Name: Lee Roy
Last Name: McBride
Birth Year: 1967
Location: Precinct 5, District 2, Chicago, Cook, Illinois, USA
The goal is to create the variables in Stata:
First Name: Allen
Last Name: Von Schmidt
Birth Year: 1965
County: Cape May
State: New Jersey
First Name: Allen
Last Name: McBride
Birth Year: 1967
County: Cook
State: Illinois
What possible .txt might lead to such, and how would I load it into Stata?
Also, the amount of terms vary in Location as in these 2 examples, but I always want the 2 before USA.
At the moment, I am putting "", around each variable from the table for the .txt.
"Allen","Von Schmidt","1965","District 1, Ocean City, Cape May, New Jersey, USA"
"Lee Roy","McBride","1967","Precinct 5, District 2, Chicago, Cook, Illinois, USA"
Is there a better way to format the .txt? How would I create the corresponding variables in Stata?
Thank you for your help!
P.S. I know that stata uses infile or insheet and can handle , or tabs to separate variables. I did not know how to scrape a variable like Location in perl with all of the those so I added the ""
There are two ways to do this. The first is to paste the data into your do-file and use input. Assuming the format is fairly regular, you can clean it up easily using commas to parse. Note that I removed the commas:
#delimit;
input
str100(first_name last_name yob geo);
"Allen" "Von Schmidt" "1965" "District 1, Ocean City, Cape May, New Jersey, USA";
end;
compress;
destring, replace;
split geo, parse(,);
rename geo1 district;
rename geo2 city;
rename geo3 county;
rename geo4 state;
rename geo5 country;
drop geo;
The second way is to insheet the data from the txt file directly, which is probably easier. This assumes that the commas were not removed:
#delimit;
insheet first_name last_name yob geo using "raw_data.txt", clear comma nonames;
Then clean it up as in the first example.
This isn't a complete answer, but I need more space and flexibility than comments (easily) allow.
One trick is based on peeling off elements from the end. The easiest way to do that could be to start looking for the last comma, which is in turn the first comma in the reversed string. Use strpos(reverse(stringvar), ",").
For example the first commma is found by strpos() like this
. di strpos("abcd,efg,h", ",")
5
and the last comma like this
. di strpos(reverse("abcd,efg,h"), ",")
2
Once you know where the last comma is you can peel off the last element. If the last comma is at position # in the reversed string, it is at position -# in the string.
. di substr("abcd,efg,h", -2, 2)
,h
These examples clearly are calculator-style examples for single strings. But the last element can be stripped off similarly for entire string variables.
. gen poslastcomma = strpos(reverse(var), ",")
. gen var_end = substr(var, -poslastcomma, poslastcomma)
. gen var_begin = substr(var, 1, length(var) - poslastcomma)
Once you get used to stuff like this you can write more complicated statements with fewer variables, but slowly, slowly step by step is better when you are learning.
By the way, a common Stata learner error (in my view) is to assume that a solution to a string problem must entail the use of regular expressions. If you are very fluent at regular expressions, you can naturally do wonderful things with them, but the other string functions in conjunction can be very powerful too.
In your specific example, it sounds as if you want to ignore a last element such as "USA" and then work in turn on the next elements working backwards.
split in Stata is fine too (I am a fan and indeed am its putative author) but can be awkward if a split yields different numbers of elements, which is where I came in.
Below is an example of a text file I need to parse.
Lead Attorney: John Doe
Staff Attorneys: John Doe Jr. Paralegal: John Doe III
Geographic Area: Wisconsin
Affiliated Offices: None
E-mail: blah#blah.com
I need to parse all the key/value pairs and import it into a database. For example, I will insert 'John Doe' into the [Lead Attorney] column. I started a regex but I'm running into problems when parsing line 2:
Staff Attorneys: John Doe Jr. Paralegal: John Doe III
I started with the following regex:
(\w*.?\w+):\s*(.)(?!(\w.?\w+:.*))
But that does not parse out 'Staff Attorneys: John Doe Jr.' and 'Paralegal: John Doe III'. How can I ensure that my regex returns two groups for every key/value pair even if the key/value pairs are on the same line? Thanks!
Does any kind of key appear as a second key? The text above can be fixed by doing a data.replace('Paralegal:', '\nParalegal:') first. Then there is only one key/value pair per line, and it gets trivial:
>>> data = """Lead Attorney: John Doe
... Staff Attorneys: John Doe Jr. Paralegal: John Doe III
... Geographic Area: Wisconsin
... Affiliated Offices: None
... E-mail: blah#blah.com"""
>>>
>>> result = {}
>>> data = data.replace('Paralegal:', '\nParalegal:')
>>> for line in data.splitlines():
... key, val = line.split(':', 1)
... result[key.strip()] = val.strip()
...
>>> print(result)
{'Staff Attorneys': 'John Doe Jr.', 'Lead Attorney': 'John Doe', 'Paralegal': 'John Doe III', 'Affiliated Offices': 'None', 'Geographic Area': 'Wisconsin', 'E-mail': 'blah#blah.com'}
If "Paralegal:" also appears first you can make a regexp to do the replacement only when it's not first, or make a .find and check that the character before is not a newline. If there are several keywords that can appear like this, you can make a list of keywords, etc.
If the keywords can be anything, but only one word, you can look for ':' and parse backwards for space, which can be done with regexps.
If the keywords can be anything and include spaces, it's impossible to do automatically.