This is the data in google sheets
Account Numkber Names
7728550,543216 Govt Req
772855,65432 Vodafone
I am trying to do a lookup of the account numbers with the formula
=QUERY(Sheet1!B$3:C$4,"Select C where B matches '^.*(" & B2 & ").*$' limit 1")
772855 - Govt req
How do I solve this ? There is a large chunk of data so I can't paste the values in different rows.
use:
=ARRAYFORMULA(IFNA(VLOOKUP(B2:B,
SPLIT(FLATTEN(SPLIT(Sheet1!F2:F, ",")&"×"&Sheet1!G2:G), "×"), 2, )))
I am trying to make a search function that displays the entire row from matching entries.
This is what currently happens using: =IFERROR(IF(B3="","No Results",ARRAYFORMULA(FILTER(DOCS!C2:C, SEARCH(B3, DOCS!C2:C)))),"No Results")
And this is the column I am trying to show
Example
Data:
Fortnite,Video Game,Epic Games
PUBG,Video Game,IDK
Steam,Service,Valve
Amazon,Service,Amazon
Cats,Species,Animal Kingdom
Search Column B for "Service"
(MY CURRENT RESULTS):
Service
Service
(MY INTENDED RESULTS):
Steam,Service,Valve
Amazon,Service,Amazon
=IF(D1<>"", IFERROR(FILTER(A:A, REGEXMATCH(LOWER(A:A), LOWER(D1))), "No Results"), )
=FILTER(DOCS!A2:G, REGEXMATCH(LOWER(DOCS!C2:C), LOWER(J1)), DOCS!E2:E="active")
I just installed the package XML2, and I manage to extract the aimed information. The next step is to 'visualize' the extracted information, e.g. with RShiny. Alas I fail to do "string parsing" correctly ...
For example: the extracted datasources
xmlfile <- read_xml("~ /Sample.xml")
ds <- xml_find_all(xmlfile , ".//datasource")
listds <- unique(unlist(ds, use.names = FALSE))
Datasources are (in this example) two excel files. Hence the outcome is a list with the names of the two excelfiles and the sheets of the respective excelfiels
"Customers (Sample)" "Orders (Sample - Sales (Excel))"
Note: I cannot say why one data source inlcudes "(Excel)" while the other does not.
Anyways, the desired outcome (= visualisation) would be
Datasource: Sample Sheet Name: Customer
Datasource: Sample - Sales Sheet Name: Orders
Question: how to tell R to "find name within () i.e. "Sample" or "Sample - Sales" and to paste this .... then to find the string within " " but outside of (), i.e. "Customer" or "Orders "?
Thanks a million for any thoughts and advice!
list the ds object. use xml_attr to get the content.
Also post the actual file.
As a relative novice in R and programming, my first ever question in this forum is about regex pattern matching, specifically line breaks. First some background. I am trying to perform some preprocessing on a corpus of texts using R before processing them further on the NLP platform GATE. I convert the original pdf files to text as follows (the text files, unfortunately, go into the same folder):
dest <- "./MyFolderWithPDFfiles"
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/Program Files (x86)/xpdfbin-win-3.04/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE))
Then, having loaded the tm package and physically(!) moved the text files to another folder, I create a corpus:
TextFiles <- "./MyFolderWithTXTfiles"
EU <- Corpus(DirSource(TextFiles))
I then want to perform a series of custom transformations to clean the texts. I succeeded to replace a simple string as follows:
ReplaceText <- content_transformer(function(x, from, to) gsub(from, to, x, perl=T))
EU2 <- tm_map(EU, ReplaceText, "Table of contents", "TOC")
However, a pattern that is a 1-3 digit page number followed by two line breaks and a page break is causing me problems. I want to replace it with a blank space:
EU2 <- tm_map(EU, ReplaceText, "[0-9]{1,3}\n\n\f", " ")
The ([0-9]{1,3}) and \f alone match. The line breaks don't. If I copy text from one of the original .txt files into the RegExr online tool and test the expression "[0-9]{1,3}\n\n\f", it matches. So the line breaks do exist in the original .txt file.
But when I view one of the .txt files as read into the EU corpus in R, there appear to be no line breaks even though the lines are obviously breaking before the margin, e.g.
[3] "PROGRESS TOWARDS ACCESSION"
[4] "1"
[5] ""
[6] "\fTable of contents"
Seeing this, I tried other patterns, e.g. to detect one or more blank space ("[0-9]{1,3}\s*\f"), but no patterns worked.
So my questions are:
Am I converting and reading the files into R correctly? If so, what has happened to the line breaks?
If no line breaks is normal, how can I pattern match the character on line 5? Is that not a blank
space?
(A tangential concern:) When converting the pdf files, is there code that will put them directly in a new folder?
Apologies for extending this, but how can one print or inspect only a few lines of the text object? The tm commands and head(EU) print the entire object, each a very long text.
I know my problem(s) must appear simple and perhaps stupid, but one has to start somewhere and extensive searching has not revealed a source that explains comprehensively how to use RegExes to modify text objects in R. I am so frustrated and hope someone here will take pity and can help me.
Thanks for any advice you can offer.
Brigitte
p.s. I think it's not possible to upload attachments in this forum, therefore, here is a link to one of the original PDF documents: http://ec.europa.eu/enlargement/archives/pdf/key_documents/1998/czech_en.pdf
Because the doc is long, I created a snippet of the first 3 pages of the TXT doc, read it into the R corpus ('EU') and printed it to the console and this is it:
dput(EU[[2]])
structure(list(content = c("REGULAR REPORT", "FROM THE COMMISSION ON",
"CZECH REPUBLIC'S", "PROGRESS TOWARDS ACCESSION ***********************",
"1", "", "\fTable of contents", "A. Introduction", "a) Preface The Context of the Progress Report",
"b) Relations between the European Union and the Czech Republic The enhanced Pre-Accession Strategy Recent developments in bilateral relations",
"B. Criteria for membership", "1. Political criteria", "1.1. Democracy and the Rule of Law Parliament The Executive The judicial system Anti-Corruption measures",
"1.2. Human Rights and the Protection of Minorities Civil and Political Rights Economic, Social and Cultural Rights Minority Rights and the Protection of Minorities",
"1.3. General evaluation", "2. Economic criteria", "2.1. Introduction 2.2. Economic developments since the Commission published its Opinion",
"Macroeconomic developments Structural reforms 2.3. Assessment in terms of the Copenhagen criteria The existence of a functioning market economy The capacity to cope with competitive pressure and market forces 2.4. General evaluation",
"3. Ability to assume the obligations of Membership", "3.1. Internal Market without frontiers General framework The Four Freedoms Competition",
"3.2. Innovation Information Society Education, Training and Youth Research and Technological Development Telecommunications Audio-visual",
"3.3. Economic and Fiscal Affairs Economic and Monetary Union",
"2", "", "\fTaxation Statistics "), meta = structure(list(author = character(0),
datetimestamp = structure(list(sec = 50.1142621040344, min = 33L,
hour = 15L, mday = 3L, mon = 10L, year = 114L, wday = 1L,
yday = 306L, isdst = 0L), .Names = c("sec", "min", "hour",
"mday", "mon", "year", "wday", "yday", "isdst"), class = c("POSIXlt",
"POSIXt"), tzone = "GMT"), description = character(0), heading = character(0),
id = "CZ1998ProgressSnippet.txt", language = "en", origin = character(0)), .Names = c("author",
"datetimestamp", "description", "heading", "id", "language",
"origin"), class = "TextDocumentMeta")), .Names = c("content",
"meta"), class = c("PlainTextDocument", "TextDocument"))
Yes, working with text in R is not always a smooth experience! But you can get a lot done quickly with some effort (maybe too much effort!)
If you could share one of your PDF files or the output of dput(EU), that might help to identify exactly how to capture your page numbers with regex. That would also add a reproducible example to your question, which is an important thing to have in questions here so that people can test their answers and make sure they work for your specific problem.
No need to put PDF and text files in separate folders, instead you can use a pattern like so:
EU <- Corpus(DirSource(pattern = ".txt"))
This will only read the text files and ignore the PDF files
There is no 'snippet view' method in tm, which is annoying. I often use just names(EU) and EU[[1]] for quick looks
UPDATE
With the data you've just added, I'd suggest a slightly tangential approach. Do the regex work before passing the data to the tm package formats, like so:
# get the PDF
download.file("http://ec.europa.eu/enlargement/archives/pdf/key_documents/1998/czech_en.pdf", "my_pdf.pdf", method = "wget")
# get the file name of the PDF
myfiles <- list.files(path = getwd(), pattern = "pdf", full.names = TRUE)
# convert to text (not my pdftotext is in a different location to you)
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE))
# read plain text int R
x1 <- readLines("my_pdf.txt")
# make into a single string
x2 <- paste(x1, collapse = " ")
# do some regex...
x3 <- gsub("Table of contents", "TOC", x2)
x4 <- gsub("[0-9]{1,3} \f", "", x3)
# convert to corpus for text mining operations
x5 <- Corpus(VectorSource(x4))
With the snippet of data your provided using dput, the output from this method is
inspect(x5)
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
REGULAR REPORT FROM THE COMMISSION ON CZECH REPUBLIC'S PROGRESS TOWARDS ACCESSION *********************** TOC A. Introduction a) Preface The Context of the Progress Report b) Relations between the European Union and the Czech Republic The enhanced Pre-Accession Strategy Recent developments in bilateral relations B. Criteria for membership 1. Political criteria 1.1. Democracy and the Rule of Law Parliament The Executive The judicial system Anti-Corruption measures 1.2. Human Rights and the Protection of Minorities Civil and Political Rights Economic, Social and Cultural Rights Minority Rights and the Protection of Minorities 1.3. General evaluation 2. Economic criteria 2.1. Introduction 2.2. Economic developments since the Commission published its Opinion Macroeconomic developments Structural reforms 2.3. Assessment in terms of the Copenhagen criteria The existence of a functioning market economy The capacity to cope with competitive pressure and market forces 2.4. General evaluation 3. Ability to assume the obligations of Membership 3.1. Internal Market without frontiers General framework The Four Freedoms Competition 3.2. Innovation Information Society Education, Training and Youth Research and Technological Development Telecommunications Audio-visual 3.3. Economic and Fiscal Affairs Economic and Monetary Union Taxation Statistics
I want to make a chart of city council members in my city over time. I envision this as kind of being like a line chart. The x axis would be years. There are nine city council seats, so there would be nine straight lines, and each would show who was city council member over time (perhaps through different colored line segments or by showing their names onMouseOver). Perhaps this is like a time line.
When I graph the city's budget, since both the years and city budget are type "number," this classic line graph works out nicely.
For this new graph, I am passing all of the data types "string" since they are peoples' names, and Google Charts API is giving the error: "Data column(s) for axis #0 cannot be of type string"
How can I make this chart? (I not only want to graph numeric data like budget surplus or deficit or number of robberies, but relate [in another chart] who was in charge at that time.)
In PHP, I queried my mySQL database and produced a JSON object in the format Google Chart API needs to receive to make horizontal lines over time that show names onHover like this:
$conn = mysql_connect("x","y","z");
mysql_select_db("a",$conn);
$sql = "SELECT year,d_mayor,d_council1,d_council2,d_council3
FROM metrics WHERE year
IN ('1998','1999','2000','2001','2002','2003','2004','2005','2006','2007','2008','2009','2010','2011','2012')";
$sth = mysql_query($sql, $conn) or die(mysql_error());
//start the json data in the format Google Chart js/API expects to receive it change
$JSONdata = "{
\"cols\": [
{\"label\":\"Year\",\"type\":\"string\"},
{\"label\":\"City Council 1\",\"type\":\"number\"},
{\"label\":\"City Council 2\",\"type\":\"number\"},
{\"label\":\"City Council 3\",\"type\":\"number\"},
{\"label\":\"Mayor\",\"type\":\"number\"}
],
\"rows\": [";
//loop through the db query result set and put into the chart cell values (note last ojbect in array has "," behind it but its working)
while($r = mysql_fetch_assoc($sth)) {
$JSONdata .= "{\"c\":[{\"v\": " . $r['year'] . "}, {\"v\": 1, \"f\": \"" . $r['d_council1'] . "\"}, {\"v\": 2, \"f\": \"" . $r['d_council2'] . "\"}, {\"v\": 3, \"f\": \"" . $r['d_council3'] . "\"}, {\"v\": 10, \"f\": \"" . $r['d_mayor'] . "\"},]},";
}
//end the json data/object literal with the correct syntax
$JSONdata .= "]}";
echo $JSONdata;
mysql_close($conn);