R: xml2, extracting datasource names - regex

I just installed the package XML2, and I manage to extract the aimed information. The next step is to 'visualize' the extracted information, e.g. with RShiny. Alas I fail to do "string parsing" correctly ...
For example: the extracted datasources
xmlfile <- read_xml("~ /Sample.xml")
ds <- xml_find_all(xmlfile , ".//datasource")
listds <- unique(unlist(ds, use.names = FALSE))
Datasources are (in this example) two excel files. Hence the outcome is a list with the names of the two excelfiles and the sheets of the respective excelfiels
"Customers (Sample)" "Orders (Sample - Sales (Excel))"
Note: I cannot say why one data source inlcudes "(Excel)" while the other does not.
Anyways, the desired outcome (= visualisation) would be
Datasource: Sample Sheet Name: Customer
Datasource: Sample - Sales Sheet Name: Orders
Question: how to tell R to "find name within () i.e. "Sample" or "Sample - Sales" and to paste this .... then to find the string within " " but outside of (), i.e. "Customer" or "Orders "?
Thanks a million for any thoughts and advice!

list the ds object. use xml_attr to get the content.
Also post the actual file.

Related

Error Query, and xlookup formula in dropdown list

Someone in this group had created a google sheet file for me. and I have given some names in the list in that google sheet files. Can you prepare the appropriate code to get the result through that given name? The person who made this cannot be contacted. It seems that the person is busy. I am just learning Google Sheets Coding. I am sharing google sheet link. Can you code it and make it ready?
spreadsheet sample
CODE:
=QUERY(ALL!B3:K,"where D like '" &
XLOOKUP(A2,Conditions!A1:A,Conditions!B1:B) & "%' AND" & IF(A3="FEE NOT PAID"," K like '"," H like '") &
XLOOKUP(A3,Conditions!D1:D,Conditions!E1:E) & "%'",0)
DROPDOWN LIST items:
(ALL PENDING LICENSE APPLICATION, ALL ISSUED LICENSE APPLICATIONM, ALL REJECTED LICENSE APPLICATION, ALL SUSPENDED LICENSE, ALL ISSUED NEW LICENSE NO)
try:
=ARRAY_CONSTRAIN(QUERY({ALL!B3:K, FLATTEN(QUERY(TRANSPOSE(ALL!H3:K),,9^9))}, "where 2=2 "&
IF(A2="",,IF(REGEXMATCH(A2, "ALL"), " and Col3 is not null",
" and Col3 contains '"&VLOOKUP(A2, Conditions!A1:B, 2, )&"'"))&
IF(A3="",," and Col11 contains '"&VLOOKUP(A3, Conditions!D1:E, 2, )&"'")), 9^9, 10)

Spark: returning all regex matches for each dataset row

I have a dataset loaded from a .csv file (imitated by ds here) which contains 2 rows: one with the publishing date of an article (publishDate), and one with mentioned names and their character offset in that article (allNames).
I'm trying to count the amount of times a name is mentioned per day, and I thought it would be good to start with removing the character offsets in allNames by mapping a regex operation. Have a look at the code:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
case class Data(publishDate: String, allNames: String)
val ds = Seq(Data("01-01-2018", "Channel One,628;Channel One,755;Channel One,1449;Channel One"),
Data("01-02-2018", "Waite Park,125;City Food,233;Adobe Flash Player,348;Charter Channel,554")).toDS()
val pattern = """([^\;\,]+),\d+""".r
val processed_ds = ds.map(data => (data.publishDate, (for (m <- pattern.findAllMatchIn(data.allNames)) yield m.group(1)).toList))
Which gives a whole list of errors when I call processed_ds.collect().foreach(println).
What is going wrong here?
NOTE: I am new to Scala.
Edit:
The expected output from processed_ds.collect().foreach(println) would be:
("01-01-2018", List("Channel One", "Channel One", "Channel One", "Channel One"))
("01-02-2018", List("Waite Park", "City Food", "Adobe Flash Player", "Charter Channel"))
Or would this be easier achieved with a split operation of some sort?
If regexp is not mandatory, can be solved with "split" function:
val result = ds.map(v => (v.publishDate, v.allNames.split(";").map(p => p.split(",")(0)).toList))
result.collect().foreach(println)
Output:
(01-01-2018,List(Channel One, Channel One, Channel One, Channel One))
(01-02-2018,List(Waite Park, City Food, Adobe Flash Player, Charter Channel))

How to read in table that depends on two sets previously defined

I am optimizing the choice of letters with the surfaces they require in the laser cutter to maximize the total frequency of words that they can form. I wrote this program for GLPK:
set unicodes;
param surfaces{u in unicodes};
table data IN "CSV" "surfaces.csv": unicodes <- [u], surfaces~s;
set words;
param frequency{w in words}, integer;
table data IN "CSV" "words.csv": words <- [word], frequency~frequency;
Then I want to give a table giving each word the count of each character with its unicode. The sets words and unicodes are already defined. According to page 42 of the manual, I can omit the set and the delimiter:
table name alias IN driver arg . . . arg : set <- [fld, ..., fld], par~fld, ..., par~fld;
...
set is the name of an optional simple set called control set. It can be omitted along with the
delimiter <-;
So I write this:
param spectrum{w in words, u in unicodes} >= 0;
table data IN "CSV" "spectrum.csv": words~word, unicodes~unicode, spectrum~spectrum;
I get the error:
Reading model section from lp...
lp:19: delimiter <- missing where expected
Context: ..., u in unicodes } >= 0 ; table data IN '...' '...' : words ~
If I write:
table data IN "CSV" "spectrum.csv": [words, unicodes] <- [word, unicode], spectrum~spectrum;
I get the error:
Reading model section from lp...
lp:19: syntax error in table statement
Context: ...} >= 0 ; table data IN '...' '...' : [ words , unicodes ] <-
How can I read in a table with data on two sets already defined?
Notes: the CSV files are similar to this:
surfaces.csv:
u,s
41,1
42,1.5
43,1.2
words.csv:
word,frequency
abc,10
spectrum.csv:
word,unicode,spectrum
abc,1,41
abc,2,42
abc,3,43
I found the answer with AMPL, A Mathematical Programming Language, which is a superset of GNU MathProg. I needed to define a set with the links between words and unicodes, and use that set as the control set when reading the table:
set links within {words, unicodes};
param spectrum{links} >= 0;
table data IN "CSV" "spectrum.csv": links <- [word, unicode], spectrum~spectrum;
And now I get:
...
INTEGER OPTIMAL SOLUTION FOUND
Time used: 0.0 secs
Memory used: 0.1 Mb (156430 bytes)
The "optional set" in the documentation is still misleading and I filed a bug report. For reference, the AMPL book is free to download and I used the transportation model scattered in page 47 in Section 3.2, page 173 in section 10.1, and page 179 in section 10.2.

R- Subset a corpus by meta data (id) matching partial strings

I'm using the R (3.2.3) tm-package (0.6-2) and would like to subset my corpus according to partial string matches contained with the metadatum "id".
For example, I would like to filter all documents that contain the string "US" within the "id" column. The string "US" would be preceded and followed by various characters and numbers.
I have found a similar example here. It is recommended to download the quanteda package but I think this should also be possible with the tm package.
Another more relevant answer to a similar problem is found here. I have tried to adapt that sample code to my context. However, I don't manage to incorporate the partial string matching.
I imagine there might be multiple things wrong with my code so far.
What I have so far looks like this:
US <- tm_filter(corpus, FUN = function(corpus, filter) any(meta(corpus)["id"] == filter), grep(".*US.*", corpus))
And I receive the following error message:
Error in structure(as.character(x), names = names(x)) :
'names' attribute [3811] must be the same length as the vector [3]
I'm also not sure how to come up with a reproducible example simulating my problem for this post.
It could work like this:
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
(corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain)))
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 20
(idx <- grep("0", sapply(meta(corp, "id"), paste0), value=TRUE))
# 502 704 708
# "502" "704" "708"
(corpsubset <- corp[idx] )
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 3
You are looking for "US" instead of "0". Have a look at ?grep for details (e.g. fixed=TRUE).

How to use regular expressions properly on a SQL files?

I have a lot of undocumented and uncommented SQL queries. I would like to extract some information within the SQL-statements. Particularly, I'm interested in DB-names, table names and if possible column names. The queries have usually the following syntax.
SELECT *
FROM mydb.table1 m
LEFT JOIN mydb.sometable o ON m.id = o.id
LEFT JOIN mydb.sometable t ON p.id=t.id
LEFT JOIN otherdb.sometable s ON s.column='test'
Usually, the statements involes several DBs and Tables. I would like only extract DBs and Tables with any other information. I thought if whether it is possible to extract first the information which begins after FROM & JOIN & LEFT JOIN. Here its usually db.table letters such as o t s correspond already to referenced tables. I suppose they are difficult to capture. What I tried without any success is to use something like:
gsub(".*FROM \\s*|WHERE|ORDER|GROUP.*", "", vec)
Assuming that each statement ends with WHERE/where or ORDER/order or GROUP... But that doesnt work out as expected.
You haven't indicated which database system you are using but virtually all such systems have introspection facilities that would allow you to get this information a lot more easily and reliably than attempting to parse SQL statements. The following code which supposes SQLite can likely be adapted to your situation by getting a list of your databases and then looping over the databases and using dbConnect to connect to each one in turn running code such as this:
library(gsubfn)
library(RSQLite)
con <- dbConnect(SQLite()) # use in memory database for testing
# create two tables for purposes of this test
dbWriteTable(con, "BOD", BOD, row.names = FALSE)
dbWriteTable(con, "iris", iris, row.names = FALSE)
# get all table names and columns
tabinfo <- Map(function(tab) names(fn$dbGetQuery(con, "select * from $tab limit 0")),
dbListTables(con))
dbDisconnect(con)
giving an R list whose names are the table names and whose entries are the column names:
> tabinfo
$BOD
[1] "Time" "demand"
$iris
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
or perhaps long form output is preferred:
setNames(stack(tabinfo), c("column", "table"))
giving:
column table
1 Time BOD
2 demand BOD
3 Sepal.Length iris
4 Sepal.Width iris
5 Petal.Length iris
6 Petal.Width iris
7 Species iris
You could use the stringi package for this.
library(stringi)
# Your string vector
myString <- "SELECT *
FROM mydb.table1 m
LEFT JOIN mydb.sometable o ON m.id = o.id
LEFT JOIN mydb.sometable t ON p.id=t.id
LEFT JOIN otherdb.sometable s ON s.column='test'"
# Three stringi functions used
# stringi_extract_all_regex will extract the strings which have FROM or JOIN followed by some text till the next space
# string_replace_all_regex will replace all the FROM or JOIN followed by space with null string
# stringi_unique will extract all unique strings
t <- stri_unique(stri_replace_all_regex(stri_extract_all_regex(myString, "((FROM|JOIN) [^\\s]+)", simplify = TRUE),
"(FROM|JOIN) ", ""))
> t
[1] "mydb.table1" "mydb.sometable" "otherdb.sometable"