Regexextract of importdata from website GoogleSheets

Regexextract of importdata from website GoogleSheets - regex

The purpose is to extract the title and tags from a webpage.
I'm using importdata and I want to have the results all in 1 row. Like this:
[webpage] [title] [1st tag] [2nd tag] [3 rd tag] [4th tag] ... [last tag]
I am stuck halfway my process in googlesheet
first tab Extracted - I've extracted the necessary lines from the
big data.
=query({array_constrain(IMPORTDATA(A1),6375,10)},"WHERE (Col1 CONTAINS 'btn btn-secondary' AND Col1 CONTAINS 'href') or (Col1 CONTAINS 'meta property' AND Col1 CONTAINS 'og:title')")
second tab with REGEXEXTRACT - extracted the text I need, but only works for the first line (only extracted tags, title still not there as it spreads across a few columns...)
=REGEXEXTRACT(query({array_constrain(IMPORTDATA(A1),6375,10)},"WHERE (Col1 CONTAINS 'btn btn-secondary' AND Col1 CONTAINS 'href')"),"\>(.+)\
I don't know how to go further :( Any help is appreciated!

=ARRAYFORMULA({REGEXREPLACE(TEXTJOIN(", ",1,
QUERY(ARRAY_CONSTRAIN(SUBSTITUTE(IMPORTDATA(A2),"""",""),1000,15),
"where Col1 contains '<meta property=og:title content='")),
"<meta property=og:title content=| />",""),
TRANSPOSE(REGEXEXTRACT(QUERY(TRANSPOSE(QUERY(TRANSPOSE(
ARRAY_CONSTRAIN(SUBSTITUTE(IMPORTDATA(A2),"""",""),8000,3)),,50000)),
"where Col1 contains '<a class=btn btn-secondary'"),"\>(.*)+\<"))})
demo spreadsheet

Related

google sheets, splitting and stacking a paragraph

I have a 3 row by 2 column table
1Q18 hello. testing row one.
2Q18 There are about 7.5b people. That's alot.
3Q18 Last sentence. To be stacking.
I want to split each sentence then have a quarter label with it, out would be
1Q18 hello
1Q18 testing row one
2Q18 There are about 7.5b people
2Q18 That's alot
3Q18 Last sentence
3Q18 To be stacking
I can get one line to work with:
=TRANSPOSE({split(rept(A1&" ",counta(split(B1,".")))," ");split(B1,".")})
which would give me:
1Q18 hello
1Q18 testing row one
I need a formula that will let me go down 100 rows, so I can't manually repeat the formula 3 times and use {} with ;
I've also tried using the
=map(A1:A,B2:B,LAMBDA(x,y,TRANSPOSE({split(rept(x&" ",counta(split(y,".")))," ");split(y,".")})))
but get a
Error Result should be a single column.

try:
=INDEX(QUERY(SPLIT(FLATTEN(LAMBDA(x, IF(x="",,A1:A&""&x))
(SPLIT(B1:B&" ", ". ", ))), ""), "where Col2 is not null", ))

Try below formula-
=QUERY(REDUCE(,B1:B3,LAMBDA(a,x,{a;TRANSPOSE(INDEX(INDEX(A1:A,ROW(x)) & " " & SPLIT(SUBSTITUTE(x,". ",".|"),"|")))})),"offset 1",0)

Here's another formula you can try:
=ARRAYFORMULA(
QUERY(
REDUCE({0,0},
QUERY(A1:A&"❄️"&SPLIT(B1:B,". ",),
"where Col1 <> '#VALUE!'"),
LAMBDA(a,c,
{a;SPLIT(c,"❄️",,)})),
"where Col2 is not null offset 1",))

Remove repeated substring in column and only return words in between

I have the following dataframe:
Column1 Column2
0 .com<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br> .comFinance
1 .com<br><br>Finance<br><br><br><br><br>DO<br><br><br><br><br><br><br> .comFinanceDO
2 <br><br>Finance<br><br><br>ISV<br><br>DO<br>DO Prem<br><br><br><br><br><br> FinanceISVDODO Prem
3 <br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br> Finance
4 <br><br>Finance<br><br><br>TTY<br><br><br><br><br><br><br><br><br> ConsultingTTY
I used to following line of code to get Column2:
df['Column2'] = df['Column1'].str.replace('<br>', '', regex=True)
I want to remove all instances of "< b >" and so I want the column to look like this:
Column2
.com, Finance
.com, Finance, DO
Finance, ISV, DO, DO Prem
Finance
Consulting, TTY

Given the following dataframe:
Column1
.com<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br>
.com<br><br>Finance<br><br><br><br><br>DO<br><br><br><br><br><br><br>
<br><br>Finance<br><br><br>ISV<br><br>DO<br>DO Prem<br><br><br><br><br><br>
<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br>
<br><br>Finance<br><br><br>TTY<br><br><br><br><br><br><br><br><br>
df['Column2'] = df['Column1'].str.replace('<br>', ' ', regex=True).str.strip().replace('\\s+', ', ', regex=True) doesn't work because of sections like <br>DO Prem<br>, which will end of like DO, Prem, not DO Prem.
Split on <br> to make a list, then use a list comprehension to remove the '' spaces.
This will preserve spaces where they're supposed to be.
Join the list values back into a string with (', ').join([...])
import pandas as pd
df['Column2'] = df['Column1'].str.split('<br>').apply(lambda x: (', ').join([y for y in x if y != '']))
# output
Column1 Column2
.com<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br> .com, Finance
.com<br><br>Finance<br><br><br><br><br>DO<br><br><br><br><br><br><br> .com, Finance, DO
<br><br>Finance<br><br><br>ISV<br><br>DO<br>DO Prem<br><br><br><br><br><br> Finance, ISV, DO, DO Prem
<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br> Finance
<br><br>Finance<br><br><br>TTY<br><br><br><br><br><br><br><br><br> Finance, TTY

### Replace br with space
df['Column 2'] = df['column 1'].str.replace('<br>', ' ')
### Get rid of spaces before and after the string
df['Column 2'] = df['Column 2'].strip()
### Replace the space with ,
df['Column 2'] = df['Column 2'].str.replace('\\s+', ',', regex=True)
As pointed out by TrentonMcKinney, his solution is better. This one doesn't solve the issue when there is a space between the string values in Column 1

Read multiple excel sheets on specific column and right them in one csv file using python

I have multiple sheets in one excel file like Sheet1, Sheet2, Sheet3,etc. Now I have to list all the particular column in one csv file. Both the sheets has one unique column "Attribute" and only those records should be listed in the csv file line by line. (First sheet's 'Attribute' values should be in 1st line and 2nd sheet's 'Attribute' values should be in 2nd line and etc.,)
If instances,
Sheet1:
Attribute,Order
P,1
Emp_ID,2
DOJ,3
Name,4
Sheet2:
Attribute,Order
C,1
Emp_ID,2
Exp,3
LWD,4
Expected result: (In some .csv file)
P,Emp_ID,DOJ,name
C,Emp_ID,Exp,LWD
Note: Line starting from P should be in first line and C should be in 2nd line and etc.,
Below is my code:
import pandas as pd
excel = 'E:\Python Utility\Inbound.xlsx'
K = 'E:\Python Utility\Headers_Files\All_Header.csv'
df = pd.read_excel(excel,sheet_name = None)
data = pd.DataFrame(df,columns=['Attribute']).T
print data
M = data.to_csv(K, encoding='utf-8',index=False,header=False)
print 'done'
Output show's as below:
Empty DataFrame Columns: [] Index: [Attribute] done
If I use sheet_name = 'sheet1' then DataFrame works good and data loaded as expected in csv file.
Thanks in advance

R: xml2, extracting datasource names

I just installed the package XML2, and I manage to extract the aimed information. The next step is to 'visualize' the extracted information, e.g. with RShiny. Alas I fail to do "string parsing" correctly ...
For example: the extracted datasources
xmlfile <- read_xml("~ /Sample.xml")
ds <- xml_find_all(xmlfile , ".//datasource")
listds <- unique(unlist(ds, use.names = FALSE))
Datasources are (in this example) two excel files. Hence the outcome is a list with the names of the two excelfiles and the sheets of the respective excelfiels
"Customers (Sample)" "Orders (Sample - Sales (Excel))"
Note: I cannot say why one data source inlcudes "(Excel)" while the other does not.
Anyways, the desired outcome (= visualisation) would be
Datasource: Sample Sheet Name: Customer
Datasource: Sample - Sales Sheet Name: Orders
Question: how to tell R to "find name within () i.e. "Sample" or "Sample - Sales" and to paste this .... then to find the string within " " but outside of (), i.e. "Customer" or "Orders "?
Thanks a million for any thoughts and advice!

list the ds object. use xml_attr to get the content.
Also post the actual file.

How to use regular expressions properly on a SQL files?

I have a lot of undocumented and uncommented SQL queries. I would like to extract some information within the SQL-statements. Particularly, I'm interested in DB-names, table names and if possible column names. The queries have usually the following syntax.
SELECT *
FROM mydb.table1 m
LEFT JOIN mydb.sometable o ON m.id = o.id
LEFT JOIN mydb.sometable t ON p.id=t.id
LEFT JOIN otherdb.sometable s ON s.column='test'
Usually, the statements involes several DBs and Tables. I would like only extract DBs and Tables with any other information. I thought if whether it is possible to extract first the information which begins after FROM & JOIN & LEFT JOIN. Here its usually db.table letters such as o t s correspond already to referenced tables. I suppose they are difficult to capture. What I tried without any success is to use something like:
gsub(".*FROM \\s*|WHERE|ORDER|GROUP.*", "", vec)
Assuming that each statement ends with WHERE/where or ORDER/order or GROUP... But that doesnt work out as expected.

You haven't indicated which database system you are using but virtually all such systems have introspection facilities that would allow you to get this information a lot more easily and reliably than attempting to parse SQL statements. The following code which supposes SQLite can likely be adapted to your situation by getting a list of your databases and then looping over the databases and using dbConnect to connect to each one in turn running code such as this:
library(gsubfn)
library(RSQLite)
con <- dbConnect(SQLite()) # use in memory database for testing
# create two tables for purposes of this test
dbWriteTable(con, "BOD", BOD, row.names = FALSE)
dbWriteTable(con, "iris", iris, row.names = FALSE)
# get all table names and columns
tabinfo <- Map(function(tab) names(fn$dbGetQuery(con, "select * from $tab limit 0")),
dbListTables(con))
dbDisconnect(con)
giving an R list whose names are the table names and whose entries are the column names:
> tabinfo
$BOD
[1] "Time" "demand"
$iris
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
or perhaps long form output is preferred:
setNames(stack(tabinfo), c("column", "table"))
giving:
column table
1 Time BOD
2 demand BOD
3 Sepal.Length iris
4 Sepal.Width iris
5 Petal.Length iris
6 Petal.Width iris
7 Species iris

You could use the stringi package for this.
library(stringi)
# Your string vector
myString <- "SELECT *
FROM mydb.table1 m
LEFT JOIN mydb.sometable o ON m.id = o.id
LEFT JOIN mydb.sometable t ON p.id=t.id
LEFT JOIN otherdb.sometable s ON s.column='test'"
# Three stringi functions used
# stringi_extract_all_regex will extract the strings which have FROM or JOIN followed by some text till the next space
# string_replace_all_regex will replace all the FROM or JOIN followed by space with null string
# stringi_unique will extract all unique strings
t <- stri_unique(stri_replace_all_regex(stri_extract_all_regex(myString, "((FROM|JOIN) [^\\s]+)", simplify = TRUE),
"(FROM|JOIN) ", ""))
> t
[1] "mydb.table1" "mydb.sometable" "otherdb.sometable"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regexextract of importdata from website GoogleSheets - regex

Related

google sheets, splitting and stacking a paragraph

Remove repeated substring in column and only return words in between

Read multiple excel sheets on specific column and right them in one csv file using python

R: xml2, extracting datasource names

How to use regular expressions properly on a SQL files?

Categories

Resources