Reading and compiling .tex files into R-Markdown

Reading and compiling .tex files into R-Markdown - r-markdown

Say I have the following .tex file:
\begin{table}[!htbp]
\centering
\setlength{\tabcolsep}{10pt}
\renewcommand{\arraystretch}{1.5}
\resizebox{\textwidth}{!}{\begin{tabular}{|l|c|c|}
\hline
\textbf{Country} & \textbf{Mali} & \textbf{Niger} \\
\hline
Regional Autonomy? & Yes & No \\
\hline
Population (1999, millions) & 10.6 & 10.9 \\
\hline
Tuareg \% of Population (2001) & 10 & 9.3 \\
\hline
GDP (1999, billion USD) & 3.4 & 2.0 \\
\hline
Ethnic Fractionalization (1999) & 0.8 & 0.6 \\
\hline
Area (million sq. km.) & 1.2 & 1.3 \\
\hline
Former French Colony? & Yes & Yes \\
\hline
Political System & Unitary semi-presidential republic & Unitary
semi-presidential republic \\
\hline
\end{tabular}}
\caption{Country Characteristics Around Mali's Decentralization}
\end{table}
I want to read this file into a R Markdown file such that when I compile the .Rmd file, the .tex file is rendered within the document.

In Rmarkdown the raw_tex extension is enabled by default. You can use the \input command to import the table as in a .tex file:
\input{table.tex}

Related

pattern to extract linkedin username from text

I am trying to extract linkedin url that is written in this format,
text = "patra 12 EXPERIENCE in / in/sambhu-patra-49b4759/ 2020 - Now O Skin Curate Research Pvt Ltd Embedded System Developer, WB 0 /bindasssambhul O SKILLS LANGUAGES Arduino English Raspberry Pi Movidius Hindi Bengali ICS Intel Compute Stick PCB Design Python UI Design using Tkinter HOBBIES HTML iti CSS G JavaScript JQuery IOT\n"
pattern = \/?in\/.+\/?\s+
I need to extract this in/sambhu-patra-49b255129/ from the any noisy text like the one above,
It's a linkedin url written in short form.
My pattern is not working

You can use
m = re.search(r'\bin\s*/\s*(\S+)', text)
if m:
print(m.group(1))
See the regex demo.
Details:
\b - word boundary
in - a preposition in
\s* - zero or more whitespaces
/ - a / char
\s* - zero or more whitespaces
(\S+) - Capturing group 1: any one or more whitespaces.

Another option matching word characters, optionally repeated by a - and word characters with an optional / at the end:
(?<!\S)in/\w+(?:-\w+)*/?
The pattern matches:
(?<!\S) Assert a whitspace boundary to the left
in/ Match literally
\w+(?:-\w+)* match 1+ word chars, optionally repeated by - and 1+ word chars
/? Match optional /
Regex demo
import re
s = ("patra 12 EXPERIENCE in / in/sambhu-patra-49b4759/ 2020 - Now O Skin Curate Research Pvt Ltd Embedded System Developer, WB 0 /bindasssambhul O SKILLS LANGUAGES Arduino English Raspberry Pi Movidius Hindi Bengali ICS Intel Compute Stick PCB Design Python UI Design using Tkinter HOBBIES HTML iti CSS G JavaScript JQuery IOT")
m = re.search(r"(?<!\S)in/\w+(?:-\w+)*/?", s)
if m:
print(m.group())
Output
in/sambhu-patra-49b4759/

How about just:
text.split(" ")[5]

This can be done without using any regex:
>>> text = "patra 12 EXPERIENCE in / in/sambhu-patra-49b4759/ 2020 - Now O Skin Curate Research Pvt Ltd Embedded System Developer, WB 0 /bindasssambhul O SKILLS LANGUAGES Arduino English Raspberry Pi Movidius Hindi Bengali ICS Intel Compute Stick PCB Design Python UI Design using Tkinter HOBBIES HTML iti CSS G JavaScript JQuery IOT\n"
>>> s = text[text.find(' in/')+1:]
>>> print (s[0:s.find(' ')])
in/sambhu-patra-49b4759/

Here is one of the ways.
regex = re.compile("\/\s?in\/(.*?)\/")
def check(str):
search = re.search(regex, str)
if search is not None:
print(search.group(1))
Output
sambhu-patra-49b4759

How do I capture a pattern that spans multiple lines with a Regex?

Please help. I've read dozens of Stack Overflow articles and online tutorials and can't figure this out!
I need a regular expression that is going to return a match that spans multiple lines and I'm not sure how to do it. For example the text is,
1) 11-JAN-2019 11:04 AM I RF HQCSQT
John Doe,Construction,555-555-5555,
2) 11-JAN-2019 1:42 PM I ADD HQCSQT
John Doe/Construction Worker Request El
ectronic Add Wires: 7600SB=. Building c
odes: ,
3) 11-JAN-2019 1:54 PM I ADD STM003
John Doe/Construction Worker Request El
ectronic Add Wires: 1430SBX=. Building
codes: ,
there are two matches that should come from the above string which is line 2 and line 3 up to the comma (","). See below for an example of a match.
2) 11-JAN-2019 1:42 PM I ADD HQCSQT
John Doe/Construction Worker Request El
ectronic Add Wires: 7600SB=. Building c
odes: ,
So I want to capture the regular expression pattern that starts with
^\d\)\s+\d\d-\w+-\d+\s+\d+:\d+\s+\w+\s+I\s+ADD\s+(HQCSQT|STM003)
and ends with the regex patter of
(,\s)$
Note: I tested "(,\s)$" and it is how the line ends when the multiline option is enabled.

You're already there. You must not be setting the regex options properly. You need to use both SingleLine and MultiLine modes at the same time.
Dim input As String = "
1) 11-JAN-2019 11:04 AM I RF HQCSQT
John Doe,Construction,555-555-5555,
2) 11-JAN-2019 1:42 PM I ADD HQCSQT
John Doe/Construction Worker Request El
ectronic Add Wires: 7600SB=. Building c
odes: ,
3) 11-JAN-2019 1:54 PM I ADD STM003
John Doe/Construction Worker Request El
ectronic Add Wires: 1430SBX=. Building
codes: ,
"
Dim pattern As String = "^\d\)\s+\d\d-\w+-\d+\s+\d+:\d+\s+\w+\s+I\s+ADD\s+(HQCSQT|STM003).*?,\s$"
Dim matches As Integer = Regex.Matches(input, pattern, RegexOptions.Multiline Or RegexOptions.Singleline).Count
Console.WriteLine(matches) ' Outputs "2"

Seems like it can just be split by double new line :
Dim parts = Split(text, vbCrLf & vbCrLf)
For i = 0 To parts.Length Step 2
Debug.Print(parts(i) & vbCrLf & vbCrLf & parts(i + 1) & vbCrLf & "------------")
Next

Regex how to find pattern?

I need to separate text below with Regex syntax. Actually I found recipes for dddd-dddd and dddd-ddd[x]. What with text? I need to get string with this value like this: "British Journal of Applied Science & Technology". How to write it in regex?
337 British Journal of Applied Science & Technology 2231-0843 5
338 British Journal of Economics, Management & Trade 2278-098X 5
339 British Journal of Education, Society & Behavioural Science 2278-0998 6
340 British Journal of Environment and Climate Change 2231-4784 5
341 British Journal of Mathematics & Computer Science 2231-0851 4
342 British Journal of Medicine and Medical Research 2231-0614 8
343 British Journal of Pharmaceutical Research 2231-2919 4
344 British Microbiology Research Journal 2231-0886 9
345 Bromatologia i Chemia Toksykologiczna 0365-9445 5
346 Budownictwo Górnicze i Tunelowe 1234-5342 5
347 Budownictwo i Architektura 1899-0665 3
348 Budownictwo, Technologie, Architektura 1644-745X 3
349 Builder 1896-0642 2
350 Built Environment 0263-7960 10
351 Bulgarian Journal of Veterinary Medicine 1311-1477 8
352 Bulgarian Medicine 1314-3387 2
353 Bulletin de la Société des sciences et des lettres de Łódź, Série: Recherches sur les déformations 0459-6854 7
354 Bulletin of Alfred Nobel University. Series "Legal Science" 2226-2873 6
355 Bulletin of Geography. Socio-economic Series 1732-4254 10
356 Bulletin of Geography: Physical Geography Series 2080-7686 9
357 Bulletin of the Polish Academy of Sciences. Mathematics 0239-7269 9
358 Business and Economic Horizons 1804-1205 8
359 Business and Economics Research Journal 1309-2448 10
360 Business Process Management Journal 1463-7154 10

(?<=\d\s)\D+(?=\s\d)
That should find what you need. If you are interested in how it works:
The first part of the Regex ((?<=\d\s)) declares that the searched phrase must come after a digit (\d) followd by a whitespace (\s).
The second part (\D+) is what is actually found. It means any number of non digit characters.
The third part ((?=\s\d)) makes sure that the result is followed by another whitespace and digit.

You can do it with an expression that uses lookahead and lookbehind, like this:
(?<=\d{3}\s).*(?=\s\d{4}-)
This expression requires three digits followed by space in front of the text, and four digits preceded by space and followed by a dash after the text. The name itself is matched by a straight .* pattern.
Demo.

Since you don't specify a target language or anything like that, here's how you could do it with perl:
cat test.txt | perl -pe 's/^\d+\s//' | perl -pe 's/[0-9X "-]+$//'
The second expression might need adaptation depending on how the rest of your data looks like.
This prints:
British Journal of Applied Science & Technology
British Journal of Economics, Management & Trade
British Journal of Education, Society & Behavioural Science
British Journal of Environment and Climate Change
[snip]
Bulletin of the Polish Academy of Sciences. Mathematics
Business and Economic Horizons
Business and Economics Research Journal
Business Process Management Journal

\d+ (.+) ....-.... \d+
Extracting:
British Journal of Applied Science & Technology
British Journal of Economics, Management & Trade
British Journal of Education, Society & Behavioural Science
British Journal of Environment and Climate Change
British Journal of Mathematics & Computer Science
British Journal of Medicine and Medical Research
British Journal of Pharmaceutical Research
[... cut ...]

(\d{3})\s([\D]+)(\d{4}-\d{3,4}X?\s\d{1,2})
This splits the string into 3 capture groups:
3 digits
Anything NOT containing a digit, up to the next digit
The reference at the end (assumes it begins with 4 digits and is in a consistent format)
See demo here

I understand you are looking for REGEX, but if you wanted something slightly more straight forward it looks like your document can easily be parsed using simple string manipulation. I offer this idea as an alternative for people not looking to use REGEX.
String tmp = "340 British Journal of Environment and Climate Change 2231-4784 5";
String ending = tmp.substring(tmp.length() - 11);
tmp = tmp.substring(0, (tmp.length() - 11)); //parse off the ending
StringTokenizer st = new StringTokenizer(tmp, " ");
String index = st.nextToken(); //reads the first int up to the first space.
tmp = tmp.substring(index.length()); //parse front
Now tmp is the name of the journal, index is the first few characters, and the reference at the end is saved as ending. This method only works presuming all the strings are exactly as listed above, or within similar bounds.

This one:
(?<=\d\s)\D+(?=\s\d)
works very well, but i found in my pdf that titles could have numbers, for example
338 British Journal of 5Economics, Management & Trade 2278-098X 5
How to properly parse it ?
PS I write my app in C#(.NET).

Generic regex with format specific conditions

Can anyone help me with a generic regex (in Visual Basic) that can handle below formats?
2100
2.100
2,100
2 100
2 100 (double white-spaces between "2" and "1"
10100
10.100
10,100
10 100
10 100
The Regex shall match all numbers in above formats not only the 2100 and 10100 examples.
b) also a generic Regex that matched above but dont match formats:
2.10
2,10
2.1
2,1
10.1
10,1
10.10
10,10
The regex I have tried but wont work is:
Regex(\d+(?:[,.]| {1,2})\d+$)

How about this:
^\d+(([.,]|\s{1,2})\d+)?$
Notice ^ and (([.,]|\s{1,2})\d+) which I made it optional with a ?

Vertical spacing of cells containing a parbox

I have a complicated longtable with several levels of nested tabular environments. To get text wrapping inside cells and have the contents aligned at the top I use \parbox[t][][t], however, the height of the parbox is computed without any margin such that the following \hline overlaps with the text.
A minimal example to reproduce this behavior is
\documentclass{article}
\begin{document}
\begin{tabular} {|p{0.2\textwidth}|}
\hline
This cell looks good. \\
\hline
\parbox[t][][t]{1.0\linewidth}{
Not so happy with this.
} \\
\hline
\end{tabular}
\end{document}
This produces the following output (sorry, can't post images yet):
image of generated output
Of course, there is no reason to use a parbox in example above, but I need them in the actual document.
I would like to avoid providing the height of the parbox (such as \parbox[t][5cm][t]). Is there a clean way to add a margin either to the bottom of a parbox or before an hline?

Sorry to answer my own question, but I have found a solution by adding vspace to each cell outside the parbox.
Here's the code:
\documentclass{article}
\begin{document}
\newcommand{\pb}[1]{\parbox[t][][t]{1.0\linewidth}{#1} \vspace{-2pt}}
\begin{tabular} {|p{0.2\textwidth}|}
\hline
This cell looks good. \\
\hline
\pb{
Now I'm happy with this.
} \\
\hline
\end{tabular}
\end{document}
The output: image of generated output
I missed that before because I didn't have a space between the closing brace of the parbox and the vspace. Turns out that space is crucial.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Reading and compiling .tex files into R-Markdown - r-markdown

In Rmarkdown the raw_tex extension is enabled by default. You can use the \input command to import the table as in a .tex file: \input{table.tex}

Related

pattern to extract linkedin username from text

How do I capture a pattern that spans multiple lines with a Regex?

Regex how to find pattern?

Generic regex with format specific conditions

Vertical spacing of cells containing a parbox

Categories

Resources