Markdown List Parsing - list

I am writing a markdown document with a lot of short lists with headings
This is a sentence describing a list:
- Hello
- World
This renders fine, but the space between the heading sentence and list makes the markdown disorganized, especially since my document has so many short lists. I'd like to do something like this:
This is a sentence describing a list:
- Hello
- World
so there is no space between the heading and list in markdown file. Unfortunately, markdown renders that as one big sentence ("This is a sentence describing a list: -hello -world") and not as a list. Is there a way to force a break at the end of the line to make markdown recognize the dashes as a list? A solution of this format
This is a sentence describing a list:[something like \newline]
- Hello
- World
would be perfect. I would like to do this in straight markdown, but for right now I have additional flexibility of Latex commands since I am embedding markdown within a Latex document (this package: https://ctan.math.washington.edu/tex-archive/macros/generic/markdown/markdown.pdf)

You can use fake bullet lists to achieve this. You can build them up using non-breaking spaces and a ⦁ (Z NOTATION SPOT) character.
This is a sentence describing a list:[space][space]
⦁ Hello[space][space]
⦁ World
Result:
This is a sentence describing a list:
   ⦁  Hello
   ⦁  World
For comparison, here's a real bullet list:
This is a sentence describing a list:
Hello
World

Related

How can I create a Regex that matches and transforms a period delimited path?

I am using den4b Renamer to rename a lot of files that follow a specific pattern. The program allows me to use RegEx: (https://www.den4b.com/wiki/ReNamer:Regular_Expressions)
I am stuck trying to conjure up an expression for a specific pattern.
My current RegEx:
Expression: ^(com\.)(([\w\s]*\.){0,4})([\w\s]*)$
Replace: \L$1\L$2\u$4
Note: \L and \u transform the sub-expression to upper and lower case as defined in the table below:
Here are a few example strings so you can get an idea of the input:
Android File Transfer.svg
Angular Console.svg
Au.Edu.Uq.Esys.Escript.svg
Avidemux.svg
Blackmagic Fusion8.svg
Broken Sword.svg
Browser360 Beta.svg
Btsync GUI.svg
Buttercup Desktop.svg
Calc.svg
Calibre EBook Edit.svg
Calibre Viewer.svg
Call Of Duty.svg
com.GitHub.Plugarut.Pwned Checker.svg
com.GitHub.Plugarut.Wingpanel Monitor.svg
com.GitHub.Rickybas.Date Countdown.svg
com.GitHub.Spheras.Desktopfolder.svg
com.GitHub.Themix Project.Oomox.svg
com.GitHub.Unrud.Remote Touchpad.svg
com.GitHub.Unrud.Video Downloader.svg
com.GitHub.Weclaw1.Image Roll.svg
com.GitHub.Zelikos.Rannum.svg
com.Gitlab.Miridyan.Mt.svg
com.Inventwithpython.Flippy.svg
com.Neatdecisions.Detwinner.svg
com.Rafaelmardojai.Share Preview.svg
com.Rafaelmardojai.Webfont Kit Generator.svg
Distributor Logo Antix.svg
Distributor Logo Archlabs.svg
Distributor Logo Dragonflybsd.svg
DOSBox.svg
Drawio.svg
Drweb GUI.svg
For this question I am focused on the strings that begin with com.xxx.xxx.
Since I can't only target those names in Renamer, the expression has to "play nice" with the other input file names and correctly leave them alone. That's why I've prefixed my expression with ^(com\.)
What I want:
Transform the entire string to lower case except for the last period separated part of the string.
Strip white space from the entire string.
For instance:
Original: com.GitHub.Alcadica.Develop.svg
After my Regex: com.github.alcadica.Develop.svg
What I want: com.github.alcadica.Develop.svg
This specific file is correctly renamed. What I'm having trouble with are names that have spaces in any part of the string. I can't figure out how to strip whitespace:
Original: com.Belmoussaoui.Read it Later.svg
After my Regex: com.belmoussaoui.Read it Later.svg
What I want: com.belmoussaoui.ReaditLater.svg
Here is a hypothetical example because I couldn't find a file with more than four parts. I want my pattern to be robust enough to handle this:
Original: com.Shatteredpixel.Another Level.Next.Pixel Dungeon.svg
After my Regex: com.shatteredpixel.another level.next.Pixel Dungeon.svg
What I want: com.shatteredpixel.anotherlevel.next.PixelDungeon.svg
Note that since I'm not using any kind of programming language, I don't have access to common string operations like trim, etc. I can, however, stack expressions. But this would create more overhead and since I am renaming thousands of files at a time I'd ideally like to keep it to one find/replace expression.
Any help would be greatly appreciated. Please let me know if I can provide any more information to make this more clear.
Edit:
I got it to work with the following rules:
Really inefficient, but it works. (Thanks to Jeremy in the comments for the idea)

Regular Expressions - Read through Text Doc and Extract Sentences with a Specific Word

I have a series of large text documents. I need to read through them and - if a particular word appears - extract the entire sentence.
So, if I'm searching for the word wobble and a sentence in the document is Weebles wobble but they don't fall down, I want to extract that sentence.
What is the most efficient way to do this?
I can think of two approaches to this:
Search the document for the word, then extract the particular sentence; or
Iterate through each sentence in the document. Check each sentence for the word. If the sentence has the word extract the sentence.
I would think 1 is more computationally efficient than 2. But not sure what the syntax would be.
Is there another approach I'm not considering?
Any help on efficiency and syntax appreciated.
you first need to get proper sentences from text document the best way of doing that is using nltk.data tokenizer first make sure that you have installed python nltk library properly.
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
txt = open("txt_file.txt")
data = txt.read()
all_sentences = tokenizer.tokenize(data)
required_sentences = []
for each_sentence in all_sentences:
if 'wobble' in each_sentence:
required_sentences.append(each_sentence)
print(required_sentences)

Stripping superscript from plaintext

I often grab quotes from articles that include citations that include superscripted footnotes, which when copied are a pain in the ass. They show up as actual letters in the text as they are pasted in plaintext and not in html.
Is there a way I could run this through a regex to take out these superscripts?
For example
In the abeginning bGod ccreated the dheaven and the eearth.
Should become
In the beginning God created the heaven and the earth.
I can't think of a way to have regex search for misspellings and a corresponding sequential set of numbers and letters.
Any thoughts? I'm also using Sublime Text 3 for the majority of my writing, but I wouldn't mind outsourcing this to an AppleScript, or text replacement app (aText, textExpander, etc.).
Matching Code vs. Matching a Screen
It's hard to tell without seeing an example, but this should be doable if you copy the text from code view, as opposed to the regular browser view. (Ctrl or Cmd-J is your friend). Since writing the rules will take time, this will only be worthwhile for large chunks of text.
In code view, your superscript will be marked up in a way that can be targetted by regex. For instance:
and therefore bananas make you smartera
in the browser view (where the a at the end is a citation note) may look like this in code view:
and therefore bananas make you smarter<span class="mycitations">a</span>
In your editor, using regex, you can process the text to remove all tags, or just certain tags. The rules may not always be easy to write, and of course there are many disclaimers about using regex to parse html.
However, if your source is always the same (Wikipedia for instance), then you can create and save rules that should work across many pages.

Regex expression for searching spaced/broken words in OCR PDFs (goo d ni g ht)

I need searching lots of OCR PDFs. I realized the words and sentences are perfect visually, but if I copy an paste the content, there are spaces which shouldn't be there!
I can see in the text: good night
If I copy and paste somewhere: goo d ni g ht
I would appreciate advices to handle this situation through a Regex expression considering:
a) The simple example for short words as \bgood night\b for goo d ni g ht
b) When there is line break in the sentence. I mean, the Regex expression isn't able to search from one line to another in the PDF even the paragraph is the same. In looking for
\bthe sun set and the night comes\b , but the PDF content is like that when pasted:
line 1: t he sun set an d th e
line 2: nig ht co m es
Many thanks,
Cadu
This random occurence of spaces in the middle of words can happen in PDF.
The reason behind it is the complex format that PDF actually is.
You see, a PDF document is actually a container of instructions for rendering the text in a viewer.
Imagine instructions like:
go to position 50, 50.
draw the character 'G'
go to position 56, 50.
draw the character 'O'
etc
Whenever you select something in a viewer (for instance Adobe), the program has to figure out what content overlaps with your selection (already this is not an easy problem). If it's text, it then needs to decide where to add spaces and line-breaks. Different viewers (or software) might use different metrics for this. A typical one for instance is "insert a space if two characters are further apart than the width of the space character in the same font"
The point is, getting text out of a PDF document is always kind of guesswork. And if you add the fact that it's an OCR PDF, you are adding a further layer of difficulties.

How to command with Notepad++ to organize a huge list?

Hey i would like to know if there a kind of command may organize a huge list, for example below:
Before:
hello:world
mars:jupiter
bomb:fire
water:earth
wind:cosmo
After:
hello
world
mars
jupiter
bomb
fire
water
earth
wind
cosmo
This example explains removing : and add a line after of this character. I couldn't do it by my hands because it contains more than 12,000 lines
If is not possible with Notepad++. Then other tools are welcome.
Use find/replace in extended mode in which you can specify \r\n. Ie, replace ':' with '\r\n'.