Adding a space within a line in file with a specific pattern - replace

I have a file with some data as follows:
795 0.16254624E+01-0.40318151E-03 0.45064186E+04
I want to add a space before the third number using search and replace as
795 0.16254624E+01 -0.40318151E-03 0.45064186E+04
The regular expression for the search is \d - \d. But what should I write in replace, so that I could get the above output. I have over 4000 of similar lines above and cannot do it manually. Also, can I do it in python, if possible.

Perhaps you could findall to get your matches and then use join with a whitespace to return a string where your values separated by a whitespace.
[+-]?\d+(?:\.\d+E[+-]\d+)?\b
import re
regex = r"[+-]?\d+(?:\.\d+E[+-]\d+)?\b"
test_str = "795 0.16254624E+01-0.40318151E-03 0.45064186E+04"
matches = re.findall(regex, test_str)
print(" ".join(matches))
Demo

You could do it very easily in MS Excel.
copy the content of your file into new excel sheet, in one column
select the complete column and from the data ribbon select Text to column
a wizard dialog will appear, select fixed width , then next.
click just on the location where you want to add the new space to tell excel to just split the text after this location into new column and click next
select each column header and in the column data format select text to keep all formatting and click finish
you can then copy all the new column or or export it to new text file

Related

How to extract text under specific headings from a pdf?

I want to extract text under specific headings from a pdf using python.
For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Summary'.
How can I do this?
This scenario is exactly what I am working on in my current company. We need to extract text lying under a heading. I'm personally using a rule based system i.e, using regex to identify all the numbered headings after reading the entire document line by line. Once I have the headings I enter the name of the heading for which I want to find the corresponding paragraph. This input is matched with the pre-existing list of headings and using universal sentence encoder I find the nearest match. After that I just display all the contents that is present from that heading upto the immediate next heading.
Pdf is unstructured text so there are no tags to extract data directly. So we use regular expression to find desired information from a corpus of text.
Extract raw page text using following code.
import fitz
page = pdf_file.loadPage(0) # 0 represents the page number... upto n-1 pages...
dl = page.getDisplayList()
tp = dl.getTextPage()
tp_text=tp.extractText()
re.split('\n\d+.+[ \t][a-zA-Z].+\n',tp_text)
Then apply regular expression as per your need... ( this re worked for me but you may or may not need to change it)
I am giving a detailed example how this will work
re.findall('\n\d+.+[ \t][a-zA-Z].+\n',"some text\n1. heading 1\nparagraph 1\n1.2.3 Heading 2\nparapgraph 2")
Output : ['\n1. heading 1\n', '\n1.2.3 Heading 2\n']
You can use re.split to split text per headings and retrieve you desired heading text.
re.split('\n\d+.+[ \t][a-zA-Z].+\n',"some text\n1. heading 1\nparagraph 1\n1.2.3 Heading 2\nparapgraph 2")
Output: ['some text', 'paragraph 1', 'parapgraph 2']
Simply ith heading will have (i+1) heading text.
The best method i found using regular expression
regex = r"^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*"
print(re.findall(regex,samplestring, re.M))

Concatenate text with newlines in PowerBI

If I have a column in a table where each cell contains text, how can I push them as output into e.g. a card and separate the cells with a new line?
I have been using the CONCATENATEX function, which takes a delimiter argument; however the standard new line character ('\n') doesn't work.
It is possible to pass Unicode characters to the CONCATENATEX function, using the UNICHAR(number) function.
The number parameter corresponds what looks to be the decimal UTF-16 or UTF-32 encodings (as shown here).
This means a new line is given by UNICHAR(10).
A final solution might then be: CONCATENATEX(TableName, TableName[TextColumn], UNICHAR(10))
Here is a screenshot that shows:
the input table in Excel (top left)
The table once imported into Power BI Desktop (top right)
The Measure 'Description' and the output within a Card object (bottom)
In the last line of the Measure code, marked yellow, you can see the use of UNICHAR(10) as a new line separator.
If nothing were to be selected in the Slicer object (i.e. everything is selected by default - no filter is used), then "Show other text" would be displayed in the Card.
In the concatenate code, if you insert a "shift + enter" after the comma, it gives me a line break, without breaking the code.
Example:
'Query1'[LetterCode],
",<Inserted SHIFT-ENTER here>
",
'Query1'[LetterCode],
Substitute /n with the string "#(cr)#(lf)" to create a new line.

How can I separate a string by underscore (_) in google spreadsheets using regex?

I need to create some columns from a cell that contains text separated by "_".
The input would be:
campaign1_attribute1_whatever_yes_123421
And the output has to be in different columns (one per field), with no "_" and excluding the final number, as it follows:
campaign1 attribute1 whatever yes
It must be done using a regex formula!
help!
Thanks in advance (and sorry for my english)
=REGEXEXTRACT("campaign1_attribute1_whatever_yes_123421","(("&REGEXREPLACE("campaign1_attribute1_whatever_yes_123421","((_)|(\d+$))",")$1(")&"))")
What this does is replace all the _ with parenthesis to create capture groups, while also excluding the digit string at the end, then surround the whole string with parenthesis.
We then use regex extract to actuall pull the pieces out, the groups automatically push them to their own cells/columns
To solve this you can use the SPLIT and REGEXREPLACE functions
Solution:
Text - A1 = "campaign1_attribute1_whatever_yes_123421"
Formula - A3 = =SPLIT(REGEXREPLACE(A1,"_+\d*$",""), "_", TRUE)
Explanation:
In cell A3 We use SPLIT(text, delimiter, [split_by_each]), the text in this case is formatted with regex =REGEXREPLACE(A1,"_+\d$","")* to remove 123421, witch will give you a column for each word delimited by ""
A1 = "campaign1_attribute1_whatever_yes_123421"
A2 = "=REGEXREPLACE(A1,"_+\d*$","")" //This gives you : *campaign1_attribute1_whatever_yes*
A3 = SPLIT(A2, "_", TRUE) //This gives you: campaign1 attribute1 whatever yes, each in a separate column.
I finally figured it out yesterday in stackoverflow (spanish): https://es.stackoverflow.com/questions/55362/c%C3%B3mo-separo-texto-por-guiones-bajos-de-una-celda-en...
It was simple enough after all...
The reason I asked to be only in regex and for google sheets was because I need to use it in Google data studio (same regex functions than spreadsheets)
To get each column just use this regex extract function:
1st column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){0}([^_]*)_')
2nd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){1}([^_]*)_')
3rd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){2}([^_]*)_')
etc...
The only thing that has to be changed in the formula to switch columns is the numer inside {}, (column number - 1).
If you do not have the final number, just don't put the last "_".
Lastly, remember to do all the calculated fields again, because (for example) it gets an error with CPC, CTR and other Adwords metrics that are calculated automatically.
Hope it helps!

Regular Expression in ms excel

How can I use regular expression in excel ?
In above image I have column A and B. I have some values in column A. Here I need to move data after = in column B. For e.g. here in 1st row I have SELECT=Hello World. Here I want to remove = sign and move Hello world in column B. How can I do such thing?
Stackoverflow has many posts about adding regular expressions to Excel using VBA. For your particular example, you would need VBA to actually move a substring from one cell to another.
If you simply want to copy the substring, you can do so easily using the MID function:
=IFERROR(MID(A1,FIND("=",A1)+1,999),A1)
I used 999 to ensure that enough characters were grabbed.
IFERROR returns the cell as-is if an equals sign is not found.
To return the portion of string before the equals sign, do this:
=LEFT(A1,FIND("=",A1&"=")-1)
In this case, I appended the equals sign to A1, so FIND won't return an error if not found.
You can use the Text to Column functionality of MS-Excel giving '=' as delimiter.
Refer to this link:
Chop text in column to 60 charactersblocks
You can simply use Text to Column feature of excel for this:
Follow the below steps :
1) Select Column A.
2) Goto Data Tab in Menu Bar.
3) Click Text to Column icon.
4) Choose Delimited option and do Next and then check the Other options in delimiter and enter '=' in the entry box.
5) Just click finish.
Here are URL for Text to Column : http://www.excel-easy.com/examples/text-to-columns.html

Regular Expression Notepad increment numbers in every line

I've to add numbers incrementally in the beginning of every line using Notepad++.
It is the not the very beginning. But, like
when ID = '1' then data
when ID = '2' then data
when ID = '3' then data
.
.
.
.
when ID = '700' then
Is there any way i can increment these numbers by replacing with any expression or is there any inbuilt-notepad functions to do so.
Thanks
If you want to do this with notepad++ you can do it in the following way.
First you can write all the 700 lines with template text (you can use a Macro or use the Edit -> Column Editor). Once you have written it, put the cursor on the place you want the number, click Shift+Alt and select all the lines:
It's not possible to accomplish this with a regular expression, as you will need to have a counter and make arithmetic operations (such as incrementing by one).
You can try the cc.p command of ConyEdit. It is a cross-editor plugin for the text editors, of course including Notepad++.
With ConyEdit running, copy the text and the command line below, then paste:
when ID = '#1' then data
cc.p 700
Gif example