Stata: removing line feed control characters - stata

I have a dataset which I export with command outsheet into a csv-file. There are some rows which breaks line at a certain place. Using a hexadecimal editor I could recognize the control character for line feed "0a" in the record. The value of the variable producing the line break shows visually (in Stata) only 5 characters. But if I count the number of characters:
gen xlen = length(x)
I get 6. I could write a Perl programm to get rid of this problem but I prefer to remove the control characters in Stata before exporting (for example using regexr()). Does anyone have an idea how to remove the control characters?

The char() function calls up particular ASCII characters. So, you can delete such characters by replacing them with empty strings.
replace x = subinstr(x, char(10), "", .)

Related

SUM multiple values after a substring within all cells in a column in Google Sheets

For an open source chat analyser in Google Sheets, I need to extract all numeric values after a substring (Example), then total them.
For example, if a cell contains Example1 another text 123 Example500 text, Example1 and Example500 should be extracted out, and their numeric values summed to 501.
This is complicated further by needing to obtain the total for a column of messages.
What I've tried already:
=REGEXEXTRACT(A1, "Example(\d+)"): This only extracts the first matching value, but works!
=SUM(SPLIT(A1, "Example")): This works for messages that only include my target string, but falls apart when other strings are included. The output could possibly be filtered to results that start with a number, but this is very messy and possibly a red herring.
CONCATENATEing all my cells together, then searching for numbers. This is error-prone due to additional numbers within messages.
Another idea is to substitute each Example(\d+) to $1 the captured digit and space |. or replace anything else with empty string (regex101 demo). Knowing that $1 is unset on the right side of the alternation. Then split on space and sum up digits (any other occurring digits have been removed). If Example is a placeholder, replace with e.g. [[:alpha:]]+ for one or more alphabetic characters.
=IF(ISTEXT(A1);SUM(SPLIT(REGEXREPLACE(A1;"Example(\d+)|.";"$1 ");" "));0)
I added IF(ISTEXT(A1);...) for only processing text in the source field (to avoid errors). Else if empty or no text it's set to 0. Just remove if the field always contains text and this is unneeded.
Edit from #TheMaster: As a array formula, we can use BYROW
=BYROW(A:A; LAMBDA(row; IF(ISTEXT(row); SUM(SPLIT(
REGEXREPLACE(row;"Example(\d+)|.";"$1 ");" "));)))
try:
=LAMBDA(x, REGEXEXTRACT(A1, "(\w+)\d+")&
SUMPRODUCT(IF(IFERROR(REGEXMATCH(x, "\w+\d+")),
REGEXEXTRACT(x, "\w+(\d+)"), )))(SPLIT(A1, " "))
update 1:
=LAMBDA(x, REGEXEXTRACT(A1, "(\D+)\d+")&
SUMPRODUCT(IF(IFERROR(REGEXMATCH(x, "\D+\d+")),
REGEXEXTRACT(x, "\D+(\d+)"), )))(SPLIT(A1, " "))
update 2:
=INDEX(LAMBDA(xx, REGEXEXTRACT(xx, "(\D+)\d+")&
BYROW(LAMBDA(x, IF(IFERROR(REGEXMATCH(x, "\D+\d+")),
REGEXEXTRACT(x, "\D+(\d+)"), ))(SPLIT(xx, " ")), LAMBDA(x, SUMPRODUCT(x))))
(A1:INDEX(A:A, MAX((A:A<>"")*ROW(A:A)))))
if you start from A2 just change A1: to A2:

BigQuery remove <0x00> hidden characters from a column

I have a table with unwanted hidden characters such as my_table:
id
fruits
1
STuff1 stuff_2 ����������������������
2
Blahblah-blahblah �������������
3
nothing
How do I remove ���������������������� when selecting this column?
Current query:
SELECT fruits, TRIM(REGEXP_REPLACE(fruits, r'[^a-zA-Z,0-9,-]', ' ')) AS new_fruits
FROM `project-id.MYDATASET.my_table`
This query is too flaw because I'm worried if I accidentally exclude/replace important data. I only want to be specific on this weird characters.
Upon opening the data as csv, the weird characters shows as <0x00>. How do I solve this?
First you have to identify which is this character, because as it is a non printable this sign is just a random representation. For replace it without remove any other important information, do the following:
identify the hexadecimal of the character. Copy from csv and past on this site:
Use the replace function in bigquery replacing the char of this hex, as following:
SELECT trim(replace(string_field_1,chr(0xfffd)," ")) FROM `<project>.<dataset>.<table>`;
if your character result is different than fffd, put you value on the chr() function

Applying a regular expression to a text file Python 3

#returns same result i.e. only the first line as many times as 'draws'
infile = open("results_from_url.txt",'r')
file =infile.read() # essential to get correct formatting
for line in islice(file, 0, draws): # allows you to limit number of draws
for line in re.split(r"Wins",file)[1].split('\n'):
mains.append(line[23:38]) # slices first five numbers from line
stars.append(line[39:44]) # slices last two numbers from line
infile.close()
I am trying to use the above code to iterate through a list of numbers to extract the bits of interest. In this attempt to learn how to use regular expressions in Python 3, I am using lottery results opened from the internet. All this does is to read one line and return it as many times as I instruct in the value of 'draws'. Could someone tell me what I have done incorrectly, please. Does re 'terminate' somehow? The strange thing is if I copy the file into a string and run this routine, it works. I am at a loss - problem 'reading' a file or in my use of the regular expression?
I can't tell you why your code doesn't work, because I cannot reproduce the result you're getting. I'm also not sure what the purpose of
for line in islice(file, 0, draws):
is, because you never use the line variable after that, you immediately overwrite it with
for line in re.split(r"Wins",file)[1].split('\n'):
Plus, you could have used file.split('Wins') instead of re.split(r"Wins",file), so you aren't really using regex at all.
Regex is a tool to find data of a certain format. Why do you use it to split the input text, when you could use it to find the data you're looking for?
What is it you're looking for? A sequence of seven numbers, separated by commas. Translated into regex:
(?:\d+,){7}
However, we want to group the first 5 numbers - the "mains" - and the last 2 numbers - the "stars". So we'll add two named capture groups, named "mains" and "stars":
(?P<mains>(?:\d+,){5})(?P<stars>(?:\d+,){2})
This pattern will find all numbers you're looking for.
import re
data= open("infile.txt",'r').read()
mains= []
stars= []
pattern= r'(?P<mains>(?:\d+,){5})(?P<stars>(?:\d+,){2})'
iterator= re.finditer(pattern, data)
for count in range(int(input('Enter number of draws to examine: '))):
try:
match= next(iterator)
except StopIteration:
print('no more matches')
break
mains.append(match.group('mains'))
stars.append(match.group('stars'))
print(mains,stars)
This will print something like ['01,03,31,42,46,'] ['04,11,']. You may want to remove the commas and convert the numbers to ints, but in essence, this is how you would use regex.

Copying only the value at column n Vim

I have a file with long lines and need to see/ copy what the values are in a specic location(s) for the whole file but copy the rest of the line.
If the text width is small enough, ~184 columns, I can use :set colorcolumnnum to highlight the value. However over 184 characters it gets a bit unwieldy scrolling.
I tried :g/\%1237c/y Z, for one of the positions I needed, but that yanked the entire line.
eg for a smaller sample :g/\%49c/y Z will yank all of line 1 and 2 but I want to yank, or copy, the character at that column ie = on line 1 and x on line 2.
vim: filetype=help foldmethod=indent foldclose=all modifiable noreadonly
Table of Contents *sfcontents* *vim* *regex* *sfregex*
*sfsearch* - Search specific commands
|Ampersand-replaces-previous-pattern|
|append-a-global-search-to-a-register|
*sfHelp* Various Help related commands
There are two problems with your :g command:
For each matching line, the cursor is positioned on the first column. So even though you've matched at a particular column, that position is lost.
The \%c atom actually matches byte indices (what Vim somewhat confusingly names "columns"), so your measurement will be off for Tab and non-ASCII characters. Use the virtual column atom \%v instead.
Instead of :global, I would use :substitute with a replace-expression, in the idiom described at how to extract regex matches using vim:
:let t=[] | %s/\%49v./\=add(t, submatch(0))[-1]/g | let ## = join(t, "\n")
Alternatively, if you install my ExtractMatches plugin, I'd be that short command invocation:
:YankMatchesToReg /\%50v./

parse text with Matlab

I have a text file (output from an old program) that I'd like to clean. Here's an example of the file contents.
*|V|0|0|0|t|0|1|1|4|11|T4|H01||||||||||||||||||||||
P|40|0.01|10|1|1|0|40|1|1|1||1|*||0|0|0||||||||||||||||
*|A1|A1|A7|A16|F|F|F|F|F|F|F|||||||||||||||||||||||
*|||||kV|kV|kV|MW|MVAR|S|S||||||||||||||||||||||||
N|I|01|H01N01|H01N01|132|125.4|138.6|0|0|||||||||||||||||||||
N|I|01|H01N02|H01N02|20|19|21|0|0|||||||||||||||||||||||
N|I|01|H01N03|H01N03|20|19|21|0.42318823|0.204959433|||||||||||||||||||||
|||||||||||||||||
|||||||||||||||||
L|I|H010203|H01N02|H01N03|1.884|360|0.41071|0.207886957||3.19E-08|3.19E-08|||||||||||
L|I|H010304|H01N03|H01N04|1.62|360|0.35316|0.1787563||3.19E-08||3.19E-08||||||||||||
L|I|H010405|H01N04|H01N05|0.532|360|0.11598|0.058702686||3.19E-08||3.19E-08|||||||||||
L|I|H010506|H01N05|H01N06|1.284|360|0.27991|0.14168092||3.19E-08||3.19E-08||||||||||||
S|SH01|SEZIONE01|1|-3|+3|-100|+100|||||||||||||||||||
S|SH02|SEZIONE02|1|-3|+3|-100|+100|||||||||||||||||||
S|SH03|SEZIONE03|1|-3|+3|-100|+100|||||||||||||||||||
||||||||||||asasasas
S|SH04|SEZIONE04|1|-3|+3|-100|+100|||||||||||||||||||
*|comment
S|SH05|SEZIONE05|1|-3|+3|-100|+100|||||||||||||||||||
I'd like it to look like:
*|V|0|0|0|t|0|1|1|4|11|T4|H01||||||||||||||||||||||
*|comment
*|comment
P|40|0.01|10|1|1|0|40|1|1|1||1|*||0|0|0||||||||||||||||
*|A1|A1|A7|A16|F|F|F|F|F|F|F|||||||||||||||||||||||
*|||||kV|kV|kV|MW|MVAR|S|S||||||||||||||||||||||||
N|I|01|H01N01|H01N01|132|125.4|138.6|0|0|||||||||||||||||||||
N|I|01|H01N02|H01N02|20|19|21|0|0|||||||||||||||||||||||
N|I|01|H01N03|H01N03|20|19|21|0.42318823|0.204959433|||||||||||||||||||||
*|comment||||||||||||||||
*|comment|||||||||||||||||
L|I|H010203|H01N02|H01N03|1.884|360|0.41071|0.207886957||3.19E-08||3.19E-08|||||||||||
L|I|H010304|H01N03|H01N04|1.62|360|0.35316|0.1787563||3.19E-08||3.19E-08||||||||||||||
L|I|H010405|H01N04|H01N05|0.532|360|0.11598|0.058702686||3.19E-08||3.19E-08|||||||||||
L|I|H010506|H01N05|H01N06|1.284|360|0.27991|0.14168092||3.19E-08||3.19E-08||||||||||||
*|comment
*|comment
S|SH01|SEZIONE01|1|-3|+3|-100|+100|||||||||||||||||||
S|SH02|SEZIONE02|1|-3|+3|-100|+100|||||||||||||||||||
S|SH03|SEZIONE03|1|-3|+3|-100|+100|||||||||||||||||||
S|SH04|SEZIONE04|1|-3|+3|-100|+100|||||||||||||||||||
S|SH05|SEZIONE05|1|-3|+3|-100|+100|||||||||||||||||||
The data are divided into 'packages' distinct from the first letter (PNLS). Each package must have at least two dedicated lines (* |) which is then read as a comment. The white lines between different letters are filled with character * |. The lines between various letters that do not begin with * | to be added. The white lines and characters 'random' between identical letters are removed.
Perhaps it is clearer in the example files.
How do I manipulate the text? Thank you in advance for the help.
Use fileread to get your file into MATLAB.
text = fileread('my file to clean.txt');
Split the resulting character string up by splitting on the new lines. (The newlines characters depend on your operating system.)
lines = regexp(text, '\r\n', 'split');
It isn't entirely clear exactly how you want the file cleaned, but these things might get you started.
% Replace blank lines with comment string
blanks = cellfun(#isempty, lines);
comment = '*|comment';
lines(blanks) = cellstr(repmat(comment, sum(blanks), 1))
% Prepend comment string to lines that start with a pipe
lines = regexprep(lines, '^\|', '\*\|comment\|')
You'll be needing to know your way around regular expressions. There's a good guide to them at regular-expressions.info.