REGEX - Remove Unwanted Text - regex

I have a list of Items example (files in a folder), each item in the list is in its own string.
in the example the X--Y-- Have incrementing Digits.
my program has the filenames in a list eg : ["file1.txt", "file2.txt"]
item 1:
"X1Y2 alehandro alex.txt"
item 2:
"X1Y3 james file of files.txt"
so for each string i want to keep only the first Part the "X1Y2" parts for each file so I need to remove all the extra text on the filename.
I just want a regex expression on how to do this, I still do struggle with regex.
I need to pass this through a, replace with "" algorithm,
(using microsoft powertoys-rename to do this..
Alternatives in powershell also welcome.
any advice would be appreciated
I Want output to be the following
["X1Y2.txt","X2Y3.txt","X4Y3.txt"]
with the unwanted extra text removed.

A general solution using re.sub along with a list comprehension might be:
files = ["X1Y2 alehandro alex.txt", "X1Y3 james file of files.txt"]
output = [re.sub(r'(\S+).*\.(\w+)$', r'\1.\2', f) for f in files]
print(output) # ['X1Y2.txt', 'X1Y3.txt']

Related

Mass regex search-and-replace BETWEEN patterns

I have a directory with a bunch of text files, all of which follow this structure:
...
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
- Again, some list items of random text
- Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
....
And I need to run a replace operation (let's say, I need to prepend CCC at the beginning of the line, just after the dash) on only those "list items", which are between PATTERN_A and PATTERN_B. The problem is they aren't really much different from the text above PATTERN_A, or below PATTERN_B, so an ordinary regex can't really catch them without also affecting the remaining text.
So, my question would be, what tool and what regex should I use to perform that replacement?
(Just in case, I'm fine with Vim, and I can collect those files in a QuickFix for a further :cdo, for example. I'm not that good with awk, unfortunately, and absolutely bad with Perl :))
Thanks!
If I have understood your questions, you can do so quite easily with a pattern-range selection and the general substitution form with sed (stream editor). For example, in your case:
$ sed '/PATTERN_A/,/PATTERN_B/s/^\([ ]*-\)/\1CCC/' file
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
-CCC Again, some list items of random text
-CCC Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
(note: to substitute in place within the file add the -i option, and to create a backup of the original add -i.bak which will save the original file as file.bak)
Explanation
/PATTERN_A/,/PATTERN_B/ - select lines between PATTERN_A and PATTERN_B
s/^\([ ]*-\)/\1CCC/ - substitute (general form 's/find/replace/') where find is from beginning of line ^ capturing text between \(...\) that contains [ ]*- (any number of spaces and a hyphen) and then replace with \1 (called a backreference that contains all characters you captured with the capture group \(...\)) and appending CCC to its end.
Look things over and let me know if you have questions or if I misinterpreted your question.
With Perl also, you can get the results
> perl -pe ' { s/^(\s*-)/\1CCC/g if /PATTERN_A/../PATTERN_B/ } ' mass_replace.txt
...
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
-CCC Again, some list items of random text
-CCC Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
....
>

Remove lines that is shorter than or equal 5 characters after the : using Notepad++

The question is like: Remove lines that is shorter than 5 characters before the # using Notepad++
But it differs a bit...
I have like that:
abc:123
abc:1234
abc:12345
PLEASE NOTE: abc is not on all the lines, it is just an example.
I want to remove the first line in the previous example because 123 which is after : is shorter than or not equal to 5 characters.
Any help would be appreciated.
Thanks!
Open Notepad++ find and replace choose regex mode in the search and place ^((?!.+:\d{5,}).)*$ in search and keep replace with blank and press replaceAll
^((?!.+:\d{5,}).)*$
Without knowing the language there is only so much help I can offer. I'll give you an example of how I would solve this problem in C#.
Start by creating a string for your updated file (without the short lines)
string content = "";
Read a line in from your file.
Then get a substring of the line you read in - the abc: portion and check the length.
line = line.substring(indexof(":"), length - indexof(":"))
if(line.length > 5)
{
content += line;
}
At the end, truncate your file and write content to it.

Remove all text from string after a sequence of words in Scala

I am trying to assemble a UDF in Scala that takes a column from a data frame and manipulates it to remove HTML and other useless pieces of text.
The column I need to modify is very messy, sometimes there is HTML, sometimes there is not... Searching SO I have found a regex solution to remove HTML
what I'd like to accomplish now is to find a regex that can find a specific word in the text and delete all the text after that word.
I think I understand from this SO answer that the regex should be something like \).* if you want to remove all after ), so I am trying to adapt this to my case, unsuccessfully due to my lack of knowledge about regex.
I have strings like:
I am interested to hear from you, thanks Sent from iPhone other stuff I want to delete....
I'd like to retain the first part of the string up to "Sent from" excluded, so a perfect output would be:
I am interested to hear from you, thanks
What I have so far is something like:
val toStringNoHTML = udf[String, String](_.toString
// code from SO as linked above
.replaceAll("""<(?!\/?a(?=>|\s.*>))\/?.*?>""", " ")
// delete all text after key word
.replaceAll("""'Sent from'.*""", "")
// remove all punctuation
.replaceAll("""[\p{Punct}\n]""", " ")
)
While the HTML gets remove, the "Sent from" and all the text after does not. Any hint how to adjust the regex to make it work?
EDIT
as pointed out in the comment, a small typo prevented my code to work, thanks for the help:
.replaceAll("""'Sent from'.*""", "")
should be
.replaceAll("""Sent from.*""", "")
Instead of doing multiple replaceAll(pattern, blank) I'd be tempted to start with an extraction.
val msgRE = "(.*>)?(.*)Sent from.*".r
val result = udfStr match {
case msgRE(_, msg) => Some(msg.trim) // .replaceAll() can be added here
case _ => None
}
Here the result is an Option[String] but that really depends on how you want to handle the non-matching input.
If more cleaning is needed after the extraction then replaceAll() can be added where indicated (or the extraction pattern can be better refined).

List files in R that do NOT match a pattern

R has a function to list files in a directory, which is list.files(). It comes with the optional parameter pattern= to list only files that match the pattern.
Files in directory data:
File1.csv File2.csv new_File1.csv new_File2.csv
list.files(path="data", pattern="new_")
results in [1] "new_File1.csv" "new_File2.csv".
But how can I invert the search, i.e. list only File1.csv and File2.csv?
I belive you will have to do it yourself, as list.files does not support Perl regex (so you couldn't do something like pattern=^(?!new_)).
i.e. list all files then filter them with grep:
grep(list.files(path="data"), pattern='new_', invert=TRUE, value=TRUE)
The grep(...) does the pattern matching; invert=TRUE inverts the match; value=TRUE returns the values of the matches (i.e. the filenames) rather than the indices of the matches.
I think that the simplest (and probably fastest if you include programmer time) approach is to run list.files 2 times, once to list all the files, then the second time with the pattern of files that you do not want, then use the setdiff function to find those file names that are not in the group that you want to exclude.
Complementing #Greg Snow answer:
library("here")
path <- here("Data", "Folder", "Subfolder")
trees_to_dfs <- list.files(path, pattern = ".csv")
unwanted <- list.files(path, pattern = "all.csv")
trees_to_dfs <- base::setdiff(trees_to_dfs, unwanted)

Regex to replace string with another string in MS Word?

Can anyone help me with a regex to turn:
filename_author
to
author_filename
I am using MS Word 2003 and am trying to do this with Word's Find-and-Replace. I've tried the use wildcards feature but haven't had any luck.
Am I only going to be able to do it programmatically?
Here is the regex:
([^_]*)_(.*)
And here is a C# example:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
String test = "filename_author";
String result = Regex.Replace(test, #"([^_]*)_(.*)", "$2_$1");
}
}
Here is a Python example:
from re import sub
test = "filename_author";
result = sub('([^_]*)_(.*)', r'\2_\1', test)
Edit: In order to do this in Microsoft Word using wildcards use this as a search string:
(<*>)_(<*>)
and replace with this:
\2_\1
Also, please see Add power to Word searches with regular expressions for an explanation of the syntax I have used above:
The asterisk (*) returns all the text in the word.
The less than and greater than symbols (< >) mark the start and end
of each word, respectively. They
ensure that the search returns a
single word.
The parentheses and the space between them divide the words into
distinct groups: (first word) (second
word). The parentheses also indicate
the order in which you want search to
evaluate each expression.
Here you go:
s/^([a-zA-Z]+)_([a-zA-Z]+)$/\2_\1/
Depending on the context, that might be a little greedy.
Search pattern:
([^_]+)_(.+)
Replacement pattern:
$2_$1
In .NET you could use ([^_]+)_([^_]+) as the regex and then $2_$1 as the substitution pattern, for this very specific type of case. If you need more than 2 parts it gets a lot more complicated.
Since you're in MS Word, you might try a non-programming approach. Highlight all of the text, select Table -> Convert -> Text to Table. Set the number of columns at 2. Choose Separate Text At, select the Other radio, and enter an _. That will give you a table. Switch the two columns. Then convert the table back to text using the _ again.
Or you could copy the whole thing to Excel, construct a formula to split and rejoin the text and then copy and paste that back to Word. Either would work.
In C# you could also do something like this.
string[] parts = "filename_author".Split('_');
return parts[1] + "_" + parts[0];
You asked about regex of course, but this might be a good alternative.