Edit large textfile in mac terminal - regex

I have this very large dictionary file with 1 word on each line, and I would like to trim it down.
What I would like to do is leave 3-6 letter improper nouns, so it has to detect the words based on these:
if the word is less than 3 letters, delete it
if the word is more than 6 letters, delete it
if the word has a capital letter, delete it
if the word has a single quote or space, delete it.
I used this:
cat Downloads/en-US/en-US.dic | egrep '[a-z]{3,6}' > Downloads/3-6.txt
but the output is incorrect. It outputs the words with greater than 3 characters alright, but that's about my progress so far.
So how do I go about doing this in the mac terminal? There must be a way to do this right?

The following command will select only words that consist of exactly three to six lowercase a-z letters:
egrep '^[a-z]{3,6}$' /usr/share/dict/words > filtered.txt
Replace /usr/share/dict/words with your input file, and filtered.txt with a name for your output file. I just verified that this works on my Mac. Hope this helps!

Use grep and write a regex rule to match the lines you want to keep. You can get info on grep by typing man grep in the terminal.

Related

grep positive/negative integer value only

I am looking to grep any positive/negative integers only and no decimals, or any other variation including a number.
I have a testpart1.txt which has:
This is a test for Part 1
Lets write -1324
Amount: $42.27
Numbers:
-345,64
067
Phone numbers:
(111)222-2424
This should output the following code:
This is a test for Part 1
Lets write -1324
067
I am new to bash and I cannot find how to exclude any number separated with a character like a '.' or ','. I am able to get all the numbers with the following code but I am stuck after that:
egrep '[0-9]' testpart1.txt
This gives me the opposite of what I want:
grep '[0-9]\.[0-9]' testpart1.txt
You may use this grep:
grep -E '(^|[[:blank:]])[+-]?[0-9]+([[:blank:]]|$)' file
This is a test for Part 1
Lets write -1324
067
Details:
-E: Enables extended regex matching
(^|[[:blank:]]): Match line start or a space or a tab character
[+-]?: Match optional + or -
[0-9]+: Match 1 or more digits
([[:blank:]]|$): Match line end or a space or a tab character

How to use zgrep to display all words of a x size from a wordlist?

I want to display all the words from my wordlist who start with a w and are 9 letters long. Yesterday I learnt a bit more on how to use zgrep so I came with :
zgrep '\(^w\)\(^.........$\)' a.gz
But this doesn't work and I think it's because I don't know how to do a AND between the two conditions. I found that it should be (?=expr)(?=expr) but I can't figure out how to build my command then
So how can I build my command using the (?=expr) ?
for example if I have a wordlist like this:
Washington
Sausage
Walalalalalaaaa --> shouldn't match
Wwwwwwwww --> should match
You may use
zgrep '^w[[:alpha:]]\{8\}$' a.gz
The POSIX BRE pattern will match a string that
^w - starts with w
[[:alpha:]]\{8\} - then has eight letters
$ - followed with with the end of string marker.
Also, see the 9.3 Basic Regular Expressions.

Using Regex to clean a csv file in R

This is my first post so I hope it is clear enough.
I am having a problem regarding cleaning my CSV files before I can read them into R and have spent the entire day trying to find a solution.
My data is supposed to be in the form of two columns. The first column is a timestamp consisting of 10 digits and the second an ID consisting of 11 or 12 Letters and numbers (the first 6 are always numbers).
For example:
logger10 |
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F0831102744
010700EDDA0F|
would become:
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F
0831102744 | 010700EDDA0F
(please excuse the lines in the middle, that was my attempt at separating the columns...).
The csv file seems to occasionally be missing a comma which means that sometimes one row will end up like this:
0923120531,010300033ADD0925075301,010700EDD00A
My hardware also adds the word logger10 (or whichever number logger this is) whenever it restarts which gives a similar problem e.g. logger10logger100831102744.
I think I have managed to solve the logger text problem (see code) but I am sure this could be improved. Also, I really don't want to delete any of the data.
My real trouble is making sure there is a line break in the right place after the ID and, if not, I would like to add one. I thought I could use regex for this but I'm having difficulty understanding it.
Any help would be greatly appreciated!
Here is my attempt:
temp <- list.files(pattern="*.CSV") #list of each csv/logger file
for(i in temp){
#clean each csv
tmp<-readLines(i) #check each line in file
tmp<-gsub("logger([0-9]{2})","",tmp) #remove logger text
pattern <- ("[0-9]{10}\\,[0-9]{6}[A-Z,0-9]{5,6}") #regex pattern ??
if (tmp!= pattern){
#I have no idea where to start here...
}
}
here is some raw data:
logger01
0729131218,020700EE1961
0729131226,020700EE1961
0831103159,0203000316DB
0831103207,0203000316DB0831103253,010700EDE28C
0831103301,010700EDE28C
0831103522,010300029815
0831103636,010300029815
0831103657,020300029815
If you want to do this in a single pass:
(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?
can be replaced with
\1\t\2\n
What this does is look for any of those rogue logger01 entries (including the space after it) optionally: That trailing ? after the group means that it can match 0 or 1 time: if it does match, it will. If it's not there, the match just keeps going anyway.
Following that, you look for (and capture) exactly 10 hex values (either digits or A-F). The ,? means that if a comma exists, it will match, but it can match 0 or 1 time as well (making it optional).
Following that, look for (and capture) exactly 12 hex values. Finally, to get rid of any strange trailing spaces, the ? (a space character followed by ?) will optionally match the trailing space.
Your replacement will replace the first captured group (the 10 hex digits), add in a tab, replace the second captured group (the 12 hex digits), and then a newline.
You can see this in use on regex101 to see the results. You can use code generator on the left side of that page to get some preformatted PHP/Javascript/Python that you can just drop into a script.
If you're doing this from the command line, perl could be used:
perl -pe 's/(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?/\1\t\2\n/g'
If another language, you may need to adapt it slightly to fit your needs.
EDIT
Re-reading the OP and comments, a slightly more rigid regex could be
(?:logger\d\d\ )?([\dA-F]{10}),?(\d{6}[\dA-F]{5,6})\ ?
I updated the regex101 link with the changes.
This still looks for the first 10 hex values, but now looks for exactly 6 digits, followed by 5-6 hex values, so the total number of characters matched is 11 or 12.
The replacement would be the same.
Paste your regex here https://regex101.com/ to see whether it catches all cases. The 5 or 6 letters or digits could pose an issue as it may catch the first digit of the timestamp when the logger misses out a comma. Append an '\n' to the end of the tmp string should work provided the regex catches all cases.

Regex - how to make sure a string contain a word and numbers

I need a little help with Regex.
I want the regex to validate the following sentences:
fdsufgdsugfugh PCL 6
dfdagf PCL 11
fdsfds PCL6
fsfs PCL13
kl;klkPCL6
fdsgfdsPCL13
some chars, than PCL and than 6 or a greater number.
How this can be done?
I'd go with something like this:
^(.*)(PCL *)([6-9][0-9]*|[1-5][0-9]+)$
Meaning:
(.*) = some chars
(PCL *) = then PCL with optional whitespaces afterwards
([6-9][0-9]*|[1-5][0-9]+) then 6 or a greater number
This one should suit your needs:
^.*PCL\s*(?:[6-9]|\d{2,})$
Visualization by Debuggex
In bash:
EXPR=^[a-zA-Z]\+ *PCL *\([6-9]\|[0-9]\{2,\}\)
Translated:
Line begins with at least 1 occurence of a character (ignore caps)
Any amount of spaces, PCL, any amount of spaces
Either a number between 6 or 9, or a number with at least 2 digits
This expression used with something like grep "$EXPR" file.txt will output in stdout the lines that are valid.
This worked well for me. Reads logically too according to the way you described the matching
/[^PCL]+PCL\s?*[6-9]\d*/

Regex match and grouping

Here's a sample string which I want do a regex on
101-nocola_conte_-_fuoco_fatuo_(koop_remix)
The first digit in "101" is the disc number and the next 2 digits are the track numbers. How do I match the track numbers and ignore the disc number (first digit)?
Something like
/^\d(\d\d)/
Would match one digit at the start of the string, then capture the following two digits
Do you mean that you don't mind what the disk number is, but you want to match, say, track number 01 ?
In perl you would match it like so: "^[0-9]01.*"
or more simply "^.01.*" - which means that you don't even mind if the first char is not a digit.
^\d(\d\d)
You may need \ in front of the ( depending on which environment you intend to run the regex into (like vi(1)).
Which programming language? For the shell something with egrep will do the job:
echo '101-nocola_conte_-_fuoco_fatuo_(koop_remix)' | egrep -o '^[0-9]{3}' | egrep -o '[0-9]{2}$'