Retrieve char defined in string - c++

I'm currently writing an assembler and VM program. My assembler reads in a .asm file and converts it to byte code that my VM then runs.
Currently I read in a line from my assembly file, break that line into it's components, and then determine what the line contains (is it a directive, or an instruction)
getline(assemblyFile, line);
istringstream iss(line);
vector<string> instruction{
std::istream_iterator<std::string>(iss),{}
};
This gives me a vector of strings that has been working well for me up to this point. If my directive is an int, I'm able to retrieve it simply by saying
mem[dataCounter] = stoi(instruction[VALUE]);
This was also working well when I was using ASCII values for my characters. However, I'm trying now to be able to provide either ASCII representation, or use a notation like
J .BYT 'J'
Where the first J is a label, the .BYT tells me what data type it is, and my 'J' is the byte I'm wanting to store in my byte array. If I don't use quotes,
J .BYT J
the following works nicely
mem[dataCounter] = int(instruction[VALUE].c_str()[0]);
(gives me the decimal/byte value), where instruction is whole line, and VALUE is an index of 2. If I use the former, it of course returns the first quote. Not using quotes may be the solution in and of itself, however, I'm also having trouble reading in special characters, such as spaces, or newline characters. In the case of spaces, my directive looks like
SPACE .BYT ' '
which returns me a vector that has four elements, "SPACE", ".BYT", "'" and "'", and in the case of my newline which I've been attempting as
NEWLN .BYT \n
I have three elements with the last being "\n".
In none of these cases have I been able to find yet a way to retrieve the characters I am attempting to represent in my .asm file to their equivalent char/decimal value. I would like to continue to use string as it's been convenient and changing would require a fair bit of refactoring, but can be done to support the functionality.
What methods/functions are available that can help me retrieve these characters, in particular the special characters?

I would use strtok() and treat special characters with caution.
For example, I would examine whether the token is a newline, and if it is explicitly state it.
For the ' ', I would search for it in the string, and if found, remember its information (starting position for example in the string) and then erase it from the string. Afterwards, I would split into tokens.
Minimal Example for demonstrative purposes only:
#include <cstdio>
#include <cstring>
#include <string>
#include <iostream>
int main ()
{
//std::string str ="SPACE .BYT \n";
//std::string str = "J .BYT 'J'";
std::string str ="SPACE .BYT ' '";
std::size_t start_position_to_erase = str.find("' '");
if(start_position_to_erase != std::string::npos) {
std::cout << "Found: " << std::string(str, start_position_to_erase, start_position_to_erase+3) << std::endl;
str.erase(start_position_to_erase, 3);
}
char * pch;
printf ("Splitting string \"%s\" into tokens:\n", str.c_str());
pch = strtok ((char*)str.c_str()," ");
while (pch != NULL)
{
if(pch[0] == '\n')
printf ("\\n");
else
printf ("%s\n",pch);
pch = strtok (NULL, " ");
}
return 0;
}
Output:
Found: ' '
Splitting string "SPACE .BYT " into tokens:
SPACE
.BYT

Related

Parsing tokens issue during interpreter development

I'm building a code interpreter in C++ and while I have the whole token logic working, I ran into an unexpected issue.
The user inputs a string into the console, the program parses said string into different objects type Token, the problem is that the way I do this is the following:
void splitLine(string aLine) {
stringstream ss(aLine);
string stringToken, outp;
char delim = ' ';
// Break input string aLine into tokens and store them in rTokenBag
while (getline(ss, stringToken, delim)) {
// assing value of stringToken parsed to t, this labes invalid tokens
Token t (readToken(stringToken));
R_Tokens.push_back(t);
}
}
The issue here is that if the parse receives a string, say Hello World! it will split this into 2 tokens Hello and World!
The main goal is for the code to recognize double quotes as the start of a string Token and store it whole (from " to ") as a single Token.
So if I type in x = "hello world" it will store x as a token, then next run = as a token, and then hello world as a token and not split it
You can use C++14 quoted manipulator.
#include <string>
#include <sstream>
#include <iomanip>
#include <iostream>
void splitLine(std::string aLine) {
std::istringstream iss(aLine);
std::string stringToken;
// Break input string aLine into tokens and store them in rTokenBag
while(iss >> std::quoted(stringToken)) {
std::cout << stringToken << "\n";
}
}
int main() {
splitLine("Heloo world \"single token\" new tokens");
}
You really don't want to tokenize a programming language by splitting at a delimiter.
A proper tokenizer will switch on the first character to decide what kind of token to read and then keep reading as long as it finds characters that fit that token type and then emit that token when it finds the first non-matching character (which will then be used as the starting point for the next token).
That could look something like this (let's say it is an istreambuf_iterator or some other iterator that iterates over the input character-by-character):
Token Tokenizer::next_token() {
if (isalpha(*it)) {
return read_identifier();
} else if(isdigit(*it)) {
return read_number();
} else if(*it == '"') {
return read_string();
} /* ... */
}
Token Tokenizer::read_string() {
// This should only be called when the current character is a "
assert(*it == '"');
it++;
string contents;
while(*it != '"') {
contents.push_back(*it);
it++;
}
return Token(TokenKind::StringToken, contents);
}
What this doesn't handle are escape sequences or the case where we reach the end of file without seeing a second ", but it should give you the basic idea.
Something like std::quoted might solve your immediate problem with string literals, but it won't help you if you want x="hello world" to be tokenized the same way as x = "hello world" (which you almost certainly do).
PS: You can also read the whole source into memory first and then let your tokens contain indices or pointers into the source rather than strings (so instead of the contents variable, you'd just save the start index before the loop and then return Token(TokenKind::StringToken, start_index, current_index)). Which one's better depends partly on what you do in the parser. If your parser directly produces results and you don't need to keep the tokens around after processing them, the first one is more memory-efficient because you never need to hold the whole source in memory. If you create an AST, the memory consumption will be about the same either way, but the second version will allow you to have one big string instead of many small ones.
So I finally figured it out, and I CAN use getline() to achieve my goals.
This new code runs and parses the way I need it to be:
void splitLine(string aLine) {
stringstream ss(aLine);
string stringToken, outp;
char delim = ' ';
while (getline(ss, stringToken, delim)) { // Break line into tokens and store them in rTokenBag
//new code starts here
// if the current parse sub string starts with double quotes
if (stringToken[0] == '"' ) {
string torzen;
// parse me the rest of ss until you find another double quotes
getline(ss, torzen, '"' );
// Give back the space cut form the initial getline(), add the parsed sub string from the second getline(), and add a double quote at the end that was cut by the second getline()
stringToken += ' ' + torzen + '"';
}
// And we can all continue with our lives
Token t (readToken(stringToken)); // assing value of stringToken parsed to t, this labes invalid tokens
R_Tokens.push_back(t);
}
}
Thanks to everyone who answered and commented, you were all of great help!

Can I use 2 or more delimiters in C++ function getline? [duplicate]

This question already has answers here:
How can I read and parse CSV files in C++?
(39 answers)
Closed 4 years ago.
I would like to know how can I use 2 or more delimiters in the getline functon, that's my problem:
The program reads a text file... each line is goning to be like:
New Your, Paris, 100
CityA, CityB, 200
I am using getline(file, line), but I got the whole line, when I want to to get CityA, then CityB and then the number; and if I use ',' delimiter, I won't know when is the next line, so I'm trying to figure out some solution..
Though, how could I use comma and \n as a delimiter?
By the way,I'm manipulating string type,not char, so strtok is not possible :/
some scratch:
string line;
ifstream file("text.txt");
if(file.is_open())
while(!file.eof()){
getline(file, line);
// here I need to get each string before comma and \n
}
You can read a line using std::getline, then pass the line to a std::stringstream and read the comma separated values off it
string line;
ifstream file("text.txt");
if(file.is_open()){
while(getline(file, line)){ // get a whole line
std::stringstream ss(line);
while(getline(ss, line, ',')){
// You now have separate entites here
}
}
No, std::getline() only accepts a single character, to override the default delimiter. std::getline() does not have an option for multiple alternate delimiters.
The correct way to parse this kind of input is to use the default std::getline() to read the entire line into a std::string, then construct a std::istringstream, and then parse it further, into comma-separate values.
However, if you are truly parsing comma-separated values, you should be using a proper CSV parser.
Often, it is more intuitive and efficient to parse character input in a hierarchical, tree-like manner, where you start by splitting the string into its major blocks, then go on to process each of the blocks, splitting them up into smaller parts, and so on.
An alternative to this is to tokenize like strtok does -- from the beginning of input, handling one token at a time until the end of input is encountered. This may be preferred when parsing simple inputs, because its is straightforward to implement. This style can also be used when parsing inputs with nested structure, but this requires maintaining some kind of context information, which might grow too complex to maintain inside a single function or limited region of code.
Someone relying on the C++ std library usually ends up using a std::stringstream, along with std::getline to tokenize string input. But, this only gives you one delimiter. They would never consider using strtok, because it is a non-reentrant piece of junk from the C runtime library. So, they end up using streams, and with only one delimiter, one is obligated to use a hierarchical parsing style.
But zneak brought up std::string::find_first_of, which takes a set of characters and returns the position nearest to the beginning of the string containing a character from the set. And there are other member functions: find_last_of, find_first_not_of, and more, which seem to exist for the sole purpose of parsing strings. But std::string stops short of providing useful tokenizing functions.
Another option is the <regex> library, which can do anything you want, but it is new and you will need to get used to its syntax.
But, with very little effort, you can leverage existing functions in std::string to perform tokenizing tasks, and without resorting to streams. Here is a simple example. get_to() is the tokenizing function and tokenize demonstrates how it is used.
The code in this example will be slower than strtok, because it constantly erases characters from the beginning of the string being parsed, and also copies and returns substrings. This makes the code easy to understand, but it does not mean more efficient tokenizing is impossible. It wouldn't even be that much more complicated than this -- you would just keep track of your current position, use this as the start argument in std::string member functions, and never alter the source string. And even better techniques exist, no doubt.
To understand the example's code, start at the bottom, where main() is and where you can see how the functions are used. The top of this code is dominated by basic utility functions and dumb comments.
#include <iostream>
#include <string>
#include <utility>
namespace string_parsing {
// in-place trim whitespace off ends of a std::string
inline void trim(std::string &str) {
auto space_is_it = [] (char c) {
// A few asks:
// * Suppress criticism WRT localization concerns
// * Avoid jumping to conclusions! And seeing monsters everywhere!
// Things like...ah! Believing "thoughts" that assumptions were made
// regarding character encoding.
// * If an obvious, portable alternative exists within the C++ Standard Library,
// you will see it in 2.0, so no new defect tickets, please.
// * Go ahead and ignore the rumor that using lambdas just to get
// local function definitions is "cheap" or "dumb" or "ignorant."
// That's the latest round of FUD from...*mumble*.
return c > '\0' && c <= ' ';
};
for(auto rit = str.rbegin(); rit != str.rend(); ++rit) {
if(!space_is_it(*rit)) {
if(rit != str.rbegin()) {
str.erase(&*rit - &*str.begin() + 1);
}
for(auto fit=str.begin(); fit != str.end(); ++fit) {
if(!space_is_it(*fit)) {
if(fit != str.begin()) {
str.erase(str.begin(), fit);
}
return;
} } } }
str.clear();
}
// get_to(string, <delimiter set> [, delimiter])
// The input+output argument "string" is searched for the first occurance of one
// from a set of delimiters. All characters to the left of, and the delimiter itself
// are deleted in-place, and the substring which was to the left of the delimiter is
// returned, with whitespace trimmed.
// <delimiter set> is forwarded to std::string::find_first_of, so its type may match
// whatever this function's overloads accept, but this is usually expressed
// as a string literal: ", \n" matches commas, spaces and linefeeds.
// The optional output argument "found_delimiter" receives the delimiter character just found.
template <typename D>
inline std::string get_to(std::string& str, D&& delimiters, char& found_delimiter) {
const auto pos = str.find_first_of(std::forward<D>(delimiters));
if(pos == std::string::npos) {
// When none of the delimiters are present,
// clear the string and return its last value.
// This effectively makes the end of a string an
// implied delimiter.
// This behavior is convenient for parsers which
// consume chunks of a string, looping until
// the string is empty.
// Without this feature, it would be possible to
// continue looping forever, when an iteration
// leaves the string unchanged, usually caused by
// a syntax error in the source string.
// So the implied end-of-string delimiter takes
// away the caller's burden of anticipating and
// handling the range of possible errors.
found_delimiter = '\0';
std::string result;
std::swap(result, str);
trim(result);
return result;
}
found_delimiter = str[pos];
auto left = str.substr(0, pos);
trim(left);
str.erase(0, pos + 1);
return left;
}
template <typename D>
inline std::string get_to(std::string& str, D&& delimiters) {
char discarded_delimiter;
return get_to(str, std::forward<D>(delimiters), discarded_delimiter);
}
inline std::string pad_right(const std::string& str,
std::string::size_type min_length,
char pad_char=' ')
{
if(str.length() >= min_length ) return str;
return str + std::string(min_length - str.length(), pad_char);
}
inline void tokenize(std::string source) {
std::cout << source << "\n\n";
bool quote_opened = false;
while(!source.empty()) {
// If we just encountered an open-quote, only include the quote character
// in the delimiter set, so that a quoted token may contain any of the
// other delimiters.
const char* delimiter_set = quote_opened ? "'" : ",'{}";
char delimiter;
auto token = get_to(source, delimiter_set, delimiter);
quote_opened = delimiter == '\'' && !quote_opened;
std::cout << " " << pad_right('[' + token + ']', 16)
<< " " << delimiter << '\n';
}
std::cout << '\n';
}
}
int main() {
string_parsing::tokenize("{1.5, null, 88, 'hi, {there}!'}");
}
This outputs:
{1.5, null, 88, 'hi, {there}!'}
[] {
[1.5] ,
[null] ,
[88] ,
[] '
[hi, {there}!] '
[] }
I don't think that's how you should attack the problem (even if you could do it); instead:
Use what you have to read in each line
Then split up that line by the commas to get the pieces that you want.
If strtok will do the job for #2, you can always convert your string into a char array.

How to assign string a char array that starts from the middle of the array?

For example in the following code:
char name[20] = "James Johnson";
And I want to assign all the character starting after the white space to the end of the char array, so basically the string is like the following: (not initialize it but just show the idea)
string s = "Johnson";
Therefore, essentially, the string will only accept the last name. How can I do this?
i think you want like this..
string s="";
for(int i=strlen(name)-1;i>=0;i--)
{
if(name[i]==' ')break;
else s+=name[i];
}
reverse(s.begin(),s.end());
Need to
include<algorithm>
There's always more than one way to do it - it depends on exactly what you're asking.
You could either:
search for the position of the first space, and then point a char* at one-past-that position (look up strchr in <cstring>)
split the string into a list of sub-strings, where your split character is a space (look up strtok or boost split)
std::string has a whole arsenal of functions for string manipulation, and I recommend you use those.
You can find the first whitespace character using std::string::find_first_of, and split the string from there:
char name[20] = "James Johnson";
// Convert whole name to string
std::string wholeName(name);
// Create a new string from the whole name starting from one character past the first whitespace
std::string lastName(wholeName, wholeName.find_first_of(' ') + 1);
std::cout << lastName << std::endl;
If you're worried about multiple names, you can also use std::string::find_last_of
If you're worried about the names not being separated by a space, you could use std::string::find_first_not_of and search for letters of the alphabet. The example given in the link is:
std::string str ("look for non-alphabetic characters...");
std::size_t found = str.find_first_not_of("abcdefghijklmnopqrstuvwxyz ");
if (found!=std::string::npos)
{
std::cout << "The first non-alphabetic character is " << str[found];
std::cout << " at position " << found << '\n';
}

Split a wstring by specified separator

I have a std::wstring variable that contains a text and I need to split it by separator. How could I do this? I wouldn't use boost that generate some warnings. Thank you
EDIT 1
this is an example text:
hi how are you?
and this is the code:
typedef boost::tokenizer<boost::char_separator<wchar_t>, std::wstring::const_iterator, std::wstring> Tok;
boost::char_separator<wchar_t> sep;
Tok tok(this->m_inputText, sep);
for(Tok::iterator tok_iter = tok.begin(); tok_iter != tok.end(); ++tok_iter)
{
cout << *tok_iter;
}
the results are:
hi
how
are
you
?
I don't understand why the last character is always splitted in another token...
In your code, question mark appears on a separate line because that's how boost::tokenizer works by default.
If your desired output is four tokens ("hi", "how", "are", and "you?"), you could
a) change char_separator you're using to
boost::char_separator<wchar_t> sep(L" ", L"");
b) use boost::split which, I think, is the most direct answer to "split a wstring by specified character"
#include <string>
#include <iostream>
#include <vector>
#include <boost/algorithm/string.hpp>
int main()
{
std::wstring m_inputText = L"hi how are you?";
std::vector<std::wstring> tok;
split(tok, m_inputText, boost::is_any_of(L" "));
for(std::vector<std::wstring>::iterator tok_iter = tok.begin();
tok_iter != tok.end(); ++tok_iter)
{
std::wcout << *tok_iter << '\n';
}
}
test run: https://ideone.com/jOeH9
You're default constructing boost::char_separator. The documentation says:
The function std::isspace() is used to identify dropped delimiters and std::ispunct() is used to identify kept delimiters. In addition, empty tokens are dropped.
Since std::ispunct(L'?') is true, it is treated as a "kept" delimiter, and reported as a separate token.
Hi you can use wcstok function
You said you don't want boost so...
This is maybe a wierd approach to use in C++ but I used it one in a MUD where i needed a lot of tokenization in C.
take this block of memory assigned to the char * chars:
char chars[] = "I like to fiddle with memory";
If you need to tokenize on a space character:
create array of char* called splitvalues big enough to store all tokens
while not increment pointer chars and compare value to '\0'
if not already set set address of splitvalues[counter] to current memory address - 1
if value is ' ' write 0 there
increment counter
when you finish you have the original string destroyed so do not use it, instead you have the array of strings pointing to the tokens. the count of tokens is the counter variable (upperbound of the array).
the approach is this:
iterate the string and on first occurence update token start pointer
convert the char you need to split on to zeroes that mean string termination in C
count how many times you did this
PS. Not sure if you can use a similar approach in a unicode environment tough.

C++ printf: newline (\n) from commandline argument

How print format string passed as argument ?
example.cpp:
#include <iostream>
int main(int ac, char* av[])
{
printf(av[1],"anything");
return 0;
}
try:
example.exe "print this\non newline"
output is:
print this\non newline
instead I want:
print this
on newline
No, do not do that! That is a very severe vulnerability. You should never accept format strings as input. If you would like to print a newline whenever you see a "\n", a better approach would be:
#include <iostream>
#include <cstdlib>
int main(int argc, char* argv[])
{
if ( argc != 2 ){
std::cerr << "Exactly one parameter required!" << std::endl;
return 1;
}
int idx = 0;
const char* str = argv[1];
while ( str[idx] != '\0' ){
if ( (str[idx]=='\\') && (str[idx+1]=='n') ){
std::cout << std::endl;
idx+=2;
}else{
std::cout << str[idx];
idx++;
}
}
return 0;
}
Or, if you are including the Boost C++ Libraries in your project, you can use the boost::replace_all function to replace instances of "\\n" with "\n", as suggested by Pukku.
At least if I understand correctly, you question is really about converting the "\n" escape sequence into a new-line character. That happens at compile time, so if (for example) you enter the "\n" on the command line, it gets printed out as "\n" instead of being converted to a new-line character.
I wrote some code years ago to convert escape sequences when you want it done. Please don't pass it as the first argument to printf though. If you want to print a string entered by the user, use fputs, or the "%s" conversion format:
int main(int argc, char **argv) {
if (argc > 1)
printf("%s", translate(argv[1]));
return 0;
}
You can't do that because \n and the like are parsed by the C compiler. In the generated code, the actual numerical value is written.
What this means is that your input string will have to actually contain the character value 13 (or 10 or both) to be considered a new line because the C functions do not know how to handle these special characters since the C compiler does it for them.
Alternatively you can just replace every instance of \\n with \n in your string before sending it to printf.
passing user arguments directly to printf causes a exploit called "String format attack"
See Wikipedia and Much more details
There's no way to automatically have the string contain a newline. You'll have to do some kind of string replace on your own before you use the parameter.
It is only the compiler that converts \n etc to the actual ASCII character when it finds that sequence in a string.
If you want to do it for a string that you get from somewhere, you need to manipulate the string directly and replace the string "\n" with a CR/LF etc. etc.
If you do that, don't forget that "\\" becomes '\' too.
Please never ever use char* buffers in C++, there is a nice std::string class that's safer and more elegant.
I know the answer but is this thread is active ?
btw
you can try
example.exe "print this$(echo -e "\n ")on newline".
I tried and executed
Regards,
Shahid nx