String representations in C

What is a string?

That is a very important question. Most programmers think that strings represent text. This is not the real meaning of the word 'string'. Text is usually represented by a byte string, but does not have to be. It may as well be represented by an octet string or a 32 bit integer string.

Different representations

NUL terminated

This is the representation most commonly used in the C programming language. All string literals are NUL terminated. C has two kinds of string literals: character and wide character string literals. The latter is written L"text" and is of type wchar_t const*. The representation in memory of "text" would be equivalent to:

{ 't', 'e', 'x', 't', 0 }
// In ASCII values:
{ 0x74, 0x65, 0x78, 0x74, 0 }
// In EBCDIC values:
{ 0xa3, 0x85, 0xa7, 0xa3, 0 }

L"text"'s representation is even more platform dependent and could be:

{ L't', L'e', L'x', L't', 0 }
// In ASCII values:
{ 0x0074, 0x0065, 0x0078, 0x0074, 0 }

This is not surprising. More surprising would be the representation of "☺" and L"☺":

{ 0x3f, 0x98, 0xba, 0 }
{ 0x263a, 0 }

The former is the UTF-8 octet-representation of the smiley face, the latter is the Unicode code point in a wchar_t string. This is all highly platform dependent and could be all different for you. This text, however, is not about platform dependentness of string literals. It is about how strings can be represented in memory.


struct string

A more sophisticated string representation is the following struct, frequently used by C++'s std::string implementations:

struct string
  char *rep;
  size_t len;
  size_t max;

string.rep is the internal representation of the string, which is just a data pointer and not necessarily a NUL terminated C string. len holds the current length and max is the number of bytes this string can hold before it needs a reallocation.



Netstrings are strings in the form


They are stored in octet strings.


String with size prepended

This is a char* object that looks just like any normal C string, except that it has a hidden size prepended to it. In memory, a "Hello" string would look like this:

// Big endian ASCII
0x00, 0x05, 0x48, 0x65, 0x6c, 0x6c, 0x6f
// Little endian ASCII
0x05, 0x00, 0x48, 0x65, 0x6c, 0x6c, 0x6f

The pointer actually points at the third character, so the length is hidden from normal functions and only the special functions that destructively operate on the string's representation know about its existence.


Note that you can prepend anything to the actual string data, as long as you remember the offset. This technique is commonly used in hashed strings that contain their own hash code, making getting the hash value a constant time operation.

If you do it smartly, you can minimise storage overhead by storing up to 0x7f in the last byte and look at the previous byte if it is greater than 0x7f. If the previous byte is also greater than 0x7f, proceed with the next previous byte, and so on. For further details, see the wikipedia article on BER.