?

Log in

No account? Create an account

Tue, Nov. 8th, 2005, 10:15 pm
Post character limitations question.

Ahh, just what I was looking for. Look at the LiveJournal limitations FAQ page.

Entries have a limit of 65,535 bytes or characters. Are some characters somehow more, uh, byte-consuming than others? How does this work?

UPDATE: cgranade provides the answer.

Wed, Nov. 9th, 2005 08:40 am (UTC)
guido_jacobs

Mr. Owl: "Let's find out! A-one...a-two...a-three...*crunch* Three."

Wed, Nov. 9th, 2005 09:37 am (UTC)
cgranade: On Unicode.

Are some characters somehow more, uh, byte-consuming than others?


Abso-fucking-loutly. Right click on the page and click Page Info... (since you're using Firefox). See the Encoding item? It will show that the page is encoded in UTF-8. What is that? Well, under Unicode, characters don't have a specific binary representation. Instead, familiar characters such as a have code points such as U+0041. These code points wind up being represented differently depending on which of the seven Unicode encoding formats is used. The most common is UTF-8, which maps characters with to strings of bytes. For many common Latin-1 points, these mappings are exactly the same as for ASCII. For more exotic (again, from the stand point of Latin-1) characters, their code points map to two to four-byte strings. In some rare cases (those characters with code points above U+FFFF), they can be mapped to over four bytes.



In comparison, under UTF-16, all characters have two-byte representations at least. For code points above U+FFFF, more are be used. In UTF-32 (almost unheard of), all characters are four bytes long.



In short, as long as you're below U+0070 and are using UTF-8, there should be no difference between characters.

Wed, Nov. 9th, 2005 11:00 am (UTC)
masstreble: Re: On Unicode.

That sounds about right! Now I'm prepared with the knowledge necessary to make the largest LJ post possible!