Limitations of UTF-8 in SQL Server

Microsoft in the News

Attention all you hackers out there!

Microsoft is looking to reward people who can discover vulnerabilities in Azure DevOps (previously known as Visual Studio Team Services) and the latest release of Azure DevOps server. They will pay up to $20,000 US for the discovery of eligible vulnerabilities.

Even if you find something that will help them improve DevOps, but it isn’t an eligible vulnerability, Microsoft will still give you public recognition for your contribution.

The devil is in the details, and the details can be found here:

Limitations of UTF-8 in SQL Server

By implementing UTF-8 within SQL Server 2019, you MAY find significant space saving. I say “may” because it will depend on what the majority of the characters you are storing are.

For most of us, the vast majority of characters we are entering into the fields are found in the standard ASCII character set. These are the characters that are found on your keyboard, including those accessed using the Shift key. But because we are not willing to place a bet saying that only those characters will be entered, we end up formatting the field for NVARCHAR. NVARCHAR uses UTF-16 which is a 16 bit variable width encoding system. If the vast majority of the characters being stored are found in the ASCII character set, using UTF-16 to store them requires 2 bytes per character. If we use UTF-8, those characters require only 1 byte to store them.

For the majority of the data you store, using UTF-8 means you use half the storage you would otherwise need. For the next 1920 characters outside the standard ASCII set, UTF-8 needs 2 bytes to store them, so here, your storage saving is zero.

Once you are past the first 2048 characters, the next 63,488 characters require 3 bytes in UTF-8, but only 2 bytes in UTF-16. For these infrequently used characters (in North America), you have an increase in your storage requirements of 1 byte per character.

When you go beyond the first 2048 characters and start using Chinese or Japanese characters, both UTF-8 and UTF-16 use 4 bytes per character and again, your storage savings is zero.

Type # of Characters Bytes  UTF-8 Bytes UTF-16 Savings in Bytes per character
ASCII 128 1 2 1
Latin-script alphabets 1920 2 2 0
Remaining alphabets (Asian) 63488 3 2 -1
Everything else (emojis, etc.) Oodles 4 4 0

Obviously, you will only enjoy significant savings in space requirements if the majority of what you are storing falls in the ASCII character set.

Leave a Reply

Your email address will not be published. Required fields are marked *