UTF-8 Encoding

Basics

Applications in CODESYS can process a wide variety of characters, for example, to output an error message in various languages. Or to display visualizations in a language selected by the user which accepts user input in a wide variety of languages, characters, or symbols. If a comprehensive character set is not necessary, or if a project should not be changed, then strings which are encoded Latin-1 format can still be used.

Table 493: Character set tables
Character set	Code page number	Description	Character encoding
ASCII	20127	128 characters Suitable for english texts	7-bit encoded character
DOS-Latin-1	819, 850	Complies with ISO 8859 Suitable for western european languages in the Windows command line window.	8-bit encoded character
Latin-1	28591	Complies with ISO-8859-1 Often used for HTML pages with äöüß but without € or for example without special french characters.	8-bit encoded character
Windows 1252 Encoding	1252	Default Windows character set for western european countries Windows uses the UTF-16 format internally Contains all characters from ISO 8859-1 and ISO 8859-15, but partly with different encoding	8-bit encoded character
Unicode		Universal character set for all possible languages, including historical languages, Braille, music, or emojis. More than 100,000 characters can be displayed. Each character has a numeric code. In contrast to ASCII, a separation is made between the assignment of code points to characters and the encoding of the characters. Numeric code < 128 are ASCII compatible Numeric codes < 256 are ISO 8859-1 compatible ‎⮫ For more information, see unicode.org
Unicode 14.0		144,697 characters
UTF-16	1200	Special Unicode Used in some operating systems (Windows, OS X) and programming languages (Java, .NET) for internal character representation. It should be noted that different computer architectures encode the 4-byte characters differently. Little endian byte order for UTF-16LE	16-bit encoded characters The characters are encoded either in 2 bytes or 4 bytes.
UTF-8	65001	Byte-oriented encoding format of Unicode characters. Most widespread Used in GNU/Linux and Unix operating systems, and in various Internet services (email, web, browser). Compatible with ASCII characters in the first 128 characters (0–127).	Tuple of 8-bit words per character The characters are encoded in different length from 1 to 4 bytes.

UTF-8 in CODESYS

UTF-8 encoding is the encoding with the most comprehensive character set. Therefore, it is recommended that you enable UTF-8 encoding for new projects as well as for existing projects to be used in a new context.

Table 494: Project-wide encoding in CODESYS
Data type	Compile option: UTF8 Encoding for STRING	Which encoding is used project-wide?
STRING	Enabled	UTF-8
STRING	Disabled	Windows 1252 encoding (default Windows encoding) Latin-1
WSTRING	Enabled	UTF-16
WSTRING	Disabled	UTF-16

In CODESYS, the “STRING” data type can be encoded in Latin-1 or UTF-8 formats. The “WSTRING” data type always encodes its characters as Unicode in UTF-16.

Encoding a single string literal in UTF-8 format

Even if the project-wide encoding format is set to Latin-1, you can encode a single literal in UTF-8 format. To do this, add the “UTF8#” type prefix to the literal.

{attribute 'monitoring_encoding' := 'UTF-8'}
strVarUtf8: STRING := UTF8#'你好,世界!ÜüÄäÖö';

For more information, see:

Constant: UTF8# String; ⮫ “Constant: UTF8# String ”

Pragma Attribute: ⮫ monitoring_encoding

String conversion for UTF-8 encoding

If you have enabled UTF-8 encoding project-wide, then you can use the string conversion functions as usual.

String manipulation

Use library functions to manipulate your strings.

If “STRING” variables should be manipulated, then an index access to a variable in ASCII format often leads to the desired result. It is better not to use this construct. It is not just a bad programming style. To make matters worse, with UTF-8 encoding, index access leads to unwanted string manipulation.

UTF-8 encoding only for project-wide configuration

A UTF-8 encoding is used if the project-wide compile option UTF8 encoding for STRING is enabled. Library functions and add-ons are then also oriented according to this setting.

If you use single UTF-8 encoded strings, then you have to make sure that they are interpreted correctly wherever they are used. For example, a string variable in the OPC server will be converted to UTF-8 before being transferred to a client if the setting is not selected. Values such as “UTF8#'äöü'” would then be misinterpreted. Similar problems can arise when outputting strings in the visualization.