Lob的字符集和字节序

402e里这么描述CLOBNCLOB所用的字符集:

CLOB is used to store character data using the database character set. The internal

code is 112.

NCLOB is used to store character data using the national character set. The internal

code is 112.

 

但这是不对的,至少是不全面的,我们来看一下真相是什么。

对于定长的字符集而言,如WE8MSWIN1252 , AR8ISO8859P6等,确实就如402e里所描述的那样,CLOB所用的字符集就是和库里所用的字符集一致。

 

对于变长的字符集而言,如ZHS16GBKUTF8AL32UTF8等,oracle这里实际上会做一个字符集间的转换,也就是说,实际存储在CLOB/NCLOB里的字符集其实是internal fixed-width Unicode character set这个internal fixed-width Unicode character set目前来看,只能是UCS2 或者AL16UTF16

 

UCS2AL16UTF16的大致区别为:

* UCS2 is a subset of the UTF16 standard.

* UCS2的字节序跟库所在的platform的字节序一致

* AL16UTF16的字节序始终是big endian

从这里我们可以推断出,在big endianplatformUCS2就等于AL16UTF16

 

如下的这张图非常经典,已经说明了一切:

lob_internal_characterset

 

关于这个图的详细信息大家可以去看CLOBs and NCLOBs character set storage in Oracle Release 8i, 9i, 10g and 11g. [ID 257772.1]”里面解释的非常清楚。

 

里面有一段话这么说:

On Big Endian platforms, when a database with a varying-width database or national character set is migrated to Oracle Database 10g, no conversion of the stored CLOB and NCLOB data is necessary as UCS2 and AL16UTF16 are equivalent. On Little Endian platforms, the storage character set is not changed as part of the upgrade. Only column meta-data is updated to indicate that the CLOB/NCLOB columns are in Little Endian format. There is no usage impact after the migration because the internal format is not exposed to users, and proper conversions will happen implicitly in the system.

 

这里的meta-data是指lob$里的字段propertysql.bsq里明确指出:

property      number not null,           /* 0x00 = user defined lob column */

                                    /* 0x01 = kernel column(s) stored as lob */

                                     /* 0x02 = user lob column with row data */

                                            /* 0x04 = partitioned LOB column */

                                   /* 0x0008 = LOB In Global Temporary Table */

                                          /* 0x0010 = Session-specific table */

                                      /* 0x0020 = lob with compressed header */

                                        /* 0x0040 = lob using shared segment */

                                  /* 0x0080 = first lob using shared segment */

                                   /* 0x0100 = klob and inline image coexist */

                                /* 0x0200 = LOB data in little endian format */

 

最后我们的结论为:

1、如果库的字符集是定长字符集,则CLOB所用的字符集和库的字符集一致,字节序也和库所在的platform的字节序一致。

2、当库的字符集是变长字符集的时候,CLOB/NCLOB所用的字符集只能是UCS2或者AL16UTF16AL16UFT16始终是big endianUCS2的字节序则和库所在的platform的字节序一致。具体又分为如下这几种情况:

 1) 如果是8i/9i,则CLOB/NCLOB所用的字符集都是UCS2

 2) 如果是10gCLOB/NCLOB所用的字符集基本上都是AL16UTF16,为UCS2的只有上图中所描述的那两种情况

 

其实这篇文章只是上述那篇metalink文档的一个翻译和总结,我为什么要写这篇文章来源于我昨天跟老熊之间的一次电话交流,他纠正了我长久以来的一个错误观点并且使我对lob internal的认识已经变得异常清晰,非常感谢老熊!



Leave a Reply

Your email address will not be published. Required fields are marked *