Board index   FAQ   Search  
Register  Login
Board index PHP PHP General

Unicode/UTF-8 setup

General discussions related to php

Moderators: macek, egami, gesf

Unicode/UTF-8 setup

Postby JunkGuy » Wed Mar 05, 2003 1:21 pm

Hello. I have a PHP application/scripts that have been written to handle ASCII. I need to convert over to Unicode (be able to take in Unicode characters, display them, process strings, etc). I have rebuilt Apache/PHP with the "--enable-multibyte=UNICODE" parameter, and I have played with the php.ini file as per http://www.php.net/manual/tw/ref.mbstring.php, but whenever I enter non English/ASCII characters into a form, submit them, then redisplay the results, I end up with garbage (ASCII representations of the multibyte unicode characters). I have no idea why.

Does anybody have any tips, info, help, gotchas, etc on setting up Unicode, converting functionality over, etc? I have not modified any of my PHP code as of yet because I don't know what all to do (mod PHP.ini again? change string functions in the code? etc.).

Thank you for your time.
JunkGuy
New php-forum User
New php-forum User
 
Posts: 5
Joined: Wed Mar 05, 2003 1:13 pm

Re: Unicode/UTF-8 setup

Postby WiZARD » Thu Mar 06, 2003 12:15 am

Try to set firstly in exiting HTML next:
Code: Select all
<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">

where charset is youre coding......
User avatar
WiZARD
Moderator
Moderator
 
Posts: 1257
Joined: Thu Jun 20, 2002 10:14 pm
Location: Ukraine, Crimea, Simferopol

Re: Unicode/UTF-8 setup

Postby WiZARD » Thu Mar 06, 2003 12:28 am

By whe way.... It's from manual of PHP:
Code: Select all
Since PHP is basically designed for ISO-8859-1, some multi-byte character encoding does not work well with PHP. Therefore, it is important to set mbstring.internal_encoding to a character encoding that works with PHP.
...................
These are examples of internal character encoding that works with PHP and does NOT work with PHP.


Character encodings work with PHP:
ISO-8859-*, EUC-JP, UTF-8

Character encodings do NOT work with PHP:
JIS, SJIS
 



Character encoding, that does not work with PHP, may be converted with mbstring's HTTP input/output conversion feature/function.

Note: SJIS should not be used for internal encoding unless the reader is familiar with parser/compiler, character encoding and character encoding issues.

Note: If you use databases with PHP, it is recommended that you use the same character encoding for both database and internal encoding for ease of use and better performance.

If you are using PostgreSQL, it supports character encoding that is different from backend character encoding. See the PostgreSQL manual for details.
The following configure options are related to the mbstring module.



--enable-mbstring: Enable mbstring functions. This option is required to use mbstring functions.

Note: As of PHP 4.3.0, the option --enable-mbstring will be enabled by default and replaced with --with-mbstring[=LANG] to support Chinese, Korean and Russian language support. Japanese character encoding is supported by default. If --with-mbstring=cn is used, simplified chinese encoding will be supported. If --with-mbstring=tw is used, traditional chinese encoding will be supported. If --with-mbstring=kr is used, korean encoding will be supported. If --with-mbstring=ru is used, russian encoding will be supported. If --with-mbstring=all is added, all supported character encoding in mbstring will be enabled, but the binary size of PHP will be maximized because of huge Unicode character maps. Note that Chinese, Korean and Russian encoding is experimentally supported in PHP 4.3.0.

--enable-mbstr-enc-trans : Enable HTTP input character encoding conversion using mbstring conversion engine. If this feature is enabled, HTTP input character encoding may be converted to mbstring.internal_encoding automatically.

Note: As of PHP 4.3.0, the option --enable-mbstr-enc-trans will be eliminated and replaced with mbstring.encoding_translation. HTTP input character encoding conversion is enabled when this is set to On (the default is Off).

--enable-mbregex: Enable regular expression functions with multibyte character support.

Example 1. php.ini setting example

; Set default language
mbstring.language        = English; Set default language to English (default)
mbstring.language        = Japanese; Set default language to Japanese

;; Set default internal encoding
;; Note: Make sure to use character encoding works with PHP
mbstring.internal_encoding    = UTF-8  ; Set internal encoding to UTF-8

;; HTTP input encoding translation is enabled.
mbstring.encoding_translation = On

;; Set default HTTP input character encoding
;; Note: Script cannot change http_input setting.
mbstring.http_input           = pass    ; No conversion.
mbstring.http_input           = auto    ; Set HTTP input to auto
                                ; "auto" is expanded to "ASCII,JIS,UTF-8,EUC-JP,SJIS"
mbstring.http_input           = SJIS    ; Set HTTP2 input to  SJIS
mbstring.http_input           = UTF-8,SJIS,EUC-JP ; Specify order

;; Set default HTTP output character encoding
mbstring.http_output          = pass    ; No conversion
mbstring.http_output          = UTF-8   ; Set HTTP output encoding to UTF-8

;; Set default character encoding detection order
mbstring.detect_order         = auto    ; Set detect order to auto
mbstring.detect_order         = ASCII,JIS,UTF-8,SJIS,EUC-JP ; Specify order

;; Set default substitute character
mbstring.substitute_character = 12307   ; Specify Unicode value
mbstring.substitute_character = none    ; Do not print character
mbstring.substitute_character = long    ; Long Example: U+3000,JIS+7E7E
 
 


Example 2. php.ini setting for EUC-JP users

;; Disable Output Buffering
output_buffering      = Off

;; Set HTTP header charset
default_charset       = EUC-JP   

;; Set default language to Japanese
mbstring.language = Japanese

;; HTTP input encoding translation is enabled.
mbstring.encoding_translation = On

;; Set HTTP input encoding conversion to auto
mbstring.http_input   = auto

;; Convert HTTP output to EUC-JP
mbstring.http_output  = EUC-JP   

;; Set internal encoding to EUC-JP
mbstring.internal_encoding = EUC-JP   

;; Do not print invalid characters
mbstring.substitute_character = none
 
 


Example 3. php.ini setting for SJIS users

;; Enable Output Buffering
output_buffering     = On

;; Set mb_output_handler to enable output conversion
output_handler       = mb_output_handler

;; Set HTTP header charset
default_charset      = Shift_JIS

;; Set default language to Japanese
mbstring.language = Japanese

;; Set http input encoding conversion to auto
mbstring.http_input  = auto

;; Convert to SJIS
mbstring.http_output = SJIS   

;; Set internal encoding to EUC-JP
mbstring.internal_encoding = EUC-JP   

;; Do not print invalid characters
mbstring.substitute_character = none
 
Example 5. php.ini setting example

;; Enable output character encoding conversion for all PHP pages

;; Enable Output Buffering
output_buffering    = On

;; Set mb_output_handler to enable output conversion
output_handler      = mb_output_handler
 
 


Example 6. Script example

<?php

// Enable output character encoding conversion only for this page

// Set HTTP output character encoding to SJIS
mb_http_output('SJIS');

// Start buffering and specify "mb_output_handler" as
// callback function
ob_start('mb_output_handler');

?>
 
 


Supported Character Encodings
Currently, the following character encoding is supported by the mbstring module. Character encoding may be specified for mbstring functions' encoding parameter.

The following character encoding is supported in this PHP extension:

UCS-4, UCS-4BE, UCS-4LE, UCS-2, UCS-2BE, UCS-2LE, UTF-32, UTF-32BE, UTF-32LE, UCS-2LE, UTF-16, UTF-16BE, UTF-16LE, UTF-8, UTF-7, ASCII, EUC-JP, SJIS, eucJP-win, SJIS-win, ISO-2022-JP, JIS, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-15, byte2be, byte2le, byte4be, byte4le, BASE64, 7bit, 8bit and UTF7-IMAP.

As of PHP 4.3.0, the following character encoding support will be added experimentaly : EUC-CN, CP936, HZ, EUC-TW, CP950, BIG-5, EUC-KR, UHC (CP949), ISO-2022-KR, Windows-1251 (CP1251), Windows-1252 (CP1252), CP866, KOI8-R.

php.ini entry, which accepts encoding name, accepts "auto" and "pass" also. mbstring functions, which accepts encoding name, and accepts "auto".

If "pass" is set, no character encoding conversion is performed.

If "auto" is set, it is expanded to "ASCII,JIS,UTF-8,EUC-JP,SJIS".

See also mb_detect_order()

Note: "Supported character encoding" does not mean that it works as internal character code.

Overloading PHP string functions with multi byte string functions
Because almost PHP application written for language using single-byte character encoding, there are some difficulties for multibyte string handling including japanese. Almost PHP string functions such as substr() do not support multibyte string.

Multibyte extension (mbstring) has some PHP string functions with multibyte support (ex. substr() supports mb_substr()).

Multibyte extension (mbstring) also supports 'function overloading' to add multibyte string functionality without code modification. Using function overloading, some PHP string functions will be oveloaded multibyte string functions. For example, mb_substr() is called instead of substr() if function overloading is enabled. Function overload makes easy to port application supporting only single-byte encoding for multibyte application.

User avatar
WiZARD
Moderator
Moderator
 
Posts: 1257
Joined: Thu Jun 20, 2002 10:14 pm
Location: Ukraine, Crimea, Simferopol

More UTF-8 problems

Postby JunkGuy » Fri Mar 07, 2003 8:06 am

I rebuilt php and apache with UTF-8 as the mb encoding. Previously I had built them with UNICODE as the encoding method and nothing worked.

After I rebuilt it, I modified a few database insert and select statements to use "utf8_encode" to allow me to insert the characters into the database and display them again when I retrieved them. Everything works fine if I use the forms I have in my pages to submit the data and then use the pages to retrieve the data and display it. I also added the "<meta..." stuff into the HTML of the page, and also as a "headers('<meta...');" function call in my php at the top of each page.

However, there is another component to my application that uses a database client application (Oracle Forms) to insert data into the same database that I'm using. It also enters UTF-8 data into the database and can retrieve it just fine - it displays fine, no problems. But when my PHP pages retrieve the same data that the Forms inserted, it shows up as "??????". Additionally, when Forms is used to view the data that the PHP pages has inserted into the database, it shows up as "& # 3AB4" (but put together - no spaces), etc.

The client systems/apps have no problem viewing the data inserted by Oracle Forms, so my conclusion as that PHP is encoding the data incorrectly. How can I get PHP to correctly encode the data such that it is compatible with the Oracle forms and client apps, and can pull their data out and display it correctly?

The following is taken from my PHP.INI:
output_buffering = On
.
.
output_handler = mb_output_handler
.
.
default_mimetype = "text/html"
default_charset = "UTF-8"
.
.
[mbstring]
; language for internal character representation.
;mbstring.language =

; internal/script encoding.
; Some encoding cannot work as internal encoding.
; (e.g. SJIS, BIG5, ISO-2022-*)
mbstring.internal_encoding = UTF-8

; http input encoding.
;mbstring.http_input = auto

; http output encoding. mb_output_handler must be
; registered as output buffer to function
mbstring.http_output = UTF-8

; enable automatic encoding translation accoding to
; mbstring.internal_encoding setting. Input chars are
; converted to internal encoding by setting this to On.
; Note: Do _not_ use automatic encoding translation for
; portable libs/applications.
mbstring.encoding_translation = On

; automatic encoding detection order.
; auto means
;mbstring.detect_order = auto

; substitute_character used when character cannot be converted
; one from another
mbstring.substitute_character = none;


I have also modified httpd.conf as follows:
AddCharset UTF-8 .php

Again, the only real changes I have made to my php code is to add the META headers and a few test usages of "utf8_encode" and "utf8_decode" in places where both PHP and Oracle Forms could get at, display, and manipulate the same data.

Thanks for your time.
JunkGuy
New php-forum User
New php-forum User
 
Posts: 5
Joined: Wed Mar 05, 2003 1:13 pm

Re: More UTF-8 problems

Postby WiZARD » Fri Mar 07, 2003 8:48 am

I have lat question youre Oracle understant UTF-8 or other char set?
Try input some info in DB and see from other client programm
Last edited by WiZARD on Fri Mar 07, 2003 8:50 am, edited 1 time in total.
User avatar
WiZARD
Moderator
Moderator
 
Posts: 1257
Joined: Thu Jun 20, 2002 10:14 pm
Location: Ukraine, Crimea, Simferopol

Postby JunkGuy » Fri Mar 07, 2003 8:49 am

The database has been converted to UTF-8...it would have to be for the Forms and other client software to work properly.
JunkGuy
New php-forum User
New php-forum User
 
Posts: 5
Joined: Wed Mar 05, 2003 1:13 pm

Postby WiZARD » Fri Mar 07, 2003 8:54 am

JunkGuy wrote:The database has been converted to UTF-8...it would have to be for the Forms and other client software to work properly.

No,no,no, try input some info from PHP form!
User avatar
WiZARD
Moderator
Moderator
 
Posts: 1257
Joined: Thu Jun 20, 2002 10:14 pm
Location: Ukraine, Crimea, Simferopol

Postby JunkGuy » Fri Mar 07, 2003 10:14 am

As I said, I have. If I input data into a PHP form, place the data into the database using PHP, then extract that data and display it in the page generated by PHP, it works fine.

If I look at that same data in Forms, it is invalid (junk data).

If I enter data in Forms, place the data into the database using Forms, then extract that data and display it in Forms, it works fine.

If I enter data in Forms, place the data into the database using Forms, then extract taht data and display it in the page generated by PHP, it is invalid (junk data - shows up as "?????").
JunkGuy
New php-forum User
New php-forum User
 
Posts: 5
Joined: Wed Mar 05, 2003 1:13 pm

Postby WiZARD » Sun Mar 09, 2003 12:41 am

Code please
User avatar
WiZARD
Moderator
Moderator
 
Posts: 1257
Joined: Thu Jun 20, 2002 10:14 pm
Location: Ukraine, Crimea, Simferopol

Postby JunkGuy » Mon Mar 10, 2003 7:24 am

utf8_encode(<string>)
JunkGuy
New php-forum User
New php-forum User
 
Posts: 5
Joined: Wed Mar 05, 2003 1:13 pm

Postby WiZARD » Mon Mar 10, 2003 11:55 pm

Hi JunkGuy!
See http://www.php.net/manual/ru/function.utf8-encode.php and scroll down at the page you may find some example decoding to utf8 by ronen at greyzone dot com
try it!
User avatar
WiZARD
Moderator
Moderator
 
Posts: 1257
Joined: Thu Jun 20, 2002 10:14 pm
Location: Ukraine, Crimea, Simferopol

Postby WiZARD » Tue Mar 11, 2003 12:02 am

and once more see this:
http://www.winsyntax.com/ just recommend to see how peaple write project for converting strings
User avatar
WiZARD
Moderator
Moderator
 
Posts: 1257
Joined: Thu Jun 20, 2002 10:14 pm
Location: Ukraine, Crimea, Simferopol


Return to PHP General

Who is online

Users browsing this forum: Bing [Bot], Google [Bot] and 1 guest

Sponsored by Sitebuilder Web hosting and Traduzioni Italiano Rumeno and antispam for cPanel.

cron