Sending Japanese emails by a Perl script
WARNING: Thanks to a generous feedback from J, I have realized that the following method using ISO-2022-JP for Japanese emails is obsolete. Accordingly, I would like to mark the main part of this article as NOT-RECOMMENDED. Instead, in the comment column below, I have posted a fixed script that uses UTF-8. This UTF-8 based emailing script should work not only for Japanese but also for any other non-English languages as well.
Sending Japanese emails is not as easily as sending English emails.
Simply sending your Japanese email from the Unix command line to your recipient across the Internet often does not work. Without the important practice explained in this article, your Japanese email is always susceptible to Mojibake (文字化け) where the Japanese characters in your email become broken at the recipient's side.
In this article, I introduce a simple and practical way to do the crucial pre-processing of a Japanese email (i.e. Unicode to ISO-2022-JP conversion and Japanese MIME-header generation) using Perl. Finally, I demonstrate how to send out our "well-formed" Japanese email using sendEmail program. (For an introduction to sending an email from the command line by Perl, see this previous article.)
Here is an index to this article:
Background
To prevent Mojibake of your Japanese email, stick to the following de facto standard rules commonly practiced by Japanese IT professionals:
- Convert all Japanese characters into ISO-2022-JP (7-bit) character code.
- Apply MIME encoding to Japanese characters that appear in header fields such as Subject, To, From, etc.
In modern Linux distributions, Unicode support is getting well-matured. Many text editors (e.g. Emacs, gedit, etc.) support Unicode by default. So you may compose a Japanese email and save it easily in Unicode (UTF-8) using those editors. For that reason, I assume that your Japanese email is originally written in Unicode.
Mail servers have been traditionally handling only 7-bit character codes (most notably US-ASCII code), and many of mail servers in the world still support only 7-bit character transmission mode (thus, not 8-bit clean). On the other hand, many of Japanese character codes (e.g. EUC-JP, SHIFT-JIS, and Japanese within UTF-8) are 8-bit character codes. The only 7-bit Japanese code is ISO-2022-JP. Hence using ISO-2022-JP is the only safe option to avoid Mojibake caused by the loss of the top bit of 8-bit Japanese characters while being transmitted across emails servers.
The message header of an email is re-arranged during the process of transmission. For that reason, only ASCII characters are allowed to appear in message headers. RFC1522 defines rules to embed non-ASCII characters into message headers by using the standard MIME (Multipurpose Internet Mail Extensions) encoding method. For example, to embed a Japanese string "日本語のタイトル" into the subject field, the string is MIME-encoded as
=?ISO-2022-JP?B?GyRCRnxLXDhsJE4lPyUkJUglaw==?=
following the form of
=?<charset>?<method>?<encoded string>?=
where ? is a delimiter and B indicates the BASE64 encoding method.
How to run the program
1. Getting resources
Please get the following files and save them into the same directory.
- sendEmail: A Perl script software that handles all the email-spec-related stuffs to construct and send emails. If you are using Ubuntu, you may install it with
$ sudo apt-get install sendemail
If sendEmail is not included in the repository of your Linux distribution (or using Windows), you may still download a copy from sendEmail's website.
- send-email-jp.pl: The main Perl script developed in this article.
- mybase64encode.pl: A Perl script that takes an ISO-2022-JP string from standard input and outputs the MIME message header of the string to standard output.
- body-utf8.txt: The Japanese message body file written in Unicode (UTF-8).
2. Installing nkf
We use nkf, a command-line Japanese character code conversion program to do conversion from Unicode (UTF-8) to ISO-2022-JP. In Ubuntu, install the program by
$ sudo apt-get install nkf
3. Running the script
Make sure that the Perl scripts have executable permission:
$ chmod 775 send-email-jp.pl mybase64encode.pl
Finally,
$ ./send-email-jp.pl
In your own applications, please modify the script (e.g. To, From, Message body) accordingly. Enjoy sending Japanese emails!
Details
#!/usr/bin/perl
$to_jp_part_utf8 = '担当者';
$to_eng_part_utf8 = '<info@example.com>';
$to_jp_part_7bit_mime = `echo -n "$to_jp_part_utf8" | nkf | ./mybase64encode.pl`;
$to_7bit_mime = $to_jp_part_7bit_mime . " " . $to_eng_part_utf8;
$from_jp_part_utf8 = '山田太郎';
$from_eng_part_utf8 = '<taro@example.net>';
$from_jp_part_7bit_mime = `echo -n "$from_jp_part_utf8" | nkf | ./mybase64encode.pl`;
$from_7bit_mime = $from_jp_part_7bit_mime . " " . $from_eng_part_utf8;
$sub_utf8 = '日本語のタイトル';
$sub_7bit_mime = `echo -n "$sub_utf8" | nkf | ./mybase64encode.pl`;
$cmd = "cat ./body-utf8.txt | nkf > body-7bit.txt";
system("$cmd");
$cmd = "sendEmail " .
"-t \"$to_7bit_mime\" " .
"-f \"$from_7bit_mime\" " .
"-u \"$sub_7bit_mime\" " .
"-o message-file=body-7bit.txt " .
"-o message-charset=ISO-2022-JP " .
"-s localhost:25" .
"\n";
print $cmd;
system("$cmd");
system("rm -f body-7bit.txt");
This is a complete Perl script that does the proper pre-processing of a Japanese email and send it out in the command line using sendEmail program.
$to_jp_part_utf8 = '担当者'; $to_eng_part_utf8 = '<info@example.com>'; $to_jp_part_7bit_mime = `echo -n "$to_jp_part_utf8" | nkf | ./mybase64encode.pl`; $to_7bit_mime = $to_jp_part_7bit_mime . " " . $to_eng_part_utf8;
This block constructs a MIME-encoded header of the To-field. In the original Unicode form, the field looks like
担当者 <info@example.com>
We convert it into a MIME-encoded To-field as
=?ISO-2022-JP?B?GyRCQzRFdjxU?= <info@example.com>
where the Japanese recipient name's part in Unicode is first converted into ISO-2022-JP code by nkf, then MIME-encoded by script mybase64encode.pl. The last line concatenates the Japanese part and the ASCII part (email address).
$sub_utf8 = '日本語のタイトル'; $sub_7bit_mime = `echo -n "$sub_utf8" | nkf | ./mybase64encode.pl`;
This block creates a MIME-encoded Subject header. Since this field can be entirely in Japanese, we are MIME-encoding the whole part of the field.
$cmd = "cat ./body-utf8.txt | nkf > body-7bit.txt";
system("$cmd");
This block converts a Japanese body message file body-utf8.txt in Unicode into a file body-7bit.txt in ISO-2022-JP.
$cmd = "sendEmail " .
"-t \"$to_7bit_mime\" " .
"-f \"$from_7bit_mime\" " .
"-u \"$sub_7bit_mime\" " .
"-o message-file=body-7bit.txt " .
"-o message-charset=ISO-2022-JP " .
"-s localhost:25" .
"\n";
print $cmd;
system("$cmd");
system("rm -f body-7bit.txt");
Finally, we set To-, From-, Subject-fields containing MIME-encoded ISO-2022-JP strings to the options of sendEmail program. We also set the ISO-2022-JP message body file. We explicitly specify that the email message is using ISO-2022-JP character set. The command is executed by system() function. After sending out the email, we clean up the temporarily generated body file.
#!/usr/bin/perl use MIME::Base64; undef $/; $body = <> $body_encoded = encode_base64($body); $body_encoded =~ s/\s+$//; # Remove newlines print '=?ISO-2022-JP?B?' . $body_encoded . '?=';
This script constructs an ISO-2022-JP string given in standard input into a complete MIME header. To do BASE64 encoding, we use Perl's standard encode_base64() function. The last line constructs a complete MIME header by string concatenation.
References
"Mojibake", http://en.wikipedia.org/wiki/Mojibake
"Japanese Emails on the Internet", http://www.kanzaki.com/docs/jis-mail.html (In Japanese)
Think twice before using ISO-2022-JP
I do not think ISO-2022-JP is advisable nowadays, and here are the reasons :
Sending a message in ISO-2022-JP has a good number of shortcomings.
- ISO-2022-JP is not able to represent non-Japanese characters. In fact, it does not even fully represent English. ISO-2022-JP-2 and successors fix English support, but still cannot represent many common graphic characters and other languages. If you need *ANY* non-Japanese character, you cannot use this. It makes it even harder than it already is for Japanese people to access foreign information.
- ISO-2022-JP is not even able to fully represent the Japanese language. A number of uncommon Japanese kanji are missing from ISO-2022-JP, because they are missing from the JIS X-* standards in the first place.
- If you enter text normally with a standard IME on a modern computer, characters outside of ISO-2022-JP are commonplace and not marked as such ; sending e-mail in ISO-2022-JP will trash these for no apparent reason for the end user.
- ISO-2022-JP has poor support outside of Japan. Using this character set makes it even harder for foreigners to access Japanese content.
- ISO-2022-JP has almost ubiquitous support in Japan, but still not as good as Unicode.
- ISO-2022-JP is slow to process and not robust to data truncation.
- ISO-2022-JP is an old and legacy encoding : it has declining support and more and more new devices do not support it. For example, the iPhone does not support it out of the box. By contrast, Unicode enjoys better and growing support, including inside of Japan.
To sum it up, the suggested use of ISO-2022-JP will ensure you will have problems every time you deal with a foreign computer, every time you enter complicated characters with your IME, every time you get network problems, and will get you more and more problems as time passes. It was a good idea ten years ago. It just is not any more.
Sending messages directly in UTF-8 does not have any of these problems.
I never ran into a 8-bit unclean mailer program in the practice. But if you really need to transfer your data over a 7-bit channel, it's much more advisable to use MIME and base64 encode your UTF-8 data.
The perl code would be simpler. The support is better. It will continue to work in the foreseeable future, as opposed to ISO-2022-JP. It will work outside of Japan, it will work better even inside of Japan, it will allow you to use any character your computer allows you to input. It's just better.
Re: Think twice before using ISO-2022-JP
Hi J,
Thank you very much for reading my article and providing very informative feedback. Actually, I have learned many new things regarding ISO-2022-JP from your comment. Now I realize that the method that I have taken is very obsolete nowadays.
Before writing the article, I tested sending Japanese emails to various email accounts of mine and others that I have reach for, directly in UTF-8 as you have suggested. That time, I saw fail cases where Japanese part suffered mojibake. Then, after I converted UTF-8 into ISO-2022-JP by nkf and sent them out, everything worked fine. However, I might have done any mis-configurations on other aspects of dealing with mail servers when sending UTF-8 Japanese emails.
So I would like to mark the article as "not-recommended". I also seek to send UTF-8 Japanese emails stably in my environment.
FIX: Sending Japanese emails using UTF-8
I have learned that using ISO-2022-JP for sending Japanese emails is against the trend. As suggested, I tried sending Japanese emails using UTF-8 instead of ISO-2022-JP. It turned out that I was making careless mistakes (assigning a wrong character code in sendEmail command option!) before. Now UTF-8 based Japanese emailing is working fine.
Here, I post a working Perl script that uses UTF-8. When using the script, you should modify the bold parts based on your situation.
Due to its UTF-8 based implementation, the following emailing script is valid for sending not only Japanese emails but also for any other non-English languages as well.
send-email-utf8.pl
#!/usr/bin/perl use MIME::Base64; sub conv_utf8_mime { my $str = $_[0]; my $str_enc = encode_base64($str); $str_enc =~ s/\s+$//; # Remove newlines my $str_utf8_mime = '=?UTF-8?B?' . $str_enc . '?='; return $str_utf8_mime; } $to_jp_part_utf8 = '担当者'; $to_addr_part_utf8 = '<info@example.com>'; $to_utf8_mime = &conv_utf8_mime($to_jp_part_utf8) . " " . $to_addr_part_utf8; $from_jp_part_utf8 = '山田太郎'; $from_addr_part_utf8 = '<taro@example.net>'; $from_utf8_mime = &conv_utf8_mime($from_jp_part_utf8) . " " . $from_addr_part_utf8; $sub_utf8 = '日本語のタイトル'; $sub_utf8_mime = &conv_utf8_mime($sub_utf8); $cmd = "sendEmail " . "-t \"$to_utf8_mime\" " . "-f \"$from_utf8_mime\" " . "-u \"$sub_utf8_mime\" " . "-o message-file=body-utf8.txt " . "-o message-charset=UTF-8 " . "-s localhost:25" . "\n"; print $cmd; system("$cmd");