Published in IEEE Advance Computing conference 2009

Framework for Web Application Internationalization and Localization Supporting Indian Languages

Jalindar Baban Karande M. L. Dhore Sandip R. Shinde

Abstract—This paper under lines multilingual nature of India by analyzing census data of the country and need for development of multilingual software systems. This paper reviews current technology used for internationalization and localization and their limitations for Indian society. This paper proposes a framework for localization of web applications in Indian languages. The proposed framework is based on characteristics of Indic scrips and Unicode.

I. INTRODUCTION

Traditionally, Web sites have been developed for the English speaking community, with only a limited attempt to develop web sites for other languages. Today most of the functionality that traditional desktop based software systems are providing is being replaced by web based online applications. Organizations are expanding their business to global market. However, globalization of business needs web applications to be localized to a language and cultural environment of the users[1]. India has 22 ofﬁcially recognized languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada,Kashmiri, Konkani, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu, Urdu, Bodo, Maithili, Dogri and Santhali. If Web application developed for American or any other community has to localize for Indian community, needs to be localized for at least 22 languages. Most of the online users of any web application communicates with each other using their languages. In countries like America nearly all people communicate in the same language, but situation is different in India. Fig 1 shows language distribution in India.

Even, there is no state in India where every citizen speaks the same language. For example Fig 2 shows laguage distribution in Maharashtra state in India.

Fig. 1. Distribution of popullation by Languages in India

Fig. 2. Distribution of popullation by Languages in Maharashtra

II. EXISTING FRAMEWORKS

Starvos Kokkotos has proposed architecture for development of internationalized software called ISDAi[2], which is highly modular design and costly for implemntation as it requires to built up different libraries and conﬁguration ﬁles. Terence Parr [3] had proposed XML based string template for localization of strings and other data types like currency, date, time etc. using locale. N. Anbarasan[4] detailed on localization process for web in Indian languages and several issues related with localization process. Valentina Dagient proposed framework for internationalization of open source[5] as most of the open source projects are localized in different languages and culctures. Jesus Cardenosa et. al.[6] proposed approach for localization of existing software without changing source code. Most of the existing frameworks are based on Roman script and cases of multiple scripts are not considered.

III. INDIC SCRIPT

S. P. Mudur et. al[7] explained several properties of Indic scripts and how this properties made software development difﬁcult in Indian languages. Fortunately there are some properties of Indic scripts like phonetic nature and most logical writing system which can be used positively for software development in Indian languages.

India is a multilingual country with 22 recognized ofﬁcial languages and over 6000 dialects. Of these languages, some are of PersoArabic origin with script and writing rules similar to Arabic. Most of the others have same writing rules derived from the ancient Brahmi script, even if their scripts are totally distinct. Panini’s phonetic classiﬁcation of the Indian alphabets into vowels (V) and consonants (C) serves as a common base for all Indian languages of Bhrahmi origin. It also provides us with a unique encoding for any word in the language. There are differences in their written forms, as different letter shapes and different shaping rules get used. The methodology of combining these two basic groups (C and V) to form various syllables is in itself a unique and scientiﬁc approach, common to all the Indian scripts.

IV. PROPERTIES OF UNICODE In previous section we briefed common features of most of the Indian scripts. Fortunately these common features are preserved in Unicode system also. Consonants with same pronunciation are having same offset from starting point of respective script. Same thing is applicable to the vowels. Following table shows example of offset of one set of consonants.

Similarly offset for vowels and numbers are same in most of the languages.

	Hindi Unicode	Hindi offset	Gujrathi Unicode	Gujrathi offset	Telugu Unicode	Telugu offset
k	0915	21	0A95	21	0C15	21
K	0916	21	0A96	21	0C16	21
g	0917	21	0A97	21	0C17	21
G	0918	21	0A98	21	0C18	21
w	0919	21	0A99	21	0C19	21

TABLE I UNICODE AND OFFSET FOR ONE SET OF CONSONANTS

Offset for consonants are not same for Dravidian script

Tamil, but same in other Dravidian script i.e. Telugu, Malayalam, and Kannada.

V. PROPOSED MODEL

Fig. 3. Proposed Model for Localization of web in Indian Language

Fig 3 show proposed model for localization of the web application in Indian languages. This proposed model deals with static text and dynamic text differently. Dealing with static text has few similarity with Jesus Cardenosa et. al.[6] approach. Jesus Cardenosa et. al. [6] approach uses langID attribute of localization to identify language for which resource ﬁle is localized E.g.

In proposed model we use separate ﬁle for every language. The langID attribute is removed from localization tag. We identify language for which this resource is localized, by ﬁle name itself. We use convention <ﬁlename>..xml for naming resource ﬁle. Eg. Home page of a web application “Home.aspx” has to be localized for Hindi and Marathi language and Indian culture, we store all static text strings in ﬁle Home.aspx.HNIN.xml and Home.aspx.MRIN.xml respectively. To ﬁnd localized UI string we need to append HNIN.xml or MRIN.xml to a ﬁle name for which we are searching localized strings, so time required for parsing ﬁle to ﬁnd particular language resources is reduced. Here static text does not mean static part of the web application. It is the text which is having static contents. This include following:

Text displayed on web application.

•

Messages displayed at runtime, but having static contents

•

e.g. Error messages, Prompts to user, guide line messages etc. URL of the images displayed on the web page as different

•

culture may requires different version of some images. Dynamic text is treated in two separate ways. Text like name of person, place are not translated. This is transliterated from one script to other. In proposed approach transliteration from one Indic script to other is carried out using properties of Unicode discussed in previous chapter. As offset of most consonant and vowels are same in most Indian scripts, we proposed following formula for transliteration.

chn = cho _₋So + Sn

where chn is Unicode of character in new script, cho is Unicode of character in old script, Sn is starting of Unicode for new script and So is starting of Unicode for old script.

To transliterate text from English(Roman script) to any Indic script or vice versa requires alternative way as Indic scripts are phonetic based whereas Roman script is not phonetic based. We prefer to use lookup table for transliteration between Roman script to Devnagari and vice versa. To transliterate text from other Indic script to Roman script and vice versa, we use Devnagari as intermediate language. This approach works well with most of the names but not all. To take care of correct transliteration system should keep database of previously transliterated strings. User may notices some error in transliteration. Transliteration completed with assistance of user is stored in database and referred in future transliteration.

For translation of other dynamic text, lexicon server or multilingual wordnet[8] sever is used. This multilingual dictionary is prepared for words in the context of application to reduce the size of dictionary, time to search and confusions because of same word in different context for different meaning. Multlingual Dictionary can be sliced to to produce small bilingual dictionary in XML ﬁle and ported to the client to perform translation at the client side. Transliteration is also performed at client side if client has sufﬁcent processing power to reduce load at web server.

VI. INPUT METHOD EDITOR

The most critical problem in multilingual web application development is keyboard support for multiple languages. Keyboard codes can be embedded into web application if web application is supporting only one language, becomes complicated in case of multiple languages. One solution is to use operating systems keyboard if operating system is localized to user’s language. But if user’s operating system is not localized to his language, this option is of no use. Third party IME (Input Method Editors) are available in most of the languages e.g. BarahaIME 1.0 supports eleven languages. But this gives additional overhead on user to download and install third party IME.

Currently authors are using javascript based online keyboard to support input methods for users. This keyboard can be localized to different languages with little effort. authors are working on design of text box web control with keyboard layout which supports most of the Indic scripts and localized to any Indic language without change in source

VII. CONCLUSION

Localization of web applications is very necessary for every nonEnglish speaking country. Localization of web application in India has more problems than any other country due to multilingual culture of the country. Few people had tried to localize web applications to Indian languages but still 100% localization is not possible. Existing frameworks take care of static text only. Also existing framework not considers multilingual software development. Proposed model will help to localize web application and requires minimum human support. Proposed model also take care of dynamic text and multilingual software development.

ACKNOWLEDGMENT

The authors would like to thank Prof. S. G. Pukale, Prof. P. S. Dhabe and Prof. N. Z. Tarapore for their valuable suggestions and guidance during the work. The authors would like to thank the staff members of the Department of Computer Engineering of Vishwakarma Institute of Technology, Pune, India for their support and cooperation.

REFERENCES

[1] J. Hogan, C. HoStuart, and B. Pham, “Key challenges in software internationalisation,” in AustralianWorkshoponSoftwareInternationalisation2004, vol. 32, 2004.

[2] stavros kokkotos and constantine spyropoulos, “An architecture for designing internationalized software,” in SoftwareTechnologyandEngineeringPractice, pp. 13–21, July 1997.

[3] T. Parr, “Web application internationalization and localization in action,” in 6thInternationalconferenceonWebengineering, vol. 263, pp. 64 – 70, ACM New York, NY, USA, 2006 2006.

[4] N. Anbarasan, “Software localization process and issues,” TamilInternet, 2003.

[5] V. Dagient and R. Laucius, “Internationalization of open source: Framework and some issues,” in IEEE2ndInternationalConferenceonInformationTechnology:ResearchandEducationITRE, IEEE Computer Society, 2004.

[6] J. Cardenosa, C. Gallardo, and A. Martin, “Internationalization and localization after system develpment: A practical case,” InternationalJournal”InformationTechnologiesandKnowledge, vol. Vol.1, pp. 121– 127, 2007.

[7] S. P. Mudur, N. Nayak, S. Shanbhag, and R. K. Joshi, “An architecture for the shaping of indic texts,” ComputersandGraphics, vol. 23, pp. 7–24, 1999.

[8] J. Ramanand, A. Ukey, B. K. Singh, and P. Bhattacharyya, “Mapping and structural analysis of multilingualwordnets,” in BulletinoftheIEEEComputerSocietyTechnicalCommitteeonDataEngineering, IEEE Computer Society, 2000.

Web Development in Indian Languages

Pages

Search This Blog

Framework for Web Application

Framework for Web Application Internationalization and Localization Supporting Indian Languages

2 comments: