Kannada Language and Script - Shodhganga

Segmental scripts are classified into abjads and alphabets based on the graphemes used to represent phonemes. The alphabets are further classified int...

71 downloads 838 Views 635KB Size
18

Chapter 2

Kannada Language and Script 2.1 Introduction Every human community has a language. It is the acquired cognitive ability of human being used to communicate among his/her community members. Natural languages can be either spoken or signed. Spoken language uses the human speech organs to produce sounds. These sounds are combined to pronounce words and sentences based on the well-defined combination rules of sounds. The study of languages is called linguistics. However, the development of writing systems and the process by which they have supplanted traditional oral systems of communication has been sporadic, uneven, and slow. Once established, writing systems on the whole change more slowly than their spoken counterparts, and often preserve features and expressions which are no longer available in the spoken language. The present chapter gives an overview of Kannada language and script. The basic Kannada characters are listed and possible combinations of basic characters leading to compound characters are discussed in detail. So, the purpose of this chapter is to introduce the complexity of Kannada script and thereby showing the challenges involved in data collection and classification.

2.2 Language A language is the cognitive ability of a human being to learn and use the complex communication system to express one‟s thoughts through predefined set of rules and conventions understood and shared by a community [99]. A language is processed in Broca's and Wernicke's areas in the human brain and human beings acquire language through social interaction in their early childhood [100, 101]. Natural languages are

Online Recognition of Isolated Handwritten Characters

Chapter 2. Kannada Language and Script

19

either spoken or signed. Spoken language uses the audio modality, whereas sign language uses the visual modality. All spoken languages have phonemes of at least two different categories, vowels and consonants, which can combine to form syllables. On an average the number of spoken languages and their dialects in the world are estimated to be around 7000 [99].

2.3 Writing System and Script The different ways of representing language in graphical or symbolic forms are called the writing systems [102]. The invention of the first writing systems is estimated to be in the beginning of the Bronze Age. The great benefit of writing systems is their ability to maintain a persistent record of information expressed in a language and the same information can be retrieved independently of the initial act of formulation. Writing systems have given different dimension for communication and has enabled communication across distances possible. A fundamental unit or minimally significant element of a writing system is called a grapheme and other equivalent terms used are, glyph, sign, and character [103]. A set of well-defined graphemes of a writing system is called a script. There exists a set of predefined rules and conventions, which assign meaning to graphemes, their writing order, and they also gives relations among various graphemes of a script. The various writing styles of a grapheme are called allographs of a grapheme. The writing styles vary from writer to writer and also due to the instruments used for writing. The product of a writing system is called text. The act of composing text is referred as writing, and the act of interpreting the text is called as reading. The writing systems are classified based on few common features and also depending on the systems used for writing [102-104]. They are, 2.3.1 Pictographic/Ideographic Writing System In pictographic writing system, the graphemes are represented as iconic pictures and they are not able to express all that can be communicated by a language. In ideographic scripts, graphemes are represented as ideograms representing concepts or ideas, rather than a specific word in a language. Ex. Hieroglyphs, Testerian.

Online Recognition of Isolated Handwritten Characters

Chapter 2. Kannada Language and Script

20

2.3.2 Logographic Writing System In logographic writing systems, glyphs represent words or morphemes rather than phonetic elements. In addition to logograms, these scripts have graphemes to represent phonetic elements to specify the sound of a logogram. Ex. Egyptian, Hanzi , Kanji, Hanja. 2.3.3 Syllabary Writing System In a syllabary writing system, graphemes represent syllables. Syllables are the combination of consonant and vowel segments and they correspond to distinct parts of a syllable. Ex. Hiragana, Katakana. 2.3.4 Segmental Writing System In a segmental writing system, graphemes represent the phonemes of a language. There is no one-to-one correspondence between the graphemes of the script and the phonemes of a language. A phoneme may be represented by the combination of a string of graphemes. Segmental scripts are classified into abjads and alphabets based on the graphemes used to represent phonemes. The alphabets are further classified into linear and nonlinear alphabets. Ex. Arabic, Sindhi, Urdu, Tibetan, Devanagari, Kannada, Malayalam, Telugu, etc.

2.4 Kannada Language and Script Kannada is a southern Dravidian language and also one of the scheduled languages of India spoken predominantly in the state of Karnataka [105]. It is the official and administrative language of Karnataka [106]. In 2008, the Government of India officially recognized Kannada as a classical language [107]. Kannada script is an abugida (alphasyllabary) of the Brahmi script [108]. It is a segmental, non-linear alphabet script characterized by consonants appearing with an inherent vowel. Each written symbol corresponds to one syllable, as opposed to one phoneme in languages like English. Each alphabet is called as akshara, akkara or varna and each letter has its own form (akara) and sound (shabda) giving the visible and audible representations. Kannada characters are written in horizontal lines from left to right as isolated letters. The space between parts of characters and words vary

Online Recognition of Isolated Handwritten Characters

Chapter 2. Kannada Language and Script

21

from writer to writer. The script is also used for writing Konkani, Tulu, Beary Bashe and Kodava languages. Kannada alphabet is popularly known as Aksharamale or Varnamale and the current Varnamale list consists of forty nine characters. In order to make the recognition system compatible to the earlier Varnamale set, fifty characters have been considered in the present work. However, the numbers of written symbols are far more than fifty characters, as characters can combine to form compound characters leading to Ottaksharas. In addition, the script is complicated due to the occurrence of various combinations of half-letters or symbols that attach to various letters in a manner similar to diacritical marks. A sample of printed Kannada text is shown in Fig. 2.1. ಕನ್ನಡ ರಾಜ ್ಯೋತ್ಸವ ಪ್ರತಿ ವರ್ಷದ ನ್ವ ೆಂಬರ್ ೧ ರೆಂದು ಆಚರಿಸಲಾಗುತ್ತದ . ಮೈಸ್ರು ರಾಜ್ಯವು (ಈಗಿನ್ ಕರ್ಾಷಟಕ) ೧೯೫೬ರ ನ್ವ ೆಂಬರ್ ೧ ರೆಂದು ನಿರ್ಾಷಣವಾದುದರ ಸೆಂಕ ೋತ್ವಾಗಿ ಈ ರಾಜ ್ಯೋತ್ಸವವನ್ುನ ಆಚರಿಸಲಾಗುತ್ತದ . Figure 2.1 Sample of printed Kannda text.

2.5 Classification of Kannada Varnamale The fifty basic characters are classified into three categories [109]. They are Swaras (vowels), Vyanjanas (consonants), and Vogavahakas (part vowel, part consonants). 2.5.1 Swaras (Vowels) There are fourteen vowels and are called swaras. Table 2.1 shows the graphemes of vowels and the corresponding ITRANS (Indian language Transliterations). In the revised version, the character ೠ has been dropped from the list. Table 2.1 Glyphs of Kannada Swaras (Vowels). Vowels

ಅ ಆ

ITRANS

a

ಇ ಈ ಉ ಊ ಋ

aa i

I

u

U



ಎ ಏ ಐ ಑ ಒ ಓ

Ru RU e

E i

o

O ou

Table 2.2 Glyphs of Kannada Yogavahakas (part-vowel, part consonant).

Anusvara Visarga Yogavahakas ITRANS

aM

aH

Online Recognition of Isolated Handwritten Characters

Chapter 2. Kannada Language and Script

22

2.5.2 Yogavahakas (part-vowel, part-consonant) There are two Yogavahakas and are called as Anusvara and Visarga. In general, these two are also grouped under vowel category. The graphemes of these characters are shown in Table 2.2. The dotted circle in the table indicates that these graphemes always appear after a vowel or a consonant. 2.5.3 Vyanjanas (Consonants) The Vyanjanas are classified into structured and unstructured consonants. The structured consonants are further classified into five groups according to where the tongue touches the palate of the mouth while pronouncing these characters. The unstructured consonants are those, which do not belong to any of the structured consonants. Tables 2.3 and 2.4 show the graphemes of the consonants modified by a vowel ಅ. Table 2.3 Glyphs of structured Kannada Vyanjanas (Consonants). Voiceless

Voiceless Aspire

Voiced

Voiced Aspirate

Nasal

Velars

ಔ (ka)

ಕ (kha)

ಖ (ga)

ಗ (gha)

ಘ (nga)

Palatals

ಙ (ca)

ಚ (Cha)

ಛ (ja)

ಜ (jha)

ಝ (nja)

Retroflex

ಞ (ṭa)

ಟ (Tha)

ಠ (Da)

ಡ (Dha)

ಢ (Na)

Dentals

ಣ (ta)

ತ (tha)

ಥ (da)

ದ (dha)

ಧ (na)

Labials

ನ (pa)

಩ (pha)

ಪ (ba)

ಫ (bha)

ಬ (ma)

Table 2.4 Glyphs of unstructured Kannada Vyanjanas (Consonants). Unstructured Consonants ITRANS

ಯ ರ ya

ಲ ಴





ra la va sha Sha







sa ha La

2.6 Kannada Character Combinations In Kannada, the vowels (V) are indicated with diacritic marks when they combine with a consonant (C) and these diacritics may appear above, below, before, or after the consonants. On the other hand, when Vs are alone written as independent letters, they appear in the beginning of syllables. The consonant-vowel (CV) combinations are formed by retaining most of the consonant glyph and attaching a glyph corresponding to the vowel modifier. A C without a V is called a dead consonant.

Online Recognition of Isolated Handwritten Characters

Chapter 2. Kannada Language and Script A C is made a dead C by adding a halant (

23

) symbol to it. Similarly, maximum of

three Cs combine with Vs leading to CCV and CCCV combinations. When two consonants combine, the second consonant changes its shape or size, and is written to the right bottom of the first one and is called a consonant conjunct. There are 34 consonant-conjuncts, one corresponding to each consonant. For example, a CCV combination, ಷ (swa) is obtained by writing a consonant-conjunct a consonant ಴(va)

below



corresponding to

the vowel modified consonant ಷ(sa). In CCCV

combinations, the vowel modifies the first consonant and the other two form the consonant-conjuncts, written below the CV. For example, a CCCV combination, ಣ (tsya) is obtained by writing two consonant-conjuncts

and

corresponding to

consonants ಷ(sa) and ಯ(ya) below the vowel modified consonant ತ(ta). In the same manner and also due to numerous defined character combinations of basic characters leading to new characters, the character set of Kannada script is very huge. The glyphs of CV combinations of a character ಢ, dead consonants, and few of the glyphs of CCV and CCCV combinations are shown in Tables 2.5 and 2.6 respectively. The glyphs of consonant-conjuncts are shown in Figure 2.2.

2.7 Kannada Numerals and Special Symbols The script contains ten Kannada numerals corresponding to decimal number system and the special symbols. In addition, due to the influence of English language, the writer uses punctuation marks and Indo-Arabic numerals. Table 2.7 shows the glyphs of these symbols. In general, a typical Kannada character could be a V, a dead consonant, a CV, a CCV, a CCCV, or a numeral [109]. In addition, it may contain special symbols and Indo-Arabic numerals. On direct calculation of all character combinations, the possible number of Kannada characters as shown in Table 2.8 is 647999. However, many of these combinations are not defined in the language or phonetic structure of Kannada. Hence, the total number of characters is less than this value but big enough that require special attention while designing a recognition system.

Online Recognition of Isolated Handwritten Characters

Chapter 2. Kannada Language and Script

24

Table 2.5 Glyphs of CV combinations of character „ಣ‟ with their vowel diacritics. Dead Consonant + Vowel Diacritics

ಣ್ ಣ್

ಣ್

ಣ್

ಣ್

ಣ್

ಣ್ ಣ್

ಣ್ ಣ್

ಣ್

ಣ್

ಣ್

ಣ್

ಣ್

ಣ್

+

+

+

+

+

+

+

+

+

+

+

+

+

CV







ಢ ಢ





ITRANS

N a

N u

N U

N N N N R R e E U u

N a M

N a H

+

N a a

N i

N I

+

+

N a i

N o

N O

N o u

Table 2.6 Glyphs of dead consonants, CCV, and CCCV combinations. Dead Consonants

(k)

(K)

(g)

(G)

(ng)

(ch)

(t)

(T)

(D)

(Dh)

(n)

(t)

(y)

(r)

(l)

(v)

ಔ (kta)

CCV CCCV

(stri)

(ks)

(sh)

(Ch) (th)

(s)

(j)

(J)

(d)

(h)

(dh)

(nj) (N)

(L)

(swa) ಢ (NNo)

(kya)

(skru) ಣ (tsya)

Figure 2.2 Glyphs of consonant conjuncts corresponding to each consonant. Table 2.7 Kannada numerals, Indo-Arabic numerals and special symbols. Kannada Numerals

೦ ೧ ೨





೫ ೬ ೭ ೮



Indo Arabic Numerals

0

3

4

5

9

Punctuation Marks & Special Symbols

1

2



~

(

) [ ] { } ..

-

+

*

% .

6

7

8

!

?

/

\ <

,





“ ”

Online Recognition of Isolated Handwritten Characters

>

Chapter 2. Kannada Language and Script

25

Table 2.8 Calculation of possible Kannada character combinations.

Type of the Character e Vowels (V)

Possible Combinations 16

Dead Consonants CV ( 34 x 16) CCV ( 34 x 34 x16) CCCV ( 34 x 34 x 34 x16)

34 544 18496 628864

Kannada Numerals

10

Indo-Arabic Numerals Special Symbols

10

Total

25 647999

2.8 Summary In this chapter, the salient features of Kannada language and script are discussed. The various combinations of basic characters leading to new graphemes are analyzed and hence the complexity of the script is understood. The major issues associated with such a huge character set are at both data collection level and in the designing of a robust classifier. In addition, due to almost similar shape characters, chances of misclassification degrade the recognition performance.

Online Recognition of Isolated Handwritten Characters