jagomart
digital resources
picture1_Text Mining Pdf 87438 | Multilingual Text Mining Approach


 155x       Filetype PDF       File size 0.25 MB       Source: ccc.inaoep.mx


File: Text Mining Pdf 87438 | Multilingual Text Mining Approach
knowledge based systems 17 2004 219 227 www elsevier com locate knosys amultilingual text mining approach to web cross lingual text retrieval rowena chau chung hsing yeh school of business ...

icon picture PDF Filetype PDF | Posted on 14 Sep 2022 | 3 years ago
Partial capture of text on file.
                                                               Knowledge-Based Systems 17 (2004) 219–227
                                                                                                                              www.elsevier.com/locate/knosys
               Amultilingual text mining approach to web cross-lingual text retrieval
                                                                               *
                                                          Rowena Chau , Chung-Hsing Yeh
                                School of Business Systems, Faculty of Information Technology, Monash University, Clayton, Vic. 3800, Australia
                                                              Received 26 August 2003; accepted 6 April 2004
                                                                      Available online 28 May 2004
             Abstract
               To enable concept-based cross-lingual text retrieval (CLTR) using multilingual text mining, our approach will first discover the
             multilingual concept–termrelationshipsfromlinguisticallydiversetextualdatarelevanttoadomain.Second,themultilingualconcept–term
             relationships, in turn, are used to discover the conceptual content of the multilingual text, which is either a document containing potentially
             relevant information or a query expressing an information need. When language-independent concepts hidden beneath both document and
             query are revealed, concept-based matching is made possible. Hence, concept-based CLTR is facilitated. This approach is employed for
             developing a multi-agent system to facilitate concept-based CLTR on the Web.
             q2004Elsevier B.V. All rights reserved.
             Keywords: Multilingual text mining; Cross-lingual text retrieval; Agent; Fuzzy clustering; Fuzzy classification
             1. Introduction                                                               Documents and queries about the same concept do not
                                                                                        necessarily contain matching sets of translation equivalents
                TheexponentialgrowthoftheWorldWideWeboverthe                            ofeachother.Conceptualrelevancebetweendocumentsand
             globe is the most influential factor that contributes to the                queries is not to be determined in an explicit way. To realize
             increasing awareness of cross-lingual text retrieval (CLTR)                concept-based CLTR, the development of a conceptual
             in recent years. Relevant information exists in different                  interlingua to support lexical transfer across multiple
             languages. A user may want to find documents in languages                   languages is required. To encode a conceptual interlingua,
             other than the one the query is formulated in. Among                       terms from multiple languages describing the same concept
             various CLTR techniques developed recently, query                          should be mapped to a language-independent scheme. In
             translation is the most extensively studied one. Such                      this way, it is possible to match a term to its corresponding
             CLTR approaches are developed mainly to facilitate term-                   counterparts in all other languages and to achieve concept-
             based lexical transfer between a single pair of source and                 based CLTR.
             target languages. However, a bilingual lexical transfer is not                Multilingual thesaurus (e.g. EuroWordNet) encoding
             sufficient for fully supporting the user’s need of multilingual             conceptual relationship among multilingual terms is such a
             information seeking.                                                       conceptual interlingua that has been used to achieve this
                Within a multilingual information community, users                      goal [7]. However, the manual construction of multilingual
             often rely on CLTR to explore global knowledge relevant to                 thesauri is very labor expensive and their coverage is not
             a certain topic/area. Instead of looking for some specific                  domain specific. An automatic and possibly unsupervised
             documents that can be characterized by a few translation                   approach for generating such linguistic knowledge for
             equivalents of the query terms, users are often interested in a            CLTR by discovering structures of lexical relationships
             broader view of a particular domain. They are thinking in                  among multilingual terms from analyzing text of relevant
             terms of concepts and expecting to receive all relevant                    domain is highly desirable.
             documentsexisting in any language. In such cases, concept-
             based CLTR capable of identifying multilingual documents                      To provide better support to CLTR, a knowledge
             about the concept of a query is necessary.                                 discovery technology, known as text mining, looks
                                                                                        promising in discovering such kind of in-depth multilingual
              * Corresponding author.                                                   linguistic knowledge. Typically, text mining concerns the
                E-mail address: rowena.chau@infotech.monash.edu.au (R. Chau).           discovery and extraction of hidden relationships, such as
             0950-7051/$ - see front matter q 2004 Elsevier B.V. All rights reserved.
             doi:10.1016/j.knosys.2004.04.001
             220                                 R. Chau, C.-H. Yeh / Knowledge-Based Systems 17 (2004) 219–227
             conceptual associations, among textual items, including           written in multiple languages. Corpus-based query trans-
             terms and documents.                                              lation is based on the idea that terms are represented as
                To enable concept-based CLTR using multilingual text           points in a multi-dimensional semantic space, and terms (in
             mining, our approach will first discover the multilingual          different languages) mapped to the same set of points in that
             concept–term relationships from linguistically diverse            semantic space are used to describe the same concept.
             textual data relevant to a domain. Second, the multilingual       Geometric relationships between terms within the semantic
             concept–termrelationships, in turn, are used to discover the      space are automatically extracted by analyzing co-occur-
             conceptual content of the multilingual text, which can be         rence statistics of terms across a parallel corpus. By
             either a document containing potentially relevant infor-          substituting every query term with its geometrically close
             mation or a query expressing an information need. When            translations in the semantic space, query translation is then
             language-independent concepts hidden beneath both docu-           facilitated [6,12]. The corpus-based approach is most
             ments and queries are revealed, concept-based matching is         effective for CLTR when the document collection is
             made possible, thus facilitating concept-based CLTR. This         domain-specific. In this paper, a corpus-based approach to
             approach is employed for developing a multi-agent system          CLTRthatapplies multilingual text mining using a parallel
             to facilitate concept-based CLTR on the Web.                      corpus is proposed.
             2. Current CLTR techniques                                        3. A multilingual text mining approach
                                                                               to cross-lingual text retrieval
                Given a query expressed in one language, the objective
             of CLTR is to search for relevant documents in other                 Our work for enabling CLTR with multilingual text
             languages. To break the language barrier, either document         mining is focused on exploiting the knowledge discovery
             or query translation is required. As query translation is less    capability of text mining over multilingual text. This is a
             resource demanding than document translation, it has              logical approach due to the complementary nature of these
             proven to be a more feasible approach to CLTR. There              twoareas. Both CLTR andmultilingual text mining analyze
             are three major approaches to query translation: (a) machine      multilingual textual data employing techniques from
             translation, (b) knowledge-based methods using machines-          information retrieval, natural language processing and
             readable dictionary [2,8], and (c) corpus-based methods           machine learning. In terms of the functions they perform,
             using parallel corpus [14].                                       CLTR facilitates multilingual information access while
                Despite translating query using machine translation            multilingual text mining enables knowledge discovery from
             being straightforward, it is argued that machine translation      multilingual texts. The objective of CLTR is to locate
             and CLTR have divergent concerns [13]. Machine trans-             relevant documents from a multilingual document collec-
             lation aiming at syntactically accurate translation is            tion in response to a query represented by a set of terms,
             redundant to CLTR. Since query is short, grammatically            while the objective of multilingual text mining is to reveal
             invalid and is just formulated with a few terms, it offers little concepts and their relationships embedded within a collec-
             context for the machine translation system to translate           tion of multilingual texts. To determine the conceptual
             accurately. Besides, machine translation always replaces the      relevance between documents and a query written in
             original query term with only one of its many possible            different languages, CLTR requires understanding of their
             synonymous translations in the target language. This              semantics. Multilingual text mining has the potential to
             prevents a query expansion by which all synonymous                complement CLTR by discovering intrinsic meanings of
             terms are considered to enhance recall.                           multilingual texts. Our approach to concept-based CLTR
                Query can easily be translated by replacing every query        with multilingual text mining is depicted in Fig. 1.
             termwithasetofallitspossibletranslations as encoded in a             Within an integrated framework, multilingual text
             machine-readable dictionary. However, this approach is            mining yields knowledge that supports CLTR. First, the
             ineffective mainly due to the translation ambiguity of            multilingual concept–term relationships, which are necess-
             polysemous terms (i.e. terms with multiple meanings). A           ary for a CLTR system to associate documents and query
             polysemous term may have several alternative translations         across languages, are mined from a parallel corpus. This is
             carrying different senses (meanings) in any foreign               achieved by a fuzzy multilingual term clustering algorithm.
             language. Translating a query by including every possible         By grouping conceptually related multilingual terms into
             translation of every query term can greatly increase the set      clusters, the multilingual concept–term relationships are
             of possible meanings in the translated query, thus                revealed. Second, using the conceptual relationship among
             contributing to poor precision. Moreover, inadequate              multilingual terms discovered in the previous step as the
             coverage of specific terminology and phrases is also a             linguistic knowledge base, conceptual content exhibiting
             serious shortcoming of such machine-readable dictionary.          ideas hidden beneath the multilingual texts is also mined.
                Analternative to machine-readable dictionary is using a        Thisisfacilitated by a fuzzy multilingual text categorization
             parallel corpus. A parallel corpus is a set of identical text     algorithm. As a result, both documents and query in
                                                    R. Chau, C.-H. Yeh / Knowledge-Based Systems 17 (2004) 219–227                                     221
                                                    Fig. 1. A multilingual text mining approach to concept-based CLTR.
             different languages can then be encoded with language-                    a concept-oriented frame of lexical reference. A cluster of
             independent concepts, instead of language-specific terms.                  conceptually related multilingual terms helps enormously in
             As such, concept-based matching is made possible and                      focusing solely on relevant lexical alternatives by establish-
             concept-based CLTR is facilitated.                                        ing a virtual semantic domain.
                                                                                          Clustering is an unsupervised method for automatic class
             3.1. Mining the conceptual relationship                                   formation. It offers the advantage that a priori knowledge of
             of multilingual terms                                                     classes is not required. Typically, clustering algorithms (e.g.
                                                                                       k-means) [9] aim to maximize inter-clustering distance and
                Successful application of text mining in supporting                    minimizeintra-clusterdistancesofsomesimilaritymeasure.
             monolingual information retrieval has been well reported                  In the context of mining conceptual relationships among
             [1]. To facilitate CLTR, our first multilingual text mining                multilingual terms, clustering looks at building up clusters
             task is to discover the conceptual relationships among                    of semantically related multilingual terms.
             multilingual terms. Towards this end, a fuzzy multilingual                   As concepts tend to overlap in terms of meaning, crisp
             term clustering algorithm is developed using a fuzzy                      clustering algorithms like k-means that generate partitions
             clustering technique, known as fuzzy c-means [3]. Its                     such that each term is assigned to exactly one cluster is
             purpose is to generate a partition of a set of multilingual               inadequate for representing the real textual data structure. In
             terms for revealing their concept–term relationships with                 this aspect, fuzzy clustering methods that allow objects
             additional concept membership degrees. Application of the                 (terms)tobeclassifiedtomorethanoneclusterwithdifferent
             multilingual term clustering algorithm thus results in a                  membership values are more appropriate. With the appli-
             collection of concepts represented by clusters of concep-                 cation of fuzzy c-means, the resulting fuzzy multilingual
             tually related multilingual terms. This collection of clusters,           term clusters, which are overlapping, will provide a more
             analogous to a multilingual thesaurus, represents a com-                  realistic representation of the multilingual semantic space.
             pression and reflection of the usage of multiple languages.                   The fuzzy c-means algorithm aims at minimizing the
                                                                                                                             P P
                                                                                       objective function JðX;U;vÞ¼ c                n         m 2
             Its importance in concept-based CLTR is in providing                                                              i¼1   k¼1 ðmikÞ d ðvi;xkÞ
              222                                   R. Chau, C.-H. Yeh / Knowledge-Based Systems 17 (2004) 219–227
                                     P
              under the constraints    n   m .0foralli[{1;…;c}and                      and k ¼ 1;…;K randomly such that
                                       k¼1   ik
              Pc m ¼1foralli[{1;…;c}whereX¼{x ;…;x }#Rp
                i¼1  ik                                        1      n                 c
              is the set of objects; c the number of fuzzy clusters; m     [           X
                                                                         ik                mik ¼ 1      ;k ¼ 1;…;K                               ð1Þ
              ½0;1 the membership degree of object xk to cluster i; vi the            i¼1
              prototype (cluster center) of cluster i, and dðv ;x Þ the
                                                                    i  k               and
              Euclidean distance between prototype vi and object xk:
              Theparameter m . 1is the fuzziness index. For m ! 1; the                 mik [ ½0;1       ;i ¼ 1;…c; ;k ¼ 1;…k                    ð2Þ
              clusters tend to be crisp, i.e. either m  !1orm !0;for
                                                      ik          ik
              m!1;m !1=c:
                         ik
                                                                                    2. Calculate the concept prototype (cluster centers) v ; using
                 On the basis of the objective function optimization,                                                                       i
                                                                                       these membership values m :
              fuzzy c-means is most suitable for finding optimal                                                      ik
              groupings of objects that best represent the structure of                      XK ðmikÞmxk
              the data set. By minimizing the sum of within-group                      v ¼      k¼1          ;      ;i ¼ 1;…;c                   ð3Þ
                                                                                        i      XK         m
              variance, the strength of associations of objects is                               k¼1 ðmikÞ
              maximized within clusters and minimized between
              clusters. In this aspect, fuzzy c-means is particularly                                                              new
              useful in text mining applications, such as term clustering,          3. Calculate the new membership values mik          using these
              where intrinsic conceptual structure and semantic relation-              cluster centers vi :
              ships among terms must be revealed in order to gain                        new                1
                                                                                       m     ¼                           ;
                                                                                         ik          !
              knowledge for better text categorization and retrieval.                             c               2=ðm21Þ
                 Statistical analysis of parallel corpus has been proven to                      X kvi2xkk                                       ð4Þ
              be an effective means of extracting useful multilingual                            j¼1  kvj 2 xkk
              lexical knowledge for CLTR and this has been successfully
              applied to the development of translation models for CLTR                ;i ¼ 1;…;c; ;k ¼ 1;…;K
              [12]. Text in parallel translation is increasingly available as
              a result of the global explosion of the World Wide Web.                        new                         new
              Toward using the World Wide Web as a source of parallel               4. If km     2mk.1; let m¼m               and go to step 2.
                                                                                       Otherwise, stop.
              text, effective techniques for automatically identifying              5. Concept labeling. As a result of clustering, every
              parallel translated documents on the Web have also been                  multilingual term is assigned to various concepts
              developed [4,15].                                                        (clusters) with various membership values. To apply
                 Based on the hypothesis that semantically related                     these found clusters as a multilingual concept directory,
              multilingual terms representing similar concepts tend to                 concepts can be labeled by giving meaningful tags. This
              co-occur with similar inter- and intra-document frequencies              can be done manually using expert knowledge or by
              across a parallel corpus, fuzzy c-means can be applied to                selecting the term being assigned the highest member-
              sort a set of multilingual terms into clusters (concepts) such           ship in each cluster for every language involved. As a
              that terms belonging to any one of the clusters (concepts)               result, a fuzzy partition of the multilingual term space
              should be as similar as possible while terms of different                acting as a multilingual linguistic knowledge base is now
              clusters (concepts) are as dissimilar as possible in terms of            available for mining the conceptual content of all
              the concepts they represent.                                             multilingual text.
                 To realize the idea of mining the multilingual concept–
              term relationship using fuzzy c-means, a fuzzy multilingual           3.2. Mining the conceptual content of multilingual text
              term clustering algorithm is developed. To begin with, a set
              of multilingual terms, which are the objects to be clustered,            Aiming at discovering the conceptual content of both
              is first extracted from a parallel corpus of N parallel                multilingual document and query, our second multilingual
              documents. Each term is then represented as an input vector           text mining task concerns the mapping of multilingual text
              of N features where each of the N parallel documents is               to concepts This process is considered a text categorization
              regarded as an input feature with each feature value                  task.
              representing the frequency of that term in the nth parallel              Text categorization is conducted based on the cluster
              document. Details of the fuzzy multilingual term clustering           hypothesis [16], which states that documents with similar
              algorithm is presented as follows:                                    contents are relevant to the same concept. To accomplish
                 The fuzzy multilingual term clustering algorithm:                  the task, the crisp k-nearest neighbor algorithm [5] is among
                                                                                    the most widely used method [11,17]. It determines the
                                                                                    membership of an unclassified text d to a concept c by
              1. Initialize the membership values mik of the k multilingual         examining whether the k pre-classified texts, which are
                 termsx toeachoftheiconcepts(clusters)fori ¼ 1;…;c
                         k                                                          closest to d have also been classified to c.
The words contained in this file might help you see if this file matches what you are looking for:

...Knowledge based systems www elsevier com locate knosys amultilingual text mining approach to web cross lingual retrieval rowena chau chung hsing yeh school of business faculty information technology monash university clayton vic australia received august accepted april available online may abstract enable concept cltr using multilingual our will rst discover the termrelationshipsfromlinguisticallydiversetextualdatarelevanttoadomain second themultilingualconcept term relationships in turn are used conceptual content which is either a document containing potentially relevant or query expressing an need when language independent concepts hidden beneath both and revealed matching made possible hence facilitated this employed for developing multi agent system facilitate on qelsevier b v all rights reserved keywords fuzzy clustering classication introduction documents queries about same do not necessarily contain sets translation equivalents theexponentialgrowthoftheworldwideweboverthe ofeac...

no reviews yet
Please Login to review.