Production Pdf 191480

Partial capture of text on file.
                   Learning to Generate Pseudo-code from Source
                         Code using Statistical Machine Translation
                                          Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata,
                                                Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura
                                   Graduate School of Information Science, Nara Institute of Science and Technology
                                                     8916-5 Takayama, Ikoma, Nara 630-0192, Japan
                               {oda.yusuke.on9, fudaba.hiroyuki.ev6, neubig, hata, ssakti, tomoki, s-nakamura}@is.naist.jp
                Abstract—Pseudo-code written in natural language can aid        comprehension of beginners because it explicitly describes
             the comprehension of source code in unfamiliar programming         what the program is doing, but is more readable than an
             languages. However, the great majority of source code has no       unfamiliar programming language.
             corresponding pseudo-code, because pseudo-code is redundant
             and laborious to create. If pseudo-code could be generated             Fig. 1 shows an example of Python source code, and En-
             automatically and instantly from given source code, we could       glish pseudo-code that describes each corresponding statement
             allow for on-demand production of pseudo-code without human                             1
             effort. In this paper, we propose a method to automatically        in the source code.    If the reader is a beginner at Python
             generate pseudo-code from source code, speciﬁcally adopting the    (or a beginner at programming itself), the left side of Fig.
             statistical machine translation (SMT) framework. SMT, which        1 may be difﬁcult to understand. On the other hand, the
             was originally designed to translate between two natural lan-      right side of the ﬁgure can be easily understood by most
             guages, allows us to automatically learn the relationship between  English speakers, and we can also learn how to write speciﬁc
             source code/pseudo-code pairs, making it possible to create a      operations in Python (e.g. if we want to check if the type of
             pseudo-code generator with less human effort. In experiments,      the variable is not an integer, we can see that we write “if
             we generated English or Japanese pseudo-code from Python           not isinstance(something, int):”). In other words,
             statements using SMT, and ﬁnd that the generated pseudo-code       pseudo-code aids the “bottom-up comprehension” [3] of given
             is largely accurate, and aids code understanding.                  source code.
                Keywords: Algorithms, Education, Statistical Approach               However, in real programming environments, pseudo-code
                                  I.  INTRODUCTION                              corresponding to source code rarely exists, because pseudo-
                                                                                code is not necessary once a programmer has a good grasp
                Understanding source code is an essential skill for all         of the programming language and project. In addition, indis-
             programmers. This is true for programmers at all levels, as it is  criminately inserting pseudo-code into existing source code
             necessary to read and understand code that others have written     manually would impose a large burden on programmers. On
             to, for example, work efﬁciently in a group or to integrate and    the other hand, if pseudo-code could be generated automati-
             modify open-source software. On a macro level, there are a         cally, this burden could be reduced, and pseudo-code could
             variety of tools to aid engineers in comprehending the overall     be created for actual programming projects. If we want to
             structure of programming projects. For example, DeLine et al.      practically use automatically generated pseudo-code, we can
             proposed a tool that allows programming experts to evaluate        say that satisfying the following 4 points is required:
             large-scale cooperative software engineering projects [1]. Rah-        •    pseudo-code is accurate enough to describe the behav-
             man et al. also proposed a system that recommends methods                   ior of the original source code,
             for ﬁxing source code when it does not work as expected [2].
                Onamoreﬁne-grainedlevel,whenwetrytounderstandthe                    •    pseudo-code should be provided upon the readers’
             behavior of source code in detail, we usually need to read each             request,
             statement in the source code carefully, and understand what            •    pseudo-code should be automatically generated to
             each statement does. Of course, thoroughly reading and under-               avoid the burden on programmers, and
             standing source code of existing software is possible (although
             time consuming) for veteran programmers. In contrast, this             •    the method to generate pseudo-code should be efﬁ-
             process is much more difﬁcult for beginner programmers or                   cient, to avoid making the readers wait.
             programmers who are learning a new programming language.
             Such inexperienced readers sometimes do not understand the             In this study, we propose a method to automatically gener-
             grammar and style of the source code at hand, so reading           ate pseudo-code from source code. In particular, our proposed
             source code written in such languages imposes a large burden.      method makes two major contributions:
                Onthe other hand, in educational texts about programming          1
             and algorithms, it is common to use “pseudo-code,” which              While there are many varieties of pseudo-code, in this paper we assume
             describes the behavior of statements in the program using          that pseudo-code is “line-to-line” translation between programming and natural
             natural language (usually in English, or the programmers’          languages as shown by Fig. 1. This assumption clearly deﬁnes the relationship
                                                                                between source code and pseudo-code and is a convenient ﬁrst step towards
             mother tongue) or mathematical expressions. Pseudo-code aids       applying machine translation to this task.
                 Fig. 1.  Example of source code written in Python and corresponding pseudo-code written in English.
                     •      To our knowledge, this is the ﬁrst method for gen-                           and thus comments are expected to concisely summarize
                            erating pseudo-code that completely describes the                            what the code is doing, instead of covering each statement
                            corresponding source code. This should be contrasted                         in detail. From a technical point of view, however, pseudo-
                            with previous work on comment generation, which                              code generation is similar to automatic comment generation
                            aims to help experienced engineers by reducing the                           or automatic documentation as both generate natural language
                            amount of source code to be read, and is described in                        descriptions from source code, so in this section we outline
                            §II.                                                                         the methods used in previous approaches and contrast them to
                     •      We propose a framework to perform pseudo-code                                our method.
                            generation using statistical machine translation (SMT).                           There are two major paradigms for automatic com-
                            SMTisatechnology that can automatically learn how                            ment generation: rule-based approaches, and data-based ap-
                            to translate between two languages, and was designed                         proaches. Regarding the former, for example, Sridhara et al.
                            for translating between natural languages, such as En-                       perform summary comment generation for Java methods by
                            glish and Japanese. Our proposed method of applying                          analyzing the actual method deﬁnition using manually deﬁned
                            this to pseudo-code generation has several beneﬁts,                          heuristics [4], [5]. Buse et al. also proposed a method to
                            the largest of which being that it is easy to extend the                     generate documents that include the speciﬁcation of exceptions
                            generator to other combinations of programming and                           that could be thrown by a function and cases in which
                            natural languages simply by preparing data similar to                        exceptions occur [6]. Moreno et al. also developed a method to
                            that in Fig. 1.                                                              generate summaries that aid understanding, especially focusing
                     In this paper, we ﬁrst refer to related works on automatic                          on class deﬁnitions [7]. These rule-based approaches use
                 comment generation and clarify the differences and the merits                           detailed information closely related to the structure of source
                 of our method (§II). Second, we describe the summary of two                             code, allowing for handling of complicated language-speciﬁc
                 SMT frameworks to be used in this study (§III). Next, we                                structures, and are able to generate accurate comments when
                 explain how to apply the SMT framework to pseudo-code gen-                              their rules closely match the given data. However, if we need
                 eration (§IV), how we gather source code/pseudo-code parallel                           new heuristics for a speciﬁc variety of source code or project,
                 data to train our pseudo-code generation system (§V), and the                           or to create a system for a new programming language or
                 method to evaluate the generated pseudo-code using automatic                            natural language, the system creator must manually append
                 and code understanding criteria (§VI). In experiments, we                               these heuristics themselves. This causes a burden on the main-
                 apply our pseudo-code generation system to the Python-to-                               tainers of comment generation systems, and is a fundamental
                 English and Python-to-Japanese pseudo-code generation tasks,                            limitation of rule-based systems.
                 and we ﬁnd that providing automatically generated pseudo-                                    On the other hand, in contrast to rule-based systems, data-
                 code along with the original source code makes the code                                 based approaches can be found in the comment generation or
                 easier to understand for programming beginners (§VII). We                               code summarization ﬁelds. Wong et al. proposed a method
                 ﬁnally mention the conclusion and future directions, including                          for automatic comment generation that extracts comments
                                                                                            2
                 applications to other software engineering tasks (§VIII).                               from entries of programming question and answer sites using
                                          II.    RELATED WORKS                                           information retrieval techniques [8]. There are also methods
                                                                                                         to generate summaries for code based on automatic text
                     There is a signiﬁcant amount of previous work on au-                                summarization [9] or topic modeling [10] techniques, possibly
                 tomatic comment and document generation. The important                                  in combination with the physical actions of expert engineers
                 difference between our work and conventional studies is the                             [11]. This sort of data-based approach has a major merit: if
                 motivation. Previous works are normally based on the thought                            we want to improve the accuracy of the system, we need
                 of “reducing” the amount of code to be read. This is a                                  only increase the amount of “training data” used to construct
                 plausible idea for veteran engineers, because their objective                           it. However, existing methods are largely based on retrieving
                 is to understand a large amount of source code efﬁciently,                              already existing comments, and thus also have signiﬁcant
                                                                                                         issues with “data sparseness;” if a comment describing the
                   2Datasets used to construct and evaluate our pseudo-code generators are               existing code doesn’t already exist in the training data, there
                 available at http://ahclab.naist.jp/pseudogen/                                          is no way to generate accurate comments.
                SMT, which we describe in the next section, combines            B. Phrase-based Machine Translation
             the advantages of these approaches, allowing for detailed              One of the most popular SMT frameworks is phrase-
             generation of text, like the rule-based approaches, while being    based machine translation (PBMT) [15], which directly uses
             learnable from data like the data-based approaches.                the phrase-to-phrase relationships between source and target
                                                                                language pairs. In software engineering studies, Karaivanov
                     III.  STATISTICAL MACHINE TRANSLATION                      et al. proposed a method applying the PBMT framework
                SMT is an application of natural language processing            to programming language conversion, which learns the re-
             (NLP), which discovers the lexical or grammatical relation-        lationships between parts of statements in two program-
             ships between two natural languages (such as English and           ming languages (e.g. “System.out.println” in Java to
             Japanese), and converts sentences described in a natural lan-      “Console.WriteLine” in C#) [16].
                                                                                    To describe PBMT modeling, we introduce a set of phrase
             guage into another natural language. SMT algorithms used in                                                  ⟨ (n)     (n)⟩
                                                                                pairs ϕ = [ϕ ,ϕ ,···ϕ      ]. Each ϕ   = s      →t       repre-
             recent years are mainly based on two ideas:                                      1  2      |ϕ|          n
                                                                                sents the n-th subsequence of the source sentence s(n) and
                •    extract the relationships (usually called “rules”) be-     the target subsequence t(n) associated with the corresponding
                     tween small fragments of new input and output lan-         source subsequence. For example, in Fig. 2, s is separated into
                     guages, and                                                |ϕ| = 4 phrases:
                •    use these relationships to synthesize the translation                 ⟨      “if”       →         “if”       ⟩ 
                     results of a new input sentence using statistical models        ϕ= ⟨          “x”       →         “x”        ⟩ .     (2)
                     to decide which translation is best [12], [13], [14].                 ⟨      “% 5”      →       “by 5”       ⟩ 
                SMT frameworks have quickly developed in recent years,                       ⟨  “== 0 :” → “is divisible”          ⟩
             mainly because of the large amount of data available and the       ϕ is generated using a “phrase table”, which contains var-
             increase in calculation power of computers. In this section, we    ious phrase-to-phrase relationships with probabilities, and is
             describe the foundations of the SMT framework. Especially,         extracted from a parallel corpus as explained in §III-D.
             we explain the details of the phrase-based machine transla-            Given ϕ, we can generate the target sentence t from
             tion (PBMT) and tree-to-string machine translation (T2SMT),        the source sentence by concatenating each t(n). But simply
             whicharemajorSMTframeworks,usedinthisworktoconvert                 concatenating t(n) according to their order cannot obtain an
             source code into pseudo-code.                                      accurate target sentence, because the grammatical characteris-
                                                                                tics (e.g. ordering of tokens) of the source and target languages
             A. Overview of Statistical Machine Translation                     are usually different. For example, if we concatenate the target
                                                                                side phrases of Equation (2), we obtain the target sentence “if
                At ﬁrst, we introduce some notation for the formulation of      x by 5 is divisible,” which is not a ﬂuent English sentence.
             SMT-based pseudo-code generation. Let s = [s1,s2,··· ,s       ]        To avoid this problem, we need to perform “reordering,”
                                                                         |s|
             describe the “source sentence,” an array of input tokens, and      which chooses the proper order of phrases in the target
             t = [t1,t2,··· ,t|t|] describe the “target sentence,” an array     sentence. To express reordering, we introduce the phrase
             of output tokens. The notation | · | represents the length of      alignment a = [a ,a ,··· ,a      ], where each a   is an integer
             a sequence. In this paper, we are considering source code                            1   2       |ϕ|               n
                                                                                that represents the order of the n-th phrase pair in the source
             to pseudo-code translation, so s represents the tokens in the      sentence. In Fig. 2, we assume that a = [1,2,4,3], which
             input source code statements and t represents the words of         means that ﬁrst and second phrase pairs keep their own
             a pseudo-code sentence. For example, in the small Python to        positions, and the third and fourth phrase pairs are swapped
             English example in Fig. 2, s is described as the sequence of       before the target sentence is generated.
             Python tokens [“if”, “x”, “%”, “5”, “==”, “0”, “:”], and t             Formally, the conditional probability Pr(t|s) of the PBMT
             is described as [“if”, “x”, “is”, “divisible”, “by”, “5”], with    model is usually estimated using a log-linear model, that
             |s| = 7 and |t| = 6.                                               combines several probabilities calculated over the source and
                The objective of SMT is to generate the most probable           target sentences [17]:
                            ˆ
             target sentence t given a source sentence s. Speciﬁcally, we do              ˆ
             so by deﬁning a model specifying the conditional probability                t   ≡ arg maxPr(t|s)                               (3)
                                                                      ˆ                              t
             distribution of t given s, or Pr(t|s), and ﬁnd the t that                       ≃ arg maxPr(t,ϕ,a|s)                           (4)
             maximizes this probability:                                                           t,ϕ,a
                                                                                                             exp(wTf(t,ϕ,a,s))
                                ˆ                                                            ≃ arg max∑                                     (5)
                               t   ≡ arg maxPr(t|s).                     (1)                                          T    ′
                                           t                                                       t,ϕ,a       exp(w f(t ,ϕ,a,s))
             The difference of each SMT framework is guided by the                                          t′
             method for calculating Pr(t|s). This probability is estimated                   = arg maxwTf(t,ϕ,a,s),                         (6)
             using a set of source/target sentence pairs called a “parallel                        t,ϕ,a
             corpus.” For example, Fig. 1 is one example of the type of         where f(t,ϕ,a,s) represents feature functions calculated dur-
             parallel corpus targeted in this study, in which have one-by-      ing the translation process, and w represents the weight vector
             one correspondences between each line in the source code and       of the corresponding features, which deﬁnes the importance of
             pseudo-code.                                                       each feature. Intuitively, Equation (6) means that the PBMT
                                                                                                              T2SMT was originally conceived for translating natural
                                                                                                         languages such as English, and because natural languages
                                                                                                         include ambiguities we obtain the parse tree Ts using a
                                                                                                         probabilistic parser that deﬁnes the probability T                  given s, or
                                                                                                                                                                           s
                                                                                                         Pr(Ts|s) [19], [20]. Thus, the formulation of T2SMT can be
                                                                                                         obtained by introducing this parsing probability into Equation
                                                                                                         (1):
                                                                                                                           ˆ
                                                                                                                           t   = arg maxPr(t|s)                                         (7)
                                                                                                                                          t
                                                                                                                               ≃ arg maxPr(t|T )Pr(T |s).                               (8)
                                                                                                                                                          s         s
                                                                                                                                         t,T
                                                                                                                                            s
                                                                                                              Fortunately, the probability Pr(Ts|s) can be ignored in our
                                                                                                         proposed method for pseudo-code generation, because all prac-
                                                                                                         tical programming languages have a compiler or interpreter
                                                                                                         that can parse the corresponding source code deterministically,
                                                                                                         and thus there is only one possible parse tree. So the formu-
                                                                                                         lation is less complex:
                                                                                                                                   ˆ
                                                                                                                                   t ≃ arg maxPr(t|T ),                                 (9)
                                                                                                                                                              s
                                                                                                                                              t
                                                                                                              Fig. 3 shows the process of T2SMT-based pseudo-code
                                                                                                         generation. First, we obtain the parse tree T                 by transforming
                                                                                                                                                                    s
                                                                                                         the input statement into a token array using tokenization, and
                 Fig. 2.  Example of Python to English PBMT pseudo-code generation.                      parsing the token array into a parse tree using parsing. The
                                                                                                         important point of Fig. 3 is that Ts is deﬁned by the grammar
                                                                                                         of the programming language, and is not an “abstract” syntax
                 model ﬁnds the sentence which has the highest “score” cal-                              tree. This requirement comes from the characteristics of the
                 culated by the weighted sum of the features: wTf(t,ϕ,a,s).                              inner workings of the T2SMT algorithm, and this topic is
                 Some typical examples of these features include:                                        described in detail in §IV.
                     •      Language model Pr(t) measures the ﬂuency of the                                   The probability Pr(t|Ts) is deﬁned similarly to that of
                            sentence t under the target language as described in                         PBMT (usually formulated as a log-linear model) with two
                            §III-E.                                                                      major differences:
                     •      Phrase translation model calculates the product                                   •     In the T2SMT, we use the “derivation” d                               =
                                                                                                                    [d ,d ,··· ,d         ]  instead of the phrase pairs ϕ in
                            of the probabilities Pr(t(n)|s(n)) of the individual                                       1    2          |d|         ⟨                 ⟩
                                                                                                                    PBMT. Each d              = T(n) →t(n)               represents the
                            phrases in the sentence.                                                                                      n            s
                     •      Reordering model Pr(a|ϕ) calculates the probability                                     relationship between between a source subtree (the
                            of the arranging each phrase in a particular order.                                     gray boxes in Fig. 3) and target phrase with wild-
                                                                                                                    cards. All derivations are connected according to the
                                                                                                                    structure of original parse tree T , and the target
                     While PBMT’s mechanism of translating and reordering                                                                                           s
                 short phrases is simple, it also lacks the expressive power                                        sentence is generated by replacing wildcards with their
                 to intuitively model pseudo-code generation. For example,                                          corresponding phrases.
                 we intuitively know that the English string including two                                    •     The T2SMT translation process does not include ex-
                 wildcards “X is divisible by Y ” corresponds to the source code                                    plicit reordering models because the ordering of the
                 “X % Y == 0:,” but PBMT is not capable of using these                                              wildcards in the target phrase naturally deﬁnes the
                 wildcards. Thus, ϕ in the example in Fig. 2 uses obviously                                         target ordering.
                 “wrong” correspondences such as ⟨“==0:” → “is divisible”⟩.
                 In addition, source code has an inherent hierarchical structure                         D. Extracting SMT Rules
                 which can not be used by explicitly when only performing
                 phrase-to-phrase translation and reordering.                                                 To train the PBMT and T2SMT models, we have to extract
                                                                                                         the translation rules, which deﬁne the relationship between the
                 C. Tree-to-string Machine Translation                                                   parts of the source and target language sentences, from the
                                                                                                         given parallel corpus. To do this, we use a “word alignment”
                     As we mentioned in the previous section, PBMT-based                                 between both sentences. Word alignments represent the word-
                 pseudo-code generation cannot take advantage of wildcards                               to-word level relationships between both languages, shown
                 or the hierarchical correspondences between two languages.                              in Fig. 4. In standard SMT training, word aligments are
                 T2SMT uses the parse tree of source sentence T instead of                               automatically calculated from the statistics of a parallel corpus
                                                                                   s
                 the source tokens s to avoid this problem [18], as shown in                             by using a probabilistic model and unsupervised machine
                 Fig. 3.                                                                                 learning techniques [21], [22].
The words contained in this file might help you see if this file matches what you are looking for:

...Learning to generate pseudo code from source using statistical machine translation yusuke oda hiroyuki fudaba graham neubig hideaki hata sakriani sakti tomoki toda and satoshi nakamura graduate school of information science nara institute technology takayama ikoma japan on ev ssakti s is naist jp abstract written in natural language can aid comprehension beginners because it explicitly describes the unfamiliar programming what program doing but more readable than an languages however great majority has no corresponding redundant laborious create if could be generated fig shows example python en automatically instantly given we glish that each statement allow for demand production without human effort this paper propose a method reader beginner at specically adopting or itself left side smt framework which may difcult understand other hand was originally designed translate between two lan right gure easily understood by most guages allows us learn relationship english speakers also how ...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area