208x Filetype PDF File size 0.70 MB Source: assets.amazon.science
Learning-based Identification of Coding Best Practices from Software Documentation Neela Sawant Srinivasan H Sengamedu AWS AI, Amazon AWS AI, Amazon Bangalore, India Seattle, USA nsawant@amazon.com sengamed@amazon.com Abstract—Automatic identification of coding best practices can • “It is good programming practice to not use mutable scale the development of code and application analyzers. We objects as default values. Instead, use None as the default present Doc2BP, a deep learning tool to identify coding best value and inside the function, check if the parameter practices in software documentation. Natural language descrip- is None and create a new list/ dictionary/ whatever if tions are mapped to an informative embedding space, optimized it is” (Python 3.7 tutorial) under the dual objectives of binary and few shot classification. The binary objective powers general classification into known • “When using the DynamoDBMapper to add or edit best practice categories using a deep learning classifier. The few signed (or encrypted and signed) items, configure it to use shot objective facilitates example-based classification into novel a save behavior, such as PUT, that includes all attributes. categories by matching embeddings with user-provided examples Otherwise, you might not be able to decrypt your data” at run-time, without having to retrain the underlying model. (AWS Java SDK guide) We analyze the effects of manually and synthetically labeled examples, context, and cross-domain information. • “The minimum maintenance window is 60 minutes” We have applied Doc2BP to Java, Python, AWS Java SDK, (AWS CloudFormation user guide). andAWSCloudFormationdocumentations.Withrespecttoprior Our goal is to automate best practice identification from the works that primarily leverage keyword heuristics and our own parts of speech pattern baselines, we obtain 3-5% F1 score documentation on various languages, frameworks, and applica- improvement for Java and Python, and 15-20% for AWS Java tions to help scale the development of related code and appli- SDKandAWSCloudFormation. Experiments with four few shot cation analyzers. Identified best practices can be implemented use-cases show promising results (5-shot accuracy of 99%+ for as new static analysis rules or used to enhance existing rules Java NullPointerException and AWS Java metrics, 65% for AWS by covering more APIs and properties. Our primary use-case CloudFormation numerics, and 35% for Python best practices). Doc2BPhascontributed new rules and improved specifications is Amazon CodeGuru (https://aws.amazon.com/codeguru/) [4], in Amazon’s code and application analyzers: (a) 500+ new checks a developer tool that provides intelligent recommendations in cfn-lint, an open-source AWS CloudFormation linter, (b) over to improve code quality and identify an application’s most 97% automated coverage of metrics APIs and related practices expensive lines of code. The first three coding best practices in Amazon DevOps Guru, (c) support for nullable AWS APIs in described above were implemented as new rules in Code- Amazon CodeGuru’s Java NullPointerException (NPE) detector, (d) 200+ new best practices for Java, Python, and respective Guru’s Java, Python, and AWS Java SDK code analyzers, AWSSDKsin Amazon CodeGuru, and (e) 2% reduction in false respectively. The fourth practice was used to update an existing positives in Amazon CodeGuru’s Java resource leak detector. rule in cfn-lint, a linter for AWS CloudFormation [5]. Index Terms—natural language understanding, information Prior works in automated best practice identification pri- extraction, embeddings, deep learning, few shot learning marily rely on keyword heuristics curated on case-by-case I. INTRODUCTION basis. For example, extracting warnings and recommendations Creating quality software requires an in depth knowledge by matching keywords such as ‘must’, ‘should’, ‘require’, of coding best practices on various aspects such as data struc- ‘encourage’, and ‘recommend’ [2], [6], [7]. However, such tures, error handling, resource management, multiprocessing, heuristics fail to generalize for various reasons. and security. Coding best practices need to be identified before • Keyword mismatch - Keywords may differ across use- they can be incorporated in developer code or implemented cases, for example in describing nullable APIs (null in as static analyzer checks. However, identification is non- Java and None in Python) or resources leaks (terminate trivial since best practice descriptions can be fragmented in or kill in Python and tear down in AWS CloudFormation documentation and hard to find due to significant differences instead of shutdown and close in Java and AWS Java). in keywords, form, and semantics [1]–[3]. For example, • Context sensitivity - Best practices may be contextual. For • “Document.getText Method Now Allows for Partial Re- example, AWS SDK for Java [8] describes over 2000 turns. For more efficient use, callers should invoke seg- metrics APIs to monitor the health and behavior of AWS ment.setPartialReturn(true) and be prepared to receive a services. The text is not consistently structured, requiring portion at a time” (Java 11 Swing API reference) the context of each metric to be inferred before extracting related best practice descriptions. For example, for Ama- effective solution for both general-purpose and specialized zonLex1 RuntimeSystemErrors, relevant practices include requirements. Section VII details real-world impact of Doc2BP “The response code range for a system error is 500 to on multiple code and application analyzers such as cfn-lint - 599”, “Valid dimension for the PostContent operation an AWS CloudFormation linter [5], Amazon DevOps Guru - with the Text or Speech InputMode: BotName, BotAlias, a cloud operations service to improve application availability Operation, InputMode”, and “Unit: Count”. For AWS [10], and Amazon CodeGuru - an automated code review Lambda2 Errors metric, relevant practices include “Sum tool for multiple programming languages and frameworks statistic”, “To calculate the error rate, divide the value including Java, Python, and respective AWS SDKs [4]. of Errors by the value of Invocations”, etc. • Non-keyword patterns - Best practice descriptions may II. RELATED WORK not be keyword based. For example, AWS CloudFor- We now present prior work in extracting information from mation [9] describes value constraints on resources and software documentation as well as related work in deep learn- properties of AWS cloud services such as a) “TargetValue ing, natural language understanding, and few shot learning. range is 8.515920e-109 to 1.174271e+108 (Base 10) or 2e-360 to 2e360 (Base 2)”, (b) “The minimum window is A. Information Extraction from Software Documentation a 60 minute”, (c) “Up to five VPC security group IDs, of Monperrus et al. conducted a formal study of the types of the form sg-xxxxxxxx”, (d) “The total number of allowed knowledge in software documentation [6]. They proposed a resources is 250”. These practices are numeric patterns. list of keywords based on a manual review of Java docu- Our main contribution is Doc2BP, a deep learning tool to mentation, RFC2119 - “Keywords for use in RFCs to Indi- identify best practice descriptions from software documenta- cate Requirement Levels” [11], Oracle technical reports [12], tion. The tool is aimed at reducing the overhead in maintaining and research papers [13]. Use-cases include extraction of multiple heuristics and simplifying new rule creation for methodcall practices (“Subclasses should not call this internal different programming languages and frameworks. The tool method”), subclassing practices (“Subclasses may override supports two modes, general classification and example-based any of the following methods: isLabelProperty, getImage, classification, powered by a common embedding space for getText, dispose”), or synchronization practices (“If multiple natural language descriptions and jointly optimized under the threads access a hash map concurrently, and at least one of dual objectives of binary and few shot classification respec- the threads modifies the map structurally, it must be synchro- tively. The binary classification objective ensures coverage of nized externally”). This approach has been reused in other known categories in available training data via a deep learning general-purpose studies [2], [3], [14]–[16] and extended for classifier, whereas the few shot objective allows classification specialized requirements such as interrupt conditions [17] and into previously unseen categories based on the embedding performance concerns [7], [18]. For performance concerns, similarity with a few user-labeled examples at inference time, keywords can be fast, slow, expensive, cheap, performance, without retraining the underlying deep learning model. speedup, efficient, etc. and their inflections (e.g., efficiency, Wehaveextensively applied Doc2BP on Java, Python, AWS efficiently) [7], resulting in findings such as “Raising this value Java SDK, and AWS CloudFormation documentations. The decreases the number of seeds found, which makes mean shift choice of documentations reflects the domains supported by computationally cheaper”. Table I lists popular prior work. AmazonCodeGuruatthetimeofwritingthis paper, and offers Fewstudies have used specialized natural language process- a good mix of general-purpose and specialized domains to ing for healthcare [19], resource and method handling [20], study. Section III motivates the learning based approach using [21], bug report analysis [22], [23] and software categorization a case study on AWS CloudFormation. Sections IV and V [24]. A recent survey [25] indicates that less than 5% of present the representation learning formulation and overall research in security patterns uses natural language processing, Doc2BP system. Section VI covers extensive experiments for example, to extract access control requirements [26], with manually and synthetically labeled examples, context, [27], privacy policy visualization and summarization [28], and cross-domain information. With respect to prior keyword inconsistent security requirements detection [29], and mining heuristics and our own parts of speech (POS) pattern base- cyber threats from online documents [30], [31], and logs [32]. lines, we obtain 3-5% F1 score improvement in best practice detection for Java and Python and 15-20% for AWS Java SDK B. Related Work in Machine Learning and AWSCloudFormation. We experiment with four use-cases We now discuss concepts related to Doc2BP formulation. in few shot setting with promising results (5-shot accuracy of 1) Deep Learning and Natural Language Understanding: 99%+ for Java NullPointerException and AWS Java metrics, Deep learning has achieved a major breakthrough in many 65%for AWS CloudFormation numerics, and 35% for Python fields [33]–[36]. The seminal survey by Allamanis et al. best practices). These results indicate that Doc2BP is an [37] covers many applications such as code search, code 1https://docs.aws.amazon.com/lex/latest/dg/ completion, code generation, and documentation improvement. monitoring-aws-lex-cloudwatch.html Subsequently many powerful neural models such as Code- 2https://docs.aws.amazon.com/lambda/latest/dg/monitoring-metrics.html BERT[38],PLBART[39],andCodeT5[40]havebeenapplied TABLE I KEYWORDHEURISTICS IN POPULAR SOFTWARE DOCUMENTATION MINING LITERATURE Reference Use-Case Keyword Pattern ControlFlow:Conditional ”(assum— only— debug— restrict— never— condition— strict—necessar— portab— strong)” ControlFlow:Temporal ”(call— invo— before — after — between — once — prior)” Recommend:Warning ”(warn—aware—error—note)” Monperrus et al. [6] Recommend:Affirmative ”(must— mandat— require— shall— should— encourage— recommend— may )” Recommend:Alternative ”(desir—alternativ—addition)” Performance:Performance ”(performan—efficien—fast—quick—better—best)” Concurrency:Concurrency ”(concurren—synchron—lock—thread—simultaneous)” Subclassing:Subclassing ”(extend—overrid—overload—overwrit—re.?implement—sub.?class—super—inherit)” ControlFlow:Conditional ”(under the condition—whether— if —when—assume that)” ControlFlow:Temporal ”(before—after)” Recommend:Warning ”(insecure —susceptible — error— null— exception— susceptible— unavailable— not thread safe— illegal— inappropriate— insecure)” Recommend:Affirmative ”(must—should—have to—need to)” Li et al. [2] Recommend:Alternative ”(instead of—rather than—otherwise)” Recommend:Recommendation ”(deprecate—better to—best to—recommended—less desirable—discourage)” Recommend:Negative ”(do not—be not—never)” Recommend:Emphasis ”(none—only—always)” Recommend:Note ”(note that—notably—caution)” Tao 2020 [7] Performance:Performance ”(fast—slow—expensive—cheap—performan—speedup—computation—accelerat—intensi—scalable—efficien)” to problems of bug detection [41], code review generation [42], and code and documentation synthesis [43], [44]. Detecting best practices, recommendations, and warnings is related to traditional natural language understanding tasks such as sentiment analysis [45]–[48] and suggestion mining [49]–[51]. In general literature, these tasks have been modeled using classical approaches such as parts of speech [52]–[54] and deep learning [35], [36], [55]. We have not seen any prior work generally applying deep learning or advanced natural language understanding for best practice identification from software documentation. Section 5.6 of the Allamanis survey [37] states ‘Also out-of-scope is work that combines natural language information with APIs’ and refers readers to investigate work already discussed above. Fig. 1. POS patterns learned from documentation of two AWS services. 2) Few Shot Learning: Few-shot learning classifies new data having seen only a few training examples [56]. Few shot with a letter or number”). We chose parts of speech (POS) learning can be made tractable by incorporating in pre-training representations, mapping each word to its POS tag according [57] knowledge from similar tasks, useful parameters, or to its syntactic role in the sentence (noun, pronoun, adjective, data [58]–[60]. Similarity based algorithms such as matching determiner, verb) [63]. We then applied PrefixSpan [64], a networks [61] or prototypical networks [62] learn embeddings rule induction algorithm to infer frequent POS subsequence from training tasks that allow classification of unseen classes patterns. Given two sequences x = (x ,x ,...,x ) and with few examples. Our approach is inspired by matching 1 2 m y = (y ,y ,...,y ), x is called a subsequence of y, denoted networks [61] and weakly supervised training [58]. 1 2 n as x ⊆ y if there exist integers 1 ≤ a ≤ a ≤ ... ≤ a ≤n 1 2 m III. FROM KEYWORDS TO LEARNING SYNTAX PATTERNS such that x ⊆ y , x ⊆ y , x ⊆y .Figure 1 shows the 1 a 2 a m a 1 2 m We conducted a case study with AWS CloudFormation (a frequent POS subsequence patterns learned from the selected specialized framework for 200+ AWS services) to motivate examples. We observed the following: learning based approaches for detecting non-keyword patterns 1) Ability to Replace Keyword-based Solutions: POS sub- given a few examples. Based on a manual documentation sequences [‘MD’, ‘VB’] and [‘MD’, ‘RB’] occur in sentences 3 whosePOSsequencematchesthefollowingregularexpression review of two AWS services - CloudTrail and CodeCom- 4 {[MD] >< .∗ > ∗ < [VB]|[RB] >}. MD, VB, and RB mit , we extracted about 50 best practice examples ranging from general recommendations (e.g, “You can only use this are POS tags representing modal structure, base verbs and property to add code when creating a repository with a AWS adverbs respectively. The pattern triggers on all imperative CloudFormation template at creation time”; “This property sentences, for example, “The value must be no more than 255 cannot be used for updating code to an existing repository”), characters”. Comparing the detections from imperative POS to alpha-numeric value constraints (e.g., “Be between 3 and pattern with keyword heuristics in Table I, we find a significant 128 characters”; “Start with a letter or number, and end overlap as seen in Figure 2. We find that the imperative pattern detects about 50% of all detections in the affirmative category 3https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/ and over 40% detections by the conditional category. aws-resource-cloudtrail-trail.html 2) Ability to Capture Non-Keyword Information: The sub- 4https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/ aws-resource-codecommit-repository.html sequence [‘DT’, ‘NN’, ‘CD’] (in regular expression format coverage of known categories in the available training data, by training a deep learning classifier for general classification. By default, such classifier cannot adapt to categories not included in the training set. To avoid model retraining for emerging requirements, we introduce a few shot learning capability that performs example-based classification, e.g. predicting new classes based on embedding similarity with few user-labeled examples at run time, without modifying model parameters. This objective encourages examples belonging to the same category to be co-located in the embedding space, thereby facilitating similarity based, non-parametric classification. A. Binary Classification Let g be any binary classifier parameterized via θ. If g is a logistic regression parameterized by θ = {w,b}, label y can be modeled as a function of input embedding fφ(x) as follows: ⊺ yˆ = P(y = 1|x;φ,θ) = σ(w fφ(x)+b) (1) −a Fig. 2. Cooccurrence analysis between imperative POS pattern and multiple where σ(a) = 1/(1+e ) is the logistic sigmoid function. keyword heuristics shows that a single learned rule can significantly replace Loss between the predicted and actual probability distribu- detections from multiple heuristics. The imperative pattern detects about 50% tions (yˆ and y) is quantified using binary cross entropy (BCE). of all detections in the affirmative category and over 40% detections in the M conditional category. Diagonal is suppressed to improve visual contrast. L =−Xylog(yˆ)+(1−y)log(1−yˆ) (2) BCE i i i i i=1 {[DT] >< .∗ > ∗ < [NN] >< .∗ > ∗ < [CD] >} B. Few Shot Classification is a pattern containing CD, e.g. cardinal digit. It matches a Our formulation of example-based classification is founded wide variety of numeric value constraints. For example, (a) on two ideas. First, for the model to generalize to the test “The maximum length is 200 characters”, (b) “The number of environment given a small number of new labeled examples resources cannot exceed 250 across events”, (c) “The count of (few shot), it should be trained under a similar setting. Sec- allowed data resources is 250”, and (d) “This can be a number ondly, the model should classify new test examples without from 1 - 1024”. The ability to detect the non-keyword patterns any changes to the model parameters. For these purposes, we is an additional benefit of learning based approach. adopt the following episodic training strategy. To summarize, learning based algorithms can infer useful We create an n-way-k-shot episodic training, where the Mc patterns from few examples, replace or augment keyword labeled dataset S = {(x ,z )} is converted into several c j j j=1 heuristics, and capture non-keyword requirements. This is training episodes (e.g., mini-batches) by subsampling n train- possible because the natural language constructs as well as ing classes as well as k examples within each class. Each software documentation exhibit reasonably consistent struc- episode consists of n × k labeled examples (support set B) tures. This insight has led to our detailed deep learning and an additional t examples (test set), also sampled from formulation, described below. the same n classes. Test label z is modeled based on the embedding similarity of the test and the support examples. IV. REPRESENTATION LEARNING FRAMEWORK The similarity function between two embeddings, say a(.,.) We are given a training dataset S containing M labeled can be any attention kernel like a kernel density estimator or examples, S = {(x ,y ),...,(x ,y )} where x ∈ RD, is a k-nearest neighbor that produces a similarity score. Similar to 1 1 M M i matching network [61], we model a as a soft-max over the D-dimensional feature vector and y ∈ {0,1} is a best practice i cosine similarity c(.,.) of embeddings, e.g. label. For a subset Sc containing Mc known best practices Mc c(f (x),f (x’)) e.g., S ⊂ S = {(x ,y ) | y = 1} , we are also given e φ φ c j j j j=1 a(x,x’;φ) = P (3) an additional label z ∈ {1,...,N} to denote the category of c(f (x),f (x”)) j x” e φ φ best practice from N categories known at training time, for The label distribution zˆ is a function of class similarities. example, related to performance, security, subclassing, etc. For X ′ Mc zˆ = P(z|x;φ) = a(x,x’;φ)z (4) simplicity, we denote it as S = {(x ,z )} . c j j j=1 The core idea is to learn a metric space where each example (x′,z′)∈B can be encoded into a smaller L-dimensional dense represen- Loss between the predicted and actual probability distribu- D L tions (zˆ and z) is quantified using general cross entropy (CE). tation (e.g., embedding) with function f : R →R ,L≤D φ and φ representing learnable embeddings. We optimize the M c embedding space under the dual objectives of binary and few L =−Xzlog(zˆ) (5) shot classification. The binary classification objective ensures CE i i i=1
no reviews yet
Please Login to review.