Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 3.33 KB

2312.08976.md

File metadata and controls

20 lines (15 loc) · 3.33 KB

Background

  • Background The paper explores a new task of code generation using external entities, requiring the model to reuse functions defined in the project during generation. Researchers found existing retrieval-augmented large language models have trouble assigning relevance scores between similar entity names and to expand the entity name set due to the limited context size that cannot accommodate the indefinitely large context of the entire project.

  • Existing Work The paper states that, even though existing large pre-trained language models like T5 and GPT-3 have made significant strides in natural language tasks and code generation, they fall short in utilizing external entities from knowledge bases during the generation process. For instance, in traditional code generation tasks, when code generation involving the use of in-project functions is required, models often fail to effectively utilize these functions.

Core Contributions

  • Introduced a new architecture for entity-augmented decoding
    • Challenge 1: How to perform direct entity lookup in the dynamic vocabulary during code generation To address this challenge, the model proposed in the paper integrates the entity retriever directly into the language model's decoder instead of the encoder. This architecture allows the generator to look at all entities at once, thereby eliminating reliance on top-k approximation introduced by earlier architectures.

    • Challenge 2: How to handle particularly diverse samples that may simultaneously refer to documents on different topics The proposed model allows the decoder to condition not only on the input but also on the already generated output, which helps with handling diverse samples and enables the model to perform better than common baselines in various scenarios, including project-level code generation as well as Bash and SQL scripting.

Implementation and Deployment

The model proposed in the paper demonstrated performance that surpasses common baselines in several scenarios, particularly in project-level code generation, Bash, and SQL scripting. The model comprises two components: an entity retriever and a generator (LLM). The entity retriever is responsible for converting entity descriptions into entity embeddings and leverages cross-attention with the last input token representations from the generator. The generator is a general-purpose LLM for autoregressive decoding and is augmented with the retriever, accepting the input string and entity embeddings and producing a code string. The resulting code string includes regular tokens and out-of-vocabulary entity tokens, which are converted back to regular tokens with simple additional post-processing. Additionally, the T5 architecture was chosen for the model, and it improves the summarization of entity descriptions through cross-attention with the generator.

Summary

The paper proposes an innovative architecture to tackle the task of code generation with external entities. The architecture can scale without sacrificing performance, and with the integration of the entity retriever into the decoder rather than the encoder, the model can inspect all entities at once and directly use them. The new architecture not only resolves the limitations of existing models but also demonstrates its superiority in several experimental scenarios.