For some benchmark tasks, training data is plentiful, but for others it is limited or does not match the genre of the test set. To facilitate research in this direction, we present the General Language Understanding Evaluation (GLUE, ): a benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models. If we aspire to develop models with understanding beyond the detection of superficial correspondences between inputs and outputs, then it is critical to develop a unified model that can execute a range of linguistic tasks across different domains. In contrast, most NLU models above the word level are designed for a specific task and struggle with out-of-domain data. Cite (Informal): GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding (Wang et al., EMNLP 2018) Copy Citation: BibTeX Markdown MODS XML Endnote More options… PDF: Code additional community code Data GLUE, QNLI, CoLA, MRPC, MultiNLI, SNLI, SST, SentEval, = ". Association for Computational Linguistics. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Anthology ID: W18-5446 Volume: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP Month: November Year: 2018 Address: Brussels, Belgium Venue: EMNLP SIG: Publisher: Association for Computational Linguistics Note: Pages: 353–355 Language: URL: DOI: 10.18653/v1/W18-5446 Bibkey: wang-etal-2018-glue Cite (ACL): Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Analysis with our diagnostic dataset yields similarly weak performance over all phenomena tested, with some exceptions. The best models still achieve fairly low absolute scores. We evaluate baselines that use ELMo (Peters et al., 2018), a powerful transfer learning technique, as well as state-of-the-art sentence representation models. While none of the datasets in GLUE were created from scratch for the benchmark, four of them feature privately-held test data, which is used to ensure that the benchmark is used fairly. GLUE thus favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. Abstract Human ability to understand language is general, flexible, and robust.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |