The The PerlGPT MetaCPAN Curator

Building a Training Dataset to Fine-Tune the Perl LLM

Authors

  • William Braswell Author

Keywords:

perl, perlgpt, llama, codellama, llm, ai, artificialintelligence, generativeai

Abstract

PerlGPT is a proposed large language model (LLM) comparable to ChatGPT 4.0, and trained on additional Perl-related content only. PerlGPT will be based on Meta’s Code Llama or equivalent language models, with all new components implemented in Perl where possible and released as free-and-open-source software (FOSS), unlike ChatGPT and other proprietary LLM systems. Development of PerlGPT version 1.0 will consist of training a 13B (or larger) input language model using Perl-related stimulus/response pairs curated from CPAN, the Comprehensive Perl Archive Network. The goal of version 1.0 will be to deliver an LLM capable of generating pure-Perl source code in collaboration with a Perl programmer, similar to Microsoft Bing and GitHub Copilot.  The PerlGPT MetaCPAN Curator is the component responsible for searching through all Perl distributions (software releases) on CPAN in order to choose only the highest-quality Perl code for our PerlGPT training dataset. The MetaCPAN Curator is itself written in parallelized Perl, and is designed according to the same high standards of source code quality used in judging other Perl distributions.

Published

11/17/2024

Issue

Section

Full Paper (10-36 pages + References)

Categories