The The PerlGPT MetaCPAN Curator
Building a Training Dataset to Fine-Tune the Perl LLM
Keywords:
perl, perlgpt, llama, codellama, llm, ai, artificialintelligence, generativeaiAbstract
PerlGPT is a proposed large language model (LLM) comparable to ChatGPT 4.0, and trained on additional Perl-related content only. PerlGPT will be based on Meta’s Code Llama or equivalent language models, with all new components implemented in Perl where possible and released as free-and-open-source software (FOSS), unlike ChatGPT and other proprietary LLM systems. Development of PerlGPT version 1.0 will consist of training a 13B (or larger) input language model using Perl-related stimulus/response pairs curated from CPAN, the Comprehensive Perl Archive Network. The goal of version 1.0 will be to deliver an LLM capable of generating pure-Perl source code in collaboration with a Perl programmer, similar to Microsoft Bing and GitHub Copilot. The PerlGPT MetaCPAN Curator is the component responsible for searching through all Perl distributions (software releases) on CPAN in order to choose only the highest-quality Perl code for our PerlGPT training dataset. The MetaCPAN Curator is itself written in parallelized Perl, and is designed according to the same high standards of source code quality used in judging other Perl distributions.
Published
Issue
Section
Categories
License
Copyright (c) 2024 William Braswell (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.