El corpus ROBOT-TALK para el reconocimiento del origen robótico de textosen español

Lara Alonso Simón; Ana Fernández-Pampillón Cesteros

El corpus ROBOT-TALK para el reconocimiento del origen robótico de textosen español

LARA ALONSO SIMÓN ^[1] ; ANA M.ª FERNÁNDEZ-PAMPILLÓN CESTEROS ^[1]
1. [1] Universidad Complutense de Madrid
  
  Universidad Complutense de Madrid
  
  Madrid, España
Localización: Alfinge: revista de filología, ISSN 0213-1854, Nº 37, 2025, págs. 9-32
Idioma: español
DOI: 10.21071/arf.v37i.18687
Títulos paralelos:
- The ROBOT-TALK corpus for recognising the robotic origin of Spanish texts
Enlaces
- Texto completo (pdf)
Resumen
- español
  ROBOT-TALK es un corpus monitor comparable de textos humanos en español y su contrapartida escrita por grandes modelos generativos del lenguaje (LLM). Su objetivo es permitir el estudio de posibles rasgos lingüísticos diferenciadores entre textos generadosautomáticamente y los escritospor las personas. El corpus constituye un recurso lingüístico en español para el reconocimiento de autoría humana vs. “robótica”de textos y diseñado para (1) permitir estudios lingüísticos contrastivos entre LLM y humanos o entre LLM, (2) estudiar la evolución lingüística de los LLM, y (3) servir de soporte en la creación de métodos lingüísticos y herramientas informáticas para laatribución de la autoría humana o automática. Contiene textos de tres géneros diferentes en la lengua escrita (artículos científicos, noticias y reseñas).Cada par de textos, de longitud similar, trata el mismo tema para poder comparar entre dos tipos de escritura y analizar con fiabilidad las características discursivas de los textos. Se recogen muestras de gpt-3, text-davinci-003, babbage-002, curie, gpt-3.5-turbo, gpt-4, bloom, bard, gemini-2.0-flash, gemini-2.5-flash,falcon-180B-chat, Mixtral-8x7B-Instruct-v0.1, claude-3-5-sonnet-20240620, claude-3-7-sonnet-20250219 y DeepSeek-V3. El etiquetado en XMLde los textos del corpus permite su consulta con cualquier herramienta de análisis textual que soporte este estándar de marcado. ROBOT-TALK se ha utilizado con la herramienta SketchEngine para realizar (1) un análisis lingüístico con el fin de encontrar los rasgos más salientes que caracterizan los textos generados por los LLM; (2) un análisis estadístico de rasgos lingüísticos propios de los LLM frente a un posible estilo general humano en español; (3) un análisis lingüístico forense para verificar la fiabilidad en la atribución de autoría;(4) la construcción de clasificadores automáticos binarios y multiclase basados en aprendizaje automático para distinguir textos róboticos y humanos.
- English
  ROBOT-TALK is a comparable corpus of human texts in Spanish and their counterparts written by large language models (LLMs). Its objective is to enable the study of possible linguistic features that differentiate between automatically generated texts and those writtenby humans.The corpus is a Spanish language resource for recognising human vs. “robotic”authorship of texts and it is designed to (1) enable contrastive linguistic studies between LLMs and humans or between LLMs, (2) study the linguistic evolution of LLMs, and (3) support the creation of linguistic methods and computational tools for attributing human or automatic authorship.It contains texts of three different genres in written language (scientific articles, news articles, and reviews). Each pair of texts, of similar length, deals with the same topic so that the two types of writing can be compared and the discursive characteristics of the texts can be reliably analysed. Samples are collected from gpt-3, text-davinci-003, babbage-002, curie, gpt-3.5-turbo, gpt-4, bloom, bard, gemini-2.0-flash, gemini-2.5-flash, falcon-180B-chat, Mixtral-8x7B -Instruct-v0.1, claude-3-5-sonnet-20240620, claude-3-7-sonnet-20250219 and DeepSeek-V3. The XML tagging of the texts in the corpus allows them to be queried with any text analysis tool that supports this markup standard. ROBOT-TALK has been used with the SketchEngine tool to perform (1) a linguistic analysis to find the most salient features that characterise the texts generated by LLMs; (2) a statistical analysis of linguistic features specific to LLMs compared to a possible general human style in Spanish; (3) a forensic linguistic analysis to verify the reliability of authorship attribution; and (4) the construction of automatic binary and multi-class classifiers based on machine learning to distinguish between robotic and human texts.