AI and the Race to Decipher Lost Languages

Shreshta Agarwal '28

Throughout history, language has been a vital component of the human experience, used by every civilization in some form. Whether it be through oral or written means, humans rely on a shared path of communication to express feelings and needs as well as establish relationships with the world around them, assuring survival through social support and protection against external threats. Communities construct and share unique languages, which inform researchers about the history and culture of a region. Despite the importance of language, though, only around 6,000 languages remain in use today, meaning nearly 81% have become extinct (Armstrong, n.d.). Though the decreasing amount of available languages allows for some ease of communication between groups, it also threatens the cultural connotations languages have long carried, which are personal to the people that use them. Historians and scientists alike are concerned about the disappearance of “myriad details” about ancient life that are close to being lost (Armstrong, n.d.). However, the development of Artificial Intelligence (AI) in recent years has proven to be a promising step in the direction of recovering lost texts in order to provide a better picture of historic communication and culture.

The Massachusetts Institute of Technology’s (MIT) Computer Science and Artificial Intelligence Laboratory (CSAIL) has begun developing an AI technique trained to predict translations of texts. The AI works using Natural Language Processing (NLP), which studies the pattern of a language, including recognizing phonology (sounds), morphology (word structure), and sentence structure. The AI processes how both real-life speakers and writers use a language to communicate ideas, thus making its developments relevant and understandable in a social context. CSAIL’s work is unique in that it builds on existing AI models and needs a smaller
sample size for training than other models alone. CSAIL’s research uses principles from Bayesian Program Learning, which introduces inductive bias to train an algorithm to predict “a wide diversity of natural language phenomena” (Albright, et al., 2022). The AI is widely translatable as it “[investigates] subtle similarities in the way diverse languages transform words,” meaning that a language need not be fully decoded before patterns are discerned (Gordon, 2023). Though it cannot isolate a definitive translation, the system can narrow down possibilities. Yilun Du, an MIT student and CSAIL affiliate, says the Laboratory “enlists a multitude of AI models[…] [which] can sharpen and improve their own answers by scrutinizing the responses offered by their counterparts” (Gordon, 2023). This method, called multiagent debate, allows AI models to eliminate each others’ answers and converge on common ones, which in the context of a language algorithm makes the translation process more efficient for researchers. According to T. Florian Jaegar, a professor of brain and cognitive science and computer science at the University of Rochester, one of the system’s main limitations is that it cannot take into account human information processing biases that influence language development (Gordon, 2023). Still, CSAIL continues to develop language AI and to help researchers ideate and refine translations for lost languages. Such breakthroughs in translation cannot provide access to historical records, but it can reveal a connection between social circumstances and the outlooks of ancient authors, which still apply to our views on today’s developing world.

AI has been using a similar pattern recognition method to decode the Indus River Valley civilization’s ancient scripts. The Indus civilization, which formed in 3300 BCE in what are now Pakistan and India, left a set of writings likely resembling graphemes (characters representing a group of sounds.) Beginning in 2024, computer scientist Debasis Mitra, alongside a team of
researchers and Indian Statistical Institute students, began grant-funded research to digitize the scripts. His research method includes the Automatic Script Recognition Network (ASR-net) which he developed, using over 1,000 photos to find repeating symbols in the texts (Lowenstein, 2024). ASR-net combines preexisting AI models to detect individual symbols similarly to software used for reading modern languages. With Mitra’s ASR-net, researchers have been able to identify the Indus Valley scripts’ symbols with a 95% accuracy rate (REPOSITORY). Mitra also developed the Motif Identification and Prediction Network (MIP-net) to sort Indus seals. Each seal combines characters and inscriptions, making it uniquely symbolic. MIP-net does not only recognize characters within the seals, but it identifies contextual connections between characters, revealing the seals’ potential meaning. The model learns to recognize meaning from seals whose motifs have already been identified. Since AI language models are usually trained on existing translation attempts, a lack of data limits the team’s research. Still, he and his team have successfully identified and digitized some symbols, offering researchers a condensed set of data from which to theorize meanings. Mitra hopes that in the future, layering AI models can help not only translate the Indus scripts themselves but also “uncover broader patterns and trends in human development” (REPOSITORY). Mitra says that translation of the Indus River Valley Scripts is important to him as it is “part of [his] history” (Lowenstein, 2024). The project has also motivated American students to visit India, revealing how historic research drives curiosity for today’s cultures and diverse perspectives (Lowenstein, 2024).

Beyond translation, AI can make texts more accessible to historians, giving them the opportunity to focus on analysis. In April 2025, researchers from Cornell University and Tel Aviv University (TAU) presented ProtoSnap, a generative AI trained to recognize cuneiform characters. Excavated tablets have revealed over 1,000 characters excluding regional variations, so a large portion of tablets have not been deciphered. ProtoSnap scans images of tablets and recognizes cuneiform characters using a diffusion model, a type of AI commonly used for tasks like image generation. The diffusion model aligns cuneiform images pixel-by-pixel with templates of existing characters. The presentation of the model at 2025’s International Conference on Learning Representation confirmed that ProtoSnap can “estimate the diverse configurations of cuneiform characters” regardless of style differences between eras, areas, and writers (Alpher, et al., 2025). Computers can also recognize ProtoSnap’s scans much more accurately than data from earlier AI models (Waldron, 2025). Thus, AI processes texts much faster and more precisely than human researchers. Historians can then confirm text translations. Co-researcher and TAU archaeology professor Yoram Cohen expects ProtoSnap to “[lead] to new measurable insights about ancient societies – their religion, economy, social and legal life” (Waldron, 2025). Translating and publishing cuneiform texts will simplify the learning process for historians and museum-goers alike.

Artificial Intelligence is bringing back languages that have been lost for millenia. Recent developments like CSAIL’s model and ProtoSnap provide a glimpse into AI’s ability to simplify and expedite language learning. By using AI in linguistics, scientists and historians can gain a better glimpse into the lives of ancient civilizations.

References

Albright, A. et al. (2022, August 30). Synthesizing theories of human language with Bayesian program induction. Nature Communications. Retrieved April 23, 2025 from https://doi.org/10.1038/s41467-022-32012-w
Alper, M. et al. (2025). ProtoSnap: Prototype Alignment for Cuneiform Signs. Tel Aviv University Visual Artificial Intelligence Lab. Retrieved April 23, 2025 from https://tau-vailab.github.io/ProtoSnap/
Armstrong, R. Language Death. University of Houston. Retrieved April 23, 2025, from https://engines.egr.uh.edu/episode/2723
Du, Y. et al. (2023, May 23). Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv. Retrieved April 23, 2025 from
https://arxiv.org/pdf/2305.14325
Gordon, R. (2023, September 18). Multi-AI collaboration helps reasoning and factual accuracy in large language models. Massachusetts Institute of Technology Computer Science & Artificial Intelligence Laboratory. Retrieved April 23, 2025, from
https://www.csail.mit.edu/news/multi-ai-collaboration-helps-reasoning-and-factual-accur acy-large-language-models
Lowenstein, A. (2024, March 22). Researcher uses machine learning to help digitize ancient texts from Indus civilization. Phys.org. Retrieved May 6, 2025, from
https://phys.org/news/2024-03-machine-digitize-ancient-texts-indus.html Waldron, P. (2025, March 4). AI models make precise copies of cuneiform characters. Cornell
Chronicle. Retrieved April 23, 2025 from
https://news.cornell.edu/stories/2025/03/ai-models-make-precise-copies-cuneiform-chara cters
Zewe, A. (2022, August 30). AI that can learn the patterns of human language. Massachusetts Institute of Technology Computer Science & Artificial Intelligence Laboratory. Retrieved April 23, 2025, from
https://www.csail.mit.edu/news/ai-can-learn-patterns-human-language
(2025, March 4). AI models makes precise copies of cuneiform characters. EurekAlert!. Retrieved April 23, 2025 from https://www.eurekalert.org/news-releases/1075730 Atturu, D. (2024, May). Deep Learning in Indus Valley Script Digitizations. Florida University of Technology. Retrieved May 6, 2025, from https://repository.fit.edu/etd/1416.