Written by: Hunter Lockwood Thompson, Kayla Becker, and Stella Beerman

Generative Artificial Intelligence (AI) has quickly become a daily tool for many people, including Indigenous communities and employees of Tribal Nations. This document is intended to guide National Breath of Life communities in making informed decisions about the use of generative AI in language and cultural work.

Generative AI refers to deep-learning models that can create new, high-quality content based on the vast datasets they were trained on. While often associated with text, these tools can also generate images, videos, and audio. Common tools include ChatGPT, Claude, Google Gemini, Midjourney, DALL-E3, ElevenLabs, Suna, Sora, and Runway Gen-2.

For communities specifically engaged in language and cultural revitalization, these tools can accelerate the research workflow. However, there are also very real risks to consider before choosing to utilize them.

NBoL Workshop — National Breath of Life participants engage with language revitalization work. Photo by Karen Baldwin, Miami Tribe of Oklahoma.

When language, stories, and cultural knowledge are entered into commercial systems, they are effectively “scraped” and ingested by external models. This creates a scenario where unique cultural information could become the intellectual property of the corporation owning the model, risking it being stripped of context and community control.

Additionally, these generative AI tools are not human and are trained on general data from the internet. This means publicly trained models often don’t have enough data on specific community stories, values, language, or culture to accurately generate information. This results in what we call hallucinations, which is when the AI simply makes things up. It can produce incorrect language, false information, and images based on harmful, pan-Indigenous stereotypes.

These hallucinations aren’t just incorrect, but they are phrased to look like they are true. Most generative AI tools are trained to be “sycophantic.” This means the models will often offer “user-pleasing” responses, validating the user’s perspective rather than offering truthful answers. This kind of content actively works against the generations of work by culture bearers to share information within their communities.

How LLMs Scrape

Commercial generative AI tools are built using Large Language Models (LLMs). These are like massive vacuums. They ingest terabytes of text from the open internet, like books, websites, articles, and social media, as well as user inputs to tools like ChatGPT or Google Gemini. When they “learn” a language, they are not understanding it like a human but instead recognizing statistical patterns.

If a person uploads language materials, like a dictionary or a book of stories, those documents are processed on external servers owned by various companies. At that point, they will be stored, logged, and might even be used to train future models, depending on the company’s policies and the user’s settings.

While most companies say they allow users to “opt out” of their input being used as training material, the user often loses the ability to guarantee how that information may be retained, analyzed, or accessed when cultural materials leave community-controlled infrastructure. The concern is that the AI model will then be able to regurgitate that information to anyone who asks, removing their community’s ability to vet who accesses restricted knowledge.

The core mechanism of Generative AI is Data Extraction. These models are learning engines that continuously scrape and ingest data to refine their outputs. This creates a critical vulnerability where knowledge entered into these systems could effectively become the intellectual property of the technology provider.

Paradox of Participation

Research in the field of Indigenous AI ethics highlights a difficult bind known as the “Paradox of Participation.”

1. The Risk of Abstention: If Indigenous nations refuse to engage with AI, they risk Digital Erasure.

2. The Risk of Participation: If the nation participates by feeding data into current models, it risks Digital Exploitation.

Compounding this paradox is the lack of transparency. Commercial AI models are “Black Boxes.” We do not know exactly what data they have been trained on, nor do they have a mechanism to “unlearn” data once it has been ingested. If sensitive data is uploaded, there is no easy “delete” button for the trained model.

The risks are real, but so are the opportunities. Across Indian Country, leaders and their communities are developing “Sovereign-Aligned” frameworks that allow tribes to develop and use AI tools without compromising community knowledge.

Intentional and Responsible Use of AI

Before discussing why a community might decide to use AI tools, it’s worth spending a little time talking about how. So-called “flagship” AI models like Anthropic’s Claude and OpenAI’s ChatGPT can be accessed in two main ways. First, and probably more familiar to most users, is by visiting their websites or downloading their apps onto your phone or tablet. By default, when you use the website or phone app, both Anthropic and OpenAI are allowed to use your input to train their models. It’s always good practice to avoid sending Claude or ChatGPT anything sensitive – chats with AI models aren’t legally protected! – But if you’re a regular user of these websites and phone apps, it’s a good idea to navigate to the settings and make sure to turn off anything that allows those companies to train using your data.

The second way of interacting with these AI models is by using what’s called an application programming interface (API) to make your request. Using the API requires a little bit more technical know-how and cannot be done for free, but the payoff is that these companies cannot use your data to train their models. Depending on your use cases, it might be worth spending the time and money to buy a guarantee of privacy.

Generally speaking, it’s always a good idea to think critically about what information you’re sending out over the internet, where you’re sending it, and why!

With that out of the way, there are ways of using AI tools that will keep your data protected and private. You could decide to never tell the model which language you’re working with, for instance. Or, if you’re asking about specific patterns in the language, you can replace the real language data with test data or with alphanumeric codes like “root1-suffix1, root2-suffix1,” and so on. Using strategies like this, some of us at the Myaamia Center have used AI tools to speed up Latin translations and double-check ambiguous transcriptions of primary source data, all without sending tech companies a single Myaamiaataweenki ‘Miami language’ word.

LLMs also excel at writing code, and ILDA allows communities to download their own dictionary and archival data in a structured format like JSON or CSV. That means it’s possible, for example, to have AI tools write a simple computer program that reads an ILDA dictionary JSON file and selects a random “word of the day” – and you don’t have to (and shouldn’t!) upload your dictionary directly to an AI model to do so. All you have to do is describe the structure of your data, describe what you want to do, and let the LLM handle the bulk of the coding logic.

But the future may not entirely revolve around Large Language Models controlled by Big Tech companies. As consumer-grade technology improves, we have seen an explosion of Small Language Models that can run “locally”, entirely on a user’s computer, laptop, or phone, without ever sending any data over the internet. The big flagship models are designed to do everything from translating Latin to creating images, but smaller, more specialized models may be better suited for tribal needs. At the Myaamia Center, the institutional home of National Breath of Life, we are experimenting with these sorts of solutions for automated machine translation and text-to-speech, among other uses, and look forward to sharing what we’ve learned with National Breath of Life participants over time.

National Breath of Life, ILDA, and AI

To be clear, the Indigenous Languages Digital Archive (ILDA) does not use AI “under the hood”. Your data belongs to your community, and the only information that will show up in your database is what you’ve entered there yourself. None of that is changing.

In response to these challenges with AI, the Myaamia Center is dedicated to understanding how to best use these tools while prioritizing data sovereignty. National Breath of Life, through the Myaamia Center overseen by the Miami Tribe of Oklahoma, has developed several policies and procedures around the ILDA software to protect users’ data from LLMs scraping for data.

Because the Miami Tribe of Oklahoma holds the copyright to the ILDA software, they take the responsibility of protecting other communities’ data very seriously. The Miami Tribe is committed to ensuring that every community using ILDA maintains complete control over its own digital archives and language materials.

To make this possible, National BoL has established clear policies to safeguard user data. Every community can password-protect its dictionaries and archives. This ensures that only authorized users, chosen by the community, can access the data.

These security measures also prevent “data scraping.” By password-protecting archives, communities ensure their knowledge isn’t scraped by generative AI tools without their explicit consent. Before using ILDA, each community enters into a transparent privacy policy and user agreement with the Miami Tribe. This ensures everyone is on the same page regarding how the software interacts with their data.

Additionally, National BoL data, through the Myaamia Center at Miami University, is stored on Amazon Web Services (AWS) servers. The data on these servers is protected by a contractual agreement between the University and AWS. AWS is committed to protecting this data because that’s their whole business model; if AWS suddenly decided that they owned everything they stored, they would face immediate economic and legal challenges from powerful entities around the world, ranging from Netflix to NASA.

By providing these safeguards, National BoL ensures that communities can engage with digital tools to engage in language revitalization with the peace of mind that their data remains firmly under community control.

Generative AI can be a powerful tool, but its risks should be carefully considered. For Indigenous communities, the choice of whether or how to use it should be guided by sovereignty and cultural responsibility. By approaching AI intentionally and by building and supporting Indigenous-controlled technologies, communities can benefit from AI without making sacrifices.

National Breath of Life Blog

Responsible AI Use for Language Revitalization

Written by: Hunter Lockwood Thompson, Kayla Becker, and Stella Beerman

How LLMs Scrape

Paradox of Participation

Intentional and Responsible Use of AI

National Breath of Life, ILDA, and AI

Leave a comment Cancel reply

Written by: Hunter Lockwood Thompson, Kayla Becker, and Stella Beerman

How LLMs Scrape

Paradox of Participation

Intentional and Responsible Use of AI

National Breath of Life, ILDA, and AI

Share this:

Related

Leave a comment Cancel reply