How do AI tools use personal data?
AI models are increasingly becoming part of our daily routines. From drafting that perfect email to planning a family vacation or to generating a logo for your next side gig, the capabilities of AI seem endless—almost magical. However, the “magic” behind these powerful large language models is built on something quite ordinary: data. Vast amounts of text data are used to train and refine large language models, enabling them to understand and generate human-like responses.
Given the significance of data in powering AI, it raises an important question: Is it safe to enter personal information into AI chats? What happens to this data once it’s been inputted, and is it possible to delete it?
Why AI cares about your data
In the field of artificial intelligence (AI) we often hear of different “models” or “model versions”. The term “model” in this context does not come out of the blue. If you want to create an AI to predict house prices, you generally start by collecting thousands of examples (or data) that can be used to ultimately create a “model,” a smaller abstraction of the real world. The resulting algorithm can then infer from the existing data to new predictions.
Using vast amounts of data to create models is fundamental to the success of all major large language models (LLMs) on the market. OpenAI, Google, Meta, Antrophic, and others, all rely on huge datasets to be able to train their AI models. Some researchers predict that the currently available human-generated public text could be fully used for training within 2028.
Given that AI companies are hungry for data, how does this affect your data? The default for many publicly available tools like ChatGPT is to store data entered by users and potentially use it for training purposes, so there are some data privacy concerns I outline here.
Risks of AI data sharing
Data privacy concerns have been discussed in many publications and include identity theft and fraud, mass surveillance, and an erosion of civil liberties to targeted misinformation and manipulation.
With the rise of new AI technologies there is another major concern: uncertainty. While some say LLMs cannot store personal data, there have been cases of extracting training data from public models including personal data like names, phone numbers, and email addresses. Ultimately, scaling LLMs is a very new technology and the field is moving incredibly fast and not all risks and mitigation measures are fully understood.
In light of this uncertainty, individuals should be particularly cautious about the information they share online, especially when interacting with platforms and services powered by LLMs. These models are trained on vast datasets, and while they are designed to generate responses based on patterns in data rather than specific memories, there is still a risk that sensitive information might be inadvertently exposed or mishandled. This becomes even more concerning as LLMs are increasingly integrated into everyday tools, from customer service bots to content creation platforms.
Protecting your data
Given the known - and potentially unknown - risks, here are some practical tips on how to use LLMs while mitigating those risks.
It’s important to remember that protecting data goes beyond just your own personal information. The data you share could also involve sensitive information about your family, friends, or even your company.
1. Minimize data
The first and most fundamental rule for enhancing data privacy is data minimization. There are no risks with personal data that is not available in the first place. This matters for any data sharing in the digital and non-digital world, not only AI toolings.
You likely wouldn’t upload your medical records into a random online form or share sensitive health data with people you don’t trust. Similarly, it’s essential to avoid sharing personal data with AI tools that don’t guarantee robust privacy protections. After all, almost every request you make to an AI model, will be sent to the servers of the company operating it, and they obtain that data. By keeping sensitive information private, you reduce the risk of data misuse or exposure.
For example, if you use an AI tool to understand the long contract of your new insurance policy, first remove sensitive information like your address or social security number.
2. Use data controls
Most AI tools have started to introduce data controls that give you at least some control over the usage of your entered data. For some AI tools you can choose to opt out of the training data collection. Here is an overview of how to disable data collection for several popular tools:
https://www.wired.com/story/how-to-stop-your-data-from-being-used-to-train-ai/
Going a step further, some models like OpenAI even offer “zero data retention” for sensitive data, however these features are not always available for individuals.
3. Install browser extensions
If you are using AI models a lot, the risk of accidental data sharing increases with the amount of shared and uploaded data. Some companies offer products and other tools to help detect and remove sensitive data before sharing it. An easy-to-use option is to install browser extensions.
Conclusion
The success of AI models depends on the vast amounts of data they are trained on, but this dependency also introduces risks, especially when it comes to personal data. By practicing data minimization, making use of available data controls, and leveraging tools to detect sensitive information, individuals can reduce the chances of their data being mishandled. In this rapidly developing landscape, safeguarding personal information is not just advisable—it’s essential for protecting one’s privacy and security in the digital age.
Thanks to Oli Ross and Jessica Traynor
Resources
- https://helloomo.ai/blog/data-privacy-llms
- https://www.sectionschool.com/blog/your-privacy-guide-to-ai-chatbots
- https://openai.com/enterprise-privacy/
- https://www.wired.com/story/how-to-stop-your-data-from-being-used-to-train-ai/
- Will we run out of data? Limits of LLM scaling based on human-generated data
- https://www.zendata.dev/post/what-californias-ab-1008-could-mean-for-data-privacy-and-ai
- Extracting Training Data from Large Language Models
- https://usekoda.com/secured-conversations-data-privacy-in-the-age-of-ai-chatbots/
What do you want to know about data, privacy, or technology?
Data Curious is a public resource supported by Good Research LLC in collaboration with the Center for Digital Civil Society at University of San Diego.
To contact us, send us an email at hello@datacurious.org.