What data is used to train an AI, where does it come from, and who owns it?

Artificial intelligence (AI) relies on large amounts of data. Machine learning algorithms learn to find inter-dependencies and patterns amongst data sets and apply those learnings to any new data it is presented with. It follows that the higher the volume and quality of the data (being uniform, diverse, comprehensive and relevant), the more accurate the algorithms.


Data is used throughout all stages of the AI development process and can broadly be categorised into the following:

  1. Training data - data used to train the AI model
  2. Test data - data used to test the model and compare to other models
  3. Validation data - data used to validate the final AI model

Training data can be either structured or unstructured, an example of the former being market data presented in tables, and of the latter audio, video and images.


Training data can be sourced either internally, for example customer data held by organisations or externally, from third party sources.

Internal data is often used for very specific AI training or for more niche internal projects. Examples of this include Spotify’s AI DJ which tracks your listening history to generate playlists and Facebook which runs the data of its users through its recommendation algorithm to push recommended content.

Data can be obtained from vendors who obtain and sell large amounts of it. Reddit, for example, started charging users for access to its API in April 2023, likely in response to the success of ChatGPT and the scope to generate a new revenue stream through sales of its data for AI training purposes.

Other sources of external data include open data sets provided by for example, the government, research institutions, and companies for commercial purposes. Companies also use internet scrapers to obtain data, but there is a higher risk that doing so may infringe copyright.


Data is not owned per se; instead, different rights may attach to it, and the owner of those rights may enforce their rights to restrict use of data by third parties. Each of the laws of copyright, confidentiality, and sui generis database rights may apply to training data.

Copyright is most likely to be relevant; it subsists in most human-created text (including code), images, audio, video, and other literary, artistic, dramatic or musical works, and is infringed where all or a substantial part of the work in question is copied.

Database rights may also apply. Database rights protect data from being extracted from a database without the permission of the owner of that database.

The law of confidence is less likely to be relevant to most uses of training data, unless such data has been disclosed to the party relying on it for training purposes in confidence.


Using unlicensed or unauthorised data may carry significant risks, arising out of the rights described above.

The owner of those rights may bring litigation for infringement of copyright, of database right, or for breach of confidence.

For example, Getty Images have famously commenced legal proceedings in the UK and USA against Stability AI asserting that the use by Stability AI’s “Stable Diffusion” generative AI of Getty’s images within its training dataset constitutes copyright infringement. Getty also argue that information extracted from training data and stored as “latent images” comprise infringing copies of its works. Finally, it argues that outputs from Stable Diffusion reliant on those latent images also constitute infringing works. The outcome of this litigation is pending, and there are complex arguments in relation to each alleged infringement, but it is shows that a tainted training dataset can infect the entirety of a generative AI.

It’s worth noting that copyright infringement may occur every time a work is copied. Therefore, even if the company training a generative AI obtains a licence to a dataset, if the licensor of that dataset does not itself have the rights to the data within it, the training company may still be infringing copyright. When obtaining a licence of data, it is important to make sure that you know its source, and to obtain warranties and an indemnity confirming that your use will not infringe the rights of a third party, and that you will be compensated if that proves incorrect.

As an aside, data privacy and protection laws such as the GDPR should always be kept in mind, particularly where data used in training may identify an individual.


  • Use data which is out of copyright, or which is provided expressly for the purpose for which you are using it (i.e. training a generative AI).
  • Where data is not provided openly for your purposes, seek to obtain a licence for that data. Where obtaining such a licence, ensure that it contains warranties and ideally an indemnity protecting you from third party allegations of infringement.
  • There is a significant risk that web scraping will pull down infringing data.
  • Open datasets will likely have their own conditions that must be complied with when using the data and should indicate whether licensing is permitted (including, for example, that the data is not to be used for commercial purposes).
  • Ensure that you are not using data capable of personally identifying an individual, or that you have the necessary consents to do so.

The legislation in this area is likely to develop, and may well differ slightly country to country, and we will keep the Potter Clarkson AI Hub up to date as those changes occur.

This article forms part of our AI Hub, which you can access here.

Potter Clarkson’s specialist electronics and communications team includes a number of attorneys with extensive experience in software, and AI inventions. If we can help you with an issue relating to the protection and commercialisation of innovation in any area artificial intelligence, please get in touch.