The Data Broker Industry and the People It Has Never Met

by Scott

Somewhere in a database you have never heard of, maintained by a company whose name you do not know, there is a file about you. It probably contains your name, your current address, and most of the addresses where you have lived over the past decade. It likely includes your date of birth, your estimated household income, your marital status, and whether you own or rent your home. It may contain information about your political affiliation, your religious beliefs, your health conditions, and the medications you take. It almost certainly contains a record of your purchasing behavior, your browsing history, and your interests as inferred from your online activity. If you have ever been involved in a lawsuit, declared bankruptcy, received a traffic ticket, or appeared in a property record, that information is probably there too. The company that holds this file has never spoken to you, has never asked for your consent, and in all likelihood has no legal obligation to tell you it exists.

This is not a hypothetical scenario. It is a description of the routine operations of the data broker industry, a sector of the economy that most people have never heard of and that exercises remarkable influence over how information about private individuals is collected, packaged, sold, and used. Data brokers are companies whose primary business is the aggregation of personal information from a wide range of sources and the sale or licensing of that information to clients who use it for purposes ranging from targeted advertising to background checks to fraud detection to political campaigning. The industry is large, profitable, largely unregulated in most of the world, and almost entirely invisible to the people whose information it trades.

The origins of the data broker industry lie in the direct marketing business of the mid-twentieth century, when companies began systematically compiling mailing lists of consumers with particular characteristics. A company selling gardening supplies had an obvious interest in reaching people who had previously purchased gardening products, and the business of maintaining and renting such lists became a niche industry. The computerization of record-keeping through the 1970s and 1980s dramatically expanded the scale at which such lists could be compiled and the precision with which they could be segmented, and companies that specialized in this work began to resemble something closer to what we now call data brokers.

The transformation of the industry into its current form happened through the convergence of several developments in the 1990s and 2000s. The digitization of public records made enormous quantities of government data available in machine-readable form at relatively low cost. The growth of the internet created new channels through which behavioral data could be collected at scale. The emergence of loyalty programs, credit cards, and other consumer products that collected transactional data as a byproduct of their primary function created rivers of behavioral information flowing toward companies that knew how to aggregate and monetize it. And the development of sophisticated database technology and data matching techniques made it possible to link information from disparate sources into comprehensive profiles of individual people with a reliability and scale that earlier techniques could not approach.

The sources from which data brokers draw their information are remarkably diverse, and understanding them helps explain both the comprehensiveness of the profiles they build and the difficulty of limiting the information they can access. Public records are one foundational layer. Property records, court filings, voter registration databases, professional licenses, business registrations, and a range of other government-maintained records are public in the sense that they can be accessed by anyone, and data brokers systematically collect and process them. The information in these records is there for legitimate reasons of transparency and accountability, but the aggregation of it by private companies for commercial purposes was not the use case its collection was designed to serve.

Commercial transactions are another major source. When you use a credit card, subscribe to a service, make a purchase from a retailer, or fill out a warranty registration card, you are generating transactional data that may be shared with data brokers through a variety of mechanisms. Retailers sell purchase data to data brokers directly. Credit card companies share anonymized transaction data with analytics firms that use it to infer interests and behaviors. Loyalty programs, whose entire value proposition to their operators is the behavioral data they collect from members, are significant sources of detailed purchase history. The terms of service and privacy policies through which this sharing is authorized are typically written in language that is difficult for non-specialists to interpret and that few consumers read.

Online behavioral data has become an increasingly important source as internet usage has become central to daily life. The tracking technologies embedded in websites and mobile applications, including cookies, device fingerprints, pixel trackers, and various other mechanisms, collect data about what people read, search for, click on, and spend time with online. This data flows through a complex ecosystem of advertising technology companies, data management platforms, and data brokers that aggregate and process it. The consent mechanisms through which this data collection is nominally authorized, the cookie consent banners and privacy notices that websites display, are widely recognized by researchers to be ineffective at communicating meaningful choices to users and at obtaining genuinely informed consent.

Location data is among the most sensitive and most commercially valuable categories of information in the data broker ecosystem. Smartphone applications that request location permissions in order to provide their primary function frequently share that location data with third parties as a secondary commercial activity. The resulting streams of precise location information, time-stamped records of where specific devices have been throughout the day, can reveal information about a person’s home, workplace, medical appointments, religious observances, political activities, personal relationships, and daily routines that they would consider deeply private. Several investigative journalism projects have obtained location datasets from data brokers and demonstrated that, contrary to claims that the data is anonymized, the precision of the location records makes re-identification straightforward.

The clients who purchase data from brokers are as diverse as the sources from which the data is drawn. Advertisers are among the largest customers, using data broker profiles to target marketing messages to individuals with particular characteristics or behavioral histories. Insurance companies use data broker information to supplement the information provided in insurance applications, assessing risk based on behavioral and demographic data that applicants have not provided directly. Employers use background check services, many of which are data brokers or rely on data broker information, to investigate job applicants. Landlords use similar services to screen prospective tenants. Lenders use data broker information in credit decisions, particularly in the market for short-term and alternative credit products.

The use of data broker information in consequential decisions about people’s access to employment, housing, insurance, and credit raises serious concerns about accuracy and fairness. Data broker files frequently contain errors, outdated information, and inaccurate inferences. A person may be incorrectly associated with a criminal record belonging to someone with a similar name. An address that appeared in a database years ago because of a brief connection may persist in profiles and suggest a current residence that is incorrect. Inferences about income, health status, or creditworthiness derived from behavioral data may be systematically wrong in ways that disadvantage particular demographic groups. The people affected by these errors typically have no way of knowing the information exists, no mechanism for reviewing it, and no practical recourse for correcting it.

The political uses of data broker information have attracted attention since the revelations around Cambridge Analytica’s activities during the 2016 election cycle, but the use of commercial data in political campaigning predates that episode by many years. Political campaigns have long purchased voter data enhanced with commercial data broker information to build detailed profiles of potential supporters and voters, and the micro-targeting of political advertising based on those profiles has become standard practice. The use of detailed personal data to identify and influence persuadable voters raises questions about the relationship between data-driven political communication and democratic deliberation that remain unresolved.

Law enforcement use of data broker information represents a particularly contested area. In many jurisdictions, law enforcement agencies can purchase commercially available data from data brokers without the judicial oversight that would be required to obtain the same information through compelled disclosure from the original holders. An agency that could not obtain a person’s location history from their mobile carrier without a warrant may be able to purchase a commercially aggregated location dataset covering the same person and time period without any judicial authorization. This practice, sometimes called data laundering in the civil liberties literature, has been documented and criticized by privacy advocates and has prompted legislative attention in some jurisdictions, though comprehensive regulation has been slow to develop.

The health information that appears in data broker profiles is particularly sensitive and particularly poorly protected in the United States, where the primary federal health privacy law, the Health Insurance Portability and Accountability Act, applies to healthcare providers and their business associates but not to the broad range of commercial entities that can infer health information from behavioral data. A data broker that infers from purchase records and app usage that an individual is likely to have a particular health condition is not covered by the same legal protections that would apply to a hospital holding the same information. The resulting market in inferred health data is poorly regulated and widely used by insurers, pharmaceutical marketers, and other entities with commercial interests in health information.

The regulatory landscape for the data broker industry reflects the difficulty of governing a sector that operates across the full complexity of the modern data economy. The European Union’s General Data Protection Regulation, implemented in 2018, established a comprehensive framework of rights for individuals with respect to their personal data, including rights of access, correction, deletion, and portability, along with requirements for lawful bases for processing personal data and protections against automated decision-making. The GDPR’s application to data brokers has been the subject of enforcement actions and ongoing interpretation, and its actual effectiveness at constraining data broker practices remains debated. The legal basis under which data brokers claim to process personal data is typically legitimate interests, a flexible category whose application to commercial data trading has been questioned but not definitively resolved.

In the United States, the regulatory environment is considerably more fragmented. There is no comprehensive federal privacy law governing commercial data practices. A patchwork of sector-specific laws covers particular categories of information or particular types of entities, and a growing number of state laws, including California’s Consumer Privacy Act and its successor the California Privacy Rights Act, have established rights for state residents with respect to their personal data. The data broker industry has been subject to specific state-level registration requirements in a small number of states, most notably Vermont and California, which require data brokers to register with state authorities and disclose their data practices. These requirements have created some transparency about the scale and nature of the industry without substantially constraining its practices.

Attempts by individuals to locate and remove their information from data broker databases run into the practical challenge that there are hundreds of data broker companies, each with its own opt-out process, and that the opt-out processes are typically designed to be cumbersome and incomplete. A person who successfully removes their information from one broker may find it reappears after a period of time as the broker refreshes its data from sources that continue to contain that information. A small industry of privacy protection services has emerged to help individuals manage the process of submitting opt-out requests to large numbers of data brokers, but these services operate within the same structural constraints and can only partially mitigate the problem.

The children who are growing up today are accumulating data broker profiles from birth, as their parents share information about them on social media, as their devices generate behavioral data, and as the systems around them collect records that will persist for decades. The data footprint of a person born today will be orders of magnitude larger than that of a person born two decades ago, because the systems collecting and retaining data have expanded so rapidly and so thoroughly into daily life. The implications of this accumulation for the adults these children will become, for the decisions that will be made about them based on records of their earliest years, are only beginning to be understood.

What the data broker industry represents, in the broadest sense, is the emergence of a parallel information system about human beings, one that operates outside the knowledge and consent of the people it describes and that is governed primarily by the commercial interests of the companies that operate it. The people whose information flows through this system are simultaneously its subject and its product, the raw material from which a multi-billion dollar industry extracts value. They have, in most cases, no relationship with the companies that hold their information, no ability to review what is held, and no effective means of correcting errors or limiting uses they would find objectionable. The industry has never met them. It knows them anyway.