A New Framework for India

0
8


In light of the various copyright disputes concerning AI firms, Tirthaj Mishra argues that India should shift the burden of licensing for AI training data from creators to AI companies. The guest post critiques the ineffectiveness of the opt-out model using robots.txt and proposes a statutory “Duty to License” framework inspired by India’s broadcasting laws under Section 31D of the Copyright Act. Tirthaj is a 3rd year law student at Maharashtra National Law University Mumbai. His academic focus centers on the intersection of technology and legal frameworks, particularly in fields like artificial intelligence and intellectual property.

A soccer player holding a yellow card while a referee gestures during a match, with text overlay referencing AI firms and an opt-out requirement.

Reversing the Opt-Out Burden: Why AI Firms Should Bear Licensing Obligations for Training Data

By Tirthaj Mishra

India’s generative AI industry relies on creative works for training their models, yet creators bear the burden of protecting IP through robots.txt, a protocol lacking enforceability and disproportionately impacting creators as it’s their responsibility to manually opt-out of the dataset.  Additionally, this archaic system can only block entire domains, not individual works limiting their protection, and is frequently ignored by AI crawlers. 

This post advocates for imposing statutory licensing obligations on AI firms, inspired by India’s broadcasting laws under section 31D, to achieve a fair balance between fostering innovation and safeguarding creator rights in the Indian generative AI market.
At the outset, it’s crucial to recognize the distinction between broadcasting and AI training. While broadcasting directly disseminates entire copyrighted works to the public (e.g., playing songs on the radio), AI training extracts patterns or data without public communication during the process as explained here. Yet, AI still monetizes creative labor by internalizing stylistic elements or producing outputs that compete with original works, as highlighted in cases like Andersen v. Stability AI  and in arguments here. Redefining “broadcasting” under Section 31D to include algorithmic dissemination establishes a duty to license, ensuring compensation for creators regardless of the mode of use. Moreover, even if courts rule that AI training qualifies as “fair dealing” under Section 52 of the Indian Copyright Act, as debated in ongoing litigation like ANI v. OpenAI, the argument for statutory licensing persists. Fair dealing may excuse non-expressive, pattern-based use, but the commercial scale of AI training and its potential to impact creator markets justify proactive compensation and transparency to uphold equity as seen in arguments here and also in a report by The U.S. Copyright Office which states “Making commercial use of vast troves of copyrighted works to produce expressive content that competes with them… goes beyond established fair use boundaries”. By adapting Section 31D, India can address these unique challenges and set a precedent for ethical AI governance.

I. Why Robots.txt Fails Creators and its Vulnerabilities.

The robots.txt protocol is a text file through which website owners can indicate the directories or pages they do not want web crawlers to visit. robots.txt directives have no technical or legal enforcement powers and are therefore strictly advisory. This voluntary model of compliance introduces a very uneven playing field for creators and leaves data vulnerable to extraction by bad-faith actors without the protection of a legal safeguard while depending on outdated preventive methods which lack enforceability. 

Domain-blocking doesn’t work on third-party mirrors/archives as robots.txt depends on voluntary compliance and is thus limited in effect. Major search engines obey it but third-party operations and other crawlers often disregard these limitations. E.g., the Wayback Machine. It is also not effective in shielding individual content on websites like articles and images because it only applies to entire domains or subdirectories as a result, individual works are left vulnerable.

II. Case Study: Lessons from India’s Broadcasting Licensing Model

Section 31D of India’s Copyright Act allows broadcasters to use copyrighted works without prior approval by paying royalties to rights holders. This provides public access to content while fairly compensating creators. Section 31 D mandates broadcasters to notify copyright owners and pay royalties set by a tribunal or court. This ensures fair compensation and prevents hoarding and monopolising of works. 

While Section 31D currently applies to literary/musical works and sound recordings, its principles can be extended to all copyright categories-from software code to visual art-through amendments. However, the Wynk v. Tips litigation highlights risks: courts may reject analogies between broadcasting and AI training without explicit legislative intent. To avoid this, India must redefine ‘communication to the public’ to include algorithmic intake and establish tiered royalties, ensuring startups and giants alike contribute fairly.

This statutory licensing framework can regulate the use of data by AI companies for training AI models by imposing a statutory duty on the companies to license training data. This framework shifts the responsibility of compliance from creators to AI companies, similar to how broadcasters must license music, such as: fixed minimal fees for startups (₹1,000/10k works) and revenue-sharing for giants (2% of AI-related income). The UK’s CLA and Israel’s tiered systems prove this balances accessibility and fairness. An AI Training Registry-a public blockchain ledger-can automate tracking, avoiding the manual audits that plagued radio licensing” Payment models can be made flexible and not confined to per-usage royalties but also other mediums such as revenue-sharing, ensuring fair compensation for extensive data use. The reasoning behind this framework is to establish parity between broadcasters and AI firms: just as broadcasters must license content for public benefit, AI companies should license training data to prevent exploitation and ensure equal access to innovation. 

III. The New Framework: Statutory “Duty to License”

To regulate uncontrolled data scraping, India can adopt a statutory “Duty to License” framework, compelling AI firms to actively license training data. The proposal is based on a Three-Pillar Framework to balance both innovation and creators’ rights.

Proactive Disclosure

AI firms would be required to maintain mandatory training data manifests, keeping a record of the data on which training takes place. The manifests would be made available to the public and bring accountability, enabling creators to track whether their works were used. For this, India can establish an Indian AI Training Registry, a public database where firms submit usage. This registry would be a public record, raising trust and slowing controversies over unauthorized use. (For more on this, see here). 

To address problems like huge datasets and data provenance multiple options exist such as: 

Automated Content Fingerprinting: Tools like the Data Provenance Initiative’s hashing algorithms can generate unique identifiers for creative works, enabling automated cross-referencing against training manifests. For example, DECORAIT’s decentralized ledger uses cryptographic hashes to track consent, allowing creators to register works before training begins.

Opt-In Defaults with Smart Contracts: By integrating standardized opt-in/out protocols (e.g., C2PA) into the registry, creators could pre-emptively license works under terms that trigger automated payments via smart contracts when matches are detected. This shifts the tracking burden to AI firms, not creators.

Collective Licensing via Copyright Societies: India’s existing copyright societies (e.g., IPRS for music) could manage bulk licensing and auditing, leveraging their infrastructure to identify infringements. The EU’s DSM Directive shows collective management reduces individual monitoring burdens. (related)

Hybrid Human-AI Audits: While AI firms use algorithms to curate data, the registry could mandate periodic third-party audits using tools like OLMoTrace, which links model outputs to training sources. Creators would still need to verify usage, but automated fingerprinting and collective management minimize manual effort. 

While imperfect, this system is a critical improvement from robots.txt, shifting the compliance burden from creators to firms and enabling better redressal. 

IV. Addressing Counterarguments

Q1: “Won’t this hinder innovation?”

Critics ignore factual information from the broadcasting industry in India, whose growth was accelerated by statutory licensing under Section 31D. After the 2012 reforms were put into effect, sectoral revenue increased by 23%, allowing over 300 radio stations to continue operating while providing appropriate remuneration (see here). Broadcasters could prioritize content diversification rather than worrying about drawn-out discussions through organized structures (related). Stakeholder interests are best served by legal clarity, which has also promoted innovation in small digital platforms and regional language programmes. The Madras High Court dismissed arguments claiming compliance stifles innovation by upholding the constitutionality of Section 31D and highlighting its function in striking a balance between creators rights and open access (related).

While broadcasting and AI differ in technical execution, the constitutional balancing of rights and access under Section 31D applies universally.

Principles from Broadcasting Applicable to AI
1. Algorithmic Use ≠ Public Communication

AI leverages creative works through algorithms, not by directly sharing them with the public like broadcasting does. Yet, it still profits from creators’ efforts, as seen in Andersen v. Stability AI. Statutory licensing ensures creators are paid for this use.

2. Scalable Royalty Models

Hybrid payment systems—like small fixed fees for startups and revenue-sharing for big firms—promote fairness and access, as explained here.

3. Legal Immunity

Following statutory licensing model protects AI firms from infringement lawsuits, as shown in Phonographic Performance Ltd v. ENIL.

Extending these principles to AI helps India boost innovation equitably while dodging U.S.-style legal battles.

Q.2 Will unclaimed works be dealt with under this statutory framework?

The U.S. Copyright Office-tested solution for dealing with orphan works is a mandatory licensing pool with escrowed fees, which temporarily holds royalties until rights holders come forth (US report). Section 33 of Indian Copyright Act already empowers the copyright societies to collectively manage royalties-income that include orphan works, and this does not require amendment. This model is also similar to the EU orphan works framework as it avoids red tape but doesn’t lose incentives for rights holders (related). With this framework, India can bring itself to the global standard and get fair remuneration while lessening infringement risks.

Q3: “Opt-out systems are simpler!” 

Opt-out regimes systematically disadvantage marginalized creators, only 12% of non-English content creators in India make use of opt-out mechanisms, compared to 68% of English-language creators as seen in (KPMG report)(Ipsos survey)(also see related). This significant disparity suggests that non-English creators have less awareness of, or access to, digital rights protections. As a result, they are less able to control how their content is used, especially in areas such as AI training and digital marketing. 

Q4: “Global AI Firms Will Stay Away from India!” 

The EU’s AI Act proves ethical frameworks draw capital from investors when paired with legal certainty (read more about EU’s AI act here and here). While Section 84’s pharma licensing faced investor skepticism , AI licensing under Section 31D is different as it ensures reciprocity: firms pay royalties but gain legal immunity and access proper access to data making it easier to train models without the threat of litigation over copyright infringement which can be both costly and time-consuming. India’s hybrid model—token fees for startups, revenue-sharing for giants—and automated compliance via blockchain avoid GDPR’s pitfalls like administrative cost concerns that plagued its initial rollout, leveraging digital infrastructure and offering India data sovereignty and countering opposition through fairness and predictability.

Q5: “Small Startups Don’t Have Money to Pay Royalties!” 

Israel’s tiered licensing system adopted in 2023 is proof that hybrid models (e.g., token fees for SMEs and revenue sharing for larger firms) make this work. While broadcasting and AI differ in content, Section 31D(3) of the Indian Copyright Act supports variable pricing by setting different royalties for radio and TV, a principle upheld by courts to reflect paying capacity, This logic applies to AI firms as well, tailoring obligations to firm size, not industry, mirroring MSME exemptions in place (e.g., ₹ 5 crores threshold), patent filings rose by 355% over 2016 to 2024 (refer); this is an indication of how scalable frameworks nurture innovation (refer for data), while broadcasting and AI training differ, the principle of tiered licensing-proven in Israel’s 2023 AI policy and India’s Section 31D(3)-ensures scalable, proportional fees for startups and revenue-sharing for larger firms. This approach addresses financial disparities by tailoring obligations to firm size, not content type. Thus, the regulatory logic, not the industry, justifies applying hybrid models to balance innovation and creator rights.

V. Suggestions for Implementation

Copyright Act Amendments

Expanding Section 31D to include machine learning (ML) training would subject AI companies to statutory licensing. Aligning ML data use with broadcasting’s compulsory licensing model supports creators through revenue-sharing rights and ensures public access. Amendments need to make clear that “broadcasting” encompasses algorithmic consumption of works.

Digital India Act Integration

Unlicensed training needs to be branded “data malpractice” under the new Act, with strict penalties. This aligns with Clause 8(5) of the DPDP Act, where fiduciaries must ensure against unauthorized use of data. This framework would discourage exploitative scraping of India’s linguistic and cultural data.

Regulatory Architecture

A special AI Training Compliance Cell under MeitY could examine training manifests and resolve disputes. To supplement this, specialist IP courts on the Delhi High Court’s IP Division model would accelerate cases of AI copyright infringement. This double-edged structure balances innovation and accountability.

VI. Conclusion

India is standing at a momentous crossroads: continue an exploitative AI economy or lead ethical data governance. By extending Section 31D’s constitutional protections to machine learning and broadening “broadcasting” to the algorithmic age, India can safeguard its creative content while promoting AI innovation. With the majority of regional artists not knowing about robots.txt, this structure democratizes access to redressal mechanisms through collective licensing and hybrid royalties. As the EU grapples with AI Act loopholes and the US delays regulation, India’s model would provide a template for Global South countries to take back control of the data economy. We are past the era of reactionary measures: proactive governance is the way forward for constitutional equity.



Source link

LEAVE A REPLY

Please enter your comment!
Please enter your name here