Securing the Future with GenAI Data Access Controls

Why you need data access controls for your GenAI systems

As generative AI (GenAI) rapidly transforms industries, the need for stringent data access controls is becoming a critical security priority. According to IBM’s Cost of a Data Breach 2024 report1, a staggering 46% of breaches involve customer personal data, a concerning statistic as GenAI models increasingly process sensitive, proprietary, and personal data. With their ability to generate content and insights at scale, GenAI systems pose new challenges for securing data pipelines, making it essential for organisations to adopt granular, adaptive access controls to mitigate risks while harnessing the full potential of these powerful tools.

 

Objectives of GenAI access controls

To control access to a GenAI based application using the principle of least privilege, i.e., a user can only use appropriate prompts to interact with a GenAI application which in turn can only access data (approved for the user level) to provide inference (evaluated for appropriateness). This may involve any combination of control features like data classification & categorisation, role definition, role – resource mapping, attribute based permissioning, data masking, encryption at rest & transit.

 

Access control through roles and attributes

Role-based access control (RBAC) and the related attribute-based access control (ABAC) are methods of regulating access to computer, or a network resource based on the individual roles or other attributes of users within an organisation. RBAC ensures that only authorised individuals can access specific resources, performing only actions necessary for their roles. ABAC includes multiple attributes to determine access to resources (of which role can be one).

The benefit of RBAC is its simplicity; users do not need to manage or remember specific permissions as their role automatically determines their access. This facilitates changes in user roles and enhances security and compliance. 

There are however challenges even at an organisational level in implementing a robust role-based access system. The challenges resulting from:

1️⃣ Role complexity with too many roles and a network role hierarchy

2️⃣ Role confusion where it is unclear which role is appropriate for a particular user or task

3️⃣ Maintaining role definition accuracy over time leading to outdated and inconsistent access controls

4️⃣ Managing joiner/mover/leaver (JML) processes and ensuring alignment with RBAC

5️⃣ Frequent changes in dynamic environments

Additionally, a more fine-grained access control will be necessary, as we delve into the complexities of an AI application deployment involving ever changing AI architectures. This is where extending RBAC and using attributes to develop an entitlement & permissioning system is an immediate necessity.

 

Specific challenges for implementing RBAC for GenAI

GenAI applications are those that use large language models (LLMs) to generate natural language texts or perform natural language understanding tasks.

LLMs are powerful tools that can enable various scenarios such as content creation, summarisation, translation, question answering, and conversational agents. However, LLMs also pose significant security challenges that need to be addressed by developers and administrators of GenAI applications. These challenges include:

1️⃣ Protecting the confidentiality and integrity of the data used to train and query the LLMs

2️⃣ Ensuring the availability and reliability of the LLMs and their services

3️⃣ Preventing the misuse or abuse of the LLMs by malicious actors or unintended users

4️⃣ Monitoring and auditing the LLMs’ outputs and behaviours for quality, accuracy, and compliance

5️⃣ Managing the ethical and social implications of the LLMs’ outputs and impacts

An effective GenAI access control requires a deep understanding of AI system architecture and a precise identification and definition of target features and objects accessible by AI users. These target features must be governed individually with entitlements, ensuring users and system resources have access to, can operate on and deliver information that is considerate of the entitlements defined.

While RBAC is well understood in enterprise security, implementing it for GenAI systems can, however, be challenging:

1️⃣ Inherent missing access controls: GenAI applications inherently do have not integrated RBAC features which can lead to a host of data privacy and security issues

2️⃣ Unstructured input: Inputs to Gen AI applications are usually unstructured. Request (prompts) are usually in natural language unlike the highly structured API calls for applications where identity-based policies are easier to implement

3️⃣ Natural language output: Typical outcomes from a Gen Ai application is in natural text that can contain any kind of information, in response to the request (prompts). These outcomes may contain sensitive information which may be in form of code or unstructured text

4️⃣ Model’s inherent structure: AI models are inherently complex and sometimes monolithic. Controlling access to specific part of the model is complex

5️⃣ Extensibility: Advanced techniques like soft prompts & fine-tuning can extend a model’s existing functionality and are candidates for control

Given the impracticality of deploying multiple models, each trained specifically for individual roles in the RBAC system, access should not be binary. It should also consider additional variables such as hyper-parameters, used to control model behaviour.

The requirement to share the same model with multiple users, with different access, requires a more holistic view of access controls, taking into consideration the inputs, outputs and in case of retrieval augmented generation (RAG)-based models AI agents.

 

Pre-requisites for a successful RBAC/ABAC implementation

A successful RBAC implementation for GenAI requires a foundation of key pre-requisites. Organisations must first define clear roles, ensuring each role has well-articulated permissions tailored to data sensitivity and usage within the GenAI framework. Comprehensive data classification is crucial, enabling more granular control over who accesses specific datasets.

Additionally, regular audits and monitoring processes should be established to prevent privilege creep and ensure compliance. Lastly, cross-departmental collaboration is vital, as security, IT, and AI teams must align on policies to effectively manage the unique risks posed by GenAI systems while maintaining operational efficiency.

Role definition Agree on role definition and role hierarchy
Access mapping Mapping of roles & resources to data access entitlements
Data classification Classification and categorisation of data  
Data labelling Labelling raw data based on privacy, security & sensitivity
Data curation Creation of training, validation datasets & embedding incorporating data classification and privacy elements
IAM (Identity & Access Management) Roles to be aligned to IAM and managed via JML processes
RBAC Resource role-based policies for controlling access
ABAC Resource attributes-based policies for controlling access

 

Implementation approach

Implementing RBAC for AI applications (GenAI, RAG-based, AI models), requires the understanding of the AI architecture in play and is best approached following the layered approach closely following the AI architecture components. GenAI models can generate diverse content, including its language output, audio, images, and even videos.

Our focus initially will be on LLMs, Large Language Models that generate natural language output. This creates a scope boundary and provides a view of the threat surface are to be covered for the LLM application and the formulation of controls required to ensure data privacy and security.

Subsequent enhancements will investigate how the controls can be extended to cover the additional complexities of handling multimodal capabilities of a Gen AI applications.

Securing LLM applications using access controls can be achieved by a layered approach, where each layer is secured from unauthorised access along with core data security such as masking and encryption. The layers are:

Layer 1: End user layer access control

* This layer controls who can access the GenAI tools themselves. It involves defining user roles and permissions to determine which employees can interact with the GenAI applications/ agents

* The focus is on ensuring that only authorised users can use the GenAI tools, thereby preventing unauthorised access to the system itself

Layer 2: AI layer access control

* At this layer, access controls what data and functionality the GenAI model & agents can access based on the permissions of the user making the request

* The AI model respects the user’s role and permissions when processing prompts and retrieving information, ensuring that sensitive data is accessed only by users with appropriate clearance

Layer 3: Data layer access control

* At this layer, access controls what data AI model can access based on the permissions of the user making the request

* The focus here is controlling access to data being used and produced by the model

Layer 4: Infrastructure layer access control

* At this layer, access controls who has access to the infrastructure where AI solution has been deployed

* The focus here is in providing secure access to the GenAI deployment infrastructure

To ensure consistent security throughout the AI lifecycle, RBAC policies should integrate model architecture information, training procedures, data, and logs with the access policies for inference use. Adopting a data-centric approach in designing RBAC policies allows organisations to implement granular policies while treating AI systems as a single entity throughout their life cycles. While Role-Based Access Control (RBAC) may set a foundational baseline for security within enterprise AI systems, it falls short in offering the nuanced granularity required for data access by Agents and this is where the implementation of Attribute based access controls comes into play.

Put together RBAC and ABAC should provide a level of security which consummate with a secure use of GenAI applications.

 

Solution architecture for GenAI data access controls

A layered AI solutions architecture is outlined here:

Layer 1: End user layer access control

THREATS CONTROL
End user application

A web application as a front end to delivering inference

·   Access to GenAI application ·   RBAC: controlling who has access to the web application
AI Inputs

GenAI inputs or “Prompts” that instruct the GenAI model

Prompt template

·   Prompt injection ·   RBAC policy: to control access to templates

·   Limiting prompt parameter length

·   Limiting to specific formats

·   Restricting parameter values to a predefined set

Prompts intent detection ·   Adversarial attack ·   DLP: Security policies implemented at inference endpoints to ensure data privacy, sensitivity, exfiltration is controlled

·   RBAC: Role based access to who (human), what (system) can access the API

·   Encrypted traffic: bidirectional encryption to control data exploitation while in transit

API

To connect to the underlying application / model, transferring information bidirectionally

·   Access to GenAI application ·   DLP: Security policies implemented at inference endpoints to ensure data privacy, sensitivity, exfiltration is controlled

·   RBAC: Role based access to who (human), what (system) can access the API

·   Encrypted traffic: bidirectional encryption to control data exploitation while in transit

AI outcomes

Outcomes from GenAI models (primarily unstructured) analysed and flagged for any potentially harmful or sensitive content with tagging

·   Harmful / Sensitive data leakage ·   RBAC: Policies on harmful or sensitive content

 

Layer 2: AI layer access control

THREATS CONTROL
LLM Model Weights

The trained critical parameters for the Model

·   Data / IP theft, model manipulation ·   ABAC: Granular policies to control resources that can access / modify GenAI model weights

·   Data Encryption: Ensure that all data used in training, both at rest and in transit, is encrypted. This protects against un-authorised access and tampering

LLM Hyperparameters

Settings like temperature, input context size, and output size, influencing model behaviour and output

·   Model behaviour / IP theft ·   ABAC: Granular policies to control resources that can access / modify GenAI model hyperparameters
RAG LLM

To enable to access external data not included in the model during training

·   Data leakage, Data corruption, Service disruption ·   ABAC: Granular policies by which users are assigned specific roles and permissions to use a particular agent / agent capability for agents, tools, and reader/retriever
Training

Training GenAI models with RBAC features

·   Data poisoning ·   Model Training: Ensuring that RBAC principles are applied during the data preprocessing, tokenisation and embedding stages. Models. This involves filtering the training data based on the defined access controls and ensuring that the LLM only learns from data appropriate for each user role.

·   Adversarial training: A defensive algorithm that involves introducing adversarial examples into a model’s training data to teach the model to correctly classify these inputs as intentionally misleading.

Vector DB

Used to store, index, and retrieve embeddings for use by the AI models

·   Data leakage, service disruption, model inference manipulation ·   RBAC: Policies to ensure that only authorised personnel have access to the database, and even within that group, access levels are differentiated based on roles.
Fine tuning

Models for fine-tuning LLM’s (LORA) which introduce additional tuning parameters

·   Model behaviour, IP theft, data leakage ·   ABAC: Granular policies to control resources that can access / modify GenAI model tuning parameters

 

Layer 3: Data layer access control

THREATS CONTROL
Transactional & Analytical data store

Data-warehouse/ lakes: for Real time Transactional business data (Financial and non-financial) along Analytical & with slow changing data (HR data)

·   Data poisoning ·   RBAC / ABAC: Granular control on data access within the organisation

·   Data Encryption: Ensure that all data used in training, both at rest and in transit, is encrypted. This protects against un-authorised access and tampering.

Journals data store

Settings like temperature, input context size, and output size, influencing model behaviour and output

·   Data & IP Theft, Service recovery disruption ·   RBAC/ABAC: Granular control on data access within the organisation

 

Layer 4: Infrastructure layer access control

THREATS CONTROL
Infrastructure

Development, Training & Production are some of the infrastructure environments used to deploy AI inferencing applications

·   Service disruption, Privacy & Compliance breach ·   RBAC / ABAC: Granular control on AI infrastructure

·   Data Encryption: Ensure that all data used in training, both at rest and in transit, is encrypted. This protects against un-authorised access and tampering

Entitlements management Defining user entitlements and permissions required to enforce access controls ·   Gaining unauthorised access to any system within the organisation/ Information theft ·   RBAC: Policies to ensure that only authorised personnel have access to the entitlements management application and data.

 

Conclusion

In conclusion, effectively implementing role-based access control (RBAC) for generative AI is crucial for safeguarding sensitive data while maximising the technology’s potential. By establishing clear roles, conducting thorough data classification, and fostering collaboration across teams, organisations can create a robust security framework that mitigates risks associated with GenAI. Regular audits and monitoring will further enhance the system’s resilience against insider threats and compliance breaches. As the landscape of AI continues to evolve, organisations must remain proactive in refining their access control strategies to ensure that innovation does not come at the expense of security and data integrity.

 

References

  1. https://www.ibm.com/reports/data-breach

 

 

Jaidip Banerjee
Thushan Kumaraswamy