Skip to main content

Sensitive Data Detection

MCPProxy includes automatic sensitive data detection that scans MCP tool call arguments and responses for secrets, credentials, API keys, and other potentially exposed data. This feature helps identify accidental data exposure in your AI agent workflows.

Full Configuration Schema

{
"sensitive_data_detection": {
"enabled": true,
"scan_requests": true,
"scan_responses": true,
"max_payload_size_kb": 1024,
"entropy_threshold": 4.5,
"categories": {
"cloud_credentials": true,
"private_key": true,
"api_token": true,
"auth_token": true,
"sensitive_file": true,
"database_credential": true,
"high_entropy": true,
"credit_card": true
},
"custom_patterns": [
{
"name": "acme_api_key",
"regex": "ACME-[A-Z0-9]{32}",
"severity": "high",
"category": "api_token"
}
],
"sensitive_keywords": ["SECRET_PROJECT", "INTERNAL_KEY"]
}
}

Configuration Options

OptionTypeDefaultDescription
enabledbooleantrueEnable or disable sensitive data detection entirely
scan_requestsbooleantrueScan tool call arguments for sensitive data
scan_responsesbooleantrueScan tool responses for sensitive data
max_payload_size_kbinteger1024Maximum payload size to scan in kilobytes
entropy_thresholdfloat4.5Shannon entropy threshold for high-entropy string detection

Detection Categories

MCPProxy detects sensitive data across multiple categories. Each category can be individually enabled or disabled.

Category Reference

CategoryDescriptionSeverityExamples
cloud_credentialsCloud provider credentialsCritical/HighAWS access keys, GCP API keys, Azure connection strings
private_keyCryptographic private keysCriticalRSA, EC, DSA, OpenSSH, PGP private keys
api_tokenService API tokensCritical/HighGitHub PATs, Stripe keys, OpenAI keys, Anthropic keys
auth_tokenAuthentication tokensHigh/MediumJWT tokens, Bearer tokens
sensitive_fileSensitive file pathsHighSSH keys, credentials files, private key files
database_credentialDatabase connection stringsCritical/HighMySQL, PostgreSQL, MongoDB, Redis connection strings
high_entropyHigh-entropy stringsMediumRandom strings that may be secrets
credit_cardPayment card numbersCriticalCredit card numbers (Luhn-validated)

Built-in Detection Patterns

Cloud Credentials

  • AWS Access Key: AKIA..., ASIA... (20 characters)
  • AWS Secret Key: 40-character base64 strings
  • GCP API Key: AIza... (39 characters)
  • GCP Service Account: JSON with "type": "service_account"
  • Azure Client Secret: 34+ character strings with special characters
  • Azure Connection String: Contains AccountKey=...

Private Keys

  • RSA Private Key: -----BEGIN RSA PRIVATE KEY-----
  • EC Private Key: -----BEGIN EC PRIVATE KEY-----
  • DSA Private Key: -----BEGIN DSA PRIVATE KEY-----
  • OpenSSH Private Key: -----BEGIN OPENSSH PRIVATE KEY-----
  • PGP Private Key: -----BEGIN PGP PRIVATE KEY BLOCK-----
  • PKCS#8 Private Key: -----BEGIN PRIVATE KEY-----

API Tokens

  • GitHub: ghp_..., gho_..., ghs_..., ghr_..., github_pat_...
  • GitLab: glpat-...
  • Stripe: sk_live_..., pk_live_..., sk_test_...
  • Slack: xoxb-..., xoxp-..., xapp-...
  • SendGrid: SG....
  • OpenAI: sk-..., sk-proj-...
  • Anthropic: sk-ant-api...

Authentication Tokens

  • JWT: Base64-encoded tokens starting with eyJ
  • Bearer Token: Bearer ... authorization headers

Database Credentials

  • MySQL connection strings: mysql://user:pass@host
  • PostgreSQL connection strings: postgresql://user:pass@host
  • MongoDB connection strings: mongodb://user:pass@host
  • Redis connection strings: redis://:pass@host
  • Database password environment variables: DB_PASSWORD=...

Credit Cards

  • Visa, Mastercard, American Express, Discover, JCB, Diners Club
  • Validated using the Luhn algorithm
  • Known test card numbers are flagged as examples

Enabling/Disabling Categories

To disable specific categories:

{
"sensitive_data_detection": {
"categories": {
"cloud_credentials": true,
"private_key": true,
"api_token": true,
"auth_token": true,
"sensitive_file": true,
"database_credential": true,
"high_entropy": false,
"credit_card": true
}
}
}

Categories not specified in the configuration are enabled by default.

Custom Patterns Configuration

You can define custom detection patterns for organization-specific secrets or internal credentials.

Regex-Based Patterns

Use regular expressions to match specific formats:

{
"sensitive_data_detection": {
"custom_patterns": [
{
"name": "acme_api_key",
"regex": "ACME-[A-Z0-9]{32}",
"severity": "high",
"category": "api_token"
},
{
"name": "internal_service_token",
"regex": "SVC_[a-zA-Z0-9]{24}_[0-9]{10}",
"severity": "critical",
"category": "auth_token"
},
{
"name": "internal_db_password",
"regex": "(?i)INTERNAL_DB_PASS=[^\\s]+",
"severity": "critical",
"category": "database_credential"
}
]
}
}

Keyword-Based Patterns

Use simple keyword matching for straightforward detection:

{
"sensitive_data_detection": {
"custom_patterns": [
{
"name": "internal_project_id",
"keywords": ["PROJ-SECRET", "INTERNAL-KEY", "CONFIDENTIAL-TOKEN"],
"severity": "medium"
},
{
"name": "legacy_api_marker",
"keywords": ["X-Legacy-Auth", "OldApiKey"],
"severity": "low",
"category": "api_token"
}
]
}
}

Pattern Configuration Options

FieldTypeRequiredDescription
namestringYesUnique identifier for the pattern
regexstringNo*Regular expression pattern
keywordsarrayNo*List of keywords to match (case-insensitive)
severitystringYesRisk level: critical, high, medium, or low
categorystringNoCategory for grouping (defaults to custom)

*Either regex or keywords must be specified, but not both.

Severity Levels

SeverityDescriptionUse Cases
criticalImmediate security riskPrivate keys, cloud credentials, production API keys
highSignificant security concernAPI tokens, database passwords, OAuth tokens
mediumPotential security issueHigh-entropy strings, internal tokens
lowInformationalKeywords, debug markers

Sensitive Keywords Configuration

For simple keyword matching without creating full pattern definitions, use the sensitive_keywords array:

{
"sensitive_data_detection": {
"sensitive_keywords": [
"SUPER_SECRET",
"INTERNAL_API_KEY",
"CONFIDENTIAL_TOKEN",
"PRIVATE_DATA",
"DO_NOT_SHARE"
]
}
}

Keywords are matched case-insensitively. Each match is reported with:

  • Type: sensitive_keyword
  • Category: custom
  • Severity: low

Entropy Threshold Tuning

Understanding Shannon Entropy

Shannon entropy measures the randomness of a string. Higher entropy indicates more randomness, which often suggests a secret or credential.

Entropy Ranges:

RangeDescriptionExamples
< 3.0Low entropyNatural language, repeated characters
3.0 - 4.0Medium entropyEncoded data, UUIDs
4.0 - 4.5High entropyPossibly a secret
> 4.5Very high entropyLikely a random secret

Adjusting the Threshold

The default threshold of 4.5 balances detection accuracy with false positives:

{
"sensitive_data_detection": {
"entropy_threshold": 4.5
}
}

Lower threshold (e.g., 4.0):

  • More detections
  • Higher false positive rate
  • Use when security is paramount

Higher threshold (e.g., 5.0):

  • Fewer detections
  • Lower false positive rate
  • Use when dealing with many encoded strings

High-Entropy Detection Behavior

  • Scans for strings 20+ characters matching base64-like patterns
  • Applies entropy calculation to each candidate
  • Skips strings already matched by other patterns (to avoid duplicates)
  • Limited to 5 high-entropy matches per scan to prevent noise

Performance Considerations

Payload Size Limits

The max_payload_size_kb setting controls the maximum size of content scanned:

{
"sensitive_data_detection": {
"max_payload_size_kb": 1024
}
}

Impact:

  • Larger limits increase scan time
  • Content exceeding the limit is truncated
  • Truncated scans are marked with truncated: true in results
  • Default of 1024 KB (1 MB) balances thoroughness with performance

High-Security Environments:

{
"sensitive_data_detection": {
"enabled": true,
"scan_requests": true,
"scan_responses": true,
"max_payload_size_kb": 2048,
"entropy_threshold": 4.0
}
}

Performance-Sensitive Environments:

{
"sensitive_data_detection": {
"enabled": true,
"scan_requests": true,
"scan_responses": false,
"max_payload_size_kb": 512,
"entropy_threshold": 4.8,
"categories": {
"high_entropy": false
}
}
}

Minimal Detection (Critical Only):

{
"sensitive_data_detection": {
"enabled": true,
"scan_requests": true,
"scan_responses": true,
"categories": {
"cloud_credentials": true,
"private_key": true,
"api_token": true,
"auth_token": false,
"sensitive_file": false,
"database_credential": true,
"high_entropy": false,
"credit_card": true
}
}
}

Detection Limits

  • Maximum 50 detections per scan to prevent excessive processing
  • High-entropy detection limited to 5 matches per content block
  • Patterns are evaluated in order, stopping at detection limit

Detection Results

When sensitive data is detected, the result includes:

{
"detected": true,
"detections": [
{
"type": "aws_access_key",
"category": "cloud_credentials",
"severity": "critical",
"location": "arguments",
"is_likely_example": false
}
],
"scan_duration_ms": 12,
"truncated": false
}

Result Fields

FieldDescription
detectedtrue if any sensitive data was found
detectionsArray of detection details
scan_duration_msTime taken to scan in milliseconds
truncatedtrue if payload exceeded max size and was truncated

Detection Fields

FieldDescription
typePattern name that matched (e.g., aws_access_key)
categoryDetection category (e.g., cloud_credentials)
severityRisk level (critical, high, medium, low)
locationWhere the match was found (arguments or response)
is_likely_exampletrue if the match appears to be a known test/example value

Disabling Detection

To completely disable sensitive data detection:

{
"sensitive_data_detection": {
"enabled": false
}
}

Or to scan only requests (not responses):

{
"sensitive_data_detection": {
"scan_requests": true,
"scan_responses": false
}
}