# duplicateDetection

> Command: `duplicateDetection`  
> Category: **Analysis Tools**  
> Status: Production Ready

## Description

Finds duplicate records in HANA tables using various matching strategies. It supports exact matching, fuzzy matching with similarity thresholds, and partial key matching to identify near-duplicates.

### What Are Duplicate Records?

**Duplicate records** are multiple entries in a table that represent the same real-world entity but were entered separately. Common examples:

- Same customer entered twice with slightly different names (John Smith vs. Jon Smith)
- Same product created multiple times due to system errors
- Duplicate transactions from failed batch retries
- Data imported twice due to incomplete cleanup

### Why Is Duplicate Detection Critical?

Duplicate data creates significant problems across your organization:

**Data Quality Issues:**

- **False Uniqueness**: Records that should be unique (customers, products) appear multiple times
- **Skewed Metrics**: Counts, aggregations, and statistics become inaccurate
- **Broken Relationships**: Foreign key references may point to wrong duplicate copies
- **Data Inconsistency**: Updates to one copy don't reflect in other duplicates

**Business Impact:**

- **Incorrect Revenue**: Duplicate customer records inflate customer counts and revenue figures
- **Invalid Analytics**: Reports show wrong trends, patterns, and insights
- **Marketing Waste**: Marketing campaigns target duplicate customer records unnecessarily
- **Compliance Risk**: Regulations (GDPR, CCPA) require accurate, non-redundant personal data
- **Loss of Trust**: Duplicate billing or communications damage customer relationships
- **Decision Errors**: Leadership makes decisions based on inflated or inaccurate data

**Operational & Financial Impact:**

- **Processing Waste**: Systems process duplicate records unnecessarily (storage, memory, CPU)
- **Storage Growth**: Database grows unnecessarily with redundant data
- **Manual Cleanup Costs**: Requires time-consuming manual review and merging
- **Integration Failures**: Other systems reject or duplicate data when integrating duplicates
- **Customer Support Issues**: Customers report receiving duplicate communications or bills
- **System Performance**: More records mean slower queries and reports

**Common Real-World Scenarios:**

1. **E-commerce**: Customer "John Smith" entered as "Jon Smith" and "John Smyth" → duplicate orders and shipping
2. **Healthcare**: Patient registered twice under slightly different spellings → medication overdose risk
3. **CRM**: Company "ABC Corp" and "ABC Corporation" tracked as different accounts → lost sales tracking
4. **Finance**: Same invoice processed twice → double-counting revenue
5. **Manufacturing**: Part number "A001" and "A-001" treated as different items → inventory mismatch

### How Duplicate Detection Helps

#### 1. Data Quality Assurance

```bash
# Identify duplicate customers by key columns
hana-cli duplicateDetection \
  --table CUSTOMERS \
  --keyColumns CUSTOMER_EMAIL \
  --mode exact
```

Find exact duplicates so you can decide which record to keep.

#### 2. Fuzzy Matching for Near-Duplicates

```bash
# Find similar customer names (typos, variations)
hana-cli duplicateDetection \
  --table CUSTOMERS \
  --keyColumns FIRST_NAME,LAST_NAME \
  --mode fuzzy \
  --threshold 0.85
```

Discover records that are similar but not identical (85% match threshold).

#### 3. Post-Migration Validation

```bash
# Ensure migration didn't create duplicates
hana-cli duplicateDetection \
  --table PRODUCTS \
  --checkColumns PRODUCT_SKU,PRODUCT_NAME \
  --limit 100000
```

Verify data integrity after system migration or import.

#### 4. Merge Strategy Planning

```bash
# Generate detailed duplicate report for analysis
hana-cli duplicateDetection \
  --table SUPPLIERS \
  --keyColumns SUPPLIER_NAME,COUNTRY \
  --mode fuzzy \
  --threshold 0.90 \
  --format json \
  --output duplicates-analysis.json
```

Export duplicates for review and decision-making before merging.

#### 5. Ongoing Monitoring

```bash
# Regular duplicate checks as part of data governance
hana-cli duplicateDetection \
  --table vendor_contracts \
  --checkColumns vendor_id,contract_number \
  --mode exact \
  --output daily-dups.csv
```

Monitor for new duplicates introduced by ongoing operations.

## Syntax

```bash
hana-cli duplicateDetection [options]
```

## Aliases

- `dupdetect`
- `findDuplicates`
- `duplicates`

## Command Diagram

```mermaid
graph TD
    A["🔷 hana-cli duplicateDetection"]
    A --> B["📋 Required Parameters"]
    B --> B1["-t, --table: Table to check"]
    B1 --> B2["-k, --keyColumns: Key columns"]
    B2 --> C["📍 Schema & Connection"]
    C --> C1["-s, --schema: Schema for table"]
    C1 --> C2["-a, --admin: Connect via admin"]
    C2 --> C3["--conn: Connection file override"]
    C3 --> D["📊 Column Selection"]
    D --> D1["-c, --checkColumns: Columns to check"]
    D1 --> D2["-e, --excludeColumns: Exclude columns"]
    D2 --> E["🔬 Detection Options"]
    E --> E1["-m, --mode: exact/fuzzy/partial"]
    E1 --> E2["--threshold, --th: Match threshold"]
    E2 --> E3["-l, --limit: Max rows to check"]
    E3 --> E4["--timeout, --to: Timeout"]
    E4 --> F["🔢 Output & Format"]
    F --> F1["-o, --output: Report file"]
    F1 --> F2["-f, --format: Report format"]
    F2 --> F3["-p, --profile: CDS Profile"]
    F3 --> G["🔍 Troubleshooting"]
    G --> G1["--disableVerbose, --quiet"]
    G1 --> G2["-d, --debug: Debug mode"]
    G2 --> H["✅ Help: -h, --help"]
    
    style A fill:#0070C0,color:#fff,stroke:#fff,stroke-width:2px
    style H fill:#51CF66,color:#fff,stroke:#fff,stroke-width:2px
```

## Parameters

### Positional Arguments

This command has no positional arguments.

### Options

| Option | Alias | Type | Default | Description |
| --- | --- | --- | --- | --- |
| `--table` | `-t` | string | required | Name of the table to check |
| `--schema` | `-s` | string | `**CURRENT_SCHEMA**` | Schema for table |
| `--keyColumns` | `-k` | string | required | Comma-separated key columns for matching |
| `--checkColumns` | `-c` | string | - | Columns to Check for Duplicates (comma-separated, optional) |
| `--excludeColumns` | `-e` | string | - | Columns to Exclude from Check (comma-separated, optional) |
| `--mode` | `-m` | string | `exact` | Detection Mode. Choices: `exact`, `fuzzy`, `partial` |
| `--threshold` | `--th` | number | `0.95` | Similarity Threshold for Fuzzy Matching (0-1) |
| `--output` | `-o` | string | - | Output Report File Path |
| `--format` | `-f` | string | `summary` | Report Format. Choices: `json`, `csv`, `summary` |
| `--limit` | `-l` | number | `10000` | Maximum Rows to Check |
| `--timeout` | `--to` | number | `3600` | Operation Timeout in Seconds |
| `--profile` | `-p` | string | - | CDS Profile |
| `--help` | `-h` | boolean | - | Show help |

### Connection Parameters

| Option | Alias | Type | Default | Description |
| --- | --- | --- | --- | --- |
| `--admin` | `-a` | boolean | `false` | Connect via admin (default-env-admin.json) |
| `--conn` | - | string | - | Connection Filename to override default-env.json |

### Troubleshooting

| Option | Alias | Type | Default | Description |
| --- | --- | --- | --- | --- |
| `--disableVerbose` | `--quiet` | boolean | `false` | Disable verbose output - removes extra output mainly intended for human-readable usage |
| `--debug` | `-d` | boolean | `false` | Debug hana-cli itself by adding lots of intermediate details |

For a complete list of parameters and options, use:

```bash
hana-cli duplicateDetection --help
```

## Detection Modes

- **exact** (default) - Find identical values in key columns
- **fuzzy** - Find similar values using Levenshtein distance and similarity threshold
- **partial** - Find duplicates using only first key column

## Similarity Threshold

The threshold determines what counts as a match in fuzzy mode:

- `1.0` (100%) - Exact match only
- `0.95` (95%) - Allow 1-2 character differences per field
- `0.90` (90%) - Allow 3-4 character differences per field
- `0.85` (85%) - More lenient matching

## Output Examples

### Summary (default)

```bash
Duplicate Detection Report
==========================

Total Rows: 10000
Unique Rows: 9850
Duplicate Groups: 75
Total Duplicates: 150

Duplicate Groups:
  Group: John||Smith, Records: 2, Match: 100%
  Group: John||Smyth, Records: 3, Match: 95%
  Group: Jane||Doe, Records: 2, Match: 100%
  ...
```

### JSON

```json
{
  "totalRows": 10000,
  "uniqueRows": 9850,
  "duplicateGroups": 75,
  "totalDuplicates": 150,
  "duplicates": [
    {
      "matchKey": "John||Smith",
      "matchPercentage": 100,
      "count": 2,
      "records": [
        {
          "rowNumber": 5,
          "data": { "FIRST_NAME": "John", "LAST_NAME": "Smith", ... }
        },
        {
          "rowNumber": 1250,
          "data": { "FIRST_NAME": "John", "LAST_NAME": "Smith", ... }
        }
      ]
    }
  ]
}
```

### CSV

```csv
Group,Rows,Similarity
"John||Smith",2,100%
"John||Smyth",3,95%
"Jane||Doe",2,100%
```

## Understanding Results

### Exact Matches

All values in key columns are identical. These are definite duplicates.

### Fuzzy Matches

Values are similar but not identical. The similarity percentage indicates how close they are.

### Partial Matches

Duplicates identified based on a subset of key columns.

## Examples

### Exact duplicate detection

```bash
hana-cli duplicateDetection --table CUSTOMERS \
  --keyColumns "FIRST_NAME,LAST_NAME" \
  --mode exact
```

### Fuzzy duplicate detection with threshold

```bash
hana-cli duplicateDetection --table CUSTOMERS \
  --keyColumns "FIRST_NAME,LAST_NAME" \
  --mode fuzzy \
  --threshold 0.90 \
  --format json \
  --output duplicates.json
```

### Exclude specific columns

```bash
hana-cli duplicateDetection --table PRODUCTS \
  --keyColumns "SKU" \
  --excludeColumns "CREATED_DATE,MODIFIED_DATE" \
  --limit 50000
```

### Partial matching

```bash
hana-cli duplicateDetection --table SUPPLIERS \
  --keyColumns "COMPANY_NAME" \
  --mode partial
```

## Handling Duplicates

After identifying duplicates, you can:

1. **Report Only** - Generate report and review manually
2. **Tag Records** - Add a flag/status column to mark duplicates
3. **Merge Records** - Combine duplicate records into one
4. **Delete Duplicates** - Remove duplicate entries (keep first occurrence)
5. **Review Process** - Use data steward process to determine action

Example workflow:

```bash
# Step 1: Identify fuzzy duplicates
hana-cli duplicateDetection --table CUSTOMERS \
  --keyColumns "FIRST_NAME,LAST_NAME,EMAIL" \
  --mode fuzzy --threshold 0.92 \
  --format json --output duplicates.json

# Step 2: Review and manually validate
# (Review duplicates.json and create merge/delete list)

# Step 3: Execute cleanup
# (Use data governance process or scripts)
```

## Return Codes

- `0` - Detection completed successfully
- `1` - Detection error or database connection issue

## Performance Tips

1. Use `exact` mode for better performance on large tables
2. Use `--limit` to test on a subset first
3. Specify key columns prudently
4. Use `--excludeColumns` to skip irrelevant columns
5. Increase `--threshold` for faster fuzzy matching

## Related Commands

- `dataValidator` - Validate data against business rules
- `dataProfile` - Generate statistical profiles

See the [Commands Reference](../all-commands.md) for other commands in this category.

## See Also

- [Category: Analysis Tools](..)
- [All Commands A-Z](../all-commands.md)
