Pattern Development Guide
This guide is for experienced contributors who want to create advanced patterns or improve existing ones. It covers best practices, advanced techniques, and common pitfalls to avoid.
Advanced Pattern Design
Understanding Capture Groups
Capture groups are crucial for extracting version information. The version_group
field specifies which capture group contains the version string. Make sure this is correctly set and tested.
Complex Pattern Structures
For complex software that reports versions in multiple ways, consider creating multiple patterns in a single file:
{
"vendor": "Example Vendor",
"product": "Example Product",
"all_versions": [
{
"name": "Header Pattern",
"pattern": "Server: Example/([\\d.]+)",
"version_group": 1,
"priority": 150,
"confidence": 0.9
},
{
"name": "Body Pattern",
"pattern": "Version: ([\\d.]+)",
"version_group": 1,
"priority": 100,
"confidence": 0.8
}
]
}
Version-Specific Patterns
For software with different version reporting mechanisms, use the versions
field to organize patterns by version range:
{
"vendor": "Example Vendor",
"product": "Example Product",
"versions": {
"2.x": [
{
"name": "Version 2 Pattern",
"pattern": "Example v2\\.([\\d]+)",
"version_group": 1,
"priority": 180,
"confidence": 0.95
}
],
"3.x": [
{
"name": "Version 3 Pattern",
"pattern": "Example v3\\.([\\d]+)",
"version_group": 1,
"priority": 180,
"confidence": 0.95
}
]
}
}
Priority and Confidence Scoring
Priority Guidelines
- 180-200: Highly specific patterns with minimal false positives
- 150-179: Specific patterns with good reliability
- 100-149: General patterns that work across versions
- 50-99: Broad patterns with potential for false positives
- 0-49: Experimental or low-confidence patterns
Confidence Guidelines
- 0.9-1.0: Extremely reliable with comprehensive test cases
- 0.8-0.89: Very reliable with good test coverage
- 0.7-0.79: Reliable with adequate test coverage
- 0.5-0.69: Moderately reliable, needs more testing
- 0.0-0.49: Low confidence, experimental
Advanced Regex Techniques
Non-Capturing Groups
Use non-capturing groups (?:...)
for grouping without affecting version_group:
"pattern": "Server: (?:Apache|httpd)/([\\d.]+)"
Lookaheads and Lookbehinds
Use lookaheads (?=...)
and lookbehinds (?<=...)
for context-sensitive matching:
"pattern": "(?<=Server: )Apache/([\\d.]+)"
Escaping Special Characters
Remember to escape special regex characters in JSON strings:
// Correct - escaped backslashes
"pattern": "Server: Apache/([\\d.]+)"
// Incorrect - unescaped backslashes
"pattern": "Server: Apache/([\d.]+)"
Test Case Best Practices
Comprehensive Coverage
Include test cases for:
- Positive matches with expected versions
- Negative matches that should not match
- Edge cases and unusual version formats
- Different software deployment scenarios
Example Test Cases
"test_cases": [
{
"input": "Server: Apache/2.4.41 (Ubuntu)",
"expected_version": "2.4.41"
},
{
"input": "Server: nginx/1.18.0",
"expected_version": "1.18.0"
},
{
"input": "Server: IIS/10.0",
"expected_version": "10.0"
},
{
"input": "Server: Apache", // No version
"expected_version": null
}
]
Performance Considerations
Avoid Catastrophic Backtracking
Write efficient regex patterns to avoid performance issues:
- Use specific patterns instead of generic ones
- Avoid nested quantifiers when possible
- Use atomic groups
(?>...)
for performance-critical patterns - Test patterns against large datasets to ensure performance
Validation and Testing
Using the Validation Tool
Always validate your patterns with our tool:
python tools/validate-new-pattern.py patterns/by-vendor/vendor/product.json
Manual Testing
Test your patterns manually with the pattern matcher:
python tools/pattern-matcher.py "Server: Apache/2.4.41" vendor product
Common Pitfalls to Avoid
False Positives
- Make patterns specific enough to avoid matching unrelated software
- Test against diverse datasets to identify potential false positives
- Use context clues to increase specificity
Version Extraction Issues
- Ensure version_group points to the correct capture group
- Handle different version formats (semver, date-based, etc.)
- Consider partial version strings vs. full versions