DNSDB Search with Compatible Regular Expressions¶
DNSDB Farsight Compatible Regular Expressions (FCRE) provides regular expression (regexp) functionality for searching Domain Name System (DNS) hostnames and rdata values in DNSDB. The system evaluates regexp searches against the DNS master file form of the hostnames and rdata values, which by design contains only printable ASCII characters. The system converts all non-printable characters, including octets outside the ASCII range, to escape sequences in the form \DDD (backslash followed by three decimal digits) per RFC 1035. This is only applicable to RData (RHS) queries.
For this limited use case, DNSDB FCRE provides a simplified subset of the Portable Operating System Interface (POSIX) Extended Regular Expression syntax, with the most notable restrictions being:
- Only printable characters are allowed in a regexp.
- Hexadecimal or octal escape sequences aren't allowed in a regexp.
- Only special characters may be escaped with '\'. Note that ']' and '}' aren't considered special characters, but '[' and '{' are.
- POSIX collating elements (e.g., [=ch=], [.a.]) in character classes aren't supported. The sequences [= and [. aren't allowed in character classes.
- As in POSIX regexps, the character '\' has no special meaning within a character class, so the class [\w] matches the characters '\' or 'w'.
- Capturing groups and backreferences aren't supported.
Note that restriction (3) means that Perl Compatible Regular Expressions (PCRE) extensions such as \w (word characters) and \d (digits) aren't allowed in FCRE regexps.
Regexp syntax¶
A regular expression is a string of printable characters. The following characters have special meaning:
\-- Escape the next character, which must be a special character. A regexp may not end with an unescaped backslash, or contain an unescaped backslash followed by a character other than another backslash or the special characters listed below, except inside of a character class.^-- Matches the beginning of the subject string.$-- Matches the end of the subject string.[-- Begin a character class.-- A special character class matching any character.(-- Begin a sub-pattern. Sub-patterns may occur within other sub-patterns.)-- End a sub-pattern.|-- Specify an alternative match. A pattern or subpattern matches if the pattern before or after the '|' matches.*-- Match the previous character, character class, or subpattern zero or more times.?-- Match the previous character, character class, or subpattern at most once.+-- Match the previous character, character class, or subpattern at least once.{-- If followed by a character other than a decimal digit, is treated as a literal '{' character. Such a '{' may be escaped with a backslash even though it isn't technically a special character in this context.
If followed by a decimal digit, begins a bounded match specification. "{n}" matches exactly n repetitions of the previous character, character class, or subpattern. "{n,m}" with m >=n matches at least n but at most m repetitions.
Character class syntax¶
A character class is a set of characters enclosed between an opening [ and a closing ]. Within the character class, the following characters have special handling:
^-- If the first character after the opening[, denotes a negated character class, i.e. a class which matches any character not listed in the remainder of the class]-- If the first character after the opening[or[^, encodes a literal]as a member of the class. A]after the first character after the opening[or[^ends the character class.--- If the first character after the opening[or[^or the last character before the closing], encodes a literal-as a member of the character class. If between two characters A and B, encodes the range of characters between A and B, inclusive, as members of the character class. The character A must occur before B in ASCII encoding.
The sequences [. and [= aren't allowed between the opening [or [^ and the closing ], to prevent confusion with unsupported POSIX collation sequences and collation classes.
If the sequence [: appears in a character class, it must be the beginning of one of the following POSIX character classes:
[:alnum:](POSIX character class) -- Alphanumeric characters 0-9, A-Z, and a-z[:alpha:](POSIX character class) -- Alphabetic characters A-Z, a-z[:blank:](POSIX character class) -- Blank characters (space and tab)- Only printable characters occur in searchable strings and space is the only printable whitespace character, thus use of
[:blank:]is equivalent to a space character. - Tabs in data appear as the escape sequence
\009and can be matched with the pattern\009. [:cntrl:](POSIX character class) -- Control characters- Only printable characters occur in searchable strings, so
[:cntrl:]won't match any characters. - Control characters in data will appear as escape sequences in the form
\DDD(backslash followed by three digits). To match one of those, you need to escape the backslash with another backslash. Use the pattern\[:digit:]{3}in a regular expression. [:digit:](POSIX character class) -- Decimal digits 0-9[:graph:](POSIX character class) -- Any printable character other than space.- Only printable characters occur in searchable strings, thus a character class containing
[:graph:]is equivalent to[^ ](negated character class containing only a space). [:lower:](POSIX character class) -- Lower case alphabetic characters a-z- Hostnames will be folded to lower case, thus use of
[:lower:]is equivalent to[:alpha:]. [:print:](POSIX character class) -- Any printable character- Only printable characters occur in searchable strings, so
[:print:]matches any character. [:punct:](POSIX character class) -- Punctuation characters (printable characters other than space and[:alnum:])[:space:](POSIX character class) -- Any whitespace character (tab, newline, vertical tab, form feed, carriage return, and space)- The space character is the only printable whitespace character, thus use of
[:space:]is equivalent to a space character. - Tabs in data appear as the escape sequence
\009and can be matched with the pattern\009. The other characters can also be matched by searching for their decimal equivalent. [:upper:](POSIX character class) -- Upper case alphabetic characters A-Z- Since all of our data is indexed as lower-case, this isn't useful as it is equivalent to
[:lower:]. [:xdigit:](POSIX character class) -- Hexadecimal digits 0-9, a-f, A-F
The above named character classes must appear inside an enclosing [ and ], e.g. [[:digit:][:punct:]] to match a digit or punctuation character. Without the enclosing braces, [:digit:] will match the characters :, d, i, g, or t.
Neither the above character classes nor a character range may begin or end a character range. For example, the character class expressions [0-[:alpha:]] and [a-n-z] are invalid.
All other characters between the opening [ or [^ and the closing ] are added to the character class, including the backslash \ character.
There is no way to express a character class containing a single ^ character: an escaped \^ should be used instead of a character class.
Important notes¶
- Regular expression searches aren't case sensitive.
- Regular expression patterns are not "anchored" front and back by default. (This is a major difference from glob searches.)
- To exactly match a literal
.(such as between labels in a DNS name), you need to escape it with a backslash. Example pattern:google\.com. This isn't necessary if the.is inside a character class, for examplefoo[.-_]bar. If you don't escape the., the patterngoogle.comwill match 'googlexcom', 'google_com', etc. - All rrnames (i.e. hostnames) in the DNS dataset end in a
., which must be accounted for in regular expressions. - All well-formed rdata we currently index in the DNS dataset ends in a
.or a", which should be accounted for in regular expressions. - Wildcard operators (
.,*,+,?, and character classes like[a-z]) must have at least two non-wildcard characters immediately before or after them. Anchors (^and$) count as non-wildcard characters for this purpose.
Valid wildcard patterns¶
- Pattern:
^vpn-[a-z]+\..*$- Valid: The$anchor provides a second non-wildcard character after.* - Pattern:
^vpn-[a-z]+\..{2,}- Valid: Explicit 2+ character match after the dot - Pattern:
example\.com.*- Valid: "com" provides 3 characters before the wildcard - Pattern:
^www\..*\.com\.$- Valid: Anchors and literal characters satisfy the requirement
Invalid wildcard patterns¶
- Pattern:
^vpn-[a-z]+\..*- Invalid: Only one character (the escaped dot) before the final.*wildcard - Pattern:
x.*- Invalid: Only one character before the wildcard - Pattern:
.*y- Invalid: Only one character after the wildcard - Pattern:
.+example- Invalid: Only one character (the start anchor is implicit but not present)
Examples¶
The following examples show regular expression patterns and some of their matching values:
- Pattern:
www\..*\.com - Matches: Hostnames with a label ending in "www." and a later label starting with ".com"
-
Example results:
www.example.com.dev-www.subdomain.example.com.www.example.com.cdn.net.stage-www.dev.community.org.
-
Pattern:
^www\..*\.com - Matches: Hostnames starting with "www." and ending in ".com"
-
Example results: No results
- Note: Hostnames in the DNS dataset contain a trailing ".", which must be accounted for in regexps. This pattern is missing the trailing dot.
-
Pattern:
^www\..*\.com\.$ - Matches: Hostnames starting with "www." and ending in ".com."
-
Example results:
www.example.com.www.subdomain.example.com.
-
Pattern:
^www\.[^.]+\.com\.$ - Matches: Hostnames starting with "www." and ending with ".com" with no other dots in between
-
Example results:
www.example.com.www.other-domain.com.
-
Pattern:
^((dev|stage)-)?www\.[^.]+\.(net|edu)\.$ - Matches: Hostnames starting with "www" optionally preceded by a "dev-" or "stage-" prefix in a .net or .edu domain
-
Example results:
www.college.edudev-www.isp.net
-
Pattern:
^"v=spf1 .* ~all"$ - Matches: TXT records encoding an SPF policy with a ~all default
-
Example results:
"v=spf1 a mx ~all""v=spf1" " a " "10.2.0.0/16" " ~all"
-
Pattern:
(^|[-._])star([-_]?)z[-._] - Matches: Hostnames that start with "star", or have "star" as a label or otherwise separate from other letters/digits, followed by an optional dash or underscore, then a z, then a period, dash or underscore
- Use case: Looking for a visibly embedded trademark
- Example results:
star-z.at.edge-star-z-mini-shv-02-mia3.goldmansachs.de.starz.webex.com.shooting-starz.tv.