Regex (with lookaround) optimization -
i trying pull out entities text, , have simple mechanism (until deploy nlp solution) avoid negation. e.g: i'd find
patient has history of cynicisimitis
but avoid
no history of cynicisimitis
and avoid
family history of cynicisimitis
to end using multiple lookbehinds make regex this:
((?<!(?i)no.{1,25}|denies.{1,35}|family.{1,35}|father.{1,10}|mother.{1,10})(?-i)${stringtomatch})
i tried adding \b negative lookbehind, thinking reduce entry points processor have, made performance worse.
problem - appears performing badly.
what can do:
- using
\b
avoid false matches (in particular word "no") - removing useless
(?-i)
(an inline modifier applies group is.) - factorizing when possible reduce performance impact of
.{m,n}
you obtain:
(?<!(?i)\b(?:no\b.{1,25}|(?:denies|family)\b.{1,35}|(?:fa|mo)ther\b.{1,10})\b)history of cynicisimitis\b
what can try:
using lazy quantifiers instead of greedy quantifiers:
\bno\b.{1,25}?
putting lookbehind after stringtomatch:
\bhistory of cynicisimitis\b(?<!(?i)\b(?:no\b.{1,25}|(?:denies|family)\b.{1,35}|(?:fa|mo)ther\b.{1,10})\bhistory of cynicisimitis)
using basic string search (that far faster regex) find offsets of stringtomatch, extract substrings
offset-50
offset+stringtomatch.length+1
, after test pattern on substrings.
Comments
Post a Comment