Regex (with lookaround) optimization -


i trying pull out entities text, , have simple mechanism (until deploy nlp solution) avoid negation. e.g: i'd find

patient has history of cynicisimitis

but avoid

no history of cynicisimitis

and avoid

family history of cynicisimitis

to end using multiple lookbehinds make regex this:

((?<!(?i)no.{1,25}|denies.{1,35}|family.{1,35}|father.{1,10}|mother.{1,10})(?-i)${stringtomatch}) 

i tried adding \b negative lookbehind, thinking reduce entry points processor have, made performance worse.

problem - appears performing badly.

what can do:

  • using \b avoid false matches (in particular word "no")
  • removing useless (?-i) (an inline modifier applies group is.)
  • factorizing when possible reduce performance impact of .{m,n}

you obtain:

(?<!(?i)\b(?:no\b.{1,25}|(?:denies|family)\b.{1,35}|(?:fa|mo)ther\b.{1,10})\b)history of cynicisimitis\b

what can try:

  • using lazy quantifiers instead of greedy quantifiers: \bno\b.{1,25}?

  • putting lookbehind after stringtomatch:

    \bhistory of cynicisimitis\b(?<!(?i)\b(?:no\b.{1,25}|(?:denies|family)\b.{1,35}|(?:fa|mo)ther\b.{1,10})\bhistory of cynicisimitis)

  • using basic string search (that far faster regex) find offsets of stringtomatch, extract substrings offset-50 offset+stringtomatch.length+1 , after test pattern on substrings.


Comments

Popular posts from this blog

database - VFP Grid + SQL server 2008 - grid not showing correctly -

jquery - Set jPicker field to empty value -

.htaccess - htaccess convert request to clean url and add slash at the end of the url -