Bug #1752
Updated by Nandini Bansal about 3 years ago
In partial_header_match, we have a filter after the generation of candidates where we penalize the KPs because they start and end with the same words header variants. But it has been observed that some cases are being penalized unnecessarily. We need to check the POS tags of the KPs with Spacy where word count of header variant < word count of KP and the uncommon word between header variant and KP is "ADJ" and rest of the words are "NOUN". "ADJ". Reduce the penalty of such cases to 0.05. For e.g. 1) tkinter standard dialog |%| -> tkinter dialogs [('tkinter', 'NOUN'), ('standard', 'ADJ'), ('dialog', 'NOUN')] [('tkinter', 'NOUN'), ('dialogs', 'NOUN')] --------------- 2) ascii lower-case character |%| -> ascii characters [('ascii', 'NOUN'), ('lower', 'ADJ'), ('-', 'PUNCT'), ('case', 'NOUN'), ('character', 'NOUN')] [('ascii', 'NOUN'), ('characters', 'NOUN')] --------------- 3) ascii white-space characters |%| -> ascii characters [('ascii', 'NOUN'), ('white', 'ADJ'), ('-', 'PUNCT'), ('space', 'NOUN'), ('characters', 'NOUN')] [('ascii', 'NOUN'), ('characters', 'NOUN')] ---------------