Project

General

Profile

Feature #1856

Penalise the multi-word KP based on "Past" tense & vector similarity score

Added by Nandini Bansal about 3 years ago. Updated almost 3 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
Start date:
11/03/2021
Due date:
% Done:

0%

Estimated time:
2.50 h

Description

In filter_by_past_tense, there is a condition which checks whether the KP is multi-word. If it is, we don't check the past-tense condition and simply keep it in the final list of the KPs. We don't want to do that anymore.

There are cases like "contained objects" matching with "container objects" in the C-API book that is passing through due to the above condition. Let us make changes such that if the KP is multi-word,
1) Check if there are any words with past tense in it. Let's call it A.
2) If there is, we need to find its stemmed equivalent in the header variant. Let's call it B.
3) Find the vector similarity score vec_sim between A & B.
4) If vec_sim < 0.374, add an additional penalty of 0.15 for the KP.

This will help us in reducing scores of KPs like "named attributes", "contained objects", "based class" which are actually quite horrible.

Test this change with the C-API book and Lib Ref book.

Also available in: Atom PDF