Bug #1616
Some unwanted removals of KPs starting with VBG/IN
0%
Description
As per Issue #1514 & #1520, we discarded some KPs which were starting with the following POS tags "IN", "VBG", "VB". However, I just noticed that along with the bad ones like "preceding regular expressions", "unforgiving regular expressions", etc, we also losing some good ones like:
1. defining a function -> defining and using functions
2. defining a function -> defining functions
3. importing standard library modules -> importing from python's standard library
4. while true loop -> while loops
The examples listed above are from the Whirlwind book. I am sure there must be other examples in Whirlwind, C API & Tutorial books as well.
Part A: Identify more such cases if there are any.
Part B: Make changes to ensure the above cases and their likes are not skipped. We can leverage the info provided by POS tags for the same. A basic outline of the idea is to check if the first words of header variants and KPs are the same (in their stemmed forms) and apply some filters based on NOUNs.