Bug #1514
Removing key phrases which are starting with IN pos tag or a preposition
100%
Description
There are some cases of key phrases such as:
1. "within regular expressions" linked to "regular expressions"
2. "before python" linked to "before python initialization"
3. "since unicode strings" linked to "unicode strings"
4. "as a context manager" linked to "connection as a context manager"
These are bad looking keyphrases and should not be present. The common pattern followed by all these is first word with ADP/IN pos tag.
The changes are to be made in the pos_filter_cands method in BR3_IR3_tagger.py. We have to add a filter to discard key phrases which are not exactly as same as the header variant and starts with an ADP/IN word. For verifying whether the key phrase and header variant are equivalent, you can use the stemmed forms. Use global_stem_dict to access pre-calculated stemmed forms of some keyphrases and header variants.
Test the changes with the following books:
1. Python Whirlwind Tour.txt
2. Python Tutorial.txt
3. Python 3 - Library Reference.txt
URL: https://edutestdev-240612.appspot.com/document/python-whirlwind-tour/m?documentURL=10054%2Fds9aug1528%2FWhirlwindTourOfPython%2F14-strings-and-regular-expressions-Special-characters-can-match-character-groups-94.html
The above URL has "within regular expressions" tagged as a purple link (or PL).
Subtasks