Project

General

Profile

Bug #1514

Removing key phrases which are starting with IN pos tag or a preposition

Added by Nandini Bansal over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Target version:
-
Start date:
08/20/2021
Due date:
% Done:

100%

Estimated time:
3.00 h (Total: 6.00 h)

Description

There are some cases of key phrases such as:

1. "within regular expressions" linked to "regular expressions"
2. "before python" linked to "before python initialization"
3. "since unicode strings" linked to "unicode strings"
4. "as a context manager" linked to "connection as a context manager"

These are bad looking keyphrases and should not be present. The common pattern followed by all these is first word with ADP/IN pos tag.

The changes are to be made in the pos_filter_cands method in BR3_IR3_tagger.py. We have to add a filter to discard key phrases which are not exactly as same as the header variant and starts with an ADP/IN word. For verifying whether the key phrase and header variant are equivalent, you can use the stemmed forms. Use global_stem_dict to access pre-calculated stemmed forms of some keyphrases and header variants.

Test the changes with the following books:
1. Python Whirlwind Tour.txt
2. Python Tutorial.txt
3. Python 3 - Library Reference.txt

URL: https://edutestdev-240612.appspot.com/document/python-whirlwind-tour/m?documentURL=10054%2Fds9aug1528%2FWhirlwindTourOfPython%2F14-strings-and-regular-expressions-Special-characters-can-match-character-groups-94.html
The above URL has "within regular expressions" tagged as a purple link (or PL).


Subtasks

Bug #1520: Instead of removing entire key phrase starting with IN pos tag for all cases, we can process the keyphrase and discard just the wordClosed08/20/2021Rohit Choudhary

Actions

Also available in: Atom PDF