Project

General

Profile

Task #1726

Handling cases of bad header variants like "representation"

Added by Nandini Bansal about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
Start date:
10/13/2021
Due date:
% Done:

0%

Estimated time:
5.00 h (Total: 8.50 h)

Description

In BR3_IR3_tagger.py, we have a function called variations_in_common_section_words that strips all the common words (extracted from the dataset) from the beginning and end of the header variant to generate new header variants. While most of the header variants generated are good, there are some bad cases like "representation" which do not necessarily result in good KPs. We need to eliminate these cases.

To do so, we can add a CW filter of 4K words. If the header variant generated is a single word variant and it lies within the 4K CW list, we save it with ignore_flag = True.

Let us first of all look at header variants that will be deleted using this filter for the Library Reference book and analyze if it's okay for us to lose them. If they look good, we can remove them and test the changes on the full book.


Subtasks

Bug #1743: Checking singular and plural forms of the tmp_var from variations_in_common_section_words in common words listResolved10/13/2021

Actions
Bug #1744: Calculating the fullness_ratio of the header variants to decide a threshold for removal of header variantsResolved10/13/2021

Actions

Also available in: Atom PDF