Task #1726
Handling cases of bad header variants like "representation"
0%
Description
In BR3_IR3_tagger.py, we have a function called variations_in_common_section_words that strips all the common words (extracted from the dataset) from the beginning and end of the header variant to generate new header variants. While most of the header variants generated are good, there are some bad cases like "representation" which do not necessarily result in good KPs. We need to eliminate these cases.
To do so, we can add a CW filter of 4K words. If the header variant generated is a single word variant and it lies within the 4K CW list, we save it with ignore_flag = True.
Let us first of all look at header variants that will be deleted using this filter for the Library Reference book and analyze if it's okay for us to lose them. If they look good, we can remove them and test the changes on the full book.
Subtasks
Updated by Nandini Bansal about 3 years ago
- Estimated time changed from 5.00 h to 48.00 h
Estimate time increased as we are stuck with some cases that are difficult to manage