Project

General

Profile

Feature #1593

Eliminate certain header variants generated from variations_in_common_section_words

Added by Nandini Bansal about 3 years ago. Updated about 3 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
Start date:
09/01/2021
Due date:
% Done:

0%

Estimated time:
4.00 h

Description

For cases where two-word header_var is entirely made up of NOUNs/PROPNs and a single word tmp_variant is generated and exists within 20K common words, do not save the header variant. These changes are to be made within the variations_in_common_section_words function of BR3_IR3_tagger.py.

This is a testing change and there is no surety of good results. To evaluate the impact of changes, we need to re-generate the master_cands.pkl, run the entire annotation, save the similar_docs.csv and perform the KPI exercise on it. Based on the impact reflected in the KPI stats and manual inspection of annotated text files, we will decide whether this change is desirable or not.

Datasets to test with:
1. Whirlwind Book
2. Library Reference
3. Tutorial Book

NOTE: While checking whether tmp_variant is present within 20K CW, you need to check both singular and plural forms of the KP.

#1

Updated by Nandini Bansal about 3 years ago

  • Status changed from New to Rejected

Also available in: Atom PDF