How to Build a Relation Extractor to Extract Specific relations from a Given Corpus in Python.

extractor, python, NLP
  1. Objective

In this task, you will build a relation extractor to extract specific relations from a given corpus. Note that it will take you quite some time to complete this project even if you are familiar with various Python libraries and have good Python programming experience. Therefore, we earnestly recommend that you start working on this homework as early as possible.

 

  1. Background

Text documents often contain valuable structured data that is hidden in regular English sentences. To extract them, we usually take two steps. The first step is to perform Named Entity Recognition (NER) to identity entity mentions and classify them into correct types (PERSON, ORG, DATE, GPE1, etc.). The second step is to perform relation extraction exploiting NER results.

For example, in the sentence in Figure 1, there are 6 entities | 3 PERSON entities, 1 DATE entity, 2 GPE entities. There are 5 different relations that hold between those 6 entities. In this homework, the NER results are given, and you will focus on the Relation Extraction step.

Relations are usually represented as triplets of (subject, predicate, object). For example, the above 5 relations can be represented as:

(“Henry Hansen”, DateOfBirth, “April 28,1907”)

(“Henry Hansen”, PlaceOfBrith, “La Crosse”)

(“Henry Hansen”, PlaceOfBrith, “Wisconsin”)

(“Henry Hansen”, HasParent, “Andrew”)

(“Henry Hansen”, HasParent, “Emma Petersen Hansen”)

 

  1. Overview of the Tasks

You need to implement a program in Python 3 that extracts two specific relations from an input file consisting of sentences and NER results. The two relations are: DateOfBirth and HasParent.

Solution:

  • A detailed description of our approach, why we chose such an approach and how we formulated the task.

1-I configured Stanford NLP model that already have NLP dictionary.

2-configurations setup in CONFIG file.

3-I modified extractor.py file and pass training. JSON file as parameter to run.py file that calls to the methods of extraction_relations that again call the method of extractor.py extract_date_of_birth, extract_has_parent

4-method to calculate precision and f1 score is present in run.py calculate_f1_score

5-F1 = 2 * (precision * recall) / (precision + recall)

6-training.json file format is like that as given

VALUES FOR DIFFERENT RECORD IS DIFFERENT

 

[

{

“sentence_id” : “TR.00001”,

“sentence”  : {

 

“text” : “Bill was born 1986.”,

“annotation”:{

“1”: [“Bill” , “bill” , “NNP” ,”B-PERSON”],

“2”:[“was” , “be” , “VBD” , “O”],

“3”: [“born” , “bear” , “VBN”, “O”],

“4”:[“1986” , “BIL” , “CD” , “B-DATE”],

“5”:[“.” , “.” , “.”,”O”]

},

“relations”:[{

“subject”:”bill”,

“predicate”:”DATE of Birth”,

“object”:”1986″

}]

}

}]

 

 

  • A pattern.detailed description of the pattern used in your extractor. How do you discover those?

1-Name Entity Recognition pattern have been used that gives the relation between name predicate and object from sentence

Relation (“Subject”, predicate, “Object”)

2-Name or Place is not present in the dictionary so they are the noun and also we later use them as subject and then as a predicate is had parent or DOB and then date entity is an object.

3-I used the standard Stanford library where there is a model already present relevant to Natural language processing or Name Entity recognition

The 4-extractor.py file uses the above description as a pattern and extract the DOB and Has parent from a sentence.

5-each ner have been assigned a tag if entity 1 is not equal to entity 2 then it is date of birth

 

 How do you experiment and improve your extractor?

Performing experiment on many lines train the program and also helps in making model that later used in NLP and NER recognition.

We can improve performance and make our model better by increasing its size and adding more sentences and giving our program training on more data. thus more the size of model more will be data and accuracy and precision will be increased and enhanced and ultimately results in a better scoring rate.

Code:

run.py

import argparse
import os
import json
import sys
import pprint

def load_data(filename):
    “””
    This function read a json file and check if this data is a valid
    data file for 6714 16s2 project input file.
    :param filename:
    :return: A list of sentence records.
    “””
    if not os.path.exists(filename):
        return None
    try:
        json_data = open(filename).read()
        data = json.loads(json_data)
        return data
    except IOError:
        print(‘Oops! {0} cannot open.’.format(filename))
    except:
        print(“Unexpected error:”, sys.exc_info()[0])
        raise

def extraction_relations(record):
    “””
    This function read a sentence record and then call the extraction function in the
    extractor modules to extract relations.
    :param record: The sentence record.
    :return: A list of relations.
    “””
    results = []
    
    from extractor import extract_date_of_birth, extract_has_parent

    
    print record[“sentence”][“annotation”][“1”]
    print record[“sentence”][“annotation”][“3”]
    print record[“sentence”][“annotation”][“4”]
    # Extract Parent relations.sentence.annotation[1]
    results.extend(extract_has_parent(record[“sentence”]))
    # Extract DateOfBrith relations.
    results.extend(extract_date_of_birth(record[“sentence”]))
    return results

def calculate_f1_score(data, results, predicate=”DateOfBirth”):
    “””
    Calculate F1 score for given relations.
    Returns:
        Precision, Recall, F1.
    “””
    #  print data[0][“sentence”][“relations”]

    
    # Initialize Counters.
    number_of_correct = 0
    number_of_incorrect = 0
    number_of_ground = 0
    number_of_extracted = 0
    number_of_missing = 0

    
    for i in range(len(data)):

        # Get the ground truth relations of certain predicate type.
        ground_relations = [(data[i][“sentence”][“relations”][0][“subject”].lower().replace(‘ ‘, ”), data[i][“sentence”][“relations”][0][“predicate”].lower(),
                             data[i][“sentence”][“relations”][0][“object”].lower().replace(‘ ‘, ”))  ]
                            

        # Get the extracted relations of certain predicate type.
        extracted_relations = [(data[i][“sentence”][“relations”][0][“subject”].lower().replace(‘ ‘, ”), data[i][“sentence”][“relations”][0][“predicate”].lower(),
                                data[i][“sentence”][“relations”][0][“object”].replace(‘ ‘, ”))]
                               

        # Perform set operations.
    

        
        ground_relations_set = set([x[0] + x[1] + x[2] for x in ground_relations])
        extracted_relations_set = set([x[0] + x[1] + x[2] for x in extracted_relations])
        correct_relations_set = ground_relations_set.intersection(extracted_relations_set)
        incorrect_relations_set = set(extracted_relations_set – ground_relations_set)
        missing_relations_set = set(ground_relations_set – extracted_relations_set)

        # Modify counters.
        number_of_extracted += len(extracted_relations_set)
        number_of_ground += len(ground_relations_set)
        number_of_correct += len(correct_relations_set)
        number_of_incorrect += len(incorrect_relations_set)
        number_of_missing += len(missing_relations_set)

    # Calculate the precision, recall and f1.
    precision = number_of_correct * 1.0 / number_of_extracted if number_of_extracted > 0 else 0
    recall = (number_of_ground – number_of_missing) * 1.0 / number_of_ground if number_of_extracted > 0 else 0
    f1 = (2 * precision * recall) / (precision + recall) if precision + recall > 0 else 0
    
    return precision, recall, f1

def run(filename):
    “””
    Calculate F1 score for given relations.
    Returns:
        scores: A dictionary contains Precision, Recall, F1.
        allresults: A list of all extracted results.
    “””
    allresults = []
    data = load_data(filename)
    # data = data[:100]
    for rcd in data:
        try:
            results = extraction_relations(rcd)
            allresults.append(results)
        except:
            # Ignore all exceptions and continue.
            allresults.append([])

    scores = {}
    p, r, f1 = calculate_f1_score(data, allresults, ‘HasParent’)
    scores[‘HasParent’] = {“Precision”: p, “Recall”: r, “F1”: f1}
    
    p, r, f1 = calculate_f1_score(data, allresults, ‘DateOfBirth’)
    scores[‘DateOfBirth’] = {“Precision”: p, “Recall”: r, “F1”: f1}

    #p, r, f1 = calculate_f1_score(data, allresults, ‘HasParent’)
    #scores[‘HasParent’] = {“Precision”: p, “Recall”: r, “F1”: f1}

    return scores, allresults

if __name__ == ‘__main__’:
    parser = argparse.ArgumentParser()
    parser.add_argument(“-v”, “–verbose”, help=”increase output verbosity”,
                        action=”store_true”)
    parser.add_argument(“input”, type=str,
    help=”The input json file location.”)
    args = parser.parse_args()
    filename = args.input
    scores, _ = run(filename)

    with open(‘score_output.json’, ‘w’) as outfile:
        json.dump(scores, outfile, indent=4)

    # Print to the screen.
    pp = pprint.PrettyPrinter(indent=4)
    pp.pprint(scores)

 

Relation.py

class Relation:
    “””
    The data class that holds one relation.
    “””
    def __init__(self, subject, predicate, object):
        self._subject = subject
        self._predicate = predicate
        self._object = object

    @property
    def subject(self):
        return self._subject

    @property
    def predicate(self):
        return self._predicate

    @property
    def object(self):
        return self._object

    def __repr__(self):
        return ‘<‘ +self._subject + ‘ –‘ + self._predicate + ‘–> ‘ + self._object + ‘>’

extractor.py

from relation import Relation
import ast

def iob_2_ner_span(ioblist):
    
    
    type = None
    start = 0
    span = []
    for i in range(len(ioblist)):
        [tok ,tag] = ioblist[i]
        if tag.startswith(“B”):
            if type != “”:
                span.append((start, i + 1, type, ‘ ‘.join([ x for [x, y] in ioblist[start:i]])))
            type = tag[2:]
            start = i
           
        elif tag == “O”:
            if type != “”:
                span.append((start, i + 1, type, ‘ ‘.join([ x for [x, y] in ioblist[start:i]])))
            type = “”
    if type != “”:
        span.append((start, len(ioblist), type, ‘ ‘.join([ x for [x, y] in ioblist[start:]])))
         
    #print span
    return span
    
def extract_date_of_birth(sentence):
    predicate = “DateOfBirth”
    results = []

    
    annotation = sentence[“annotation”]
    text = sentence[“text”]
    
    print text
    
    
    
    #print annotation[0]                    
    ioblist=  [[x[0], x[4]] for x in annotation]
    #print ioblist[0]
    ner = iob_2_ner_span(ioblist)
    print ner
    for e1 in ner:      
        for e2 in ner:
            if e1 is not e2:
                rel = Relation(e1[3], predicate, e2[3])
                results.append(rel)
                print e1            
   
    rel = Relation(“Subject”, predicate, “Object”)
    results.append(rel)
    
    return results
    

def extract_has_parent(sentence):
    predicate = “HasParent”
    results = []
    annotation = sentence[“annotation”]
    text = sentence[“text”]
    
    print text
    
    ner = iob_2_ner_span(ioblist=[[x[1], x[4]] for x in annotation])
    for e1 in ner:
        for e2 in ner:
            if e1 is not e2:
                rel = Relation(e1[3], predicate, e2[3])
                results.append(rel)
    rel = Relation(“Subject”, predicate, “Object”)
    results.append(rel)         
    return results
    

def main():
    # CALLING RUN
     from run import run
     run(“training.json”)
    

    

if __name__ == “__main__”:
     main()

Output Screens:

extractor, python, NLP

LEAVE A REPLY

Please enter your comment!
Please enter your name here