Como reconhecer nomes e entidades em Java

Nesse post, vamos entender melhor como é possível reconhecer nomes usando java. Antes de mais nada, você precisa compreender que esse assunto está localizado dentro do contexto de processamento de linguagem natural. Em uma aplicação de Processamento de Linguagem Natural é possível que o desenvolvedor queira extrair de um texto os nomes e entidades presentes e para isto existe uma subtarefa específica da extração de informações que é capaz de identificar estas entidades.

Esse post faz parte de um conjunto de posts sobre processamento de linguagem natural. Veja mais sobre isso aqui.

A implementação deste algoritmo é chamada de NER (named-entity recognition), seu objetivo é reconhecer categorias pré-definidas como nomes de pessoas, organizações, localizações, expressões de tempo, quantidades, valores monetários, porcentagens.

Veja um exemplo simples na língua inglesa:

Jim bought 300 shares of Acme Corp. in 2006.
[Jim]_Person bought 300 shares of [Acme Corp.]_Organization in [2006]_Time.

A criação deste tipo de software é baseado em classificadores e modelos. O classificador é treinado com uma base de dados onde nela existem diversos exemplos de objetos classificados corretamente. O próximo passo é treinar o classificador possibilitando a geração de um modelo, a partir deste modelo novas entidades fornecidas pelo usuário podem ser classificadas de acordo com o modelo criado. Esta classificação é uma predição baseada nos exemplos anteriores.

Um exemplo prático

Buscando criar um exemplo prático de como reconhecer nomes usando Java, a universidade de Stanford criou uma implementação deste algoritmo e os treinou para reconhecer 3 classes específicas: pessoas, organizações e localizações. A entrada pode ser feita com qualquer texto e o algoritmo tentará predizer na sentença quais palavras são pessoas, organizações ou localizações. A universidade de Stanford mantém uma ótima documentação sobre seu produto inclusive um exemplo funcional.

Lembrando que a utilização do Stanford parser é gratuita, porém necessita que o usuário faça download de todos seus modelos e o coreNPL. Recomendamos ainda ao desenvolvedor que acesse o site do grupo de Processamento de Linguagem Natural.

Este grupo mantém não só o NER, mas também muitas outras ferramentas muito interessantes. Veja um exemplo de código Java:

import edu.stanford.nlp.ie.AbstractSequenceClassifier;
import edu.stanford.nlp.ie.crf.*;
import edu.stanford.nlp.io.IOUtils;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.sequences.DocumentReaderAndWriter;
import edu.stanford.nlp.util.Triple;

import java.util.List;


public class NERDemo {

  public static void main(String[] args) throws Exception {

    String serializedClassifier = "classifiers/english.all.3class.distsim.crf.ser.gz";

    if (args.length > 0) {
      serializedClassifier = args[0];
    }

    AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifier(serializedClassifier);


    if (args.length > 1) {

   

      String fileContents = IOUtils.slurpFile(args[1]);
      List<List<CoreLabel>> out = classifier.classify(fileContents);
      for (List<CoreLabel> sentence : out) {
        for (CoreLabel word : sentence) {
          System.out.print(word.word() + '/' + word.get(CoreAnnotations.AnswerAnnotation.class) + ' ');
        }
        System.out.println();
      }

      System.out.println("---");
      out = classifier.classifyFile(args[1]);
      for (List<CoreLabel> sentence : out) {
        for (CoreLabel word : sentence) {
          System.out.print(word.word() + '/' + word.get(CoreAnnotations.AnswerAnnotation.class) + ' ');
        }
        System.out.println();
      }

      System.out.println("---");
      List<Triple<String, Integer, Integer>> list = classifier.classifyToCharacterOffsets(fileContents);
      for (Triple<String, Integer, Integer> item : list) {
        System.out.println(item.first() + ": " + fileContents.substring(item.second(), item.third()));
      }
      System.out.println("---");
      System.out.println("Ten best entity labelings");
      DocumentReaderAndWriter<CoreLabel> readerAndWriter = classifier.makePlainTextReaderAndWriter();
      classifier.classifyAndWriteAnswersKBest(args[1], 10, readerAndWriter);

      System.out.println("---");
      System.out.println("Per-token marginalized probabilities");
      classifier.printProbs(args[1], readerAndWriter);

    } else {

    
      String[] example = {"Good afternoon Rajat Raina, how are you today?",
                          "I go to school at Stanford University, which is located in California." };
      for (String str : example) {
        System.out.println(classifier.classifyToString(str));
      }
      System.out.println("---");

      for (String str : example) {
        // This one puts in spaces and newlines between tokens, so just print not println.
        System.out.print(classifier.classifyToString(str, "slashTags", false));
      }
      System.out.println("---");

      for (String str : example) {
        // This one is best for dealing with the output as a TSV (tab-separated column) file.
        // The first column gives entities, the second their classes, and the third the remaining text in a document
        System.out.print(classifier.classifyToString(str, "tabbedEntities", false));
      }
      System.out.println("---");

      for (String str : example) {
        System.out.println(classifier.classifyWithInlineXML(str));
      }
      System.out.println("---");

      for (String str : example) {
        System.out.println(classifier.classifyToString(str, "xml", true));
      }
      System.out.println("---");

      for (String str : example) {
        System.out.print(classifier.classifyToString(str, "tsv", false));
      }
      System.out.println("---");

      // This gets out entities with character offsets
      int j = 0;
      for (String str : example) {
        j++;
        List<Triple<String,Integer,Integer>> triples = classifier.classifyToCharacterOffsets(str);
        for (Triple<String,Integer,Integer> trip : triples) {
          System.out.printf("%s over character offsets [%d, %d) in sentence %d.%n",
                  trip.first(), trip.second(), trip.third, j);
        }
      }
      System.out.println("---");

      // This prints out all the details of what is stored for each token
      int i=0;
      for (String str : example) {
        for (List<CoreLabel> lcl : classifier.classify(str)) {
          for (CoreLabel cl : lcl) {
            System.out.print(i++ + ": ");
            System.out.println(cl.toShorterString());
          }
        }
      }
      System.out.println("---");
    }
  }
}

Cookie	Duração	Descrição
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.