Parse only the car makes (brands) and models from a sentence



To perform pattern matching for extracting car makes from a sentence, you can use spaCy's `Matcher` class. The `Matcher` allows you to define patterns based on token attributes and match them against a given document. Here's an example of how you can use `Matcher` for this task:


```python
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

def extract_car_make(sentence, car_make_list):
doc = nlp(sentence)

matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": make.lower()} for make in car_make_list]
matcher.add("CAR_MAKE",

[pattern]

)

matches = matcher(doc)
car_make = None

if matches:
match_id, start, end = matches[0]
car_make = doc[start:end].text

return car_make

# Example usage
sentence = "I saw a red Ford Mustang on the street."
car_make_list = ["Ford", "Chevrolet", "Toyota", "BMW"]
make = extract_car_make(sentence, car_make_list)
print(make) # Output: Ford
```

In this example, the `extract_car_make` function takes a sentence and a list of car makes as input. The sentence is processed using spaCy's `nlp` function to obtain a parsed `doc` object. We initialize a `Matcher` instance by providing the vocabulary of the loaded language model.

We create a pattern list based on the car make list, converting each make to lowercase using a list comprehension. Each pattern in the list specifies that the token should have a lowercase form matching the car make. For example, `{"LOWER": "ford"}` represents the pattern for "Ford".

We add the pattern to the matcher using `matcher.add("CAR_MAKE", None, pattern)`, where "CAR_MAKE" is a unique ID for the pattern.

We then use `matcher(doc)` to find matches in the document. If there is a match, we extract the car make from the first match by getting the start and end indices and using `doc[start:end].text`.

In the example usage, the sentence "I saw a red Ford Mustang on the street" is passed to the `extract_car_make` function, along with a car make list containing "Ford", "Chevrolet", "Toyota", and "BMW". The function identifies "Ford" as the car make, and it is printed as the output.

You can modify the `car_make_list` variable to include the specific car makes you want to extract using pattern matching. The function will then find the first matching car make from the list in the sentence and return it. 


Every Brand have to match i a pattern.  Working method could be seen like this. 


def extract_car_make(self, text):
        doc = self.nlp(text)
        car_make = None
        import pdb; pdb.set_trace()
        matcher = Matcher(self.nlp.vocab)
        pattern = [[{"LOWER": make.lower()}] for make in self.car_make_list]
        matcher.add("CAR_MAKE", pattern)
        matches = matcher(doc)
        print([token.text for token in doc])
        if matches:
            match_id, start, end = matches[0]
            car_make = doc[start:end].text

        make = VehicleMerk.objects.get(name=car_make)
        return make

With this method you can parse subcategory form one text.. For example. 

def extract_car_make_and_model(self, text):
        doc = self.nlp(text)
        car_make = None
        car_model = None
        #import pdb; pdb.set_trace()
        matcher = Matcher(self.nlp.vocab)
        pattern = [[{"LOWER": make.lower()}] for make in self.car_make_list]
        matcher.add("CAR_MAKE", pattern)
        matches = matcher(doc)
        print([token.text for token in doc])
        if matches:
            match_id, start, end = matches[0]
            car_make = doc[start:end].text

        make = VehicleMerk.objects.get(name=car_make)

        model_list = VehicleModel.objects.filter(vehicle_merk=make).values_list('name', flat=True)
        matcher_model = Matcher(self.nlp.vocab)
        pattern = [[{"LOWER": make.lower()}] for make in model_list]
        matcher_model.add("CAR_MODEL", pattern)
        matches = matcher_model(doc)
        if matches:
            match_id, start, end = matches[0]
            car_model = doc[start:end].text

        model = VehicleModel.objects.get(name=car_model, vehicle_merk=make)
        return make, model

Or you can define it together in 1 pattern if you need a string together. In our case we need to add a car from the list in our site and make price calculations. It is therefore necessary that this string comes separately.

The current approach faces an issue when dealing with entities such as Alfa Romeo, Panther Westwinds, and Martin Motors. The problem lies in the pattern matcher, which searches within each token instead of considering the entire text as a whole. To address this, the pattern matcher should be modified to search across the entire text rather than individual tokens. 


Customizing spaCy’s Tokenizer class

Furthermore, we have a problem with standard Tokenizer. 

We have to use own because a model can contain a strip and then this token is divided into two or three parts in this way matching no longer works for such models Make like Mercedes-Benz is not working.. 


Our custom configuration for ~


self.nlp.tokenizer = Tokenizer(self.nlp.vocab, infix_finditer=re.compile(r'''[~]''').finditer)


A custom tokenizer in spaCy is needed when the default tokenization provided by spaCy does not meet your specific requirements. You may require a custom tokenizer in the following scenarios:

1. Non-standard text: If you are working with text that deviates from standard language rules or includes domain-specific terms or symbols, a custom tokenizer can handle tokenizing such text more accurately.

2. Specialized tokenization rules: If your text follows specific rules or patterns that differ from the default tokenization behavior of spaCy, a custom tokenizer allows you to define and implement these specialized tokenization rules.

3. Tokenizing non-textual data: If you are working with data types other than traditional text, such as code snippets, URLs, or social media posts, a custom tokenizer can handle these data types appropriately.

To create a custom tokenizer in spaCy, you can define a function that takes the text as input and returns a list of tokens. This function should implement your specific tokenization logic. Once you have defined the function, you can add it to the spaCy pipeline as a custom component.

Here's an example of how you can define a custom tokenizer function in spaCy:

In the above example, the `custom_tokenizer` function splits the text on whitespace to generate tokens. You can modify this function according to your specific tokenization needs.

Remember to adjust the custom tokenizer according to your specific requirements and the structure of the text you are working with.

Regular expressions

The current approach faces an issue when dealing with entities such as Alfa Romeo, Panther Westwinds, and Martin Motors. The problem lies in the pattern matcher, which searches within each token instead of considering the entire text as a whole. To address this, the pattern matcher should be modified to search across the entire text rather than individual tokens.

When utilizing the REGEX operator, it's important to note that it functions on individual tokens rather than the entire text. Every expression you provide will be matched against a token. In case you require matching on the entire text instead, please refer to the specific information on regex matching for the entire text.

Spacy example. 

import spacy
import re

nlp = spacy.load("en_core_web_sm")
doc = nlp("The United States of America (USA) are commonly known as the United States (U.S. or US) or America.")

expression = r"[Uu](nited|\\.?) ?[Ss](tates|\\.?)"
for match in re.finditer(expression, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    # This is a Span object or None if match doesn't map to valid token sequence
    if span is not None:
        print("Found match:", span.text)

Cars Make and model example 



for pattern in self.make_patterns:

            car_make = self.parse_make(pattern, doc)

            if car_make:

                break


def parse_make(self, expression, doc):

        for match in re.finditer(expression, doc.text):

            start, end = match.span()

            span = doc.char_span(start, end)

            # This is a Span object or None if match doesn't map to valid token sequence

            if span is not None:

                return span.text



Comments