Parse only the car makes (brands) and models from a sentence



To perform pattern matching for extracting car makes from a sentence, you can use spaCy's `Matcher` class. The `Matcher` allows you to define patterns based on token attributes and match them against a given document. Here's an example of how you can use `Matcher` for this task:


```python
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

def extract_car_make(sentence, car_make_list):
doc = nlp(sentence)

matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": make.lower()} for make in car_make_list]
matcher.add("CAR_MAKE",

[pattern]

)

matches = matcher(doc)
car_make = None

if matches:
match_id, start, end = matches[0]
car_make = doc[start:end].text

return car_make

# Example usage
sentence = "I saw a red Ford Mustang on the street."
car_make_list = ["Ford", "Chevrolet", "Toyota", "BMW"]
make = extract_car_make(sentence, car_make_list)
print(make) # Output: Ford
```

In this example, the `extract_car_make` function takes a sentence and a list of car makes as input. The sentence is processed using spaCy's `nlp` function to obtain a parsed `doc` object. We initialize a `Matcher` instance by providing the vocabulary of the loaded language model.

We create a pattern list based on the car make list, converting each make to lowercase using a list comprehension. Each pattern in the list specifies that the token should have a lowercase form matching the car make. For example, `{"LOWER": "ford"}` represents the pattern for "Ford".

We add the pattern to the matcher using `matcher.add("CAR_MAKE", None, pattern)`, where "CAR_MAKE" is a unique ID for the pattern.

We then use `matcher(doc)` to find matches in the document. If there is a match, we extract the car make from the first match by getting the start and end indices and using `doc[start:end].text`.

In the example usage, the sentence "I saw a red Ford Mustang on the street" is passed to the `extract_car_make` function, along with a car make list containing "Ford", "Chevrolet", "Toyota", and "BMW". The function identifies "Ford" as the car make, and it is printed as the output.

You can modify the `car_make_list` variable to include the specific car makes you want to extract using pattern matching. The function will then find the first matching car make from the list in the sentence and return it. 


Every Brand have to match i a pattern.  Working method could be seen like this. 


def extract_car_make(self, text):
        doc = self.nlp(text)
        car_make = None
        import pdb; pdb.set_trace()
        matcher = Matcher(self.nlp.vocab)
        pattern = [[{"LOWER": make.lower()}] for make in self.car_make_list]
        matcher.add("CAR_MAKE", pattern)
        matches = matcher(doc)
        print([token.text for token in doc])
        if matches:
            match_id, start, end = matches[0]
            car_make = doc[start:end].text

        make = VehicleMerk.objects.get(name=car_make)
        return make

With this method you can parse subcategory form one text.. For example. 

def extract_car_make_and_model(self, text):
        doc = self.nlp(text)
        car_make = None
        car_model = None
        #import pdb; pdb.set_trace()
        matcher = Matcher(self.nlp.vocab)
        pattern = [[{"LOWER": make.lower()}] for make in self.car_make_list]
        matcher.add("CAR_MAKE", pattern)
        matches = matcher(doc)
        print([token.text for token in doc])
        if matches:
            match_id, start, end = matches[0]
            car_make = doc[start:end].text

        make = VehicleMerk.objects.get(name=car_make)

        model_list = VehicleModel.objects.filter(vehicle_merk=make).values_list('name', flat=True)
        matcher_model = Matcher(self.nlp.vocab)
        pattern = [[{"LOWER": make.lower()}] for make in model_list]
        matcher_model.add("CAR_MODEL", pattern)
        matches = matcher_model(doc)
        if matches:
            match_id, start, end = matches[0]
            car_model = doc[start:end].text

        model = VehicleModel.objects.get(name=car_model, vehicle_merk=make)
        return make, model

Or you can define it together in 1 pattern if you need a string together. In our case we need to add a car from the list in our site and make price calculations. It is therefore necessary that this string comes separately.



Furthermore, we have a problem with standard Tokenizer. 

We have to use own because a model can contain a strip and then this token is divided into two or three parts in this way matching no longer works for such models.

Customizing spaCy’s Tokenizer class

Our custom configuration 


self.nlp.tokenizer = Tokenizer(self.nlp.vocab, infix_finditer=re.compile(r'''[~]''').finditer)


Comments