Convert doc or docx files to txt in python

So, when you have many text files that you have to convert to txt and maybe too HTML and as follow add it in database for example. Then you can use this commands and libs.

In most cases we are using tika but there is also an interesting lib.

import textract
text = textract.process("path/to/file.extension")

.docx via python-docx2txt

So, if you want you can use this package without textract..

sudo pip3 install docx2txt

Minimal script is like.

import docx2txt

text = docx2txt.process("job today.docx", "text/")

import pdb;pdb.set_trace()

You can run to test it..


For more advanced cases. When you have to support very many file formats like PDF, DOC, ezv.. then tika is the best tool I have ever seen so far. 

It runs small server created in Java. So, this case when i convert only docx files it is a bit overkill.