Convert doc or docx files to txt in python


So, when you have many text files that you have to convert to txt and maybe too HTML and as follow add it in database for example. Then you can use this commands and libs.


In most cases we are using tika but there is also an interesting lib.


import textract
text = textract.process("path/to/file.extension")



.docx via python-docx2txt



So, if you want you can use this package without textract..


https://github.com/ankushshah89/python-docx2txt



sudo pip3 install docx2txt

Minimal script is like.


import docx2txt


text = docx2txt.process("job today.docx", "text/")

import pdb;pdb.set_trace()


You can run to test it..

python3 convert2text.py






For more advanced cases. When you have to support very many file formats like PDF, DOC, ezv.. then tika is the best tool I have ever seen so far. 


It runs small server created in Java. So, this case when i convert only docx files it is a bit overkill. 



Comments