So, when you have many text files that you have to convert to txt and maybe too HTML and as follow add it in database for example. Then you can use this commands and libs.
In most cases we are using tika but there is also an interesting lib.
import textract
text = textract.process("path/to/file.extension")
.docx via python-docx2txt
So, if you want you can use this package without textract..
https://github.com/ankushshah89/python-docx2txt
sudo pip3 install docx2txt
Minimal script is like.
import docx2txt
text = docx2txt.process("job today.docx", "text/")
import pdb;pdb.set_trace()
You can run to test it..
python3 convert2text.py
.csvvia python builtins.docvia antiword.docxvia python-docx2txt.emlvia python builtins.epubvia ebooklib.gifvia tesseract-ocr.jpgand.jpegvia tesseract-ocr.jsonvia python builtins.htmland.htmvia beautifulsoup4.mp3via sox, SpeechRecognition, and pocketsphinx.msgvia msg-extractor.odtvia python builtins.oggvia sox, SpeechRecognition, and pocketsphinx.pdfvia pdftotext (default) or pdfminer.six.pngvia tesseract-ocr.pptxvia python-pptx.psvia ps2text.rtfvia unrtf.tiffand.tifvia tesseract-ocr.txtvia python builtins.wavvia SpeechRecognition and pocketsphinx.xlsxvia xlrd.xlsvia xlrd
For more advanced cases. When you have to support very many file formats like PDF, DOC, ezv.. then tika is the best tool I have ever seen so far.
It runs small server created in Java. So, this case when i convert only docx files it is a bit overkill.
Comments
Post a Comment