I have two pdf files. On Sercurity tab both files have set Security Method: No Security and Document Assembly: Not Allowed and page Extraction: Not Allowed. Other items are allowed. I using standart ITextSharp method to retrieve text from pdf:
PdfReader pdfReader = new PdfReader(fileName); for (int page = 1; page <= pdfReader.NumberOfPages; page++) { ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy(); //LocationTextExtractionStrategy(); string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText))); text.Append(currentText);
From first file i can get currentText wihtout any problem from second file I cannot retrieve text, currentText is empty. I was trying with LocationTextExtractionStrategy, but result is the same. I opened this file in SodaPDF and convert it to txt file but this file is empty too (while frist file is converted to txt without any problems). It is possible to read text from this file from C# or with any other application? If I buy Adobe Reader I will convert this file to txt ? What is difference between these two files ?
Answers
I would suggest you to download and try XsPdf for .NET to convert PDF to Text file. In case your file contains images and you need to extract the text from those images, you can convert Pdf file to images and then perform OCR using .NET.
There may be a lot of pdf which actually are images. You cannot extracttext from imaged pdf as Bruno Lowagie said. you need to go for third party OCR for this.
you ca use Adobe Acrobat to convert the pdf to editable format like word, html.