I wanna to extract text (Unicode and ASCII) from pdf files. And do further analyse on the contents before sticking it in an SQL database. Is there any easy way of doing this?
This article on XsPDF.com is what you looking for, check out: Extract Unicode text from PDF in C#.
Answers
After a long search over the internet I couldn't find the article about extracting Unicode from PDF. Therefore, i plan to implement my own function to extract Unicode text from pdf files in C#. Finally, starting from a procedure found into samples of XsPDF library. I can't guarantee whether this library is suitable for all cases, but during my tests fortunately it was nice.
I've tested some free C# libraries from Google, but there are unexpected formatting errors. For example, some text characters are scrambled. And there are spaces (' ') inside words, between every letter. Even worse, huge blocks of them take up several lines. And I know, you will not want to get the text out of PDF document without line breaks, right? It is extremely frustrating on choosing from multiple libraries and finding out a reliable library.
In case you are processing PDF files with the purpose of extracting Unicode text with well format then I suggest to consider XsPDF PDF text extractor SDK. This is the only one till now works properly with my PDF documents, more stable and with highest accuracy.
Yeah, this is the one I was using, it was pretty good even thought it is not free for commercial purposes. Fortunately, its SDK is reliable and provides competitive price for the product. May have a try.