I have many large size PDF files that I need to only read a part of them. I want to start reading the PDF file and write it to another file like a txt file, or any other type of files. However, I want to make a limitation on the size of the file that I am writing in. When the size of txt file is about 15 MB, I should stop reading the PDF document and then I keep the created txt file for my purpose. Does anyone can help me how can I do this in C#?

Here is the code that I use for reading the whole file; (image content is not important for me)

using (StreamReader sr = new StreamReader(@"F:\1.pdf"))
        {
            using (StreamWriter sw = new StreamWriter(@"F:\test.txt"))
            {
                while (!sr.EndOfStream)
                {
                    string line = sr.ReadLine();                       
                    sw.WriteLine(line);
                    sw.Flush(); 
                }

            }
        }

Reading text from PDF file in c#

Answers

PDF can't be read directly using .NET. You should first convert PDF to text (or XML, or HTML).

there are lot of PDF libraries capable of converting PDF to text like iTextSharp (most popular and open-source) and lot of other tools

To control the size of the output text files you should

  • get number of pages from PDF
  • run pdf to text conversion page by page meanwhile checking the output text file size
  • once file size is over 15 MB just stop the conversion and move to another file

You have to use PDF library to do this.There are a lot of free and paid PDF libraries out there which can be used to do your task. Recently I have used EO.pdf library to read pdf page and extract page content. The best part is that it has NuGet package and also continuously developed. The cons is you have to pay for commercial use.