VBA read PDF

VBA - Visual Basic for Application

With VBA you can write many helpful macros to automate work processes within Microsoft Office. A very useful function would be to read values from a PDF file and transfer them to Excel or Word using VBA. This would avoid tedious typing or manual copying from the PDF into Excel or Word and save a lot of time.

Preparation: Convert PDF to text file

To accomplish this requires a little preparation. Those who have the paid version of Adobe Acrobat can try it directly through the library. But since most will not have the paid version, we use a command line tool to convert the PDF file to a text file, and then read the text file. This work is done by the pdftotext.exe of XpdfReader. Simply download the tools and place the exe (pdftotext.exe) in any folder. The exe-file can also be located in the same folder path as the PDF files to be read.

The readout of the PDF is to be done using the example of an invoice. Let’s assume we have the following invoice and want to read the total amount from it and write it to an Excel file.

Screenshot of PDF-Invoice as an expample
pdf invoice example

In this example we have placed the PDF and the pdftotext.exe in the same directory.

Screenshot File Explorer with pdf file that shall be read out and pdftotext.exe
Example PDF for export text values to microsoft office, e.g. Excel, Word, Access

In order to be able to read the values of the PDF file correctly later with a VBA macro, we must first understand what the text file looks like that the command line tool creates. This is the only way to parse the text file correctly later. The best way to do this is to first create the text file manually using the command prompt (cmd). There you can also learn the syntax, which we will need later in the VBA code. The text file is automatically saved in the same directory and under the same name as the PDF file. So if the PDF file is called Invoice-2022-5340.pdf, then the text file is called Invoice-2022-5340.txt.

For our example, the generated text file looks like this:

Screenshot converted text from PDF for further use in Microsoft Office (Excel, Word, Access)
Readed text from PDF within txt-file

If you know the structure of the text file, you can think about the best way to process/parse the text and how to get the desired information.

Note: The pdftotext.exe tool is not a guarantee that it will work or that everything will be read correctly. There may be PDF files that cannot be read because perhaps the font is not known, the PDF is encrypted, copy-protected or similar. Also in this example it is noticeable that for example the Euro (€) sign has not been exported.

VBA-Code

Now what does the VBA macro look like, for example, to read in the total price from the invoice?
The following procedure:

  1. Read the text from the PDF and save it to a text file
  2. Open the text file and assign the text to a variable
  3. parse the text, preferably with regex, to get the desired values
  4. write the desired value into an Excel cell or into Word
  5. Delete the text file again
Sub ReadPDFFile()
    Dim WSHShell As Object
    Dim FSO As Object
    Dim regex As Object
    Dim strCommand, pdfFilePath, txtFilePath, strText, txt As String
    On Error GoTo ErrorHandling

    Set WSHShell = CreateObject("WScript.Shell")
    Set FSO = CreateObject("Scripting.FileSystemObject")

    ''Directory path to PDF file
    pdfFilePath = "c:\temp\Invoice-2022-5340.pdf"

    'convert to text file with command line tool pdftotext.exe
    strCommand = """c:\temp\pdftotext.exe"" -raw " & """" & pdfFilePath & """"
    WSHShell.Run strCommand, 0, True
    
    'Determine path of text file and read text file
    txtFilePath = Replace(pdfFilePath, ".pdf", ".txt")
    strText = FSO.OpenTextFile(txtFilePath).ReadAll
    
    'Parse text with regex
    Set regex = CreateObject("vbscript.regexp")
    regex.Pattern = "Total:\s*\d*,\d{2}"
    Set match1 = regex.Execute(strText)
    If match1.Count > 0 Then
        txt = match1(0)
        txt = Replace(txt, "Total: ", "")
        'Insert into Excel cell
        Cells(3, 3).Value = txt
    End If
    
    'Delete text file again after work is done
    Kill (txtFilePath)

    Set regex = Nothing
    Set WSHShell = Nothing
    Set FSO = Nothing
    
    Exit Sub
ErrorHandling:
        'Perform a few cleanups in case of an error.
        Kill (txtFilePath)
        Set regex = Nothing
        Set WSHShell = Nothing
        Set FSO = Nothing
End Sub

This code reads the PDF file and writes the total amount of the invoice to the Excel cell C3.

Screenshot read out PDF file via VBA
Result: Values of PDF read out and inserted to excel file
Bestseller No. 1
Bestseller No. 2

Leave a Reply

Your email address will not be published. Required fields are marked *