With VBA you can write many helpful macros to automate work processes within Microsoft Office. A very useful function would be to read values from a PDF file and transfer them to Excel or Word using VBA. This would avoid tedious typing or manual copying from the PDF into Excel or Word and save a lot of time.
Preparation: Convert PDF to text file
To accomplish this requires a little preparation. Those who have the paid version of Adobe Acrobat can try it directly through the library. But since most will not have the paid version, we use a command line tool to convert the PDF file to a text file, and then read the text file. This work is done by the pdftotext.exe of XpdfReader. Simply download the tools and place the exe (pdftotext.exe) in any folder. The exe-file can also be located in the same folder path as the PDF files to be read.
The readout of the PDF is to be done using the example of an invoice. Let’s assume we have the following invoice and want to read the total amount from it and write it to an Excel file.
In this example we have placed the PDF and the pdftotext.exe in the same directory.
In order to be able to read the values of the PDF file correctly later with a VBA macro, we must first understand what the text file looks like that the command line tool creates. This is the only way to parse the text file correctly later. The best way to do this is to first create the text file manually using the command prompt (cmd). There you can also learn the syntax, which we will need later in the VBA code. The text file is automatically saved in the same directory and under the same name as the PDF file. So if the PDF file is called Invoice-2022-5340.pdf, then the text file is called Invoice-2022-5340.txt.
For our example, the generated text file looks like this:
If you know the structure of the text file, you can think about the best way to process/parse the text and how to get the desired information.
Note: The pdftotext.exe tool is not a guarantee that it will work or that everything will be read correctly. There may be PDF files that cannot be read because perhaps the font is not known, the PDF is encrypted, copy-protected or similar. Also in this example it is noticeable that for example the Euro (€) sign has not been exported.
VBA-Code
Now what does the VBA macro look like, for example, to read in the total price from the invoice?
The following procedure:
- Read the text from the PDF and save it to a text file
- Open the text file and assign the text to a variable
- parse the text, preferably with regex, to get the desired values
- write the desired value into an Excel cell or into Word
- Delete the text file again
Sub ReadPDFFile()
Dim WSHShell As Object
Dim FSO As Object
Dim regex As Object
Dim strCommand, pdfFilePath, txtFilePath, strText, txt As String
On Error GoTo ErrorHandling
Set WSHShell = CreateObject("WScript.Shell")
Set FSO = CreateObject("Scripting.FileSystemObject")
''Directory path to PDF file
pdfFilePath = "c:\temp\Invoice-2022-5340.pdf"
'convert to text file with command line tool pdftotext.exe
strCommand = """c:\temp\pdftotext.exe"" -raw " & """" & pdfFilePath & """"
WSHShell.Run strCommand, 0, True
'Determine path of text file and read text file
txtFilePath = Replace(pdfFilePath, ".pdf", ".txt")
strText = FSO.OpenTextFile(txtFilePath).ReadAll
'Parse text with regex
Set regex = CreateObject("vbscript.regexp")
regex.Pattern = "Total:\s*\d*,\d{2}"
Set match1 = regex.Execute(strText)
If match1.Count > 0 Then
txt = match1(0)
txt = Replace(txt, "Total: ", "")
'Insert into Excel cell
Cells(3, 3).Value = txt
End If
'Delete text file again after work is done
Kill (txtFilePath)
Set regex = Nothing
Set WSHShell = Nothing
Set FSO = Nothing
Exit Sub
ErrorHandling:
'Perform a few cleanups in case of an error.
Kill (txtFilePath)
Set regex = Nothing
Set WSHShell = Nothing
Set FSO = Nothing
End Sub
This code reads the PDF file and writes the total amount of the invoice to the Excel cell C3.