Automated Credit Card Statement Analysis and Personal Finance Management with Python: From PDF Parsing to Spending Pattern Visualization
Are you wasting time manually analyzing credit card statements? We introduce a method to automatically parse PDF statements using Python and visualize spending patterns for efficient personal finance management. The battle with Excel sheets is now over.
1. The Challenge / Context
How much time do you spend each month when you receive your credit card statement? The process of transferring data to Excel, categorizing it, and identifying spending patterns is cumbersome and time-consuming. Furthermore, the analysis tools provided by credit card companies are often limited and do not meet individual needs. This becomes even more difficult when using multiple cards, making it harder to grasp the overall spending flow. By leveraging Python, you can solve these problems and automate personal finance management, saving time and performing more accurate analysis.
2. Deep Dive: pdfminer.six & Pandas
The core technologies are the pdfminer.six library for PDF parsing and the Pandas library for data analysis and manipulation. pdfminer.six is used to extract text from PDF documents, and Pandas facilitates analysis by converting the extracted data into a DataFrame format.
pdfminer.six provides the functionality to accurately extract text by analyzing the layout of PDF documents. In addition to simply extracting text, it can also extract various information such as font size and position, allowing for effective processing of complex PDF documents.
Pandas offers powerful data manipulation and analysis capabilities. DataFrame is optimized for handling tabular data and can perform various operations such as filtering, sorting, grouping, and aggregation. Furthermore, Pandas integrates with various data visualization tools, making it easy to represent spending patterns graphically.
3. Step-by-Step Guide / Implementation
Step 1: Install Required Libraries
First, you need to install the pdfminer.six and Pandas libraries. Run the following command in your terminal or command prompt.
pip install pdfminer.six pandas matplotlib
Step 2: Extract Text from PDF File
The following code is a basic example of extracting text from a PDF file. You need to change the file path to your actual statement PDF file path.
from pdfminer.high_level import extract_text
def extract_text_from_pdf(pdf_path):
text = extract_text(pdf_path)
return text
pdf_file_path = 'your_credit_card_statement.pdf' # Change to actual file path
extracted_text = extract_text_from_pdf(pdf_file_path)
print(extracted_text)
Step 3: Clean Extracted Text Data
The extracted text often contains unnecessary spaces, special characters, and credit card company logo information. You need to extract only the necessary data using regular expressions or string manipulation functions.
import re
def clean_text(text):
# Remove unnecessary characters (e.g., page numbers, credit card company logos)
cleaned_text = re.sub(r'Page \d+ of \d+', '', text)
cleaned_text = re.sub(r'카드사 로고 관련 텍스트', '', cleaned_text) # Translate: 'Credit card company logo related text'
cleaned_text = cleaned_text.strip()
return cleaned_text
cleaned_text = clean_text(extracted_text)
print(cleaned_text)
Step 4: Extract and Structure Transaction Data
Extract transaction data such as date, merchant, and amount from the cleaned text. Since each credit card company has a different statement format, you need to adjust the regular expression or string parsing method to match the statement format. The key is to skillfully use **regular expressions**. For example, if the date format is 'YYYY.MM.DD', you can use a regular expression like `r'\d{4}\.\d{2}\.\d{2}'`.
import re
import pandas as pd
def extract_transactions(text):
# Use a regular expression that matches the credit card statement format
pattern = re.compile(r'(\d{4}\.\d{2}\.\d{2})\s+([\w\s\.\-]+)\s+([\d,\.]+)')
transactions = []
for match in pattern.finditer(text):
date = match.group(1)
description = match.group(2).strip()
amount = float(match.group(3).replace(',', '')) # Remove comma and convert to float
transactions.append({'Date': date, 'Description': description, 'Amount': amount})
return transactions
transactions = extract_transactions(cleaned_text)
df = pd.DataFrame(transactions)
print(df)
Step 5: Categorize Spending
Categorize spending based on the extracted transaction details. For example, "Starbucks" can be categorized as "Cafe", "Coupang" as "Online Shopping", and so on. Define categories and rules in advance, and use Pandas' `apply` function to assign a category to each transaction.
def categorize_transaction(description):
if '스타벅스' in description: # Translate: 'Starbucks'
return '카페' # Translate: 'Cafe'
elif '쿠팡' in description: # Translate: 'Coupang'
return '온라인 쇼핑' # Translate: 'Online Shopping'
elif '택시' in description: # Translate: 'Taxi'
return '교통' # Translate: 'Transportation'
else:
return '기타' # Translate: 'Other'
df['Category'] = df['Description'].apply(categorize_transaction)
print(df)
Step 6: Visualize Spending Patterns
Visualize spending patterns using data visualization libraries such as Matplotlib or Seaborn. Represent monthly spending trends and the proportion of spending by category in graphs to easily understand your personal financial situation.
import matplotlib.pyplot as plt
# Monthly spending trend
df['Date'] = pd.to_datetime(df['Date'])
monthly_spending = df.groupby(pd.Grouper(key='Date', freq='M'))['Amount'].sum()
monthly_spending.plot(kind='line', title='월별 소비 추이') # Translate: 'Monthly Spending Trend'
plt.show()
# Spending proportion by category
category_spending = df.groupby('Category')['Amount'].sum()
category_spending.plot(kind='pie', autopct='%1.1f%%', title='카테고리별 소비 비중') # Translate: 'Spending Proportion by Category'
plt.show()
4. Real-world Use Case / Example
Through this workflow, I save more than 2 hours every month. Previously, I spent a lot of time manually entering and analyzing credit card statements in Excel, but now, by simply running a Python script, I can identify spending patterns and focus on reducing unnecessary expenses. In particular, analyzing the 'Other' category expenses that occur every month has helped me discover and improve new spending habits.
5. Pros & Cons / Critical Analysis
- Pros:
- Time savings through automated data extraction and analysis
- Personalized spending pattern analysis and visualization
- Flexible application to various credit card statement formats (using regular expressions)
- Cons:
- Requires some programming knowledge for initial setup and script writing
- Requires script modification if the credit card statement format changes
- Text extraction accuracy may vary depending on the quality of the PDF statement
6. FAQ
- Q: Can I use other PDF parsing libraries instead of pdfminer.six?
A: Yes, you can use various PDF parsing libraries such as PyPDF2, textract, etc. Each library has its pros and cons, so you can choose according to the statement format and your personal preference. pdfminer.six generally offers higher accuracy, but for PDFs with complex layouts, textract might yield better results. - Q: Are there any security issues?
A: Credit card statements contain sensitive personal information, so security measures must be taken for the script execution environment and data storage location. If possible, it is recommended to run the script in a local environment and encrypt the extracted data for storage. Also, when uploading scripts to public repositories like GitHub, be careful not to include personal information. - Q: Can I export to an Excel file?
A: You can easily export a DataFrame to an Excel file using Pandas' `to_excel()` function.df.to_excel('credit_card_analysis.xlsx', index=False)
7. Conclusion
Automated credit card statement analysis using Python is a powerful tool that can revolutionize personal finance management. While it requires some initial effort for setup, building an automated workflow can save time and effort each month, leading to more effective financial management. Prepare your credit card statement PDF now and challenge yourself to automate personal finance management using the code above!


