When you start performing OCR (Optical Character Recognition) continuously and frequently rather than as a one-off task, the criteria for selecting tools changes dramatically.
Once you move beyond the stage of "using free tools" or "using GUI services," the configuration of Google Cloud Vision API (GCP) + Visual Studio Code (VSC) + CLI emerges as a realistic option.
In this article, I will explain why this configuration tends to be the "expert-oriented but correct route" from the perspectives of accuracy, cost, and operation.
Why OCR Workflows Break Down as Frequency Increases
OCR may seem like a simple task at first glance, but as the volume increases, the following problems become apparent:
- Changing settings every time becomes tedious.
- Rework increases due to fluctuations in accuracy.
- Manual operations become a bottleneck.
- Post-processing (formatting, saving, reusing) cannot keep up.
In particular, GUI-based OCR services have the weakness that human operation costs increase linearly. In an environment where OCR is performed frequently, this point becomes fatal.
Why Google Cloud Vision OCR Accuracy is Stable
Google Cloud Vision API's OCR (Document Text Detection) has accuracy that can withstand business use in the following respects:
- High capability for Japanese (horizontal, vertical, mixed) text.
- Recognition is possible while maintaining multiple columns and layout structures.
- Can be used with almost no awareness of pre-processing.
With OSS-based OCR (such as Tesseract), adjustments such as:
- Binarization and tilt correction
- DPI and font dependency
- Layout collapse countermeasures
are necessary, but with Cloud Vision OCR, the major difference is that this adjustment cost can be solved with money.
As the frequency increases, this difference directly translates to a difference in work time.
Actually Not Expensive? The Cost of Cloud Vision OCR
Cloud Vision OCR is a pay-as-you-go system.
As a guideline:
- Around several hundred to 1,000 yen per 1,000 pages
- For small to medium-scale use, it often fits within several hundred to several thousand yen per month.
What is important is that:
- Wasteful trials can be reduced.
- You can call the API only as much as necessary.
The reality is that fixing conditions and processing with CLI tends to be cheaper in the end than repeating trial and error and re-uploading with GUI operations.
Why GCP x VSC x CLI is Strong
VSC: A Place for Thinking and Verification
When handling OCR in Visual Studio Code:
- You can check OCR results (JSON) immediately.
- Formatting, processing, and post-processing can be done in place.
- Easy to link with AI, Linters, and scripts.
OCR is "production" after recognition. VSC can consolidate that process into one environment.
CLI: Ensuring Reproducibility and Scalability
By making it CLI-based:
- You can process under the same conditions any number of times.
- Batch processing becomes possible.
- Logs remain, and failures can be verified.
Reproducibility and automation, which are difficult with GUI, become prerequisites with CLI.
GCP: No Need to Manage Infrastructure
- Improvements to the OCR engine are done by Google.
- Can be used without being conscious of scale.
- Can naturally connect with other processes as an API.
The strength of GCP is that you can concentrate on using the OCR environment rather than "building" it.
Technical Pain Points and Their True Nature
The reason why this configuration is easily shunned is also clear.
- IAM (permission settings) is difficult to understand.
- It is easy to get stuck around service accounts.
- Initial setup is heavy.
However, this is the type of pain where:
Once you get through it, you hardly have to touch it afterwards.
Initial costs are high, but operational costs become very low. As frequency increases, this difference becomes effective.
Cases Where This Configuration is Suitable
- Performing OCR regularly and in large quantities.
- Wanting to structure and reuse results.
- Wanting to automate post-processing.
- Wanting to nurture the OCR environment as an "asset".
Conversely, if the use is only a few times a month, this configuration is excessive.
Summary
In an environment where OCR is performed frequently:
- Stability of accuracy
- Low operation cost
- Automation and reproducibility
become most important.
The realistic option that satisfies those conditions is GCP x VSC x CLI.
Although it is expert-oriented, it can be said to be the correct route when viewed on a results basis.
Reference URLs
Google Cloud Vision OCR Official Documentation https://cloud.google.com/vision/docs/ocr
Cloud Vision API Pricing https://cloud.google.com/vision/pricing
Cloud Vision OCR Use Cases https://cloud.google.com/use-cases/ocr
Cloud Vision OCR Series Articles
This article is part of a series on setting up and operating OCR using Cloud Vision API.
- [Summary] Cloud Vision OCR x CLI x GCP Setup & Troubleshooting Log](/archives/ocr-series-summary-en)
- Setup: Why Cloud Vision API OCR Failed with IAM and How I Finally Got It Working
- Operation: Why Running Cloud Vision OCR via CLI Was the Easiest Method
- Workflow: Why GCP x VSC x CLI is the Correct Route for Frequent OCR
- Tools: Explanation of Practical Tools Created to Run Cloud Vision OCR via CLI
- Code: Free Code: Minimal Script to Safely Run Cloud Vision OCR via CLI
- Extra: When Stuck in GCP GUI, Escaping to Cloud Shell + CUI Was Faster (Failure Story)