OCR – Christopher S Rose

I discovered by accident that Google Docs has the holy grail for people who work with Arabic and Persian: Optical Character Recognition. It’s a little cumbersome, but I’ve been using it to produce translations of newspaper articles in about as much time as it would normally take me to try to skim them.

Is this cheating? Probably. Also, don’t care. My productivity has gone up substantially.

Let’s take this column I’ve been working on out of Al-Muqattam, an Egyptian newspaper, from 1918. The scan of microfilm looks like this:

The problem–as anyone who works on this period immediately understands–is that the original was handwritten on hard wax, printed onto paper, and then photographed. The writing is inconsistent, and (especially for a non-native speaker) hard to decipher. Is that a smudge or the letter ق? Is that a bump or an initial م as rendered in the ruq’a script?

Someone clued me into the fact that the camera function on Google Translate does a passable job of reading Arabic with its camera function — which it does, and if you need a quick (usually really bad) translation, it will suffice. But it only works on your phone or tablet.

The problem is for those of us who are more interested in getting a clean version of the original Arabic text. The camera function isn’t perfect, and it’s hard to edit. Further, trying to transfer it from your device onto a computer is difficult, since the main function of Google Translate is … well, to translate.

Just when I had resigned myself to holding up my phone to my computer screen, I found that there is another way.

A better way.

1. What you need.

You need a Google account.
Sorry, if you’re anti-evil-corporate entity, this won’t work. I appreciate where you’re coming from, but you need Google docs for this.
A computer.
A file in the target language.
A screen capture program, preferably one that has a crop function (the Windows 10 Snip-and-Sketch program is perfect for this).

2. How to start

(note: if you’re working with a document where the text is already in one column–like a letter or printed report–you can skip this step.)

Open your document and get it on the screen. My screen looks like this.

Since I’m working with a newspaper, I do this one column at a time.

Take a snip of your document — here’s what the snip I’m working with looks like:

It’s a regular old JPG called, inventively, “November 27-01.jpg” because I’m working with the issue from November 27 1918, and it’s the first image.

Okay, here’s where the fun starts.

3. Open Google Drive

This is important: open Google Drive, not Google Docs.
Upload your images.
Right click on the first image you want to work with. In the popup menu, select “Open with > Google Docs”

This is the important part. You cannot open a blank Google Doc and insert the image. I mean, you can, but the part that you really want to happen next — the OCR magic! — doesn’t happen if you do that.

4. Wait for it …

When you open your new Google doc (it’ll happen automatically), you’ll see this.

In the words of Parker Posey from the otherwise forgettable “Superman Returns” (2006), “Gee, Lex. That’s really something.”

Yeah, there it is. Your image. In a Google Doc. Just like I told you not to do yourself.

But wait … scroll down.

You see that microscopic text at the bottom? Let’s zoom in a bit.

Why, if I didn’t know any better, I’d say it’s Arabic!

So, usually what I do is copy and paste it into a new document, enlarge it, and choose a more attractive font (under “more fonts” you can select Arabic as the language and have your pick).

Here’s the end result:

Now, as you can see … there’s some errors in there. The article’s subtitle is أو الحمى الاسبانية, not او الحي الاسبانية. So, usually I put them side by side like this and start proofing:

But proofing text that’s already been rendered by a program that knows a much wider range of words than I do is so much quicker than trying to figure it out on my own. This took less than five minutes:

لانفلونزا
أو الحمى الاسبانية

علم القراء مما نشرناه قبلا من هذه الوافدة انها توشك أن تعم العالم بأسره وانها تكاد تحرم الناس الشعور بالفرح من جراء عقد الهدية بها تلبسهم من ثياب الحداد على احيائهم وان الخوف الأكبر من الاصابة بها انما هو المضاعفات التي تصيب المريض إذا لم يعن به العناية الكافية وأشد هذه المضاعفات خطرا ذات الرئة وقد نشرنا ما يلي مقتطفات من صحف أوروبا من سير المرض فيها ونشرت جريدتنا السودان في عددها الأخير فصلا بعنوان “الانفلونزا على الطريق” قالت فيه “لا يكاد أحد يعود إلى السودان من مصر هذا السيف الاوصاب بالانفلونزا أو الحمى الإسبانية قبل وصوله إلى الخرطوم أو حال وصوله اليها.”

Now, you can leave it here. Honestly, just having the text in a much clearer and easy to-read format is a huge boon (and reminded me that, hey, I can read Arabic!).

You can also take the next step and get a quick and dirty translation by running the result through Google Translate:

Influenza
Or Spanish fever

Readers learned from what we published earlier from this invader is about to pervade the entire world and that it almost deprives people of the joy of holding a gift by wearing them from the clothes of mourning for their neighborhoods and that the greatest fear of injury to them is the complications that afflict the patient if he does not mean adequate care and more severe These complications are a pneumonia hazard, and we have published the following excerpts from the European newspapers about the course of the disease in it. Our newspaper Sudan published in its latest issue a chapter entitled “Flu on the road” in which it said: “No one returns to Sudan from Egypt. This sword is infected with flu or Spanish fever before arriving. To Khartoum or if and He reached it. “

Is this even remotely ready for publication? Hell no. I’m a little embarrassed that this is the passage I chose to show here, but it’s been a long day and I’m too tired to do a different series of screenshots.

What I did do was put both the Arabic and the quick-and-dirty English into a note attached to the Zotero entry for this particular article, so I have it at my fingertips. The most useful function here is that the quick and dirty can help me find a reference to something i missed. But if it yields material I want to quote or highlight, then I can do my own translation since GT isn’t quite there yet.

The one quibble I have is that, for some reason, Google Translate doesn’t like the Perso-Indic numbers used in Egypt and the Mashriq. You’ll have to do those manually, but that’s a small price to pay to have GOOGLE FREAKING DECIPHER OLD ARABIC TEXT.

Update

I am told by people who’ve been experimenting with other languages that this works pretty well for Persian, Kurdish (in the Arabo-Persian script), Urdu, and even Ottoman Turkish.

I’ve also been told the equivalent function in Hebrew works in the literal sense but produces much poorer results.

Update update

I’d quite forgotten about this post until Alex Mallett tested the technique and posted it on Digital Orientalist, and I will say I quite agree with his conclusions. Machine translation is still very clunky — to me, the most useful part of this whole exercise, and the one I still use regularly, is getting a transcribed version of the original Arabic scan. My Arabic typing is slow, and Google does a better job since it knows all words I don’t know. But for English? Yeah, I’ll do it myself!

Tag: OCR

How to use Google Docs to OCR Arabic text