Preventing Email Harvesters and Search Engines from Using OCR to Extract Addresses From Your Images

If you want to avoid spam, or don’t want to be found by email address or name through a search engine, you should keep your email address and name off of web pages.

Sometimes, though, you need to show the address. This discusses how to put together an image that contains your email address, or any text, and avoid having it parsed by software that’s doing optical character recognition to extract text from the image.

The drawing tool I’m using is The Gimp. I’m also using the Free Online OCR web page to process the image. (If you have a better OCR, please contact me so I can add it here.)

Here’s a graphic that’s easy for computers to parse.

The website extracted both addresses.

One way to confuse the OCR software is to alter the colors so the contrast between the foreground and background change. I just used the “Color -> Invert” menu.

The website extracted part of the address, but didn’t get the entire address out of it.

Another way to confuse the OCR software is to use some subtle filters on the image. Software is bad a “filtering out” irrelevant information. This image adds random noise/grain.

The OCR software failed to read anything on this.

Likewise, this more extreme distortion stopped the OCR from extracting the text.

In the following image, noise alone was also enough to confuse the website, but I suspect a little bit of alteration and filtering would make this image OCR-able:

By incorporating some noise and changes to background colors in your design, you can create an image that will resist some types of optical character recognition.

Leave a Reply

Your email address will not be published. Required fields are marked *