How to properly format a scanned book.

Archgaull

Estimable
Nov 12, 2014
9
0
4,510
I recently acquired an HP Officejet Pro 8610, and I'm eagerly using it to begin to convert my collection of books into PDF's for convenience. However, I'm having an annoying time doing so. The HP allows me to scan and convert directly to PDF, however it treats the scan like an image, meaning I can't format it properly.

I have the ability to scan and convert to OCR, but when I use word to edit this, it treats it as if It's still on a page. See
QgjmlSQ.png
No matter what I try, I cannot get it to fully utilize the document's space, and it stays like that. Does anyone have an idea how to fix it?
 
Solution
In your Word, make paragraph marks visible (File menu, "Options", "Display", put a checkmark). You will see that each line of your text ends with a paragraph marks, and this is your problem. You want only complete paragraphs (several lines of text) to end with paragraph mark.

GObonzo

Distinguished
Apr 5, 2011
30
0
18,610
there should be a format option in Acrobat & Word called something like "Justify". spreads the lines to fill from left to right of entire page.

normally scanning a document will just take a "snapshot" of the page. not individually scan each letter or word and print them to the page accordingly.
 

GObonzo

Distinguished
Apr 5, 2011
30
0
18,610
there are scanners that are meant for this type of thing. but, i believe yours just creates an image and doesn't individually scan the words. possibly could be just the setting your using when doing the scanning though.

those types that are meant to convert documents directly to text are rather expensive last i checked.
 
In your Word, make paragraph marks visible (File menu, "Options", "Display", put a checkmark). You will see that each line of your text ends with a paragraph marks, and this is your problem. You want only complete paragraphs (several lines of text) to end with paragraph mark.
 
Solution

Archgaull

Estimable
Nov 12, 2014
9
0
4,510
Yeah, that seems to be the cause of the problem. Is there a way to automatically fix this issue, or would I have to do it all by hand?

EDIT: I managed to fix it, by using a word tool that clears all formatting. For whatever reason, you have to use this tool specifically (In the font section, the little button that has an eraser on it)

Even if you copy and paste into a new word document, and tell word to get rid of all formatting, it doesn't do that. You have to click that button in particular.
 
I am doing this in following way:
1. Replace all paragraph marks, followed by four spaces (or as many as there at the beginning of the paragraph) with paragraph mark, and special token (eg ####). This will mark beginning of every paragraph with that special token.
2. Replace all paragraphs marks with a space. This will join consecutive lines of a paragraph into single paragraph.
3, Replace all tokens (####) with nothing.
Consult Word help about how to search-and-replace for paragraph marks (if my memory serves me correctly, it should be "^p")