can you explain your recipe in layman terms? i'm having trouble, got the third line and don't know what to do. thanks in advanced!
Sure, it's actually very simple, although tedious and error-prone for a human to do. At steps 4 and 5, what you want to do is find a search string in the last page of text that is most likely to be unique among all the books whose text is indexed by amazon. For example, let's take "Principles and Practices of Interconnection Networks" by William James Dally. By default, amazon's search inside link allows you to browse up to page 6. At that point, find a string that's likely to be unique to this particular book, e.g., ."as the 2,176 processor ports". Search on that on amazon and you'll find a link to page 6. Click on that. That will give you pages 6, 7, and 8. Again, find a string that's likely to be unique to this book, e.g., "resulting in 576-bit packets". Search on that and you'll find a link to page 8. Click on that and you'll get pages 7, 8, and 10. Repeat as needed. Obviously, this is cumbersome to do in a manual fashion. Hence, what we first need is decent OCR to map jpegs to text. This can't be that hard given the "OCR-friendly" fonts of most, say, textbooks. Second, we need to automate finding strings that are most likely to be unique to a given book. This seems like a relatively standard text data mining problem. Also note that we don't need to stop on page p when we want page p+1. Since amazon gives you a 3-page window, you just need a one page overlap to continue.