Design My Own Web Archiving Software

commissarmo

Distinguished
Jan 5, 2010
36
1
18,590
Design My Own Web Archiving Software
I am investigating the possibility of designing my own program to solve a rather esoteric problem that I have not found a solution for - I am wondering whether it will be possible, and if anyone can validate my idea before I go through the efforts of working on it.

-------PROBLEM
I am an avid web browser, I view on average 300 unique webpages per day. I also keep a LOT of tabs open across dozens of windows in 4 different browsers to keep track of things I don't want to forget, but don't want archived yet - my work flow requires them to stay 'in front of me' (which often results in them crashing, which is a problem I have tried in vain to fix, but mostly just tolerate, and have various addons like session manager which help me recover when they crash).

1. Related to this problem is that I want to be able to archive entire webpages so that if they are taken down at some point, I have a copy of them - this has happened a LOT to me in the past.

2. Also related is that I want to have a complete archived HISTORY record of every webpage/URL I have ever visited.

4. These 2 problems are solved only sparingly currently despite my extensive research. Browsers do not store a very full history (I have always found that pages are missing, I suspect it depends on whether they're accessed through cross-links, in webapplets, etc., etc.). Saving webpages of course is easily possible, but requires many clicks, and often the format of the page is destroyed when saved in HTML - the links also do not often work.

------CURRENT SOLUTION
The BEST solution to this bookmarking/history/web archive problem I have found so far after MUCH work is PINBOARD.IN - this service allows you to one-click a page you're on, and it saves the link like a bookmark, BUT ALSO CRAWLS THE PAGE and archives it for you on its cloud storage (you pay for this component).

I like it, but it lacks two components - first you have to click the bookmarklet every time you're on a page (this is annoying and/or I forget and I don't want to have to click it on EVERY page I EVER visit). Secondly, it doesn't keep a record of EVERY page I visit, which is what I want.

------PROPOSED SOLUTION
I want to build a program (if it runs on desktop, browser, I don't care), that
1. Strips a COMPLETE URL history of every single URL visited on the computer - people have told me I can get this off of my router with some sort of URL logging.

2. I then want a crawler/spider program to comb every page in THAT URL record (every page my computer visits) and ARCHIVE a complete working copy of the page with the format, Links, etc. maintained. I don't care how much space this takes up. I do NOT need to save entire websites - just webpages visited.

3. ADDITIONAL FEATURES: I would like some ability to search/organize this archive of complete webpages. An additional feature would be the ability to 'map' or visualize the 'breadcrumb pathway' by which the pages were browsed by me (i.e. did I get there with a link from a google search, or a Wikipedia link, or... etc.)


------RECOMMENDATIONS?

Given that virtually all of these things are possible MANUALLY - I assume all of this is eminently possible given enough effort.

I have minimal (pathetic) programming experience, and have only written Perl scripts for bioinformatics work, but I am willing to learn whatever I need to learn to do this. I have similarly minimal knowledge of Python.

------Does anyone have any commentary, ideas, suggestions, or helpful advice that might aid me in this quest?

Advance thanks to any replies.
 
Solution
A crawler seems like overkill and a bit of a roundabout way of achieving your goal. A crawler would also need to be smart enough to grab images and stylesheets in addition to the HTML and rewrite any links to these so that they point to local copies (which is what the browser does when saving as full HTML I believe). Ideally you want to save the page when you visit it in your browser, not hope that it is still there some time later. I'm not sure what browser you use, but this addon for Firefox might do what you want: https://addons.mozilla.org/en-US/firefox/addon/shelve/ It mentions an autosave feature, but I have not tested it myself.

randomizer

Distinguished
One of the problems you will face is when you've loaded new content into a page asynchronously after navigating to it. This usually won't update your browser history, which is why you may be finding that it is seemingly missing pages. The browser will only store full page loads into its history unless a script on the site pushes a new entry in itself (eg: the gallery links here update the history and provide something to bookmark, yet never reload the page). This is simply an unavoidable limitation with sites that make heavy use of asynchronous (also commonly referred to as AJAX) requests. The browser doesn't know what you call a "page", and if it doesn't have a history entry provided to it then there is nothing to bookmark (or crawl with a spider for that matter) to ensure that the modified content is recoverable without going through all the clicks again.
 

commissarmo

Distinguished
Jan 5, 2010
36
1
18,590
Hmm... ok - (and thank you for reading the long post), I'm not sure if that necessarily would impinge on the ability to at least get something that is similar to this built?

I understand that you're saying that the browser would need to create a history entry to create a 'page' that can then be processed by any potential software.

1. Firstly - I'm willing to live with just the pages that ARE in history being archived if that's the case.

2. If I did only care about the pages that ARE in browser history - would such a program be possible then? I've been looking to spiders/crawlers and aren't really sure how they work yet...

3. Secondly - I can physically save/archive as say full HTML any page I visit in my browser by hand by just 'saving it' to various formats, regardless of how the page was loaded.

Why can't I simply automate this manual process?
 

randomizer

Distinguished
A crawler seems like overkill and a bit of a roundabout way of achieving your goal. A crawler would also need to be smart enough to grab images and stylesheets in addition to the HTML and rewrite any links to these so that they point to local copies (which is what the browser does when saving as full HTML I believe). Ideally you want to save the page when you visit it in your browser, not hope that it is still there some time later. I'm not sure what browser you use, but this addon for Firefox might do what you want: https://addons.mozilla.org/en-US/firefox/addon/shelve/ It mentions an autosave feature, but I have not tested it myself.
 
Solution

commissarmo

Distinguished
Jan 5, 2010
36
1
18,590
This "Shelve" Definitely seems worth trying! Seems like it's not updated for the current version but I will also try contacting the creator of the Add on to get their take of possible!

I'll reply here for future users what I discover - thanks!
 

commissarmo

Distinguished
Jan 5, 2010
36
1
18,590
1. Ok wow. I have played with Shelve a bit and while it's not exactly what I want in terms of functionality, design, the GUI is confusing and not very good, and it doesn't have the option (or I can't figure out how to make it) save entire websites (though I note I originally said I didn't care about that, after using Shelve, I realized that viewing the pages later can be restrictive because only the exact pages visited can be seen of course - this is good for articles, but less so for websites that have many pages to them).

There are some other things, including the way in which it stores the files, which is by website rather than history - so if I go to Amazon to look at a product today and go back tomorrow, it will save both to the Amazon folder, making searching difficult by chronology. I assume you could use the browser history side-by-side, though I'm not sure that's helpful to me.

I'm also not sure at this point whether it can save hyperlinks to other pages, even without saving the pages themselves, just making the links work in the HTML document.

2. ALL OF THAT SAID - this is a brilliant solution that obviously is trying to solve the problem I had in mind, which is surprising in some ways since I didn't think anyone else in the world wanted this functionality to begin with.

3. The developer has made the Source Code available on SourceForge, though apparently he's the only contributor; he also has an email contact so I will reach out to him.

4. I think I will take it upon myself to attempt to contribute and to really flesh out this addon, and maybe learn to program in the process. Despite my reservations, I really think this is a DECEPTIVELY POWERFUL tool that was created here, and with some work could be an amazing personal web archiver - imagine building one's own personal off-line web saved forever - everything you ever saw the web, in a file (a big file, granted).

5. So many possibilities for how this could be expanded upon. I think merging or replicating the MHT file format (which is a single file rather than all the HTML components) would be a great addition as well: https://addons.mozilla.org/en-US/firefox/addon/mozilla-archive-format/

6. Just to give an idea of what it 'looks' like - As I browse now (it's only FX but that doesn't bother me; I'll see if there's a Chrome one, but I mainly use Firefox anyway), I keep the Shelve folder open on another monitor... and watch as every new page loaded in the browser gets saved to the folders in the file, along with all their HTML images and internal links. It's very cool.

7. THANK YOU SO MUCH FOR YOUR RECOMMENDATION AND HELP!!! :D
 

commissarmo

Distinguished
Jan 5, 2010
36
1
18,590
After testing Shelve More, AND realizing that it works BEST when you combine it with the MAFF Firefox addon (Shelve will then archive every visited page (if you want that) as a .maff file which is a SINGLE file for every page (rather than the mess of files that HTML creates) this is much cleaner, and functions like a huge archived fully accessible history of your web browsing if you set it to capture every page automatically in the background as I have.

The ONLY problem (though it is a major one I am working on) is that this activity, perhaps as expected, SLOWS THE BROWSER significantly in some cases.

Watching the Resource Monitor on Windows, I can see that when an especially graphics heavy page loads (ads, buttons, etc.) the browser hangs for awhile as all the I/O for saving/archiving that page to the disk takes place.

This is very unfortunate, and I'm investigating potential solutions.

--I already have speed optimized my Firefox install with pipelining, MemoryCaching, and tab-suspender addon, all of which limit the footprint of the browser and accelerate the page loading process, but because the file saving process occurs at every page load... it doesn't really matter.

--Because I do want every page saved... creating delays in the saving process won't really help.

--I suppose this might simply be the trade off of this intensive archiving, as in principle, clicking through pages creates a huge amount of information which needs to be written to the .maff format in disk creating overhead. I don't click through pages that quickly anyway, but the problem is it hangs the responsiveness of the page while visiting it. Firefox also obviously isn't designed/optimized to communicate with the OS with large volume I/O operations so presumably this is also part of the problem.

--Still working with it - will post here for others when I figure out more.
 

randomizer

Distinguished
I'd say that the problem is most likely the addon rather than the browser. It probably performs all I/O on the main thread which blocks the UI until it is finished. This doesn't matter if it's a trivially quick operation, but I/O is anything but quick.
 

commissarmo

Distinguished
Jan 5, 2010
36
1
18,590
Still testing and it seems to have improved in performance a bit, I suspect this has something to do with the way Windows allocates virtual memory (I've always noticed performance with browsers with a lot of tabs improves after a while of being open).

Is there a way to write add ons that don't run into the "main thread"? Given how adding cause so many problems in Firefox, I assume this problem is well studied...
 

randomizer

Distinguished
Yes, but it's not necessarily needed. Firefox has APIs for doing I/O asynchronously; that is, the main thread (though it could be any thread) initiates the I/O operation but is free to continue with other work until it is notified that the operation has completed. Writing synchronous code that requires the thread to wait is explicitly discouraged where async code is possible. The interface implemented by the addon doesn't state it if is synchronous or asynchronous and gives no recommendations about either. It is probably synchronous.

I'm not sure how much in-memory processing is required when saving the document using the interface implemented by this addon. It could actually be quite slow. The files aren't that large after all, so I wouldn't expect the browser to hang for long due to I/O, just very frequently.
 

commissarmo

Distinguished
Jan 5, 2010
36
1
18,590
Ah wow I didn't know (any of) that.

I'm fiddling with all manner of settings but it seems it's not going to go away. Text pages like Wikipedia are completely fine.

But when it comes to a blog-like page it locks up due I assume to the huge number of images, ads, and buttons and what not on the page.

Using performance monitor I've been watching everything as I use it. The files themselves don't seem to be the problem actually, they largest was 6000kb.

I believe the main problem (and having worked with software engineers before I can hear them freaking out at my very narrow use case here lol) is the use of the add on with the MAFF format.

It seems a huge amount of I/O going on involves writing to temp folder with MAF-related files. (apparently stock Firefox also just writes a lot of history and session saves in the background also, which I didn't know until now).

Guessing - I would guess this is the packaging of the MAFF format archive file which is a single file and (guess) requires some.assembly/processing/saving of the images and elements to build the archive file.

I do remember the performance was better (though admittedly I am.not so foolish to assume this type of system would run as fast as normal) before discovering the MAFF addon combination.

Obviously there's a lot of complex code issues between using 2 addons also that weren't written together, and I haven't been able to reach the dev.

I feel SO close and it works perfectly!!! I just need to figure out this performance issue which definitely makes browsing impossible because of the slow down.

1. I've investigated trying to stop Firefox unrelated disk writes (I suppose I could stop history from writing given I'm archiving a better one, though apparently Firefox writes more than just history.

2. I'm looking into going back to saving as HTML complete and then converting that ti MAFF outside the browser, since I think that's what is causing the slow down.

3. Getting even more creative, I have a server I use for basic cloud type stuff, so I'm trying to imagine if I can off board some of this work to a server somehow.

This idea comes to me because of Pinboard.in which essentially again does what I want by archiving the page buy does it on a server which gets handed the URL.

4. I've also realized a lesser solution would be simply automating Pinboard.in to run the bookmarklet on every page visited somehow, though I'm considering that a last resort given how perfectly the Shelve + MAFF addon combination is for accomplishing exactly what I want.
 

randomizer

Distinguished
If you want see if I/O is the problem, you could create a RAM disk and save to that for testing purposes. Writing to RAM is orders of magnitude faster, so you'll spend much less time waiting than writing to a spinner. If it's still really slow, then the primary cause of the slowdown is not likely to be the I/O overhead.
 

commissarmo

Distinguished
Jan 5, 2010
36
1
18,590
Latest Work:

I am quite convinced at this point that the act of creating the MAFF file is what basically kills the browser when a page loads that has a lot of photos, etc. Surprisingly, I detect no browser issues with most pages. But still, the fact that it basically shuts down the browser every time a heavy page comes up is unacceptable to me. The I/O writes are all for the components of saving the page as MAFF, and there are a LOT of them.

1. I moved the temp folders (both of them, user and system) to my RAM Disk (where I also moved the Firefox Profile I'm using now), and obviously this wasn't well controlled as a test because I did both of these, but I could not get the temp file to work - I changed the path in advanced system settings, but it kept writing to the disk anyway.

2. I'm not sure even if I did get that to work though, whether it would help - I assume it would just increase the speed, but presumably it will still slow down somewhat? I read about moving the temp folder and followed those instructions but they didn't seem to work for some reason.

3. Trying something else, I switched back to just saving as HTML Complete, and all the slowing issues went away completely even on the same heavy-image, ad, etc. pages I was testing before. It works very well logging every page as HTML Complete, so I suppose I'm very happy about that as it does achieve (if very messily) the functionality I wanted.

4. I tried the MAFF conversion on the HTML Complete files, but it's not quite that good. Several fail for some reason I'm looking into, but I imagine it must be a very complex process because the converter needs to pull all the HTML components plus the HTML page and write the MAFF format file. It did work for most, but not all. It also of course, is not automated, so I would have to do it myself (not a huge problem).

5. Also - Since the HTML Complete archives of course can be just sorted by file type, it's not that big a deal I suppose, all the associated folders can just be sorted out. This is where I currently am. It achieves what I want - a fully archived web history of every visited page saved to disk and offline-readable with links working (mostly).

6. The only reason I'm even still working on this is because I saw how beautiful the MAFF format is - single file, very clean, everything perfectly formatted, very compressed files. And it works SO well on most pages. It's just a lot of these newfangled blog and news type pages have SO many ads, images and other junk running that it freezes when it hits one of those.

7. If I can figure out how to properly get the system AND user/appdata/local/temp folder moved to RAM Disk I will try that - The assumption being that if the I/O involved in creating the MAFF file can be pushed on RAM, it'll be fast enough that the delay won't be noticeable (or minimized at least).