Design My Own Web Archiving Software
I am investigating the possibility of designing my own program to solve a rather esoteric problem that I have not found a solution for - I am wondering whether it will be possible, and if anyone can validate my idea before I go through the efforts of working on it.
-------PROBLEM
I am an avid web browser, I view on average 300 unique webpages per day. I also keep a LOT of tabs open across dozens of windows in 4 different browsers to keep track of things I don't want to forget, but don't want archived yet - my work flow requires them to stay 'in front of me' (which often results in them crashing, which is a problem I have tried in vain to fix, but mostly just tolerate, and have various addons like session manager which help me recover when they crash).
1. Related to this problem is that I want to be able to archive entire webpages so that if they are taken down at some point, I have a copy of them - this has happened a LOT to me in the past.
2. Also related is that I want to have a complete archived HISTORY record of every webpage/URL I have ever visited.
4. These 2 problems are solved only sparingly currently despite my extensive research. Browsers do not store a very full history (I have always found that pages are missing, I suspect it depends on whether they're accessed through cross-links, in webapplets, etc., etc.). Saving webpages of course is easily possible, but requires many clicks, and often the format of the page is destroyed when saved in HTML - the links also do not often work.
------CURRENT SOLUTION
The BEST solution to this bookmarking/history/web archive problem I have found so far after MUCH work is PINBOARD.IN - this service allows you to one-click a page you're on, and it saves the link like a bookmark, BUT ALSO CRAWLS THE PAGE and archives it for you on its cloud storage (you pay for this component).
I like it, but it lacks two components - first you have to click the bookmarklet every time you're on a page (this is annoying and/or I forget and I don't want to have to click it on EVERY page I EVER visit). Secondly, it doesn't keep a record of EVERY page I visit, which is what I want.
------PROPOSED SOLUTION
I want to build a program (if it runs on desktop, browser, I don't care), that
1. Strips a COMPLETE URL history of every single URL visited on the computer - people have told me I can get this off of my router with some sort of URL logging.
2. I then want a crawler/spider program to comb every page in THAT URL record (every page my computer visits) and ARCHIVE a complete working copy of the page with the format, Links, etc. maintained. I don't care how much space this takes up. I do NOT need to save entire websites - just webpages visited.
3. ADDITIONAL FEATURES: I would like some ability to search/organize this archive of complete webpages. An additional feature would be the ability to 'map' or visualize the 'breadcrumb pathway' by which the pages were browsed by me (i.e. did I get there with a link from a google search, or a Wikipedia link, or... etc.)
------RECOMMENDATIONS?
Given that virtually all of these things are possible MANUALLY - I assume all of this is eminently possible given enough effort.
I have minimal (pathetic) programming experience, and have only written Perl scripts for bioinformatics work, but I am willing to learn whatever I need to learn to do this. I have similarly minimal knowledge of Python.
------Does anyone have any commentary, ideas, suggestions, or helpful advice that might aid me in this quest?
Advance thanks to any replies.
I am investigating the possibility of designing my own program to solve a rather esoteric problem that I have not found a solution for - I am wondering whether it will be possible, and if anyone can validate my idea before I go through the efforts of working on it.
-------PROBLEM
I am an avid web browser, I view on average 300 unique webpages per day. I also keep a LOT of tabs open across dozens of windows in 4 different browsers to keep track of things I don't want to forget, but don't want archived yet - my work flow requires them to stay 'in front of me' (which often results in them crashing, which is a problem I have tried in vain to fix, but mostly just tolerate, and have various addons like session manager which help me recover when they crash).
1. Related to this problem is that I want to be able to archive entire webpages so that if they are taken down at some point, I have a copy of them - this has happened a LOT to me in the past.
2. Also related is that I want to have a complete archived HISTORY record of every webpage/URL I have ever visited.
4. These 2 problems are solved only sparingly currently despite my extensive research. Browsers do not store a very full history (I have always found that pages are missing, I suspect it depends on whether they're accessed through cross-links, in webapplets, etc., etc.). Saving webpages of course is easily possible, but requires many clicks, and often the format of the page is destroyed when saved in HTML - the links also do not often work.
------CURRENT SOLUTION
The BEST solution to this bookmarking/history/web archive problem I have found so far after MUCH work is PINBOARD.IN - this service allows you to one-click a page you're on, and it saves the link like a bookmark, BUT ALSO CRAWLS THE PAGE and archives it for you on its cloud storage (you pay for this component).
I like it, but it lacks two components - first you have to click the bookmarklet every time you're on a page (this is annoying and/or I forget and I don't want to have to click it on EVERY page I EVER visit). Secondly, it doesn't keep a record of EVERY page I visit, which is what I want.
------PROPOSED SOLUTION
I want to build a program (if it runs on desktop, browser, I don't care), that
1. Strips a COMPLETE URL history of every single URL visited on the computer - people have told me I can get this off of my router with some sort of URL logging.
2. I then want a crawler/spider program to comb every page in THAT URL record (every page my computer visits) and ARCHIVE a complete working copy of the page with the format, Links, etc. maintained. I don't care how much space this takes up. I do NOT need to save entire websites - just webpages visited.
3. ADDITIONAL FEATURES: I would like some ability to search/organize this archive of complete webpages. An additional feature would be the ability to 'map' or visualize the 'breadcrumb pathway' by which the pages were browsed by me (i.e. did I get there with a link from a google search, or a Wikipedia link, or... etc.)
------RECOMMENDATIONS?
Given that virtually all of these things are possible MANUALLY - I assume all of this is eminently possible given enough effort.
I have minimal (pathetic) programming experience, and have only written Perl scripts for bioinformatics work, but I am willing to learn whatever I need to learn to do this. I have similarly minimal knowledge of Python.
------Does anyone have any commentary, ideas, suggestions, or helpful advice that might aid me in this quest?
Advance thanks to any replies.