Filter and Extract text from a document

smalltech

Distinguished
Apr 10, 2009
11
0
18,560
Hello,

I got a problem with extracting specific phrases withing a long list of rows. Here is an example of 6 rows of my long document:

We sell big and small blue widgets at http://www.bluewidgetsdomain.com/
Our website is http://www.bluewidgetsdomain.com/
We sell many kinds of widgets. Go to this site for green widgets at http://www.green-widgets-domain.net/
Our website is http://www.green-widgets-domain.net/
We sell widgets. Check out red widgets at http://www.red-widgets-domain.org/
Our website is http://www.red-widgets-domain.org/

Qn 1) How can I extract the words bluewidgetsdomain, green-widgets-domain, red-widgets-domain from each row and delete the rest of the words

Qn 2) For the rows that have the phrase [widgets at], I want to extract all the words before [widgets at]. I would also like to know how to extract all the words after [widgets at]

Qn 3) I want to extract all domains with ending with .com only. (example, in this example the http://www.bluewidgetsdomain.com/ will be extracted)

Qn 4) I want to extract the words between [We sell] and [at]. (example, for row one, the extracted words will be [big and small blue widgets], for row 3 the extracted words will be [many kinds of widgets. Go to this site for green widgets], for row 5 the extracted words will be [widgets. Check out red widgets] )

Qn 5) If the domain have dashes, I want to remove the dashes. (example, http://www.green-widgets-domain.net/ will become http://www.greenwidgetsdomain.net/)

Qn 6) I want to remove all the slash at the end of the domains. (example, http://www.green-widgets-domain.net/ will become http://www.green-widgets-domain.net)

Qn 7) I want to delete all rows that start with [Our website]

I appreciate any help. Thanks in advance!
 

r_manic

Distinguished
Jan 7, 2009
630
0
18,960
This is actually very easy, when you use even Notepad's Edit > Replace function. Just try to see the common patterns among the lines! If you're looking for an automated solution, you'll have to use a scripting tool like PHP or the like, meaning you'll have to learn the language(s).
 

smalltech

Distinguished
Apr 10, 2009
11
0
18,560
Notepad's Edit > Replace function cannot do the things I want.

I am extracting text. Although each line has some words that are similar, I want to extract the words before / after the similar words. Not replacing them.

For example, if you look at my 6 rows of examples, they have the similar words [widgets at] or [Our website is]

Is there a way to tell the program that I want to extract all the words after [widgets at] or [Our website is]? (So I can extract all the urls from them)

Or maybe there is some program that I can set it to extract the text between [http://] and [/] (So I can extract all the urls from them)

Or is there some programs that is able to find [http://], then delete all the words in front of it. Then find [/] and delete all the words after it.

Thanks
 

r_manic

Distinguished
Jan 7, 2009
630
0
18,960
Ah... now I see what you mean. What sort of programming experience do you have? You'll need to write a program to process your strings properly. If you've got none, there's a 15-day free trial of this text parser available: http://www.template-parser.com/