Diese Seite mit anderen teilen ...

Informationen zum Thema:
Forum:
WinDev Forum
Beiträge im Thema:
19
Erster Beitrag:
vor 8 Jahren, 2 Monaten
Letzter Beitrag:
vor 8 Jahren, 2 Monaten
Beteiligte Autoren:
DanM, Chris L, Dan M, Piet van Zanten, Ruben Sanchez Peña, DarrenF, Al, Stefan Bentvelsen

Extracting Data after httpGetResult()

Startbeitrag von Dan M am 12.07.2009 01:20

I am using HTTPRequest and then HTTPGetResult to capture the html used on a page.

I now want to extracts data from this page.

I only need a few pieces of data which are in a table ...

Is there an example of how to do this or ...

Can some one tell me what functions I should be looking at to figure out how to extract the needed data ...

I am trying to extract ... the part number, manufacturer, quantity, and price

The code I need to work with starts in a tag and then is followed with
that is how I know where the record starts and then the next record starts.

Is there a way to say locate a specific tag then extract XX digits to the right or until reach the closing tag? .... then find the next and extract XX digits to the right ... etc ...

or ... find all the information between the opening an closing tag labeled...

Here is a piece of the html code that I am trying extract data from (there are 7 records) ... (there is a bunch of code above this but nothing I need and I don't think it is relevant ...

I tried to put the code here ... but it will not accept my post if I do ... How can I dispplay the html code ...

Can I use Snagit? or something else?

Antworten:

Extracting Data after httpGetResult()

I am using HTTPRequest and then HTTPGetResult to capture the html used on a page.

I now want to extracts data from this page.

I only need a few pieces of data which are in a table ...

Is there an example of how to do this or ...

Can some one tell me what functions I should be looking at to figure out how to extract the needed data ...

I am trying to extract ... the part number, manufacturer, quantity, and price

The code I need to work with starts in a tag and then is followed with
that is how I know where the record starts and then the next record starts.

Is there a way to say locate a specific tag then extract XX digits to the right or until reach the closing tag? .... then find the next and extract XX digits to the right ... etc ...

or ... find all the information between the opening an closing tag labeled...

Here is a piece of the html code that I am trying extract data from (there are 7 records) ... (there is a bunch of code above this but nothing I need and I don't think it is relevant ...





















EUPECTRANSISTOR

IGBT 1600A 1200V SINGLE
ITEM # FZ1600R12KE3



NO STOCK
Est. Lead Time28 days



$1,859.60
VolumeDiscountsAvailable


QTY.





















EUPECTRANSISTOR

IGBT 1600A 1200V SINGLE
ITEM # FZ1600R12KF4



NO STOCK
Est. Lead Time27 days



$2,208.28
VolumeDiscountsAvailable


QTY.





















EUPECTRANSISTOR

IGBT 1600A 1200V SINGLE
ITEM # FZ1600R12KL4C



NO STOCK
Est. Lead Time28 days



$2,208.28
VolumeDiscountsAvailable


QTY.





















EUPECTRANSISTOR

IGBT SINGLE
ITEM # FZ1600R17KE3-B2



NO STOCK
Est. Lead Time28 days



$2,862.80
VolumeDiscountsAvailable


QTY.




















EUPECTRANSISTOR

IGBT SINGLE
ITEM # FZ1600R17KE3



NO STOCK
Est. Lead Time28 days



$2,208.28
VolumeDiscountsAvailable


QTY.





















EUPECTRANSISTOR

IGBT 1600A 1700V SINGLE
ITEM # FZ1600R17KF6-B2



NO STOCK
Est. Lead Time28 days




QTY.




















EUPECTRANSISTOR

IGBT
ITEM # FZ1600R17KF6C-B2



NO STOCK
Est. Lead Time28 days



$2,933.14
VolumeDiscountsAvailable


QTY.








1 - 7 of 7 Matches  
 
















26010 Pinehurst Drive, Madison Heights, MI  48071

About Us | © Copyright 2009 Galco Industrial Electronics, All Rights Reserved | Terms of Use









var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl."; : "http://www.";);
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));


var pageTracker = _gat._getTracker("UA-709262-1");
pageTracker._initData();
pageTracker._trackPageview();












von Dan M - am 12.07.2009 01:15
Hello Dan

The code you attached caused your message to be picked up by the spam filter.
Can you use Snagit or other other screen shot and then use the "TinyPic" option shown in the top of the forum page to display the graphic image

Regards
Al

von Al - am 12.07.2009 04:53
Hi Dan,

if you have the page in a string, I think you can use Position() for that.


von Stefan Bentvelsen - am 12.07.2009 08:33
Dan

I've done quite a bit of this stuff over the past few years, extracting data from web pages typically (though not always) in table form.

The good news is that it can be done and it works well, even with some fairly complicated pages. (One application I use accesses a main page, extracts links to subsidiary pages and then extracts the data from these subsidiary pages, which is in the form of groups of tables.)

The bad news is that it takes a lot of time and effort to set this up. I haven't found any easy way of doing this, although I have discovered certain functions in WinDev which have made this easier.

I did think the new HTMLToText function in version 14 would be a great help but because it strips all the formatting and just leaves a long line of text, I have not been able to use this. The HTMLToRTF function leaves formatting like Bold but doesn't leave the built-in structure of the web page.

Basically, I save the web page to a text file then parse this. It's the parsing that takes the time, basically a trial-and-error process. I use a test window to do this: one with two large edit fields showing before and after views. I build up the code gradually to get the information required.

I don't know whether you're interested but here's some detail of what I've done.



Writing to a text file


FileId = fOpen("meeting.html", foCreate) // can also be a .txt file

IF FileID -1 THEN

Callres= HTTPRequest(sWebAddress)

WHILE Callres = False // loop put in to cope with delays in connection
iCount3 ++
FOR icount2 = 1 TO 1000
END
Callres= HTTPRequest(sWebAddress)
END

IF Callres= True THEN
// Save the image retrieved into the file
fWrite(FileID, HTTPGetResult())
ELSE
Trace("There's a problem with timing.")
END
fClose(FileID)
END





Next step: get to the body of the text and clean it up a bit (but leaving essential breaks and formatting)


IF fSize("meeting.html") > 1000

FileId = fOpen("meeting.html",foRead)

sTextVersion = fRead(FileId,fSize("meeting.html"))

icount2 = PositionOccurrence(sTextVersion,"/h6 h2",1,IgnoreCase)

sTextVersion = sTextVersion[[(icount2+9) TO ]]
icount2 = PositionOccurrence(sTextVersion,"div id='Footer'>",1,IgnoreCase)
sTextVersion = sTextVersion[[1 TO (icount2-27)]]

sTextVersion = Replace(sTextVersion,"/table>","&&&"+CR+CR,IgnoreCase) /// "&&&" used as a marker for later substitution
sTextVersion = Replace(sTextVersion,"/tbody>>","",IgnoreCase)
sTextVersion = Replace(sTextVersion,"h2>","",IgnoreCase)
sTextVersion = Replace(sTextVersion,"/tr> tr> td>","",IgnoreCase)
sTextVersion = Replace(sTextVersion,"",CR,IgnoreCase)
sTextVersion = Replace(sTextVersion," /td> td>",TAB,IgnoreCase)
sTextVersion = Replace(sTextVersion," /td> td>",TAB,IgnoreCase)
sTextVersion = Replace(sTextVersion," /td>","",IgnoreCase)
sTextVersion = Replace(sTextVersion," /tr>&&&","&&&",IgnoreCase)
sTextVersion = Replace(sTextVersion," /h2> h3>","##",IgnoreCase)

icount2 = PositionOccurrence(sTextVersion,"&&&",1,FromEnd)
sTextVersion = sTextVersion[[1 TO (icount2+2)]]


sTextFile = sVenue+".txt"

FileID2 = fOpen(sTextFile,foCreate)

IF FileID2 -1 THEN
WriteRes = fWrite(FileID2,sTextVersion)

IF WriteRes = -1 THEN
INFO("Oops! Error in writing sTextVersion")
END
fClose(FileID2)
END

I've had to remove all the " tr> td>" because that is what was necessary to separate this piece of data from something else in the table.

String slicing is other important component of this process. Having determined the position of a code or code group, I now know where to get the data. If it's a fixed length I can use that in the arguments of the string slice; if the length is variable then I need to determine the position of the end marker or the start of the next piece of data.




When you've got rid of the dross and broken the file down into useable sections then use the ExtractString function.


sRaceString = ExtractString(sTextVersion,firstRank,"&&&")

NumberOfRaces = 0
WHILE Length(sRaceString)>10
ProcessOneRace(sRaceString)
NumberOfRaces += 1
sRaceString = ExtractString(sTextVersion,nextRank,"&&&")
END


You'll recall I said I used the "&&&" as a marker for later substitution, maybe a TAB or CR. In this case, I've used it as a marker for a slab of text. In this particular application, sRaceString is a long string which contains all the data from one race (but one race only) extracted from a page which lists many races as well as general data about the meeting.

I next work on this sRaceString to extract what could be called the 'header' information and I am then left with a slab of text which contains just the data about the runners. I again use the ExtractString function to pick out the information for the individual runners, one string per runner. Finally, another ExtractString function picks up the individual items from each runner: name, rider, weight, etc. (This ExtractString function has proved enormously useful, saving much time and effort from my earlier versions when I had work on PositionOccurrence with HTML codes right down to the individual items of data. The sooner you can use ExtractString in your parsing process, the quicker and easier this process becomes.)




I don't know whether any of this helps but it might give you some ideas for what you're doing. As I said at the beginning, I haven't found any short easy way of doing this - it takes literally hours of trial-and-error. The only saving grace is that WD makes it very quick and easy to make a change and test it out. Write something in your code, test the window, close it and you're immediately back to your code with the cursor in the same position as you left it.

Best of luck.

Chris L
Melbourne, Oz








von Chris L - am 12.07.2009 12:04

Here is the link to view the html from which I want to extract the data

http://i28.tinypic.com/2ic0d47.jpg

von DanM - am 12.07.2009 12:04
Chris,

Thanks for pointing me in the right direction.

By using ... EDT_Edit1 = HTMLToText(HTTPGetResult()) (in WinDev12)

(in the Entry of EDT1) I am able to get the data to look like this (see below). It is getting very close.

Now , there is a bunch data before the table with the data I want to extract. A lot more than I am showing here. The data below 1 - 7 of Matches ...
is the data I want to extract ...

How do I tell WinDev where to start reading data or how do I trim the string starting after this point? Example .. there is a heading phrase

IMAGE DESCRIPTION AVAILABILITY PRICE

I want to further trim the string to only the data after this phrase. this should leave me with just the data I want to save to a data table (except for the junk at the end)

Also, what is the function to read data by line? It looks like each record below is 16 lines ... starting with the word EUPEC ... (the manufacturer)

I have been playing with PositionOccurence, Position, ExtractString ... but no luck getting it to work yet ...

Thanks for your help so far ... Dan

===============================================================

Wire Duct
Flexible Duct
Panel Duct
Wireholders
Wireholders
 Coming soon ...
 Coming soon ...
 Coming soon ...
Buy Products > Search: FZ1600

Narrow your Search
 
1 - 7 of 7 Matches Show  Items / Page   
IMAGE DESCRIPTION AVAILABILITY PRICE  


EUPEC
TRANSISTOR
IGBT 1600A 1200V SINGLE
ITEM # FZ1600R12KE3

NO STOCK
Est. Lead Time
28 days

$1,859.60
Volume
Discounts
Available
QTY.


EUPEC
TRANSISTOR
IGBT 1600A 1200V SINGLE
ITEM # FZ1600R12KF4

NO STOCK
Est. Lead Time
27 days

$2,208.28
Volume
Discounts
Available
QTY.


EUPEC
TRANSISTOR
IGBT 1600A 1200V SINGLE
ITEM # FZ1600R12KL4C

NO STOCK
Est. Lead Time
28 days

$2,208.28
Volume
Discounts
Available
QTY.


EUPEC
TRANSISTOR
IGBT SINGLE
ITEM # FZ1600R17KE3-B2

NO STOCK
Est. Lead Time
28 days

$2,862.80
Volume
Discounts
Available
QTY.


EUPEC
TRANSISTOR
IGBT SINGLE
ITEM # FZ1600R17KE3

NO STOCK
Est. Lead Time
28 days

$2,208.28
Volume
Discounts
Available
QTY.


EUPEC
TRANSISTOR
IGBT 1600A 1700V SINGLE
ITEM # FZ1600R17KF6-B2

NO STOCK
Est. Lead Time
28 days
? QTY.


EUPEC
TRANSISTOR
IGBT
ITEM # FZ1600R17KF6C-B2

NO STOCK
Est. Lead Time
28 days

$2,933.14
Volume
Discounts
Available
QTY.
1 - 7 of 7 Matches    



von DanM - am 12.07.2009 13:29
There is a phrase in the string that is the beginning of where I need to extract the data ...

1 - 7 of 7 Matches Show Items / Page
IMAGE DESCRIPTION AVAILABILITY PRICE

What is the function I can use to determine the position of the end of this phrase?

I think from there I could use "right" function to grab everything from that point on????

Any ideas??

Dan

von DanM - am 12.07.2009 16:52
I am getting closer but ... this will not allow me to extract correctly every time ...

First, I am able to use ...

gResStart = HTTPRequest("http://www.onlinecomponents.com/buy/HONEYWELL/1TL1-2G/";)

... and then ...

EDT_Edit1 = HTMLToText(HTTPGetResult())

this gets me to the point where the HTML from the page is in Text format in a string ...

Now I am trying to get rid of all the information I do not need. I am currently doing it manually by slowly figuring out at whar position the data starts and stops, as below ...

MyString is string = EDT_Edit1

FirstCut is a string = Right(MyString , 640) // Returns "Madagascar"
EDT_Edit2 = Left(FirstCut, 220)

Does anyone know of a way to ExtractString between to TAGS or phrases??
then I would be able to extract the information I need without identifying the position or location of the beginning of the data.

OR ... How do I identify the position or location of a phrase? I think I would be able to do a Right and Left function on the string to remove the un-needed data??

Any thoughts, suggestions, ideas ...

Dan





von DanM - am 12.07.2009 18:01
Hi Dan,

The beauty of extractstring is that it will not only use a single character as an argument, but also a multicharacter string. So if you wanted to extract the body part of a html string then you would use:

sBody=Extractstring(Extractstring(sHTML,1,"[/body]"),2,"[body]")
The html page is split in two by the body end tag. The inner extractstring returns the part in front the [/body] This is the first argument of the outer extractstring, which returns the part after the [body] tag. (rank 2)
Using this technique you can narrow down your search step by step.

Regards,
Piet

Note: Because the the forum will not display any pointy brackets and anything between them I replaced them by square brackets.


von Piet van Zanten - am 12.07.2009 21:36
Piet,

That is amazing ... Thank you ...

but I am still stuck when it comes to the column qty and prices ...

This allows me to get down to ...

1TL1-2G
In Stock: 36 pcs. can ship now
Factory Lead-Time: 6 weeks
Pricing for 1TL1-2G Quantity Price
1 - 24 $34.56
25 - 49 $28.69
50 - 249 $25.24
250 - 999 $23.50
1000 + $21.43

Now, many part number have different column quantities (ie. 1-2,3-5,6-10 or 1-9, 10-24,25-99 or 1-99, 100-499, 500+)

How would I use the ExtractString function to separate by line? OR can I extract by line?
is there a "CR" character I can extract by? or some other way?

Example (based on above data)

$partnumber = 1TL1-2G
etc ... (I know understand how to strip the additional fields up top ...

$qty_on_hand = ExtractString(ExtractString($qtyStep1,1,"pcs.")2,"In Stock")

but what can I do about the multiple column qty & pricing since the column qty & prices change?

... I will not know what the beginning and end strings will be to extract by?

If there is a way to extract by line I could use the $ in the price to do a ... (from beginning of line to $ and a from $ to end of line)

does that make sense ???

this would be the need outcome ....

$col1_qty = 1-24
$col1_price = $34.56
$col2_qty = 25-99
$col2_price = 28.69
$col3_qty = 50 - 249
$col3_price= $25.24
$col4_qty = 250 - 999
$col4_price = $23.50
$col5_qty = 1000 +
$col5_price = $21.43















von DanM - am 13.07.2009 01:05
Hi Dan,

To extract parts separated by a repeating string look at:
FOR EACH STRING sPart OF sContent SEPARATED BY "anything"
Discard what you don't need and add the relevant sParts to an array and break down the elements of the array with the same technique.
Use breakpoints and the debugger to track the results and narrow down your search.

You must find patterns that are allways the same, otherwise it will be impossible to extract the data. Typically you can look for tables:

von Piet van Zanten - am 13.07.2009 06:55
Dan

Piet's highlighted the problem/challenge. You have to find a separator for each part, whether you're using ExtractString or the FOR EACH construction mentioned above. That's why I ended up rejecting the HTMLToText function. As I mentioned in my previous posting, this function removed all the formatting which meant that it was next to impossible to find unique separators.

One thing I tried was analysing the ASCII codes of every character in the string. Sometimes there were hidden characters, characters which did not show on the screen when displayed as standard text. For instance, the end of a line could be an ASCII code 13 or 10 or both. That's why looking for a standard CR doesn't always work.

If you still can't find the unique characters or character combinations to distinguish lines in your example then the only solution I can think of is to go back to your original HTML file (before transformation with the HTMLToText function), work out the particular combination of codes which distinguish each line and then substitute/replace with a special string such as "&&&"; you can then use this later as your unique separator.

Rest assured, it's all possible (speaking from considerable experience parsing a range of web pages) but as I've indicated, it's very much grunt work, plain hard laborious slogging.

Have fun.

Chris

von Chris L - am 13.07.2009 13:34
Hi,

I don't know the sorce of the HTML file you are trying to process, but, is it possible to get the information supplied to you in XML?

Piet and Chris are quite correct - it's all about finding unique separators that can't appear anywhere else - even the word "Display" (for example) could appear in the product description area. But all this also assumes that the layout of the HTML page dosen't change over time.

If you can get it in XML it's much easier to process...

Cheers...

von DarrenF - am 13.07.2009 15:32
Darren,

unfortunately ... we are not able to get the xml from many of the suppliers, therefore I will need to do this for many (too many) suppliers until they get up to speed on the XML feed,

Chris,

Did you say ... it's very much grunt work, plain hard laborious slogging. Have fun.

... all in the same breath? I do not recall the last time I was slogging and had fun ... LOL

but ... I am so close ... I think if I can figure out this last piece ... I will be on my way ...

Here it is ...

I am back to the HTML (without the HTMLToText) as you discussed. My line of code is ...

sPriceBreakQty = ExtractString(ExtractString(sAPriceBreakLine, 1, "[/td]"),2,"[td width="50%" align="center" class="regprice"]")

but, that does not work because the HTML has quotes around all the data (50%, center, and regprice)

So then I tried removing the quotes around those 3 pieces of data ... but that does not work either.

so now I am thinking about what you said earlier about unique identifiers ... I thought I would replace any of the HTML that had the quotes in it ...

What if I replace [/td]"),2,"[td width="50%" align="center" class="regprice"]
... with ENDOFPRICE as a unique identifier ...

BUT ...

How do you do a replace when the item you want to replace has quotes in it when you need to put quotes around the item???

Any Ideas??

Dan


von Dan M - am 13.07.2009 16:23
Hi. If you want use " in a string you must write "" por each.

sPriceBreakQty = ExtractString(ExtractString(sAPriceBreakLine, 1, "[/td]"),2,"[td width=""50%"" align=""center"" class=""regprice""]")




von Ruben Sanchez Peña - am 13.07.2009 18:11
Dan

"... it's very much grunt work, plain hard laborious slogging. Have fun."

Yes, it was said tongue in cheek! As I think you've discovered, it's both boring and frustrating. You'd go out of your mind if you tried to do it without breaks of doing other work.

Having completed the work, I must admit it is satisfying to be able to click a button and see thousands of pieces of data stored in the appropriate files. Until of course someone decides to update the website! That happened to me at the beginning of this year; the webmaster of the major site I use decided to modernise with style sheets and the rest. It certainly looks better but it meant that many of the codes changed so I had to start over again. Luckily it gets easier second (third, fourth, etc) time round!

Ruben's info on putting quotes within quotes is worth noting - it crops up in all sorts of places. If you want to use quotes within a string defined with quotes, simply double the quotes.

sItemDescription = "Microprocessor transistor (known as ""MOSFET"")"
It looks strange when the quotes are at the end of the string and you have three quotes together but that's correct.

sItemDescription = "Transistor ""BJT"""


I'll change my parting salutation to ""Best of luck""!

Chris


von Chris L - am 14.07.2009 01:35
Chris,

Well, I have made it ... I ended up needing 1 more scrub of the data and this was how it was done ... (by adding the HTMLToText)

before adding HTMLToText : http://tinypic.com/r/fjpwsy/3

after adding HTMLToText : http://tinypic.com/r/1zl9e6t/3

sPriceBreakQty = HTMLToText(ExtractString(ExtractString(sAPriceBreakLine, 1, ""),2,""))

sPriceBreakPrice =HTMLToText(ExtractString(ExtractString(sAPriceBreakLine, 1, ""),2,""))

Apparently, there was still some HTML formatting in the string which was preventing me from getting to the actual data. I was able to see the data in an infobox but when I attempted to display it to a table it would come up as a "blank". Once I added the HTMLToText, what ever was in front of the data was stripped and now all is good.

Thank you ALL for your help on this journey !!

Dan

von DanM - am 14.07.2009 14:06
Well done, Dan!

Perseverance pays off.

It'll be easier next time!

Chris

von Chris L - am 15.07.2009 05:45
Zur Information:
MySnip.de hat keinen Einfluss auf die Inhalte der Beiträge. Bitte kontaktieren Sie den Administrator des Forums bei Problemen oder Löschforderungen über die Kontaktseite.
Falls die Kontaktaufnahme mit dem Administrator des Forums fehlschlägt, kontaktieren Sie uns bitte über die in unserem Impressum angegebenen Daten.