View Single Post
  #1  
Old 2024-09-13, 09:12 AM
mjcrossuk mjcrossuk is offline
 
Join Date: Mar 2007
Any good Python coders here? Especially with regard to web scraping?

I'm trying to scrape the torrent details from TTD to put into a database which I would make available online, in the event that TTD closes down.

Please see this thread for more details: http://www.thetradersden.org/forums/...d.php?t=202461

I've not used the requests package before, and I'm having problems.

I'm starting from this page: http://www.thetradersden.org/forums/....php/f-11.html, then going into the various list pages for some of the categories (eg Audio, Audio Inactive, Audio Pulled). The next level pages list torrents, 250 per page, each with a link to the torrent detail thread eg http://www.thetradersden.org/forums/....php/f-12.html is page 1 of 147 listing Active Audio torrents.

Sometimes my scraping code retrieves an Index page, but other times I get a Status of 200 from the request, but an empty response.

Trying to retrieve a torrent thread, eg http://www.thetradersden.org/forums/.../t-203252.html, always give a 200 status and an empty response.

If the request fails, I should get a non-200 status code, but I don't.

Could it be authentication? Caching? Is the TTD backend blocking scraping of torrent detail threads?

TIA

I realise that web scraping these pages may not be the best way to get the info; a better way would be a dump/extract of the backend database(s). If the scraping does work, I would be mindful of NOT scraping 100,000+ pages quickly.
Reply With Quote Reply with Nested Quotes