mjcrossuk
2024-09-13, 09:12 AM
I'm trying to scrape the torrent details from TTD to put into a database which I would make available online, in the event that TTD closes down.
Please see this thread for more details: http://www.thetradersden.org/forums/showthread.php?t=202461
I've not used the requests package before, and I'm having problems.
I'm starting from this page: http://www.thetradersden.org/forums/archive/index.php/f-11.html, then going into the various list pages for some of the categories (eg Audio, Audio Inactive, Audio Pulled). The next level pages list torrents, 250 per page, each with a link to the torrent detail thread eg http://www.thetradersden.org/forums/archive/index.php/f-12.html is page 1 of 147 listing Active Audio torrents.
Sometimes my scraping code retrieves an Index page, but other times I get a Status of 200 from the request, but an empty response.
Trying to retrieve a torrent thread, eg http://www.thetradersden.org/forums/archive/index.php/t-203252.html, always give a 200 status and an empty response.
If the request fails, I should get a non-200 status code, but I don't.
Could it be authentication? Caching? Is the TTD backend blocking scraping of torrent detail threads?
TIA
I realise that web scraping these pages may not be the best way to get the info; a better way would be a dump/extract of the backend database(s). If the scraping does work, I would be mindful of NOT scraping 100,000+ pages quickly.
Please see this thread for more details: http://www.thetradersden.org/forums/showthread.php?t=202461
I've not used the requests package before, and I'm having problems.
I'm starting from this page: http://www.thetradersden.org/forums/archive/index.php/f-11.html, then going into the various list pages for some of the categories (eg Audio, Audio Inactive, Audio Pulled). The next level pages list torrents, 250 per page, each with a link to the torrent detail thread eg http://www.thetradersden.org/forums/archive/index.php/f-12.html is page 1 of 147 listing Active Audio torrents.
Sometimes my scraping code retrieves an Index page, but other times I get a Status of 200 from the request, but an empty response.
Trying to retrieve a torrent thread, eg http://www.thetradersden.org/forums/archive/index.php/t-203252.html, always give a 200 status and an empty response.
If the request fails, I should get a non-200 status code, but I don't.
Could it be authentication? Caching? Is the TTD backend blocking scraping of torrent detail threads?
TIA
I realise that web scraping these pages may not be the best way to get the info; a better way would be a dump/extract of the backend database(s). If the scraping does work, I would be mindful of NOT scraping 100,000+ pages quickly.