Compare commits

..

No commits in common. "main" and "dev" have entirely different histories.
main ... dev

4 changed files with 137 additions and 674 deletions

View file

@ -14,7 +14,10 @@ jobs:
- name: Checkout original repository - name: Checkout original repository
uses: actions/checkout@v3 uses: actions/checkout@v3
with: with:
repository: nanos/FediFetcher
fetch-depth: 0 fetch-depth: 0
- name: Checkout latest release tag
run: git checkout $(git describe --tags)
- name: Set up Python - name: Set up Python
uses: actions/setup-python@v4 uses: actions/setup-python@v4
with: with:
@ -30,7 +33,7 @@ jobs:
path: artifacts path: artifacts
- name: Get Directory structure - name: Get Directory structure
run: ls -lR run: ls -lR
- run: python find_posts.py --lock-hours=0 --access-token=${{ secrets.ACCESS_TOKEN }} -c="./config.json" - run: python find_posts.py --lock-hours=0 --access-token=${{ secrets.ACCESS_TOKEN }} --server=${{ vars.MASTODON_SERVER }} --reply-interval-in-hours=${{ vars.REPLY_INTERVAL_IN_HOURS || 0 }} --home-timeline-length=${{ vars.HOME_TIMELINE_LENGTH || 0 }} --max-followings=${{ vars.MAX_FOLLOWINGS || 0 }} --user=${{ vars.USER }} --max-followers=${{ vars.MAX_FOLLOWERS || 0 }} --http-timeout=${{ vars.HTTP_TIMEOUT || 5 }} --max-follow-requests=${{ vars.MAX_FOLLOW_REQUESTS || 0 }} --on-fail=${{ vars.ON_FAIL }} --on-start=${{ vars.ON_START }} --on-done=${{ vars.ON_DONE }} --max-bookmarks=${{ vars.MAX_BOOKMARKS || 0 }} --remember-users-for-hours=${{ vars.REMEMBER_USERS_FOR_HOURS || 168 }} --from-notifications=${{ vars.FROM_NOTIFICATIONS || 0 }} --backfill-with-context=${{ vars.BACKFILL_WITH_CONTEXT || 1 }} --backfill-mentioned-users=${{ vars.BACKFILL_MENTIONED_USERS || 1 }} --max-favourites=${{ vars.MAX_FAVOURITES || 0}}
- name: Upload artifacts - name: Upload artifacts
uses: actions/upload-artifact@v3 uses: actions/upload-artifact@v3
with: with:

137
README.md
View file

@ -6,7 +6,6 @@ This GitHub repository provides a simple script that can pull missing posts into
1. fetch missing replies to posts that users on your instance have already replied to, 1. fetch missing replies to posts that users on your instance have already replied to,
2. fetch missing replies to the most recent posts in your home timeline, 2. fetch missing replies to the most recent posts in your home timeline,
3. fetch missing replies to your bookmarks. 3. fetch missing replies to your bookmarks.
4. fetch missing replies to your favourites.
2. It can also backfill profiles on your instance. In particular it can 2. It can also backfill profiles on your instance. In particular it can
1. fetch missing posts from users that have recently appeared in your notifications, 1. fetch missing posts from users that have recently appeared in your notifications,
1. fetch missing posts from users that you have recently followed, 1. fetch missing posts from users that you have recently followed,
@ -25,7 +24,7 @@ For detailed information on the how and why, please read the [FediFetcher for Ma
FediFetcher makes use of the Mastodon API. It'll run against any instance implementing this API, and whilst it was built for Mastodon, it's been [confirmed working against Pleroma](https://fed.xnor.in/objects/6bd47928-704a-4cb8-82d6-87471d1b632f) as well. FediFetcher makes use of the Mastodon API. It'll run against any instance implementing this API, and whilst it was built for Mastodon, it's been [confirmed working against Pleroma](https://fed.xnor.in/objects/6bd47928-704a-4cb8-82d6-87471d1b632f) as well.
FediFetcher will pull in posts and profiles from any servers running the following software: Mastodon, Pleroma, Akkoma, Pixelfed, Hometown, Misskey, Firefish (Calckey), Foundkey, and Lemmy. FediFetcher will pull in posts and profiles from any server that implements the Mastodon API, including Mastodon, Pleroma, Akkoma, Pixelfed, and probably others.
## Setup ## Setup
@ -35,63 +34,56 @@ You can run FediFetcher either as a GitHub Action, as a scheduled cron job on yo
Regardless of how you want to run FediFetcher, you must first get an access token: Regardless of how you want to run FediFetcher, you must first get an access token:
#### If you are an Admin on your instance
1. In Mastodon go to Preferences > Development > New Application 1. In Mastodon go to Preferences > Development > New Application
1. Give it a nice name 1. Give it a nice name
2. Enable the required scopes for your options. You could tick `read` and `admin:read:accounts`, or see below for a list of which scopes are required for which options. 2. Enable the required scopes for your options. You could tick `read` and `admin:read:accounts`, or see below for a list of which scopes are required for which options.
3. Save 3. Save
4. Copy the value of `Your access token` 4. Copy the value of `Your access token`
#### If you are not an Admin on your Instance If you are not a server admin, you do not have access to Preferences > Development. You can use [GetAuth for Mastodon](https://getauth.thms.uk) to generate an Access Token instead.
1. Go to [GetAuth for Mastodon](https://getauth.thms.uk?scopes=read&client_name=FediFetcher) ### 2.1) Configure and run the GitHub Action
2. Type in your Mastodon instance's domain
3. Copy the token.
### 2) Configure and run FediFetcher To run FediFetcher as a GitHub Action:
Run FediFetcher as a GitHub Action, a cron job, or a container:
#### To run FediFetcher as a GitHub Action:
1. Fork this repository 1. Fork this repository
2. Add your access token: 2. Add your access token:
1. Go to Settings > Secrets and Variables > Actions 1. Go to Settings > Secrets and Variables > Actions
2. Click New Repository Secret 2. Click New Repository Secret
3. Supply the Name `ACCESS_TOKEN` and provide the Token generated above as Secret 3. Supply the Name `ACCESS_TOKEN` and provide the Token generated above as Secret
3. Create a file called `config.json` with your [configuration options](#configuration-options) in the repository root. **Do NOT include the Access Token in your `config.json`!** 3. Provide the required environment variables, to configure your Action:
1. Go to Settings > Environments
2. Click New Environment
3. Provide the name `Mastodon`
4. Add environment variables to configure your action as described below.
4. Finally go to the Actions tab and enable the action. The action should now automatically run approximately once every 10 min. 4. Finally go to the Actions tab and enable the action. The action should now automatically run approximately once every 10 min.
> **Note** Keep in mind that [the schedule event can be delayed during periods of high loads of GitHub Actions workflow runs](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#schedule).
>
> Keep in mind that [the schedule event can be delayed during periods of high loads of GitHub Actions workflow runs](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#schedule).
#### To run FediFetcher as a cron job: ### 2.2) Run FediFetcher locally as a cron job
1. Clone this repository. If you want to, you can of course also run FediFetcher locally as a cron job:
1. To get started, clone this repository.
2. Install requirements: `pip install -r requirements.txt` 2. Install requirements: `pip install -r requirements.txt`
3. Create a `json` file with [your configuration options](#configuration-options). You may wish to store this in the `./artifacts` directory, as that directory is `.gitignore`d 3. Then simply run this script like so: `python find_posts.py --access-token=<TOKEN> --server=<SERVER>` etc. (Read below, or run `python find_posts.py -h` to get a list of all options.)
4. Then simply run this script like so: `python find_posts.py -c=./artifacts/config.json`.
If desired, all configuration options can be provided as command line flags, instead of through a JSON file. An [example script](./examples/FediFetcher.sh) can be found in the `examples` folder. An [example script](./examples/FediFetcher.sh) can be found in the `examples` folder.
When using a cronjob, we are using file based locking to avoid multiple overlapping executions of the script. The timeout period for the lock can be configured using `lock-hours`. When using a cronjob, we are using file based locking to avoid multiple overlapping executions of the script. The timeout period for the lock can be configured using `--lock-hours`.
> **Note** If you are running FediFetcher locally, my recommendation is to run it manually once, before turning on the cron job: The first run will be significantly slower than subsequent runs, and that will help you prevent overlapping during that first run.
>
> If you are running FediFetcher locally, my recommendation is to run it manually once, before turning on the cron job: The first run will be significantly slower than subsequent runs, and that will help you prevent overlapping during that first run.
#### To run FediFetcher from a container: *Note:* if you wish to run FediFetcher using Windows Task Scheduler, you can rename the script to the `.pyw` extension instead of `.py`, and it will run silently, without opening a console window.
### 2.3) Run FediFetcher from a container
FediFetcher is also available in a pre-packaged container, [FediFetcher](https://github.com/nanos/FediFetcher/pkgs/container/fedifetcher) - Thank you [@nikdoof](https://github.com/nikdoof). FediFetcher is also available in a pre-packaged container, [FediFetcher](https://github.com/nanos/FediFetcher/pkgs/container/fedifetcher) - Thank you [@nikdoof](https://github.com/nikdoof).
1. Pull the container from `ghcr.io`, using Docker or your container tool of choice: `docker pull ghcr.io/nanos/fedifetcher:latest` 1. Pull the container from `ghcr.io`, using Docker or your container tool of choice: `docker pull ghcr.io/nanos/fedifetcher:latest`
2. Run the container, passing the configurations options as command line arguments: `docker run -it ghcr.io/nanos/fedifetcher:latest --access-token=<TOKEN> --server=<SERVER>` 2. Run the container, passing the command line arguments like running the script directly: `docker run -it ghcr.io/nanos/fedifetcher:latest --access-token=<TOKEN> --server=<SERVER>`
> **Note** The same rules for running this as a cron job apply to running the container: don't overlap any executions.
>
> The same rules for running this as a cron job apply to running the container: don't overlap any executions.
Persistent files are stored in `/app/artifacts` within the container, so you may want to map this to a local folder on your system. Persistent files are stored in `/app/artifacts` within the container, so you may want to map this to a local folder on your system.
@ -99,21 +91,15 @@ An [example Kubernetes CronJob](./examples/k8s-cronjob.yaml) for running the con
### Configuration options ### Configuration options
FediFetcher has quite a few configuration options, so here is my quick configuration advice, that should probably work for most people: FediFetcher has quite a few configuration options, so here is my quick configuration advice, that should probably work for most people (use the *Environment Variable Name* if you are running FediFetcher has a GitHub Action, otherwise use the *Command line flag*):
> **Warning** | Environment Variable Name | Command line flag | Recommended Value |
> |:-------------------------|:-------------------|:-----------|
> **Do NOT** include your `access-token` in the `config.json` when running FediFetcher as GitHub Action. When running FediFetcher as GitHub Action **ALWAYS** [set the Access Token as an Action Secret](#to-run-fedifetcher-as-a-github-action). | -- | `--access-token` | (Your access token) |
| `MASTODON_SERVER`|`--server` | (your Mastodon server name) |
```json | `HOME_TIMELINE_LENGTH` | `--home-timeline-length` | `200` |
{ | `MAX_FOLLOWINGS` | `--max-followings` | `80` |
"access-token": "Your access token", | `FROM_NOTIFICATIONS` | `--from-notifications` | `1` |
"server": "your.mastodon.server",
"home-timeline-length": 200,
"max-followings": 80,
"from-notifications": 1
}
```
If you configure FediFetcher this way, it'll fetch missing remote replies to the last 200 posts in your home timeline. It'll additionally backfill profiles of the last 80 people you followed, and of every account who appeared in your notifications during the past hour. If you configure FediFetcher this way, it'll fetch missing remote replies to the last 200 posts in your home timeline. It'll additionally backfill profiles of the last 80 people you followed, and of every account who appeared in your notifications during the past hour.
@ -121,51 +107,50 @@ If you configure FediFetcher this way, it'll fetch missing remote replies to the
Please find the list of all configuration options, including descriptions, below: Please find the list of all configuration options, including descriptions, below:
Option | Required? | Notes | | Environment Variable Name | Command line flag | Required? | Notes |
|:----------------------------------------------------|-----------|:------| |:---------------------------------------------------|:----------------------------------------------------|-----------|:------|
|`access-token` | Yes | The access token. If using GitHub action, this needs to be provided as a Secret called `ACCESS_TOKEN`. If running as a cron job or a container, you can supply this option as array, to [fetch posts for multiple users](https://blog.thms.uk/2023/04/muli-user-support-for-fedifetcher) on your instance. | | -- | `--access-token` | Yes | The access token. If using GitHub action, this needs to be provided as a Secret called `ACCESS_TOKEN`. If running as a cron job or a container, you can supply this argument multiple times, to [fetch posts for multiple users](https://blog.thms.uk/2023/04/muli-user-support-for-fedifetcher) on your instance. |
|`server`|Yes|The domain only of your mastodon server (without `https://` prefix) e.g. `mstdn.thms.uk`. | |`MASTODON_SERVER`|`--server`|Yes|The domain only of your mastodon server (without `https://` prefix) e.g. `mstdn.thms.uk`. |
|`home-timeline-length` | No | Provide to fetch remote replies to posts in the API-Key owner's home timeline. Determines how many posts we'll fetch replies for. Recommended value: `200`. | `HOME_TIMELINE_LENGTH` | `--home-timeline-length` | No | Provide to fetch remote replies to posts in the API-Key owner's home timeline. Determines how many posts we'll fetch replies for. Recommended value: `200`.
| `max-bookmarks` | No | Provide to fetch remote replies to any posts you have bookmarked. Determines how many of your bookmarks you want to get replies to. Recommended value: `80`. Requires an access token with `read:bookmarks` scope. | `REPLY_INTERVAL_IN_HOURS` | `--reply-interval-in-hours` | No | Provide to fetch remote replies to posts that have received replies from users on your own instance. Determines how far back in time we'll go to find posts that have received replies. Recommend value: `0` (disabled). Requires an access token with `admin:read:accounts`.
| `max-favourites` | No | Provide to fetch remote replies to any posts you have favourited. Determines how many of your favourites you want to get replies to. Recommended value: `40`. Requires an access token with `read:favourites` scope. | `MAX_BOOKMARKS` | `--max-bookmarks` | No | Provide to fetch remote replies to any posts you have bookmarked. Determines how many of your bookmarks you want to get replies to. Recommended value: `80`. Requires an access token with `read:bookmarks` scope.
| `max-followings` | No | Provide to backfill profiles for your most recent followings. Determines how many of your last followings you want to backfill. Recommended value: `80`. | `MAX_FAVOURITES` | `--max-favourites` | No | Provide to fetch remote replies to any posts you have favourited. Determines how many of your favourites you want to get replies to. Recommended value: `40`. Requires an access token with `read:favourites` scope.
| `max-followers` | No | Provide to backfill profiles for your most recent followers. Determines how many of your last followers you want to backfill. Recommended value: `80`. | `MAX_FOLLOWINGS` | `--max-followings` | No | Provide to backfill profiles for your most recent followings. Determines how many of your last followings you want to backfill. Recommended value: `80`.
| `max-follow-requests` | No | Provide to backfill profiles for the API key owner's most recent pending follow requests. Determines how many of your last follow requests you want to backfill. Recommended value: `80`. | `MAX_FOLLOWERS` | `--max-followers` | No | Provide to backfill profiles for your most recent followers. Determines how many of your last followers you want to backfill. Recommended value: `80`.
| `from-notifications` | No | Provide to backfill profiles of anyone mentioned in your recent notifications. Determines how many hours of notifications you want to look at. Requires an access token with `read:notifications` scope. Recommended value: `1`, unless you run FediFetcher less than once per hour. | `MAX_FOLLOW_REQUESTS` | `--max-follow-requests` | No | Provide to backfill profiles for the API key owner's most recent pending follow requests. Determines how many of your last follow requests you want to backfill. Recommended value: `80`.
| `reply-interval-in-hours` | No | Provide to fetch remote replies to posts that have received replies from users on your own instance. Determines how far back in time we'll go to find posts that have received replies. You must be administrator on your instance to use this option, and this option is not supported on Pleroma / Akkoma and its forks. Recommend value: `0` (disabled). Requires an access token with `admin:read:accounts`. | `FROM_NOTIFICATIONS` | `--from-notifications` | No | Provide to backfill profiles of anyone mentioned in your recent notifications. Determines how many hours of notifications you want to look at. Requires an access token with `read:notifications` scope. Recommended value: `1`, unless you run FediFetcher less than once per hour.
|`backfill-with-context` | No | Set to `0` to disable fetching remote replies while backfilling profiles. This is enabled by default, but you can disable it, if it's too slow for you. |`BACKFILL_WITH_CONTEXT` | `--backfill-with-context` | No | Set to `0` to disable fetching remote replies while backfilling profiles. This is enabled by default, but you can disable it, if it's too slow for you.
|`backfill-mentioned-users` | No | Set to `0` to disable backfilling any mentioned users when fetching the home timeline. This is enabled by default, but you can disable it, if it's too slow for you. |`BACKFILL_MENTIONED_USERS` | `--backfill-mentioned-users` | No | Set to `0` to disable backfilling any mentioned users when fetching the home timeline. This is enabled by default, but you can disable it, if it's too slow for you.
| `remember-users-for-hours` | No | How long between back-filling attempts for non-followed accounts? Defaults to `168`, i.e. one week. | `REMEMBER_USERS_FOR_HOURS` | `--remember-users-for-hours` | No | How long between back-filling attempts for non-followed accounts? Defaults to `168`, i.e. one week.
| `remember-hosts-for-days` | No | How long should FediFetcher cache host info for? Defaults to `30`. | `HTTP_TIMEOUT` | `--http-timeout` | No | The timeout for any HTTP requests to the Mastodon API in seconds. Defaults to `5`.
| `http-timeout` | No | The timeout for any HTTP requests to the Mastodon API in seconds. Defaults to `5`. | -- | `--lock-hours` | No | Determines after how many hours a lock file should be discarded. Not relevant when running the script as GitHub Action, as concurrency is prevented using a different mechanism. Recommended value: `24`.
| `lock-hours` | No | Determines after how many hours a lock file should be discarded. Not relevant when running the script as GitHub Action, as concurrency is prevented using a different mechanism. Recommended value: `24`. | -- | `--lock-file` | No | Location for the lock file. If not specified, will use `lock.lock` under the state directory. Not relevant when running the script as GitHub Action.
| `lock-file` | No | Location for the lock file. If not specified, will use `lock.lock` under the state directory. Not relevant when running the script as GitHub Action. | -- | `--state-dir` | No | Directory storing persistent files, and the default location for lock file. Not relevant when running the script as GitHub Action.
| `state-dir` | No | Directory storing persistent files, and the default location for lock file. Not relevant when running the script as GitHub Action. | `ON_START` | `--on-start` | No | Optionally provide a callback URL that will be pinged when processing is starting. A query parameter `rid={uuid}` will automatically be appended to uniquely identify each execution. This can be used to monitor your script using a service such as healthchecks.io.
| `on-start` | No | Optionally provide a callback URL that will be pinged when processing is starting. A query parameter `rid={uuid}` will automatically be appended to uniquely identify each execution. This can be used to monitor your script using a service such as healthchecks.io. | `ON_DONE` | `--on-done` | No | Optionally provide a callback URL that will be called when processing is finished. A query parameter `rid={uuid}` will automatically be appended to uniquely identify each execution. This can be used to monitor your script using a service such as healthchecks.io.
| `on-done` | No | Optionally provide a callback URL that will be called when processing is finished. A query parameter `rid={uuid}` will automatically be appended to uniquely identify each execution. This can be used to monitor your script using a service such as healthchecks.io. | `ON_FAIL` | `--on-fail` | No | Optionally provide a callback URL that will be called when processing has failed. A query parameter `rid={uuid}` will automatically be appended to uniquely identify each execution. This can be used to monitor your script using a service such as healthchecks.io.
| `on-fail` | No | Optionally provide a callback URL that will be called when processing has failed. A query parameter `rid={uuid}` will automatically be appended to uniquely identify each execution. This can be used to monitor your script using a service such as healthchecks.io.
### Multi User support #### Multi User support
If you wish to [run FediFetcher for multiple users on your instance](https://blog.thms.uk/2023/04/muli-user-support-for-fedifetcher?utm_source=github), you can supply the `access-token` as an array, with different access tokens for different users. That will allow you to fetch replies and/or backfill profiles for multiple users on your account. If you wish to [run FediFetcher for multiple users on your instance](https://blog.thms.uk/2023/04/muli-user-support-for-fedifetcher?utm_source=github), you can supply the `--access-token` argument multiple times, with different access tokens for different users. That will allow you to fetch replies and/or backfill profiles for multiple users on your account. Have a look at the [sample script provided](./examples/FediFetcher-multiple-users.sh).
This is only supported when running FediFetcher as cron job, or container. Multi-user support is not available when running FediFetcher as GitHub Action. This is only supported when running FediFetcher as cron job, or container. Multi-user support is not available when running FediFetcher as GitHub Action.
### Required Access Token Scopes #### Required Access Token Scopes
- For all actions, your access token must include these scopes: - For all actions, your access token must include these scopes:
- `read:search` - `read:search`
- `read:statuses` - `read:statuses`
- `read:accounts` - `read:accounts`
- If you are supplying `reply-interval-in-hours` you must additionally enable this scope: - If you are supplying `REPLY_INTERVAL_IN_HOURS` / `--reply-interval-in-hours` you must additionally enable this scope:
- `admin:read:accounts` - `admin:read:accounts`
- If you are supplying `max-follow-requests` you must additionally enable this scope: - If you are supplying `MAX_FOLLOW_REQUESTS` / `--max-follow-requests` you must additionally enable this scope:
- `read:follows` - `read:follows`
- If you are supplying `max-bookmarks` you must additionally enable this scope: - If you are supplying `MAX_BOOKMARKS` / `--max-bookmarks` you must additionally enable this scope:
- `read:bookmarks` - `read:bookmarks`
- If you are supplying `max-favourites` you must additionally enable this scope: - If you are supplying `MAX_FAVOURITES` / `--max-favourites` you must additionally enable this scope:
- `read:favourites` - `read:favourites`
- If you are supplying `from-notifications` you must additionally enable this scope: - If you are supplying `FROM_NOTIFICATIONS` / `--from-notifications` you must additionally enable this scope:
- `read:notifications` - `read:notifications`
## Acknowledgments ## Acknowledgments

View file

@ -1,7 +1,6 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
from datetime import datetime, timedelta from datetime import datetime, timedelta
import string
from dateutil import parser from dateutil import parser
import itertools import itertools
import json import json
@ -12,13 +11,11 @@ import requests
import time import time
import argparse import argparse
import uuid import uuid
import defusedxml.ElementTree as ET
argparser=argparse.ArgumentParser() argparser=argparse.ArgumentParser()
argparser.add_argument('-c','--config', required=False, type=str, help='Optionally provide a path to a JSON file containing configuration options. If not provided, options must be supplied using command line flags.') argparser.add_argument('--server', required=True, help="Required: The name of your server (e.g. `mstdn.thms.uk`)")
argparser.add_argument('--server', required=False, help="Required: The name of your server (e.g. `mstdn.thms.uk`)") argparser.add_argument('--access-token', action="append", required=True, help="Required: The access token can be generated at https://<server>/settings/applications, and must have read:search, read:statuses and admin:read:accounts scopes. You can supply this multiple times, if you want tun run it for multiple users.")
argparser.add_argument('--access-token', action="append", required=False, help="Required: The access token can be generated at https://<server>/settings/applications, and must have read:search, read:statuses and admin:read:accounts scopes. You can supply this multiple times, if you want tun run it for multiple users.")
argparser.add_argument('--reply-interval-in-hours', required = False, type=int, default=0, help="Fetch remote replies to posts that have received replies from users on your own instance in this period") argparser.add_argument('--reply-interval-in-hours', required = False, type=int, default=0, help="Fetch remote replies to posts that have received replies from users on your own instance in this period")
argparser.add_argument('--home-timeline-length', required = False, type=int, default=0, help="Look for replies to posts in the API-Key owner's home timeline, up to this many posts") argparser.add_argument('--home-timeline-length', required = False, type=int, default=0, help="Look for replies to posts in the API-Key owner's home timeline, up to this many posts")
argparser.add_argument('--user', required = False, default='', help="Use together with --max-followings or --max-followers to tell us which user's followings/followers we should backfill") argparser.add_argument('--user', required = False, default='', help="Use together with --max-followings or --max-followers to tell us which user's followings/followers we should backfill")
@ -29,7 +26,6 @@ argparser.add_argument('--max-bookmarks', required = False, type=int, default=0,
argparser.add_argument('--max-favourites', required = False, type=int, default=0, help="Fetch remote replies to the API key owners Favourites. We'll fetch replies to at most this many favourites") argparser.add_argument('--max-favourites', required = False, type=int, default=0, help="Fetch remote replies to the API key owners Favourites. We'll fetch replies to at most this many favourites")
argparser.add_argument('--from-notifications', required = False, type=int, default=0, help="Backfill accounts of anyone appearing in your notifications, during the last hours") argparser.add_argument('--from-notifications', required = False, type=int, default=0, help="Backfill accounts of anyone appearing in your notifications, during the last hours")
argparser.add_argument('--remember-users-for-hours', required=False, type=int, default=24*7, help="How long to remember users that you aren't following for, before trying to backfill them again.") argparser.add_argument('--remember-users-for-hours', required=False, type=int, default=24*7, help="How long to remember users that you aren't following for, before trying to backfill them again.")
argparser.add_argument('--remember-hosts-for-days', required=False, type=int, default=30, help="How long to remember host info for, before checking again.")
argparser.add_argument('--http-timeout', required = False, type=int, default=5, help="The timeout for any HTTP requests to your own, or other instances.") argparser.add_argument('--http-timeout', required = False, type=int, default=5, help="The timeout for any HTTP requests to your own, or other instances.")
argparser.add_argument('--backfill-with-context', required = False, type=int, default=1, help="If enabled, we'll fetch remote replies when backfilling profiles. Set to `0` to disable.") argparser.add_argument('--backfill-with-context', required = False, type=int, default=1, help="If enabled, we'll fetch remote replies when backfilling profiles. Set to `0` to disable.")
argparser.add_argument('--backfill-mentioned-users', required = False, type=int, default=1, help="If enabled, we'll backfill any mentioned users when fetching remote replies to timeline posts. Set to `0` to disable.") argparser.add_argument('--backfill-mentioned-users', required = False, type=int, default=1, help="If enabled, we'll backfill any mentioned users when fetching remote replies to timeline posts. Set to `0` to disable.")
@ -67,17 +63,17 @@ def get_favourites(server, access_token, max):
"Authorization": f"Bearer {access_token}", "Authorization": f"Bearer {access_token}",
}) })
def add_user_posts(server, access_token, followings, known_followings, all_known_users, seen_urls, seen_hosts): def add_user_posts(server, access_token, followings, know_followings, all_known_users, seen_urls):
for user in followings: for user in followings:
if user['acct'] not in all_known_users and not user['url'].startswith(f"https://{server}/"): if user['acct'] not in all_known_users and not user['url'].startswith(f"https://{server}/"):
posts = get_user_posts(user, known_followings, server, seen_hosts) posts = get_user_posts(user, know_followings, server)
if(posts != None): if(posts != None):
count = 0 count = 0
failed = 0 failed = 0
for post in posts: for post in posts:
if post.get('reblog') is None and post.get('renoteId') is None and post.get('url') is not None and post.get('url') not in seen_urls: if post['reblog'] == None and post['url'] != None and post['url'] not in seen_urls:
added = add_post_with_context(post, server, access_token, seen_urls, seen_hosts) added = add_post_with_context(post, server, access_token, seen_urls)
if added is True: if added is True:
seen_urls.add(post['url']) seen_urls.add(post['url'])
count += 1 count += 1
@ -85,159 +81,61 @@ def add_user_posts(server, access_token, followings, known_followings, all_known
failed += 1 failed += 1
log(f"Added {count} posts for user {user['acct']} with {failed} errors") log(f"Added {count} posts for user {user['acct']} with {failed} errors")
if failed == 0: if failed == 0:
known_followings.add(user['acct']) know_followings.add(user['acct'])
all_known_users.add(user['acct']) all_known_users.add(user['acct'])
def add_post_with_context(post, server, access_token, seen_urls, seen_hosts): def add_post_with_context(post, server, access_token, seen_urls):
added = add_context_url(post['url'], server, access_token) added = add_context_url(post['url'], server, access_token)
if added is True: if added is True:
seen_urls.add(post['url']) seen_urls.add(post['url'])
if ('replies_count' in post or 'in_reply_to_id' in post) and getattr(arguments, 'backfill_with_context', 0) > 0: if (post['replies_count'] or post['in_reply_to_id']) and arguments.backfill_with_context > 0:
parsed_urls = {} parsed_urls = {}
parsed = parse_url(post['url'], parsed_urls) parsed = parse_url(post['url'], parsed_urls)
if parsed == None: if parsed == None:
return True return True
known_context_urls = get_all_known_context_urls(server, [post],parsed_urls, seen_hosts) known_context_urls = get_all_known_context_urls(server, [post],parsed_urls)
add_context_urls(server, access_token, known_context_urls, seen_urls) add_context_urls(server, access_token, known_context_urls, seen_urls)
return True return True
return False return False
def get_user_posts(user, known_followings, server, seen_hosts): def get_user_posts(user, know_followings, server):
parsed_url = parse_user_url(user['url']) parsed_url = parse_user_url(user['url'])
if parsed_url == None: if parsed_url == None:
# We are adding it as 'known' anyway, because we won't be able to fix this. # We are adding it as 'known' anyway, because we won't be able to fix this.
known_followings.add(user['acct']) know_followings.add(user['acct'])
return None return None
if(parsed_url[0] == server): if(parsed_url[0] == server):
log(f"{user['acct']} is a local user. Skip") log(f"{user['acct']} is a local user. Skip")
known_followings.add(user['acct']) know_followings.add(user['acct'])
return None return None
post_server = get_server_info(parsed_url[0], seen_hosts)
if post_server is None:
log(f'server {parsed_url[0]} not found for post')
return None
if post_server['mastodonApiSupport']:
return get_user_posts_mastodon(parsed_url[1], post_server['webserver'])
if post_server['lemmyApiSupport']:
return get_user_posts_lemmy(parsed_url[1], user['url'], post_server['webserver'])
if post_server['misskeyApiSupport']:
return get_user_posts_misskey(parsed_url[1], post_server['webserver'])
log(f'server api unknown for {post_server["webserver"]}, cannot fetch user posts')
return None
def get_user_posts_mastodon(userName, webserver):
try: try:
user_id = get_user_id(webserver, userName) user_id = get_user_id(parsed_url[0], parsed_url[1])
except Exception as ex: except Exception as ex:
log(f"Error getting user ID for user {userName}: {ex}") log(f"Error getting user ID for user {user['acct']}: {ex}")
return None return None
try: try:
url = f"https://{webserver}/api/v1/accounts/{user_id}/statuses?limit=40" url = f"https://{parsed_url[0]}/api/v1/accounts/{user_id}/statuses?limit=40"
response = get(url) response = get(url)
if(response.status_code == 200): if(response.status_code == 200):
return response.json() return response.json()
elif response.status_code == 404: elif response.status_code == 404:
raise Exception( raise Exception(
f"User {userName} was not found on server {webserver}" f"User {user['acct']} was not found on server {parsed_url[0]}"
) )
else: else:
raise Exception( raise Exception(
f"Error getting URL {url}. Status code: {response.status_code}" f"Error getting URL {url}. Status code: {response.status_code}"
) )
except Exception as ex: except Exception as ex:
log(f"Error getting posts for user {userName}: {ex}") log(f"Error getting posts for user {user['acct']}: {ex}")
return None return None
def get_user_posts_lemmy(userName, userUrl, webserver):
# community
if re.match(r"^https:\/\/[^\/]+\/c\/", userUrl):
try:
url = f"https://{webserver}/api/v3/post/list?community_name={userName}&sort=New&limit=50"
response = get(url)
if(response.status_code == 200):
posts = [post['post'] for post in response.json()['posts']]
for post in posts:
post['url'] = post['ap_id']
return posts
except Exception as ex:
log(f"Error getting community posts for community {userName}: {ex}")
return None
# user
if re.match(r"^https:\/\/[^\/]+\/u\/", userUrl):
try:
url = f"https://{webserver}/api/v3/user?username={userName}&sort=New&limit=50"
response = get(url)
if(response.status_code == 200):
comments = [post['post'] for post in response.json()['comments']]
posts = [post['post'] for post in response.json()['posts']]
all_posts = comments + posts
for post in all_posts:
post['url'] = post['ap_id']
return all_posts
except Exception as ex:
log(f"Error getting user posts for user {userName}: {ex}")
return None
def get_user_posts_misskey(userName, webserver):
# query user info via search api
# we could filter by host but there's no way to limit that to just the main host on firefish currently
# on misskey it works if you supply '.' as the host but firefish does not
userId = None
try:
url = f'https://{webserver}/api/users/search-by-username-and-host'
resp = post(url, { 'username': userName })
if resp.status_code == 200:
res = resp.json()
for user in res:
if user['host'] is None:
userId = user['id']
break
else:
log(f"Error finding user {userName} from {webserver}. Status Code: {resp.status_code}")
return None
except Exception as ex:
log(f"Error finding user {userName} from {webserver}. Exception: {ex}")
return None
if userId is None:
log(f'Error finding user {userName} from {webserver}: user not found on server in search')
return None
try:
url = f'https://{webserver}/api/users/notes'
resp = post(url, { 'userId': userId, 'limit': 40 })
if resp.status_code == 200:
notes = resp.json()
for note in notes:
if note.get('url') is None:
# add this to make it look like Mastodon status objects
note.update({ 'url': f"https://{webserver}/notes/{note['id']}" })
return notes
else:
log(f"Error getting posts by user {userName} from {webserver}. Status Code: {resp.status_code}")
return None
except Exception as ex:
log(f"Error getting posts by user {userName} from {webserver}. Exception: {ex}")
return None
def get_new_follow_requests(server, access_token, max, known_followings): def get_new_follow_requests(server, access_token, max, known_followings):
"""Get any new follow requests for the specified user, up to the max number provided""" """Get any new follow requests for the specified user, up to the max number provided"""
@ -456,24 +354,21 @@ def get_reply_toots(user_id, server, access_token, seen_urls, reply_since):
) )
def get_all_known_context_urls(server, reply_toots, parsed_urls, seen_hosts): def get_all_known_context_urls(server, reply_toots,parsed_urls):
"""get the context toots of the given toots from their original server""" """get the context toots of the given toots from their original server"""
known_context_urls = set() known_context_urls = set(
filter(
for toot in reply_toots: lambda url: not url.startswith(f"https://{server}/"),
if toot_has_parseable_url(toot, parsed_urls): itertools.chain.from_iterable(
url = toot["url"] if toot["reblog"] is None else toot["reblog"]["url"] get_toot_context(*parse_url(toot["url"] if toot["reblog"] is None else toot["reblog"]["url"],parsed_urls), toot["url"])
parsed_url = parse_url(url, parsed_urls) for toot in filter(
context = get_toot_context(parsed_url[0], parsed_url[1], url, seen_hosts) lambda toot: toot_has_parseable_url(toot,parsed_urls),
if context is not None: reply_toots
for item in context: )
known_context_urls.add(item) ),
else: )
log(f"Error getting context for toot {url}") )
known_context_urls = set(filter(lambda url: not url.startswith(f"https://{server}/"), known_context_urls))
log(f"Found {len(known_context_urls)} known context toots") log(f"Found {len(known_context_urls)} known context toots")
return known_context_urls return known_context_urls
@ -538,11 +433,6 @@ def parse_user_url(url):
if match is not None: if match is not None:
return match return match
match = parse_lemmy_profile_url(url)
if match is not None:
return match
# Pixelfed profile paths do not use a subdirectory, so we need to match for them last.
match = parse_pixelfed_profile_url(url) match = parse_pixelfed_profile_url(url)
if match is not None: if match is not None:
return match return match
@ -562,21 +452,11 @@ def parse_url(url, parsed_urls):
if match is not None: if match is not None:
parsed_urls[url] = match parsed_urls[url] = match
if url not in parsed_urls:
match = parse_lemmy_url(url)
if match is not None:
parsed_urls[url] = match
if url not in parsed_urls: if url not in parsed_urls:
match = parse_pixelfed_url(url) match = parse_pixelfed_url(url)
if match is not None: if match is not None:
parsed_urls[url] = match parsed_urls[url] = match
if url not in parsed_urls:
match = parse_misskey_url(url)
if match is not None:
parsed_urls[url] = match
if url not in parsed_urls: if url not in parsed_urls:
log(f"Error parsing toot URL {url}") log(f"Error parsing toot URL {url}")
parsed_urls[url] = None parsed_urls[url] = None
@ -586,7 +466,7 @@ def parse_url(url, parsed_urls):
def parse_mastodon_profile_url(url): def parse_mastodon_profile_url(url):
"""parse a Mastodon Profile URL and return the server and username""" """parse a Mastodon Profile URL and return the server and username"""
match = re.match( match = re.match(
r"https://(?P<server>[^/]+)/@(?P<username>[^/]+)", url r"https://(?P<server>.*)/@(?P<username>.*)", url
) )
if match is not None: if match is not None:
return (match.group("server"), match.group("username")) return (match.group("server"), match.group("username"))
@ -595,7 +475,7 @@ def parse_mastodon_profile_url(url):
def parse_mastodon_url(url): def parse_mastodon_url(url):
"""parse a Mastodon URL and return the server and ID""" """parse a Mastodon URL and return the server and ID"""
match = re.match( match = re.match(
r"https://(?P<server>[^/]+)/@(?P<username>[^/]+)/(?P<toot_id>[^/]+)", url r"https://(?P<server>.*)/@(?P<username>.*)/(?P<toot_id>.*)", url
) )
if match is not None: if match is not None:
return (match.group("server"), match.group("toot_id")) return (match.group("server"), match.group("toot_id"))
@ -604,14 +484,14 @@ def parse_mastodon_url(url):
def parse_pleroma_url(url): def parse_pleroma_url(url):
"""parse a Pleroma URL and return the server and ID""" """parse a Pleroma URL and return the server and ID"""
match = re.match(r"https://(?P<server>[^/]+)/objects/(?P<toot_id>[^/]+)", url) match = re.match(r"https://(?P<server>.*)/objects/(?P<toot_id>.*)", url)
if match is not None: if match is not None:
server = match.group("server") server = match.group("server")
url = get_redirect_url(url) url = get_redirect_url(url)
if url is None: if url is None:
return None return None
match = re.match(r"/notice/(?P<toot_id>[^/]+)", url) match = re.match(r"/notice/(?P<toot_id>.*)", url)
if match is not None: if match is not None:
return (server, match.group("toot_id")) return (server, match.group("toot_id"))
return None return None
@ -619,7 +499,7 @@ def parse_pleroma_url(url):
def parse_pleroma_profile_url(url): def parse_pleroma_profile_url(url):
"""parse a Pleroma Profile URL and return the server and username""" """parse a Pleroma Profile URL and return the server and username"""
match = re.match(r"https://(?P<server>[^/]+)/users/(?P<username>[^/]+)", url) match = re.match(r"https://(?P<server>.*)/users/(?P<username>.*)", url)
if match is not None: if match is not None:
return (match.group("server"), match.group("username")) return (match.group("server"), match.group("username"))
return None return None
@ -627,16 +507,7 @@ def parse_pleroma_profile_url(url):
def parse_pixelfed_url(url): def parse_pixelfed_url(url):
"""parse a Pixelfed URL and return the server and ID""" """parse a Pixelfed URL and return the server and ID"""
match = re.match( match = re.match(
r"https://(?P<server>[^/]+)/p/(?P<username>[^/]+)/(?P<toot_id>[^/]+)", url r"https://(?P<server>.*)/p/(?P<username>.*)/(?P<toot_id>.*)", url
)
if match is not None:
return (match.group("server"), match.group("toot_id"))
return None
def parse_misskey_url(url):
"""parse a Misskey URL and return the server and ID"""
match = re.match(
r"https://(?P<server>[^/]+)/notes/(?P<toot_id>[^/]+)", url
) )
if match is not None: if match is not None:
return (match.group("server"), match.group("toot_id")) return (match.group("server"), match.group("toot_id"))
@ -644,26 +515,11 @@ def parse_misskey_url(url):
def parse_pixelfed_profile_url(url): def parse_pixelfed_profile_url(url):
"""parse a Pixelfed Profile URL and return the server and username""" """parse a Pixelfed Profile URL and return the server and username"""
match = re.match(r"https://(?P<server>[^/]+)/(?P<username>[^/]+)", url) match = re.match(r"https://(?P<server>.*)/(?P<username>.*)", url)
if match is not None: if match is not None:
return (match.group("server"), match.group("username")) return (match.group("server"), match.group("username"))
return None return None
def parse_lemmy_url(url):
"""parse a Lemmy URL and return the server, and ID"""
match = re.match(
r"https://(?P<server>[^/]+)/(?:comment|post)/(?P<toot_id>[^/]+)", url
)
if match is not None:
return (match.group("server"), match.group("toot_id"))
return None
def parse_lemmy_profile_url(url):
"""parse a Lemmy Profile URL and return the server and username"""
match = re.match(r"https://(?P<server>[^/]+)/(?:u|c)/(?P<username>[^/]+)", url)
if match is not None:
return (match.group("server"), match.group("username"))
return None
def get_redirect_url(url): def get_redirect_url(url):
"""get the URL given URL redirects to""" """get the URL given URL redirects to"""
@ -688,37 +544,20 @@ def get_redirect_url(url):
return None return None
def get_all_context_urls(server, replied_toot_ids, seen_hosts): def get_all_context_urls(server, replied_toot_ids):
"""get the URLs of the context toots of the given toots""" """get the URLs of the context toots of the given toots"""
return filter( return filter(
lambda url: not url.startswith(f"https://{server}/"), lambda url: not url.startswith(f"https://{server}/"),
itertools.chain.from_iterable( itertools.chain.from_iterable(
get_toot_context(server, toot_id, url, seen_hosts) get_toot_context(server, toot_id, url)
for (url, (server, toot_id)) in replied_toot_ids for (url, (server, toot_id)) in replied_toot_ids
), ),
) )
def get_toot_context(server, toot_id, toot_url, seen_hosts): def get_toot_context(server, toot_id, toot_url):
"""get the URLs of the context toots of the given toot""" """get the URLs of the context toots of the given toot"""
url = f"https://{server}/api/v1/statuses/{toot_id}/context"
post_server = get_server_info(server, seen_hosts)
if post_server is None:
log(f'server {server} not found for post')
return []
if post_server['mastodonApiSupport']:
return get_mastodon_urls(post_server['webserver'], toot_id, toot_url)
if post_server['lemmyApiSupport']:
return get_lemmy_urls(post_server['webserver'], toot_id, toot_url)
if post_server['misskeyApiSupport']:
return get_misskey_urls(post_server['webserver'], toot_id, toot_url)
log(f'unknown server api for {server}')
return []
def get_mastodon_urls(webserver, toot_id, toot_url):
url = f"https://{webserver}/api/v1/statuses/{toot_id}/context"
try: try:
resp = get(url) resp = get(url)
except Exception as ex: except Exception as ex:
@ -733,119 +572,17 @@ def get_mastodon_urls(webserver, toot_id, toot_url):
except Exception as ex: except Exception as ex:
log(f"Error parsing context for toot {toot_url}. Exception: {ex}") log(f"Error parsing context for toot {toot_url}. Exception: {ex}")
return [] return []
elif resp.status_code == 429:
reset = datetime.strptime(resp.headers['x-ratelimit-reset'], '%Y-%m-%dT%H:%M:%S.%fZ')
log(f"Rate Limit hit when getting context for {toot_url}. Waiting to retry at {resp.headers['x-ratelimit-reset']}")
time.sleep((reset - datetime.now()).total_seconds() + 1)
return get_toot_context(server, toot_id, toot_url)
log( log(
f"Error getting context for toot {toot_url}. Status code: {resp.status_code}" f"Error getting context for toot {toot_url}. Status code: {resp.status_code}"
) )
return [] return []
def get_lemmy_urls(webserver, toot_id, toot_url):
if toot_url.find("/comment/") != -1:
return get_lemmy_comment_context(webserver, toot_id, toot_url)
if toot_url.find("/post/") != -1:
return get_lemmy_comments_urls(webserver, toot_id, toot_url)
else:
log(f'unknown lemmy url type {toot_url}')
return []
def get_lemmy_comment_context(webserver, toot_id, toot_url):
"""get the URLs of the context toots of the given toot"""
comment = f"https://{webserver}/api/v3/comment?id={toot_id}"
try:
resp = get(comment)
except Exception as ex:
log(f"Error getting comment {toot_id} from {toot_url}. Exception: {ex}")
return []
if resp.status_code == 200:
try:
res = resp.json()
post_id = res['comment_view']['comment']['post_id']
return get_lemmy_comments_urls(webserver, post_id, toot_url)
except Exception as ex:
log(f"Error parsing context for comment {toot_url}. Exception: {ex}")
return []
def get_lemmy_comments_urls(webserver, post_id, toot_url):
"""get the URLs of the comments of the given post"""
urls = []
url = f"https://{webserver}/api/v3/post?id={post_id}"
try:
resp = get(url)
except Exception as ex:
log(f"Error getting post {post_id} from {toot_url}. Exception: {ex}")
return []
if resp.status_code == 200:
try:
res = resp.json()
if res['post_view']['counts']['comments'] == 0:
return []
urls.append(res['post_view']['post']['ap_id'])
except Exception as ex:
log(f"Error parsing post {post_id} from {toot_url}. Exception: {ex}")
url = f"https://{webserver}/api/v3/comment/list?post_id={post_id}&sort=New&limit=50"
try:
resp = get(url)
except Exception as ex:
log(f"Error getting comments for post {post_id} from {toot_url}. Exception: {ex}")
return []
if resp.status_code == 200:
try:
res = resp.json()
list_of_urls = [comment_info['comment']['ap_id'] for comment_info in res['comments']]
log(f"Got {len(list_of_urls)} comments for post {toot_url}")
urls.extend(list_of_urls)
return urls
except Exception as ex:
log(f"Error parsing comments for post {toot_url}. Exception: {ex}")
log(f"Error getting comments for post {toot_url}. Status code: {resp.status_code}")
return []
def get_misskey_urls(webserver, post_id, toot_url):
"""get the URLs of the comments of a given misskey post"""
urls = []
url = f"https://{webserver}/api/notes/children"
try:
resp = post(url, { 'noteId': post_id, 'limit': 100, 'depth': 12 })
except Exception as ex:
log(f"Error getting post {post_id} from {toot_url}. Exception: {ex}")
return []
if resp.status_code == 200:
try:
res = resp.json()
log(f"Got children for misskey post {toot_url}")
list_of_urls = [f'https://{webserver}/notes/{comment_info["id"]}' for comment_info in res]
urls.extend(list_of_urls)
except Exception as ex:
log(f"Error parsing post {post_id} from {toot_url}. Exception: {ex}")
else:
log(f"Error getting post {post_id} from {toot_url}. Status Code: {resp.status_code}")
url = f"https://{webserver}/api/notes/conversation"
try:
resp = post(url, { 'noteId': post_id, 'limit': 100 })
except Exception as ex:
log(f"Error getting post {post_id} from {toot_url}. Exception: {ex}")
return []
if resp.status_code == 200:
try:
res = resp.json()
log(f"Got conversation for misskey post {toot_url}")
list_of_urls = [f'https://{webserver}/notes/{comment_info["id"]}' for comment_info in res]
urls.extend(list_of_urls)
except Exception as ex:
log(f"Error parsing post {post_id} from {toot_url}. Exception: {ex}")
else:
log(f"Error getting post {post_id} from {toot_url}. Status Code: {resp.status_code}")
return urls
def add_context_urls(server, access_token, context_urls, seen_urls): def add_context_urls(server, access_token, context_urls, seen_urls):
"""add the given toot URLs to the server""" """add the given toot URLs to the server"""
@ -886,6 +623,11 @@ def add_context_url(url, server, access_token):
"Make sure you have the read:search scope enabled for your access token." "Make sure you have the read:search scope enabled for your access token."
) )
return False return False
elif resp.status_code == 429:
reset = datetime.strptime(resp.headers['x-ratelimit-reset'], '%Y-%m-%dT%H:%M:%S.%fZ')
log(f"Rate Limit hit when adding url {search_url}. Waiting to retry at {resp.headers['x-ratelimit-reset']}")
time.sleep((reset - datetime.now()).total_seconds() + 1)
return add_context_url(url, server, access_token)
else: else:
log( log(
f"Error adding url {search_url} to server {server}. Status code: {resp.status_code}" f"Error adding url {search_url} to server {server}. Status code: {resp.status_code}"
@ -922,30 +664,12 @@ def get_paginated_mastodon(url, max, headers = {}, timeout = 0, max_tries = 5):
if(isinstance(max, int)): if(isinstance(max, int)):
while len(result) < max and 'next' in response.links: while len(result) < max and 'next' in response.links:
response = get(response.links['next']['url'], headers, timeout, max_tries) response = get(response.links['next']['url'], headers, timeout, max_tries)
if response.status_code != 200: result = result + response.json()
raise Exception(
f"Error getting URL {response.url}. \
Status code: {response.status_code}"
)
response_json = response.json()
if isinstance(response_json, list):
result += response_json
else: else:
break while parser.parse(result[-1]['created_at']) >= max and 'next' in response.links:
else:
while result and parser.parse(result[-1]['created_at']) >= max \
and 'next' in response.links:
response = get(response.links['next']['url'], headers, timeout, max_tries) response = get(response.links['next']['url'], headers, timeout, max_tries)
if response.status_code != 200: result = result + response.json()
raise Exception(
f"Error getting URL {response.url}. \
Status code: {response.status_code}"
)
response_json = response.json()
if isinstance(response_json, list):
result += response_json
else:
break
return result return result
@ -971,61 +695,9 @@ def get(url, headers = {}, timeout = 0, max_tries = 5):
raise Exception(f"Maximum number of retries exceeded for rate limited request {url}") raise Exception(f"Maximum number of retries exceeded for rate limited request {url}")
return response return response
def post(url, json, headers = {}, timeout = 0, max_tries = 5):
"""A simple wrapper to make a post request while providing our user agent, and respecting rate limits"""
h = headers.copy()
if 'User-Agent' not in h:
h['User-Agent'] = 'FediFetcher (https://go.thms.uk/mgr)'
if timeout == 0:
timeout = arguments.http_timeout
response = requests.post( url, json=json, headers= h, timeout=timeout)
if response.status_code == 429:
if max_tries > 0:
reset = parser.parse(response.headers['x-ratelimit-reset'])
now = datetime.now(datetime.now().astimezone().tzinfo)
wait = (reset - now).total_seconds() + 1
log(f"Rate Limit hit requesting {url}. Waiting {wait} sec to retry at {response.headers['x-ratelimit-reset']}")
time.sleep(wait)
return post(url, json, headers, timeout, max_tries - 1)
raise Exception(f"Maximum number of retries exceeded for rate limited request {url}")
return response
def log(text): def log(text):
print(f"{datetime.now()} {datetime.now().astimezone().tzinfo}: {text}") print(f"{datetime.now()} {datetime.now().astimezone().tzinfo}: {text}")
class ServerList:
def __init__(self, iterable):
self._dict = {}
for item in iterable:
if('last_checked' in iterable[item]):
iterable[item]['last_checked'] = parser.parse(iterable[item]['last_checked'])
self.add(item, iterable[item])
def add(self, key, item):
self._dict[key] = item
def get(self, key):
return self._dict[key]
def pop(self,key):
return self._dict.pop(key)
def __contains__(self, item):
return item in self._dict
def __iter__(self):
return iter(self._dict)
def __len__(self):
return len(self._dict)
def toJSON(self):
return json.dumps(self._dict,default=str)
class OrderedSet: class OrderedSet:
"""An ordered set implementation over a dict""" """An ordered set implementation over a dict"""
@ -1068,160 +740,8 @@ class OrderedSet:
return len(self._dict) return len(self._dict)
def toJSON(self): def toJSON(self):
return json.dumps(self._dict,default=str) return json.dump(self._dict, f, default=str)
def get_server_from_host_meta(server):
url = f'https://{server}/.well-known/host-meta'
try:
resp = get(url, timeout = 30)
except Exception as ex:
log(f"Error getting host meta for {server}. Exception: {ex}")
return None
if resp.status_code == 200:
try:
hostMeta = ET.fromstring(resp.text)
lrdd = hostMeta.find('.//{http://docs.oasis-open.org/ns/xri/xrd-1.0}Link[@rel="lrdd"]')
url = lrdd.get('template')
match = re.match(
r"https://(?P<server>[^/]+)/", url
)
if match is not None:
return match.group("server")
else:
raise Exception(f'server not found in lrdd for {server}')
return None
except Exception as ex:
log(f'Error parsing host meta for {server}. Exception: {ex}')
return None
else:
log(f'Error getting host meta for {server}. Status Code: {resp.status_code}')
return None
def get_nodeinfo(server, seen_hosts, host_meta_fallback = False):
url = f'https://{server}/.well-known/nodeinfo'
try:
resp = get(url, timeout = 30)
except Exception as ex:
log(f"Error getting host node info for {server}. Exception: {ex}")
return None
# if well-known nodeinfo isn't found, try to check host-meta for a webfinger URL
# needed on servers where the display domain is different than the web domain
if resp.status_code != 200 and not host_meta_fallback:
# not found, try to check host-meta as a fallback
log(f'nodeinfo for {server} not found, checking host-meta')
new_server = get_server_from_host_meta(server)
if new_server is not None:
if new_server == server:
log(f'host-meta for {server} did not get a new server.')
return None
else:
return get_nodeinfo(new_server, seen_hosts, True)
else:
return None
if resp.status_code == 200:
try:
nodeInfo = resp.json()
for link in nodeInfo['links']:
if link['rel'] in [
'http://nodeinfo.diaspora.software/ns/schema/2.0',
'http://nodeinfo.diaspora.software/ns/schema/2.1',
]:
nodeLoc = link['href']
break
except Exception as ex:
log(f'error getting server {server} info from well-known node info. Exception: {ex}')
return None
else:
log(f'Error getting well-known host node info for {server}. Status Code: {resp.status_code}')
return None
if nodeLoc is None:
log(f'could not find link to node info in well-known nodeinfo of {server}')
return None
# regrab server from nodeLoc, again in the case of different display and web domains
match = re.match(
r"https://(?P<server>[^/]+)/", nodeLoc
)
if match is None:
log(f"Error getting web server name from {server}.")
return None
server = match.group('server')
# return early if the web domain has been seen previously (in cases with host-meta lookups)
if server in seen_hosts:
return seen_hosts.get(server)
try:
resp = get(nodeLoc, timeout = 30)
except Exception as ex:
log(f"Error getting host node info for {server}. Exception: {ex}")
return None
if resp.status_code == 200:
try:
nodeInfo = resp.json()
if 'activitypub' not in nodeInfo['protocols']:
log(f'server {server} does not support activitypub, skipping')
return None
return {
'webserver': server,
'software': nodeInfo['software']['name'],
'version': nodeInfo['software']['version'],
'rawnodeinfo': nodeInfo,
}
except Exception as ex:
log(f'error getting server {server} info from nodeinfo. Exception: {ex}')
return None
else:
log(f'Error getting host node info for {server}. Status Code: {resp.status_code}')
return None
def get_server_info(server, seen_hosts):
if server in seen_hosts:
serverInfo = seen_hosts.get(server)
if('info' in serverInfo and serverInfo['info'] == None):
return None
return serverInfo
nodeinfo = get_nodeinfo(server, seen_hosts)
if nodeinfo is None:
seen_hosts.add(server, {
'info': None,
'last_checked': datetime.now()
})
else:
set_server_apis(nodeinfo)
seen_hosts.add(server, nodeinfo)
if server is not nodeinfo['webserver']:
seen_hosts.add(nodeinfo['webserver'], nodeinfo)
return nodeinfo
def set_server_apis(server):
# support for new server software should be added here
software_apis = {
'mastodonApiSupport': ['mastodon', 'pleroma', 'akkoma', 'pixelfed', 'hometown'],
'misskeyApiSupport': ['misskey', 'calckey', 'firefish', 'foundkey'],
'lemmyApiSupport': ['lemmy']
}
# software that has specific API support but is not compatible with FediFetcher for various reasons:
# * gotosocial - All Mastodon APIs require access token (https://github.com/superseriousbusiness/gotosocial/issues/2038)
for api, softwareList in software_apis.items():
server[api] = server['software'] in softwareList
# search `features` list in metadata if available
if 'metadata' in server['rawnodeinfo'] and 'features' in server['rawnodeinfo']['metadata'] and type(server['rawnodeinfo']['metadata']['features']) is list:
features = server['rawnodeinfo']['metadata']['features']
if 'mastodon_api' in features:
server['mastodonApiSupport'] = True
server['last_checked'] = datetime.now()
if __name__ == "__main__": if __name__ == "__main__":
start = datetime.now() start = datetime.now()
@ -1230,26 +750,6 @@ if __name__ == "__main__":
arguments = argparser.parse_args() arguments = argparser.parse_args()
if(arguments.config != None):
if os.path.exists(arguments.config):
with open(arguments.config, "r", encoding="utf-8") as f:
config = json.load(f)
for key in config:
setattr(arguments, key.lower().replace('-','_'), config[key])
else:
log(f"Config file {arguments.config} doesn't exist")
sys.exit(1)
if(arguments.server == None or arguments.access_token == None):
log("You must supply at least a server name and an access token")
sys.exit(1)
# in case someone provided the server name as url instead,
setattr(arguments, 'server', re.sub(r"^(https://)?([^/]*)/?$", "\\2", arguments.server))
runId = uuid.uuid4() runId = uuid.uuid4()
if(arguments.on_start != None and arguments.on_start != ''): if(arguments.on_start != None and arguments.on_start != ''):
@ -1299,7 +799,6 @@ if __name__ == "__main__":
REPLIED_TOOT_SERVER_IDS_FILE = os.path.join(arguments.state_dir, "replied_toot_server_ids") REPLIED_TOOT_SERVER_IDS_FILE = os.path.join(arguments.state_dir, "replied_toot_server_ids")
KNOWN_FOLLOWINGS_FILE = os.path.join(arguments.state_dir, "known_followings") KNOWN_FOLLOWINGS_FILE = os.path.join(arguments.state_dir, "known_followings")
RECENTLY_CHECKED_USERS_FILE = os.path.join(arguments.state_dir, "recently_checked_users") RECENTLY_CHECKED_USERS_FILE = os.path.join(arguments.state_dir, "recently_checked_users")
SEEN_HOSTS_FILE = os.path.join(arguments.state_dir, "seen_hosts")
seen_urls = OrderedSet([]) seen_urls = OrderedSet([])
@ -1333,25 +832,6 @@ if __name__ == "__main__":
all_known_users = OrderedSet(list(known_followings) + list(recently_checked_users)) all_known_users = OrderedSet(list(known_followings) + list(recently_checked_users))
if os.path.exists(SEEN_HOSTS_FILE):
with open(SEEN_HOSTS_FILE, "r", encoding="utf-8") as f:
seen_hosts = ServerList(json.load(f))
for host in list(seen_hosts):
serverInfo = seen_hosts.get(host)
if 'last_checked' in serverInfo:
serverAge = datetime.now(serverInfo['last_checked'].tzinfo) - serverInfo['last_checked']
if(serverAge.total_seconds() > arguments.remember_hosts_for_days * 24 * 60 * 60 ):
seen_hosts.pop(host)
elif('info' in serverInfo and serverInfo['info'] == None and serverAge.total_seconds() > 60 * 60 ):
# Don't cache failures for more than 24 hours
seen_hosts.pop(host)
else:
seen_hosts = ServerList({})
if(isinstance(arguments.access_token, str)):
setattr(arguments, 'access_token', [arguments.access_token])
for token in arguments.access_token: for token in arguments.access_token:
if arguments.reply_interval_in_hours > 0: if arguments.reply_interval_in_hours > 0:
@ -1361,19 +841,19 @@ if __name__ == "__main__":
reply_toots = get_all_reply_toots( reply_toots = get_all_reply_toots(
arguments.server, user_ids, token, seen_urls, arguments.reply_interval_in_hours arguments.server, user_ids, token, seen_urls, arguments.reply_interval_in_hours
) )
known_context_urls = get_all_known_context_urls(arguments.server, reply_toots,parsed_urls, seen_hosts) known_context_urls = get_all_known_context_urls(arguments.server, reply_toots,parsed_urls)
seen_urls.update(known_context_urls) seen_urls.update(known_context_urls)
replied_toot_ids = get_all_replied_toot_server_ids( replied_toot_ids = get_all_replied_toot_server_ids(
arguments.server, reply_toots, replied_toot_server_ids, parsed_urls arguments.server, reply_toots, replied_toot_server_ids, parsed_urls
) )
context_urls = get_all_context_urls(arguments.server, replied_toot_ids, seen_hosts) context_urls = get_all_context_urls(arguments.server, replied_toot_ids)
add_context_urls(arguments.server, token, context_urls, seen_urls) add_context_urls(arguments.server, token, context_urls, seen_urls)
if arguments.home_timeline_length > 0: if arguments.home_timeline_length > 0:
"""Do the same with any toots on the key owner's home timeline """ """Do the same with any toots on the key owner's home timeline """
timeline_toots = get_timeline(arguments.server, token, arguments.home_timeline_length) timeline_toots = get_timeline(arguments.server, token, arguments.home_timeline_length)
known_context_urls = get_all_known_context_urls(arguments.server, timeline_toots,parsed_urls, seen_hosts) known_context_urls = get_all_known_context_urls(arguments.server, timeline_toots,parsed_urls)
add_context_urls(arguments.server, token, known_context_urls, seen_urls) add_context_urls(arguments.server, token, known_context_urls, seen_urls)
# Backfill any post authors, and any mentioned users # Backfill any post authors, and any mentioned users
@ -1395,40 +875,40 @@ if __name__ == "__main__":
if user not in mentioned_users and user['acct'] not in all_known_users: if user not in mentioned_users and user['acct'] not in all_known_users:
mentioned_users.append(user) mentioned_users.append(user)
add_user_posts(arguments.server, token, filter_known_users(mentioned_users, all_known_users), recently_checked_users, all_known_users, seen_urls, seen_hosts) add_user_posts(arguments.server, token, filter_known_users(mentioned_users, all_known_users), recently_checked_users, all_known_users, seen_urls)
if arguments.max_followings > 0: if arguments.max_followings > 0:
log(f"Getting posts from last {arguments.max_followings} followings") log(f"Getting posts from last {arguments.max_followings} followings")
user_id = get_user_id(arguments.server, arguments.user, token) user_id = get_user_id(arguments.server, arguments.user, token)
followings = get_new_followings(arguments.server, user_id, arguments.max_followings, all_known_users) followings = get_new_followings(arguments.server, user_id, arguments.max_followings, all_known_users)
add_user_posts(arguments.server, token, followings, known_followings, all_known_users, seen_urls, seen_hosts) add_user_posts(arguments.server, token, followings, known_followings, all_known_users, seen_urls)
if arguments.max_followers > 0: if arguments.max_followers > 0:
log(f"Getting posts from last {arguments.max_followers} followers") log(f"Getting posts from last {arguments.max_followers} followers")
user_id = get_user_id(arguments.server, arguments.user, token) user_id = get_user_id(arguments.server, arguments.user, token)
followers = get_new_followers(arguments.server, user_id, arguments.max_followers, all_known_users) followers = get_new_followers(arguments.server, user_id, arguments.max_followers, all_known_users)
add_user_posts(arguments.server, token, followers, recently_checked_users, all_known_users, seen_urls, seen_hosts) add_user_posts(arguments.server, token, followers, recently_checked_users, all_known_users, seen_urls)
if arguments.max_follow_requests > 0: if arguments.max_follow_requests > 0:
log(f"Getting posts from last {arguments.max_follow_requests} follow requests") log(f"Getting posts from last {arguments.max_follow_requests} follow requests")
follow_requests = get_new_follow_requests(arguments.server, token, arguments.max_follow_requests, all_known_users) follow_requests = get_new_follow_requests(arguments.server, token, arguments.max_follow_requests, all_known_users)
add_user_posts(arguments.server, token, follow_requests, recently_checked_users, all_known_users, seen_urls, seen_hosts) add_user_posts(arguments.server, token, follow_requests, recently_checked_users, all_known_users, seen_urls)
if arguments.from_notifications > 0: if arguments.from_notifications > 0:
log(f"Getting notifications for last {arguments.from_notifications} hours") log(f"Getting notifications for last {arguments.from_notifications} hours")
notification_users = get_notification_users(arguments.server, token, all_known_users, arguments.from_notifications) notification_users = get_notification_users(arguments.server, token, all_known_users, arguments.from_notifications)
add_user_posts(arguments.server, token, notification_users, recently_checked_users, all_known_users, seen_urls, seen_hosts) add_user_posts(arguments.server, token, notification_users, recently_checked_users, all_known_users, seen_urls)
if arguments.max_bookmarks > 0: if arguments.max_bookmarks > 0:
log(f"Pulling replies to the last {arguments.max_bookmarks} bookmarks") log(f"Pulling replies to the last {arguments.max_bookmarks} bookmarks")
bookmarks = get_bookmarks(arguments.server, token, arguments.max_bookmarks) bookmarks = get_bookmarks(arguments.server, token, arguments.max_bookmarks)
known_context_urls = get_all_known_context_urls(arguments.server, bookmarks,parsed_urls, seen_hosts) known_context_urls = get_all_known_context_urls(arguments.server, bookmarks,parsed_urls)
add_context_urls(arguments.server, token, known_context_urls, seen_urls) add_context_urls(arguments.server, token, known_context_urls, seen_urls)
if arguments.max_favourites > 0: if arguments.max_favourites > 0:
log(f"Pulling replies to the last {arguments.max_favourites} favourites") log(f"Pulling replies to the last {arguments.max_favourites} favourites")
favourites = get_favourites(arguments.server, token, arguments.max_favourites) favourites = get_favourites(arguments.server, token, arguments.max_favourites)
known_context_urls = get_all_known_context_urls(arguments.server, favourites,parsed_urls, seen_hosts) known_context_urls = get_all_known_context_urls(arguments.server, favourites,parsed_urls)
add_context_urls(arguments.server, token, known_context_urls, seen_urls) add_context_urls(arguments.server, token, known_context_urls, seen_urls)
with open(KNOWN_FOLLOWINGS_FILE, "w", encoding="utf-8") as f: with open(KNOWN_FOLLOWINGS_FILE, "w", encoding="utf-8") as f:
@ -1441,10 +921,7 @@ if __name__ == "__main__":
json.dump(dict(list(replied_toot_server_ids.items())[-10000:]), f) json.dump(dict(list(replied_toot_server_ids.items())[-10000:]), f)
with open(RECENTLY_CHECKED_USERS_FILE, "w", encoding="utf-8") as f: with open(RECENTLY_CHECKED_USERS_FILE, "w", encoding="utf-8") as f:
f.write(recently_checked_users.toJSON()) recently_checked_users.toJSON()
with open(SEEN_HOSTS_FILE, "w", encoding="utf-8") as f:
f.write(seen_hosts.toJSON())
os.remove(LOCK_FILE) os.remove(LOCK_FILE)

View file

@ -2,9 +2,7 @@ certifi==2022.12.7
charset-normalizer==3.0.1 charset-normalizer==3.0.1
docutils==0.19 docutils==0.19
idna==3.4 idna==3.4
python-dateutil==2.8.2
requests==2.28.2 requests==2.28.2
six==1.16.0 six==1.16.0
smmap==5.0.0
urllib3==1.26.14 urllib3==1.26.14
defusedxml==0.7.1 python-dateutil==2.8.2