As of Python 3.9, it's no longer necessary to import things like List
and Dict from the typing module, and we can just use the built-in types
like this.
This also involved installing some new packages for the type stubs for a
few of the major third-party libraries.
I also had to change some of the imports in some model files in strange
ways, I'm not sure why some of these were necessary. I suspect this
might be a bug in mypy, but I'm not sure if I'll be able to build a
reproduction of it to be able to report it.
This adds the backend pieces (no interface yet) to configure Lua scripts
that will be applied to topics and comments due to different events.
Initially, it only supports running a script when a new topic or comment
is posted. For example, here is a Lua script that would prepend a new
topic's title with "[Text] " or "[Link] " depending on its type, as well
as replace its tags with either "text" or "link":
function on_topic_post (topic)
if (topic.is_text_type) then
topic.title = "[Text] " .. topic.title
topic.tags = {"text"}
elseif (topic.is_link_type) then
topic.title = "[Link] " .. topic.title
topic.tags = {"link"}
end
end
There can be a global script as well as group-specific scripts, and the
scripts are sandboxed, with limited access to data as well as being
restricted to a subset of Lua's built-in functions. The Lua sandboxing
code comes from Splash (https://github.com/scrapinghub/splash). It will
need to be modified, but this commit keeps it unmodified so that future
changes can be more easily tracked by comparing to the original state of
the file.
The sandboxing also includes some restrictions on number of instructions
and memory usage, but this might be more effectively managed on the OS
level. More research will still need to be done on security and resource
restrictions before this feature can be safely opened to users.
This adds some very simple metrics to all of the background jobs that
consume the event streams. Currently, the only "real" metric is a
counter tracking how many messages have been processed by that consumer,
but a lot of the value will come from being able to utilize the
automatic "up" metric provided by Prometheus to monitor and make sure
that all of the jobs are running.
I decided to use ports starting from 25010 for these jobs - this is
completely arbitrary, it's just a fairly large range of unassigned
ports, so shouldn't conflict with anything.
I'm not a fan of how much hard-coding is involved here for the different
ports and jobs in the Prometheus config, but it's also not a big deal.
RabbitMQ was used to support asynchronous/background processing tasks,
such as determining word count for text topics and scraping the
destinations or relevant APIs for link topics. This commit replaces
RabbitMQ's role (as the message broker) with Redis streams.
This included building a new "PostgreSQL to Redis bridge" that takes
over the previous role of pg-amqp-bridge: listening for NOTIFY messages
on a particular PostgreSQL channel and translating them to messages in
appropriate Redis streams.
One particular change of note is that the names of message "sources"
were adjusted a little and standardized. For example, the routing key
for a message caused by a new comment was previously "comment.created",
but is now "comments.insert". Similarly, "comment.edited" became
"comments.update.markdown". The new naming scheme uses the table name,
proper name for the SQL operation, and column name instead of the
previous unpredictable terms.
This adds settings into pyproject.toml for the isort tool to match up
with the styles I've generally been using, and then applies it to the
whole project (by running "isort -rc").
Most of these changes are very minor, but it's good to fix the few
inconsistencies that were around.
This changes the "activity" topic-sorting method to look for
"interesting" activity instead of everything, and adds a new "All
activity" method that retains the previous behavior.
Currently, "interesting activity" excludes any comments that have active
Noise, Offtopic, or Malice labels, or any of their children. These
checks are also done based on labeling activity, so for example if
someone posts a new comment it will bump the thread initially, but if
that comment is then labeled as Noise, the thread will "un-bump" and go
back to its previous position in the Activity sort.
There were also some other minor changes made to appearance to support
adding another sorting option, such as shortening the displayed names on
the "tabs", like showing "Votes" instead of "Most votes". This probably
needs some further work, but is okay for now.
Using bootstrap() seems to cause issues with re-declaring the Prometheus
metrics (which happens in the tweens that we don't really need or want
anyway). There might be better ways to do this including not attaching
the tweens for scripts, but this seems to work fine (and was already
being done this way in the YouTube API consumer).
The site-icons spritesheet has already become unwieldy - it's almost
1MB, is mostly rarely-needed icons, and needs to be fully replaced and
re-downloaded whenever a new icon is added. With HTTP/2 now being widely
supported, spritesheets seem to be mostly obsolete, and I probably never
should have done it that way in the first place.
This commit changes over to simply using individual icon images, and
rebuilds the CSS file whenever new icons are downloaded. This new CSS
file will probably be somewhat large, but should gzip extremely well.
This probably still needs some work to support cache-busting on the CSS
file.
A lot of the code in common between this and the EmbedlyScraper should
probably be generalized out to a base class soon, but let's make sure
this works first.
mypy 0.640 has made it so that it's no longer necessary to annotate the
return type for __init__ methods, since it's always None. The only time
it's necessary now is if the method doesn't have any arguments, since
this shows that the method should still be type-checked.
This adds a trigger to the scraper_results table which will add rabbitmq
messages whenever a scrape finishes, as well as a consumer that picks up
these messages, and uses Embedly data to download (and resize if
necessary) the favicons from any sites that are scraped. These are
downloaded into the input folder for the site-icons-spriter, so it
should be able to use these to generate spritesheets.
This isn't working very well in a lot of cases, shouldn't be used until
I've got some workarounds for a lot of the issues that I'm finding.
This reverts commit 369f273f8e.
As part of scraping a link, Embedly will often remove tracking vars from
the query, follow redirects, and so on. This will start using the url
returned back from an Embedly result to replace the one that was
originally submitted when it was different (though the original one will
still be kept in the original_url column).
Not really a big deal, but deleted topics are getting sent back through
this consumer when the clean_private_data script erases their data,
since that changes the markdown and puts them into the topic.edited
queue. There shouldn't be any reason to process deleted topics and
re-add "blank" metadata (0 word count, no excerpt), so we can just skip
them.
This adds a consumer (in prod only) that uses Embedly's Extract API to
scrape the links from all new link topics and stores some of the data in
the topic's content_metadata column.
Previously, any topic processed by this consumer would have its
content_metadata completely replaced. This won't work once other
consumers or processes start being able to set that data, since we don't
know that this one will always run first.
This commit updates the method the consumer uses so that it will keep
any data that's already in the topic's content_metadata column if
necessary. It would probably be good to generalize this method out
somehow so that it can be used in other places more easily.
This follows the REUSE practices to add license and copyright info to
all source files: https://reuse.software/practices/2.0/
In addition, LICENSE.md was switched to a plaintext LICENSE file, to
support the tag-value header as recommended.
Note that files that are closer to configuration than code did not have
headers added. This includes all Salt files, Alembic files, and Python
files such as most __init__.py files that only import other files, since
those are similar to header files which are not considered
copyrightable.
This commit contains only changes that were made automatically by Black
(except for some minor fixes to string un-wrapping and two
format-disabling blocks in the user and group schemas). Some manual
cleanup/adjustments will probably need to be made in a follow-up commit,
but this one contains the result of running Black on the codebase
without significant further manual tweaking.
This detects mentions of users in comments using the same pattern as the
markdown parsing uses to generate user links. Mentioned users are sent a
notification, and mentions are added/deleted if needed on comment edits.
As part of this, setup was done to generate rabbitmq messages for
comment creation and edits, and the mentions are handled by an async
consumer of these messages.