newpipe-documentation/01_Concept_of_the_extractor/index.html

320 lines
15 KiB
HTML
Raw Normal View History

<!DOCTYPE html>
<html class="writer-html5" lang="en" >
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="shortcut icon" href="../img/favicon.ico" />
<title>Concept of the Extractor - NewPipe Development Documentation</title>
<!-- local fonts -->
<link rel="stylesheet" href="../css/local_fonts.css" type="text/css" />
<link rel="stylesheet" href="../css/theme.css" type="text/css" />
<link rel="stylesheet" href="../css/theme_extra.css" type="text/css" />
<link rel="stylesheet" href="../css/theme_child.css" type="text/css" />
<!-- local code syntax highlighting -->
<link rel="stylesheet" href="../css/github.min.css" type="text/css" />
<link rel="stylesheet" href="../css/highlight.css" type="text/css" />
<script>
// Current page data
var mkdocs_page_name = "Concept of the Extractor";
var mkdocs_page_input_path = "01_Concept_of_the_extractor.md";
var mkdocs_page_url = null;
</script>
<script src="../js/jquery-2.1.1.min.js" defer></script>
<script src="../js/modernizr-2.8.3.min.js" defer></script>
<script src="../js/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
</head>
<body class="wy-body-for-nav" role="document">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav">
<div class="wy-side-scroll">
<div class="wy-side-nav-search">
<a href=".." class="icon icon-home"> NewPipe Development Documentation
</a><div role="search">
<form id ="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" title="Type search term here" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
<ul>
<li class="toctree-l1"><a class="reference internal" href="..">Welcome to the NewPipe Development Docs</a>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../00_Prepare_everything/">Before You Start</a>
</li>
</ul>
<ul class="current">
<li class="toctree-l1 current"><a class="reference internal current" href="./">Concept of the Extractor</a>
<ul class="current">
<li class="toctree-l2"><a class="reference internal" href="#the-collectorextractor-pattern">The Collector/Extractor Pattern</a>
<ul>
<li class="toctree-l3"><a class="reference internal" href="#usage-in-the-front-end">Usage in the Front End</a>
</li>
<li class="toctree-l3"><a class="reference internal" href="#typical-implementation-of-a-single-data-extractor">Typical Implementation of a Single Data Extractor</a>
</li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="#collectorextractor-pattern-for-lists">Collector/Extractor Pattern for Lists</a>
</li>
<li class="toctree-l2"><a class="reference internal" href="#listextractor">ListExtractor</a>
</li>
</ul>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../02_Concept_of_LinkHandler/">Concept of the LinkHandler</a>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../03_Implement_a_service/">Implementing a Service</a>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../04_Run_changes_in_App/">Testing Your Changes in the App</a>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../05_Mock_tests/">Mock Tests</a>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../06_releasing/">Releasing a New NewPipe Version</a>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../07_release_instructions/">Release instructions for normal releases</a>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../08_documentation/">About This Documentation</a>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../09_maintainers_view/">Maintainers' Section</a>
</li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" role="navigation" aria-label="Mobile navigation menu">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="..">NewPipe Development Documentation</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content"><div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href=".." class="icon icon-home" aria-label="Docs"></a> &raquo;</li>
<li class="breadcrumb-item active">Concept of the Extractor</li>
<li class="wy-breadcrumbs-aside">
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div class="section" itemprop="articleBody">
<h1 id="concept-of-the-extractor">Concept of the Extractor</h1>
<h2 id="the-collectorextractor-pattern">The Collector/Extractor Pattern</h2>
<p>Before you start coding your own service, you need to understand the basic concept of the extractor itself. There is a pattern
you will find all over the code, called the <strong>extractor/collector</strong> pattern. The idea behind it is that
the <a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/Extractor.html">extractor</a>
would produce fragments of data, and the collector would collect them and assemble that data into a readable format for the front end.
The collector also controls the parsing process, and takes care of error handling. So, if the extractor fails at any
point, the collector will decide whether or not it should continue parsing. This requires the extractor to be made out of
multiple methods, one method for every data field the collector wants to have. The collectors are provided by NewPipe.
You need to take care of the extractors.</p>
<h3 id="usage-in-the-front-end">Usage in the Front End</h3>
<p>A typical call for retrieving data from a website would look like this:</p>
<pre><code class="language-java">Info info;
try {
// Create a new Extractor with a given context provided as parameter.
Extractor extractor = new Extractor(some_meta_info);
// Retrieves the data form extractor and builds info package.
info = Info.getInfo(extractor);
} catch(Exception e) {
// handle errors when collector decided to break up extraction
}
</code></pre>
<h3 id="typical-implementation-of-a-single-data-extractor">Typical Implementation of a Single Data Extractor</h3>
<p>The typical implementation of a single data extractor, on the other hand, would look like this:</p>
<pre><code class="language-java">class MyExtractor extends FutureExtractor {
public MyExtractor(RequiredInfo requiredInfo, ForExtraction forExtraction) {
super(requiredInfo, forExtraction);
...
}
@Override
public void fetch() {
// Actually fetch the page data here
}
@Override
public String someDataField()
throws ExtractionException { //The exception needs to be thrown if something failed
// get piece of information and return it
}
... // More datafields
}
</code></pre>
<h2 id="collectorextractor-pattern-for-lists">Collector/Extractor Pattern for Lists</h2>
<p>Information can be represented as a list. In NewPipe, a list is represented by an
<a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html">InfoItemsCollector</a>.
An InfoItemsCollector will collect and assemble a list of <a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItem.html">InfoItem</a>.
For each item that should be extracted, a new Extractor must be created, and given to the InfoItemsCollector via <a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html#commit-E-">commit()</a>.</p>
<p><img alt="InfoItemsCollector_objectdiagram.svg" src="../img/InfoItemsCollector_objectdiagram.svg" /></p>
<p>If you are implementing a list in your service you need to implement an <a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/Extractor.html">InfoItemExtractor</a>,
that will be able to retrieve data for one and only one InfoItem. This extractor will then be <em>comitted</em> to the <strong>InfoItemsCollector</strong> that can collect the type of InfoItems you want to generate.</p>
<p>A common implementation would look like this:</p>
<pre><code>private SomeInfoItemCollector collectInfoItemsFromElement(Element e) {
// See *Some* as something like Stream or Channel
// e.g. StreamInfoItemsCollector, and ChannelInfoItemsCollector are provided by NP
SomeInfoItemCollector collector = new SomeInfoItemCollector(getServiceId());
for(final Element li : element.children()) {
collector.commit(new InfoItemExtractor() {
@Override
public String getName() throws ParsingException {
...
}
@Override
public String getUrl() throws ParsingException {
...
}
...
}
return collector;
}
</code></pre>
<h2 id="listextractor">ListExtractor</h2>
<p>There is more to know about lists:</p>
<ol>
<li>
<p>When a streaming site shows a list of items, it usually offers some additional information about that list like its title, a thumbnail,
and its creator. Such info can be called <strong>list header</strong>.</p>
</li>
<li>
<p>When a website shows a long list of items it usually does not load the whole list, but only a part of it. In order to get more items you may have to click on a next page button, or scroll down.</p>
</li>
</ol>
<p>Both of these Problems are fixed by the <a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html">ListExtractor</a> which takes care about extracting additional metadata about the list,
and by chopping down lists into several pages, so called <a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.InfoItemsPage.html">InfoItemsPage</a>s.
Each page has its own URL, and needs to be extracted separately.</p>
<p>For extracting list header information a <code>ListExtractor</code> behaves like a regular extractor. For handling <code>InfoItemsPages</code> it adds methods
such as:</p>
<ul>
<li><a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getInitialPage--">getInitialPage()</a>
which will return the first page of InfoItems.</li>
<li><a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getNextPageUrl--">getNextPageUrl()</a>
If a second Page of InfoItems is available this will return the URL pointing to them.</li>
<li><a href="https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getPage-java.lang.String-">getPage()</a>
returns a ListExtractor.InfoItemsPage by its URL which was retrieved by the <code>getNextPageUrl()</code> method of the previous page.</li>
</ul>
<p>The reason why the first page is handled special is because many Websites such as YouTube will load the first page of
items like a regular web page, but all the others as an AJAX request.</p>
<p>An InfoItemsPage itself has two constructors which take these parameters:
- The <strong>InfoitemsCollector</strong> of the list that the page should represent
- A <strong>nextPageUrl</strong> which represents the url of the following page (may be null if not page follows).
- Optionally <strong>errors</strong> which is a list of Exceptions that may have happened during extracton.</p>
<p>Here is a simplified reference implementation of a list extractor that only extracts pages, but not metadata:</p>
<pre><code>class MyListExtractor extends ListExtractor {
...
private Document document;
...
public InfoItemsPage&lt;SomeInfoItem&gt; getPage(pageUrl)
throws ExtractionException {
SomeInfoItemCollector collector = new SomeInfoItemCollector(getServiceId());
document = myFunctionToGetThePageHTMLWhatever(pageUrl);
//remember this part from the simple list extraction
for(final Element li : document.children()) {
collector.commit(new InfoItemExtractor() {
@Override
public String getName() throws ParsingException {
...
}
@Override
public String getUrl() throws ParsingException {
...
}
...
}
return new InfoItemsPage&lt;SomeInfoItem&gt;(collector, myFunctionToGetTheNextPageUrl(document));
}
public InfoItemsPage&lt;SomeInfoItem&gt; getInitialPage() {
//document here got initialized by the fetch() function.
return getPage(getTheCurrentPageUrl(document));
}
...
}
</code></pre>
</div>
</div><footer>
<div class="rst-footer-buttons" role="navigation" aria-label="Footer Navigation">
<a href="../00_Prepare_everything/" class="btn btn-neutral float-left" title="Before You Start"><span class="icon icon-circle-arrow-left"></span> Previous</a>
<a href="../02_Concept_of_LinkHandler/" class="btn btn-neutral float-right" title="Concept of the LinkHandler">Next <span class="icon icon-circle-arrow-right"></span></a>
</div>
<hr/>
<div role="contentinfo">
<!-- Copyright etc -->
</div>
Built with <a href="https://www.mkdocs.org/">MkDocs</a> using a <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<div class="rst-versions" role="note" aria-label="Versions">
<span class="rst-current-version" data-toggle="rst-current-version">
<span><a href="../00_Prepare_everything/" style="color: #fcfcfc">&laquo; Previous</a></span>
<span><a href="../02_Concept_of_LinkHandler/" style="color: #fcfcfc">Next &raquo;</a></span>
</span>
</div>
<script src="../js/jquery-3.6.0.min.js"></script>
<script>var base_url = "..";</script>
<script src="../js/theme_extra.js"></script>
<script src="../js/theme.js"></script>
<script src="../search/main.js"></script>
<script>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>