Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new tldts-icann package which does not contain private rules #1888

Merged
merged 1 commit into from
Dec 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion packages/tldts-experimental/test/publicsuffix.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,5 @@ import { publicSuffixListTests } from 'tldts-tests';
import * as tld from '../index';

describe('tldts experimental', () => {
publicSuffixListTests(tld.getDomain);
publicSuffixListTests(tld.getDomain, { includePrivate: true });
});
2 changes: 1 addition & 1 deletion packages/tldts-experimental/test/tld.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,5 @@ import { tldtsTests } from 'tldts-tests';
import * as tld from '../index';

describe('tldts experimental', () => {
tldtsTests(tld);
tldtsTests(tld, { includePrivate: true });
});
290 changes: 290 additions & 0 deletions packages/tldts-icann/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@
# tldts - Blazing Fast URL Parsing (ICANN rules only)

`tldts` is a JavaScript library to extract hostnames, domains, public suffixes, top-level domains and subdomains from URLs.

**Features**:

1. Tuned for **performance** (order of 0.1 to 1 μs per input)
2. Handles both URLs and hostnames
3. Full Unicode/IDNA support
4. Support parsing email addresses
5. Detect IPv4 and IPv6 addresses
6. Continuously updated version of the public suffix list
7. **TypeScript**, ships with `umd`, `esm`, `cjs` bundles and _type definitions_
8. Small bundles and small memory footprint
9. Battle tested: full test coverage and production use

# Install

```bash
npm install --save tldts-icann
```

# Usage

Programmatically:

```js
const { parse } = require('tldts-icann');

// Retrieving hostname related informations of a given URL
parse('http://www.writethedocs.org/conf/eu/2017/');
// { domain: 'writethedocs.org',
// domainWithoutSuffix: 'writethedocs',
// hostname: 'www.writethedocs.org',
// isIp: false,
// publicSuffix: 'org',
// subdomain: 'www' }
```

Modern _ES6 modules import_ is also supported:

```js
import { parse } from 'tldts-icann';
```

Alternatively, you can try it _directly in your browser_ here: https://npm.runkit.com/tldts

# API

- `tldts.parse(url | hostname, options)`
- `tldts.getHostname(url | hostname, options)`
- `tldts.getDomain(url | hostname, options)`
- `tldts.getPublicSuffix(url | hostname, options)`
- `tldts.getSubdomain(url, | hostname, options)`
- `tldts.getDomainWithoutSuffix(url | hostname, options)`

The behavior of `tldts` can be customized using an `options` argument for all
the functions exposed as part of the public API. This is useful to both change
the behavior of the library as well as fine-tune the performance depending on
your inputs.

```js
{
// Extract and validate hostname (default: true)
// When set to `false`, inputs will be considered valid hostnames.
extractHostname: boolean;
// Validate hostnames after parsing (default: true)
// If a hostname is not valid, not further processing is performed. When set
// to `false`, inputs to the library will be considered valid and parsing will
// proceed regardless.
validateHostname: boolean;
// Perform IP address detection (default: true).
detectIp: boolean;
// Assume that both URLs and hostnames can be given as input (default: true)
// If set to `false` we assume only URLs will be given as input, which
// speed-ups processing.
mixedInputs: boolean;
// Specifies extra valid suffixes (default: null)
validHosts: string[] | null;
}
```

The `parse` method returns handy **properties about a URL or a hostname**.

```js
const tldts = require('tldts-icann');

tldts.parse('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv');
// { domain: 'amazonaws.com',
// domainWithoutSuffix: 'amazonaws',
// hostname: 'spark-public.s3.amazonaws.com',
// isIp: false,
// publicSuffix: 'com',
// subdomain: 'spark-public.s3' }

tldts.parse(
'https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv',
{ allowPrivateDomains: true },
);
// { domain: 'spark-public.s3.amazonaws.com',
// domainWithoutSuffix: 'spark-public',
// hostname: 'spark-public.s3.amazonaws.com',
// isIp: false,
// publicSuffix: 's3.amazonaws.com',
// subdomain: '' }

tldts.parse('gopher://domain.unknown/');
// { domain: 'domain.unknown',
// domainWithoutSuffix: 'domain',
// hostname: 'domain.unknown',
// isIp: false,
// publicSuffix: 'unknown',
// subdomain: '' }

tldts.parse('https://192.168.0.0'); // IPv4
// { domain: null,
// domainWithoutSuffix: null,
// hostname: '192.168.0.0',
// isIp: true,
// publicSuffix: null,
// subdomain: null }

tldts.parse('https://[::1]'); // IPv6
// { domain: null,
// domainWithoutSuffix: null,
// hostname: '::1',
// isIp: true,
// publicSuffix: null,
// subdomain: null }

tldts.parse('tldts@emailprovider.co.uk'); // email
// { domain: 'emailprovider.co.uk',
// domainWithoutSuffix: 'emailprovider',
// hostname: 'emailprovider.co.uk',
// isIp: false,
// publicSuffix: 'co.uk',
// subdomain: '' }
```

| Property Name | Type | Description |
| :-------------------- | :----- | :---------------------------------------------- |
| `hostname` | `str` | `hostname` of the input extracted automatically |
| `domain` | `str` | Domain (tld + sld) |
| `domainWithoutSuffix` | `str` | Domain without public suffix |
| `subdomain` | `str` | Sub domain (what comes after `domain`) |
| `publicSuffix` | `str` | Public Suffix (tld) of `hostname` |
| `isIP` | `bool` | Is `hostname` an IP address? |

## Single purpose methods

These methods are shorthands if you want to retrieve only a single value (and
will perform better than `parse` because less work will be needed).

### getHostname(url | hostname, options?)

Returns the hostname from a given string.

```javascript
const { getHostname } = require('tldts-icann');

getHostname('google.com'); // returns `google.com`
getHostname('fr.google.com'); // returns `fr.google.com`
getHostname('fr.google.google'); // returns `fr.google.google`
getHostname('foo.google.co.uk'); // returns `foo.google.co.uk`
getHostname('t.co'); // returns `t.co`
getHostname('fr.t.co'); // returns `fr.t.co`
getHostname(
'https://user:password@example.co.uk:8080/some/path?and&query#hash',
); // returns `example.co.uk`
```

### getDomain(url | hostname, options?)

Returns the fully qualified domain from a given string.

```javascript
const { getDomain } = require('tldts-icann');

getDomain('google.com'); // returns `google.com`
getDomain('fr.google.com'); // returns `google.com`
getDomain('fr.google.google'); // returns `google.google`
getDomain('foo.google.co.uk'); // returns `google.co.uk`
getDomain('t.co'); // returns `t.co`
getDomain('fr.t.co'); // returns `t.co`
getDomain('https://user:password@example.co.uk:8080/some/path?and&query#hash'); // returns `example.co.uk`
```

### getDomainWithoutSuffix(url | hostname, options?)

Returns the domain (as returned by `getDomain(...)`) without the public suffix part.

```javascript
const { getDomainWithoutSuffix } = require('tldts-icann');

getDomainWithoutSuffix('google.com'); // returns `google`
getDomainWithoutSuffix('fr.google.com'); // returns `google`
getDomainWithoutSuffix('fr.google.google'); // returns `google`
getDomainWithoutSuffix('foo.google.co.uk'); // returns `google`
getDomainWithoutSuffix('t.co'); // returns `t`
getDomainWithoutSuffix('fr.t.co'); // returns `t`
getDomainWithoutSuffix(
'https://user:password@example.co.uk:8080/some/path?and&query#hash',
); // returns `example`
```

### getSubdomain(url | hostname, options?)

Returns the complete subdomain for a given string.

```javascript
const { getSubdomain } = require('tldts-icann');

getSubdomain('google.com'); // returns ``
getSubdomain('fr.google.com'); // returns `fr`
getSubdomain('google.co.uk'); // returns ``
getSubdomain('foo.google.co.uk'); // returns `foo`
getSubdomain('moar.foo.google.co.uk'); // returns `moar.foo`
getSubdomain('t.co'); // returns ``
getSubdomain('fr.t.co'); // returns `fr`
getSubdomain(
'https://user:password@secure.example.co.uk:443/some/path?and&query#hash',
); // returns `secure`
```

### getPublicSuffix(url | hostname, options?)

Returns the [public suffix][] for a given string.

```javascript
const { getPublicSuffix } = require('tldts-icann');

getPublicSuffix('google.com'); // returns `com`
getPublicSuffix('fr.google.com'); // returns `com`
getPublicSuffix('google.co.uk'); // returns `co.uk`
getPublicSuffix('s3.amazonaws.com'); // returns `com`
getPublicSuffix('tld.is.unknown'); // returns `unknown`
```

# Troubleshooting

## Retrieving subdomain of `localhost` and custom hostnames

`tldts` methods `getDomain` and `getSubdomain` are designed to **work only with _known and valid_ TLDs**.
This way, you can trust what a domain is.

`localhost` is a valid hostname but not a TLD. You can pass additional options to each method exposed by `tldts`:

```js
const tldts = require('tldts-icann');

tldts.getDomain('localhost'); // returns null
tldts.getSubdomain('vhost.localhost'); // returns null

tldts.getDomain('localhost', { validHosts: ['localhost'] }); // returns 'localhost'
tldts.getSubdomain('vhost.localhost', { validHosts: ['localhost'] }); // returns 'vhost'
```

## Updating the TLDs List

`tldts` made the opinionated choice of shipping with a list of suffixes directly
in its bundle. There is currently no mechanism to update the lists yourself, but
we make sure that the version shipped is always up-to-date.

If you keep `tldts` updated, the lists should be up-to-date as well!

# Performance

`tldts` is the _fastest JavaScript library_ available for parsing hostnames. It is able to parse _millions of inputs per second_ (typically 2-3M depending on your hardware and inputs). It also offers granular options to fine-tune the behavior and performance of the library depending on the kind of inputs you are dealing with (e.g.: if you know you only manipulate valid hostnames you can disable the hostname extraction step with `{ extractHostname: false }`).

Please see [this detailed comparison](./comparison/comparison.md) with other available libraries.

## Contributors

`tldts` is based upon the excellent `tld.js` library and would not exist without
the many contributors who worked on the project:
<a href="graphs/contributors"><img src="https://opencollective.com/tldjs/contributors.svg?width=890" /></a>

This project would not be possible without the amazing Mozilla's
[public suffix list][]. Thank you for your hard work!

# License

[MIT License](LICENSE).

[badge-ci]: https://secure.travis-ci.org/remusao/tldts.svg?branch=master
[badge-downloads]: https://img.shields.io/npm/dm/tldts.svg
[public suffix list]: https://publicsuffix.org/list/
[list the recent changes]: https://github.com/publicsuffix/list/commits/master
[changes Atom Feed]: https://github.com/publicsuffix/list/commits/master.atom
[public suffix]: https://publicsuffix.org/learn/
62 changes: 62 additions & 0 deletions packages/tldts-icann/index.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
import {
FLAG,
getEmptyResult,
IOptions,
IResult,
parseImpl,
resetResult,
} from 'tldts-core';

import suffixLookup from './src/suffix-trie';

// For all methods but 'parse', it does not make sense to allocate an object
// every single time to only return the value of a specific attribute. To avoid
// this un-necessary allocation, we use a global object which is re-used.
const RESULT: IResult = getEmptyResult();

export function parse(url: string, options: Partial<IOptions> = {}): IResult {
return parseImpl(url, FLAG.ALL, suffixLookup, options, getEmptyResult());
}

export function getHostname(
url: string,
options: Partial<IOptions> = {},
): string | null {
/*@__INLINE__*/ resetResult(RESULT);
return parseImpl(url, FLAG.HOSTNAME, suffixLookup, options, RESULT).hostname;
}

export function getPublicSuffix(
url: string,
options: Partial<IOptions> = {},
): string | null {
/*@__INLINE__*/ resetResult(RESULT);
return parseImpl(url, FLAG.PUBLIC_SUFFIX, suffixLookup, options, RESULT)
.publicSuffix;
}

export function getDomain(
url: string,
options: Partial<IOptions> = {},
): string | null {
/*@__INLINE__*/ resetResult(RESULT);
return parseImpl(url, FLAG.DOMAIN, suffixLookup, options, RESULT).domain;
}

export function getSubdomain(
url: string,
options: Partial<IOptions> = {},
): string | null {
/*@__INLINE__*/ resetResult(RESULT);
return parseImpl(url, FLAG.SUB_DOMAIN, suffixLookup, options, RESULT)
.subdomain;
}

export function getDomainWithoutSuffix(
url: string,
options: Partial<IOptions> = {},
): string | null {
/*@__INLINE__*/ resetResult(RESULT);
return parseImpl(url, FLAG.ALL, suffixLookup, options, RESULT)
.domainWithoutSuffix;
}
Loading
Loading