Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature](function)support url domain functions #42488

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

zhangstar333
Copy link
Contributor

@zhangstar333 zhangstar333 commented Oct 25, 2024

Proposed changes

support top_level_domain/first_significant_subdomain/cut_to_first_significant_subdomain functions
doc: apache/doris-website#1230

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@LiBinfeng-01
Copy link
Contributor

need to add fe fold const when adding new scalar function to fe, which can refer to: https://selectdb.feishu.cn/wiki/BGfGwgY2uiQrK7krGUVcYJFknVg?from=from_copylink

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 25 out of 39. Check the log or trigger a new build to see more.

be/src/vec/functions/url/find_symbols.h Show resolved Hide resolved
be/src/vec/functions/url/find_symbols.h Show resolved Hide resolved
be/src/vec/functions/url/find_symbols.h Show resolved Hide resolved
be/src/vec/functions/url/find_symbols.h Show resolved Hide resolved
be/src/vec/functions/url/find_symbols.h Show resolved Hide resolved
be/src/vec/functions/url/find_symbols.h Show resolved Hide resolved
be/src/vec/functions/url/find_symbols.h Show resolved Hide resolved
be/src/vec/functions/url/find_symbols.h Show resolved Hide resolved
be/src/vec/functions/url/find_symbols.h Show resolved Hide resolved
be/src/vec/functions/url/find_symbols.h Show resolved Hide resolved
@zhangstar333
Copy link
Contributor Author

run buildall

@zhangstar333
Copy link
Contributor Author

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

requires(0 <= sizeof...(symbols) && sizeof...(symbols) <= 16)
{
#if defined(__SSE4_2__)
if (sizeof...(symbols) >= 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: statement should be inside braces [readability-braces-around-statements]

Suggested change
if (sizeof...(symbols) >= 5)
if (sizeof...(symbols) >= 5) {

be/src/vec/functions/url/find_symbols.h:344:

-     else
+     } else

if (sizeof...(symbols) >= 5)
return find_first_symbols_sse42<positive, return_mode, sizeof...(symbols), symbols...>(
begin, end);
else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: statement should be inside braces [readability-braces-around-statements]

Suggested change
else
else {

be/src/vec/functions/url/find_symbols.h:346:

-         return find_first_symbols_sse2<positive, return_mode, symbols...>(begin, end);
+         return find_first_symbols_sse2<positive, return_mode, symbols...>(begin, end);
+ }

inline const char* find_first_symbols_dispatch(const std::string_view haystack,
const SearchSymbols& symbols) {
#if defined(__SSE4_2__)
if (symbols.str.size() >= 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: statement should be inside braces [readability-braces-around-statements]

Suggested change
if (symbols.str.size() >= 5)
if (symbols.str.size() >= 5) {

be/src/vec/functions/url/find_symbols.h:356:

-     else
+     } else

if (symbols.str.size() >= 5)
return find_first_symbols_sse42<positive, return_mode>(haystack.begin(), haystack.end(),
symbols);
else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: statement should be inside braces [readability-braces-around-statements]

Suggested change
else
else {

be/src/vec/functions/url/find_symbols.h:359:

-                 haystack.begin(), haystack.end(), symbols.str.data(), symbols.str.size());
+                 haystack.begin(), haystack.end(), symbols.str.data(), symbols.str.size());
+ }

while (pos < end) {
const char* delimiter_or_end = find_first_symbols<symbols...>(pos, end);

if (!token_compress || pos < delimiter_or_end) to.emplace_back(pos, delimiter_or_end - pos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: statement should be inside braces [readability-braces-around-statements]

Suggested change
if (!token_compress || pos < delimiter_or_end) to.emplace_back(pos, delimiter_or_end - pos);
if (!token_compress || pos < delimiter_or_end) { to.emplace_back(pos, delimiter_or_end - pos);
}

#include <cstring>

#define TOTAL_KEYWORDS 5045
#define MIN_WORD_LENGTH 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: macro 'MIN_WORD_LENGTH' defines an integral constant; prefer an enum instead [modernize-macro-to-enum]

#define MIN_WORD_LENGTH 4
        ^


#define TOTAL_KEYWORDS 5045
#define MIN_WORD_LENGTH 4
#define MAX_WORD_LENGTH 34
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: macro 'MAX_WORD_LENGTH' defines an integral constant; prefer an enum instead [modernize-macro-to-enum]

#define MAX_WORD_LENGTH 34
        ^

#define TOTAL_KEYWORDS 5045
#define MIN_WORD_LENGTH 4
#define MAX_WORD_LENGTH 34
#define MIN_HASH_VALUE 75
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: macro 'MIN_HASH_VALUE' defines an integral constant; prefer an enum instead [modernize-macro-to-enum]

#define MIN_HASH_VALUE 75
        ^

#define MIN_WORD_LENGTH 4
#define MAX_WORD_LENGTH 34
#define MIN_HASH_VALUE 75
#define MAX_HASH_VALUE 110600
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: macro 'MAX_HASH_VALUE' defines an integral constant; prefer an enum instead [modernize-macro-to-enum]

#define MAX_HASH_VALUE 110600
        ^

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.41% (9713/25964)
Line Coverage: 28.68% (80572/280967)
Region Coverage: 28.11% (41654/148156)
Branch Coverage: 24.67% (21158/85778)
Coverage Report: http://coverage.selectdb-in.cc/coverage/1fddbae0c025fc94f2ee358047cdc094f0bbcc3f_1fddbae0c025fc94f2ee358047cdc094f0bbcc3f/report/index.html

struct ExtractTopLevelDomain {
static size_t get_reserve_length_for_element() { return 5; }

static void execute(const char* data, size_t size, const char*& res_data, size_t& res_size) {
Copy link
Contributor

@HappenLee HappenLee Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not return StringRef, we better refactor the code

@zhangstar333
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.42% (9716/25966)
Line Coverage: 28.70% (80606/280875)
Region Coverage: 28.13% (41668/148124)
Branch Coverage: 24.70% (21177/85748)
Coverage Report: http://coverage.selectdb-in.cc/coverage/1fddbae0c025fc94f2ee358047cdc094f0bbcc3f_1fddbae0c025fc94f2ee358047cdc094f0bbcc3f/report/index.html

#include <cstdint>
#include <string>

#if defined(__SSE2__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only define SSE4_2 enoungh. and here need add header


const auto* host_end = host_view.data() + host_view.size();
const char* last_dot = find_last_symbols_or_null<'.'>(host_view.data(), host_end);
if (!last_dot) {
Copy link
Contributor

@HappenLee HappenLee Oct 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not find the last_dot what should the res_data and res_size ? DANGER ! it's may be UB

};

struct FirstSignificantSubdomainDefaultLookup {
bool operator()(StringRef host) const { return tldLookup::isValid(host.data, host.size); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_valid


static void execute(const Pos data, const size_t size, Pos& res_data, size_t& res_size,
Pos* out_domain_end = nullptr) {
FirstSignificantSubdomainDefaultLookup loookup;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why here we need a class, not direct call the static func tldLooup::is_valid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants